UNSW uses Office 365, to my infinite regret, and migrated to a hybrid Azure domain in 2015. That’s screwed up the only thing I care about: reliable email.

Here’s some adventures in Office365 hell.

“Intermittent mail service failures”

As with all proper mail system issues, it was DNS – Outlook DNS is badly broken in hitherto undocumented ways. UNSW’s IT service team also don’t care about the infrastructure they use, presumably because they think they can just throw money at Microsoft to make any and all problems go away. Good work.

This story’s told in a series of emails, scraped out of the UNSW Service Desk system. Names preserved to incriminate the incompetent.

(UNSW CAsd incident 1791533)

me to UNSW IT Service Centre, 2017-09-19 15:26

Hi there,

I called earlier to report a non-deterministic mail service failure of the IMAP service from Outlook.com. The issue promptly disappeared when I made that support call, but naturally reappeared mere moments after.

This issue has been intermittent over the last few days but has not led to any loss of service until today; at around midnight tonight, mail ceased being retrievable via IMAP, and I’ve seen only error messages since.

I use Fetchmail, and deliver mail into my own mail service, but have been able to reproduce this fault in all mail clients I can lay my hands on (Mutt, Thunderbird, Claws-Mail).

Fetchmail reports the following session information, where lines beginning ‘<’ are received and lines beginning ‘>’ are sent.

< * OK The Microsoft Exchange IMAP4 service is ready. [...]
> A0001 CAPABILITY
< * CAPABILITY IMAP4 IMAP4rev1 AUTH=PLAIN AUTH=XOAUTH2 SASL-IR UIDPLUS MOVE ID UNSELECT CHILDREN IDLE NAMESPACE LITERAL+
< A0001 OK CAPABILITY completed.
> A0002 LOGIN "z5017851@ad.unsw.edu.au" *
< A0002 OK LOGIN completed.
> A0003 SELECT "INBOX"
< A0003 BAD User is authenticated but not connected.
> A0004 LOGOUT
< * BYE Microsoft Exchange Server 2016 IMAP4 server signing off.
< A0004 OK LOGOUT completed.

My recollection of the IMAP protocol is somewhat fuzzy, but I believe the transaction marked A0002 is illegal according to the spec; a LOGIN should return an error if the login couldn’t occur, but a quick Google suggests this is a credentials issue.

This appears to be reproducible on all servers part of the DNS pool for outlook.office.com.

I’ve attached complete mail logs showing a working and failed exchange from my mail server, gelfort.rulingia.com.au (103.243.244.19). These are timestamped in Australian Eastern Standard Time.

I’d greatly appreciate assistance in getting this issue resolved promptly, as this is a proper loss of service, and degrading to the Outlook web or mobile clients is categorically awful.


reply from Riyanal Chea at UNSW IT Service Centre, 2017-09-20 09:17

Dear Jashank,

It could be issue with server during last week or during the time you tried to access your Office 365 mailbox.

Since you are using a third party software accessing your mailbox via IMAP, it could be many issue that may cause the issue rather than the Office 365 mailbox.

Please also note that from 31st of October 2017, you won’t be able to connect to your Office 365 mailbox via IMAP anymore as Microsoft retire the RPC over Http service.

This means any email clients using IMAP will no longer work. But with that said, I am not sure why your IMAP client stop working all of sudden.

Perhaps it is best if you use Outlook client with appropriate update. You can also access your Office 365 mailbox via web (OWA).


my response, 2017-09-20 11:21

At 2017-09-19 23:14:54 +0000, Riyanal Chea wrote: > Since you are using a third party software accessing your mailbox via > IMAP, it could be many issue that may cause the issue rather than the > Office 365 mailbox. > > But with that said, I am not sure why your IMAP client stop working > all of sudden.

IMAP is a well-defined standard. An issue like this would mean users accessing the service using any compliant IMAP client will see the same issue. I’ve tried a number of IMAP clients, and have found this issue.

One resource on the Internet suggests this is either - an actual bug in Exchange (which wouldn’t surprise me at all), - my password changing, - the Office365 service is “protecting” my account from frequent accesses (I poll every four minutes, as Exchange’s IMAP IDLE implementation is unreliable), or - that my mailbox has become shared.

As far as I’m aware (though please correct me if this has changed), my mailbox isn’t shared, and there haven’t been any changes (visible to mere mortals, anyway) to the Exchange service. Last I checked, my zPass was correct, and I’ve used those credentials to log into both UNSW services (like CAsd) and Office365 services.

Please also note that from 31st of October 2017, you won’t be able to connect to your Office 365 mailbox via IMAP anymore as Microsoft retire the RPC over Http service. This means any email clients using IMAP will no longer work.

Uh, what? The IMAPv4 service is a totally separate connector to the classic Exchange RPC-over-HTTP service. Disabling that connector shouldn’t affect other protocols like IMAP or POP3 (unless Exchange is even more badly designed than I thought).

I’ve discovered that the POP3 endpoint still works, and am in the process of degrading my infrastructure to use it. This is a totally non-optimal solution, due to deficiencies in the POP3 protocol.

Perhaps it is best if you use Outlook client with appropriate update.

Not an option: - I don’t run Windows or OS X. - I digitally sign my email. - I prefer MUAs that comply with Internet standards.

You can also access your Office 365 mailbox via web (OWA).

As noted in my original email:

degrading to the Outlook web or mobile clients is categorically awful.

I look forward to this service being restored; thanks for your investigations.


reply from Riyanal Chea, 2017-09-20 14:16

Can you send me screen shot of your imap settings?


my response, 2017-09-20 14:32

The best I can do is this plain-text Fetchmail configuration:

# fetchmail(1) config for jashank@gelfort.rulingia.com.au
# see <http://www.fetchmail.info/fetchmail-man.html>

# only one `poll` can `idle`; otherwise, polls with interval $daemon
set daemon 240

# unsw.edu.au: maillard -> zmail (fmrc on home.ri) -> office365
poll outlook.office365.com proto imap ssl port 993
    user "z5017851@ad.unsw.edu.au"
    fetchall
    no rewrite # don't rewrite From header for Dovecot spooling
    mda "/usr/local/libexec/dovecot/dovecot-lda -e -d jashank -m UNSW"
    #idle # XXX totally boned by broken imap in msexchange

Over the last few hours, I’ve seen a huge spike in intermittency: mail is now unreliably available over IMAP, which is an improvement on no service at all.


reply from Dawesh Chand at UNSW IT Service Centre, 2018-09-25 13:25

As UNSW is using Office 365 as its email platform, investigation of this would be best handled by Microsoft. We can log a support ticket for you with Microsoft. Please provide the best contact number on which a Microsoft support engineer can contact you.


my response, 2017-09-27 14:11

by this point, I’d had enough of this delightful dance, so went and diagnosed and worked around it myself.

Hi there,

At 2017-09-25 13:24:39 +1000, Dawesh Chand wrote: > As UNSW is using Office 365 as its email platform, investigation of > this would be best handled by Microsoft.

I doubt this, and I wasted a pleasant Saturday evening investigating and resolving this issue myself. As I expected, my infrastructure was working correctly, and Microsoft’s cloud “offering” is at fault.

Here’s the summary. Put this in the knowledge base, because it looks like nobody (else) knows this.

My infrastruture runs on a system in the Equinix SY3 data centre. To save time and effort, I don’t run a fully recursive, authoritative DNS resolver; instead, I delegate to my upstream service provider, who (perfectly validly) use Google’s Public DNS service.

Google don’t operate a non-trivial point of presence in Australia; most requests are served from somewhere in south-eastern Asia, likely in Singapore, but possibly also Taiwan.

Microsoft’s DNS infrastructure does geolocation, apparently based on the subnet of the resolver; in this case, the actual resolver would be in the Google facility serving south-eastern Asia.

A well-supported DNS extension exists, EDNS0 Client Subnet, described in RFC 7871, which allows a public resolver to proxy information about the client resolver’s subnet to be returrned. Bizarrely for what is now such a common protocol extension, Microsoft don’t implement this in their DNS infrastructure, so requests to their geolocation services will always appear to resolve to a common origin.

Now, from my laptop, which is currently on Uniwide, a DNS resolution debugging tool (like dig(1) or drill(1)) reports:

jashank@jaenelle:~$ dig outlook.office365.com
;; ANSWER SECTION:
outlook.office365.com.  64      IN      CNAME   outlook.ha.office365.com.
outlook.ha.office365.com. 5     IN      CNAME   outlook.office365.com.g.office365.com.
outlook.office365.com.g.office365.com. 98 IN CNAME outlook-au.office365.com.
outlook-au.office365.com. 197   IN      A       40.100.144.226
outlook-au.office365.com. 197   IN      A       40.100.144.242
outlook-au.office365.com. 197   IN      A       40.100.145.146
outlook-au.office365.com. 197   IN      A       40.100.145.162
outlook-au.office365.com. 197   IN      A       40.100.151.2
outlook-au.office365.com. 197   IN      A       40.100.151.18
outlook-au.office365.com. 197   IN      A       40.100.151.114
outlook-au.office365.com. 197   IN      A       40.100.151.130

From my workstation, which resolves using an Australian DNS service (operated by IIPC):

jashank@alyzon:~$ drill outlook.office365.com
;; ANSWER SECTION:
outlook.office365.com.  33      IN      CNAME   outlook.ha.office365.com.
outlook.ha.office365.com.       33      IN      CNAME   outlook.office365.com.g.office365.com.
outlook.office365.com.g.office365.com.  273     IN      CNAME   outlook-au.office365.com.
outlook-au.office365.com.       90      IN      A       40.100.151.2
outlook-au.office365.com.       90      IN      A       40.100.151.18
outlook-au.office365.com.       90      IN      A       40.100.145.146
outlook-au.office365.com.       90      IN      A       40.100.145.162
outlook-au.office365.com.       90      IN      A       40.100.144.242
outlook-au.office365.com.       90      IN      A       40.100.151.114
outlook-au.office365.com.       90      IN      A       40.100.151.130
outlook-au.office365.com.       90      IN      A       40.100.144.226

And finally, from my mail infrastructure, previously described:

jashank@gelfort:~$ drill outlook.office365.com
[...]
;; ANSWER SECTION:
outlook.office365.com.  52      IN      CNAME   outlook.ha.office365.com.
outlook.ha.office365.com.       10      IN      CNAME   outlook.office365.com.g.office365.com.
outlook.office365.com.g.office365.com.  198     IN      CNAME   outlook-apacsouth.office365.com.
outlook-apacsouth.office365.com.        171     IN      A       40.100.17.34
outlook-apacsouth.office365.com.        171     IN      A       40.100.54.226
outlook-apacsouth.office365.com.        171     IN      A       40.100.29.226
outlook-apacsouth.office365.com.        171     IN      A       40.100.29.34
outlook-apacsouth.office365.com.        171     IN      A       40.100.54.2
[...]

Microsoft, of course, operate multiple data centres, all behind their high-availability and geolocation services. One nice property of a HA service is that every node in the HA cluster should appear identical, so they can be used in DNS round-robin A-records, for example.

(Another HA approach is clever tricks with BGP; I’m much less familiar with this, having never set it up, but it’s in common use by Google, Akamai, and Telstra, off the top of my head – it’s much more a carrier-grade solution.)

The geolocation step has, however, given a totally different DNS RR, as Microsoft have decided my infrastructure isn’t in Australia but rather in the service zone for apacsouth.

In a sensible world, this shouldn’t cause a problem: your HA should already deal with this, and if you’re running infrastructure across multiple data centres, you obviously already have a backbone tying them together anyway.

So, requests to outlook.office365.com, whether handled by the RR record for outlook-apacsouth.office365.com or outlook-au.office365.com, should behave totally identically, right? Because your HA already accounts for that, and if one data centre can’t handle a request, it gets passed over the backbone to be handled elsewhere.

What a brilliant idea! Your infrastructure becomes even more robust, and is totally geographically transparent. It’s a pity the network infrastructure clowns at Microsoft don’t seem to have realised it yet.

So, if I use outlook-au.office365.com as an IMAP server, my mailbox appears as normal, and all works well. At the moment, I don’t have any mail:

< * OK The Microsoft Exchange IMAP4 service is ready. [...]
> A0001 CAPABILITY
< * CAPABILITY IMAP4 IMAP4rev1 AUTH=PLAIN AUTH=XOAUTH2 SASL-IR UIDPLUS ID UNSELECT CHILDREN IDLE NAMESPACE LITERAL+
< A0001 OK CAPABILITY completed.
> A0002 LOGIN "z5017851@ad.unsw.edu.au" *
< A0002 OK LOGIN completed.
> A0003 SELECT "INBOX"
< * 0 EXISTS
< * 0 RECENT
< * FLAGS (\Seen \Answered \Flagged \Deleted \Draft $MDNSent)
< * OK [PERMANENTFLAGS (\Seen \Answered \Flagged \Deleted \Draft $MDNSent)] Permanent flags
< * OK [UIDVALIDITY 14] UIDVALIDITY value
< * OK [UIDNEXT 20517] The next unique identifier value
< A0003 OK [READ-WRITE] SELECT completed.
> A0004 LOGOUT
< * BYE Microsoft Exchange Server 2016 IMAP4 server signing off.
< A0004 OK LOGOUT completed.
No mail for z5017851@ad.unsw.edu.au at outlook-au.office365.com

But if I ask outlook-apacsouth.office365.com:

< * OK The Microsoft Exchange IMAP4 service is ready. [...]
> A0001 CAPABILITY
< * CAPABILITY IMAP4 IMAP4rev1 AUTH=PLAIN AUTH=XOAUTH2 SASL-IR UIDPLUS MOVE ID UNSELECT CHILDREN IDLE NAMESPACE LITERAL+
< A0001 OK CAPABILITY completed.
> A0002 LOGIN "z5017851@ad.unsw.edu.au" *
< A0002 OK LOGIN completed.
> A0003 SELECT "INBOX"
< A0003 BAD User is authenticated but not connected.
> A0004 LOGOUT
< * BYE Microsoft Exchange Server 2016 IMAP4 server signing off.
< A0004 OK LOGOUT completed.
client/server synchronization error while fetching from z5017851@ad.unsw.edu.au@outlook-apacsouth.office365.com

Ah, delightful.

And apparently, nobody within the Outlook group is aware of this bug. Their “connect to the service using IMAP” interface cleverly tells me to connect to outlook.office365.com, which is Just Plain Wrong.

AAAAAAAAAAARGH.

At 2017-09-25 13:24:39 +1000, Dawesh Chand wrote: > We can log a support ticket for you with Microsoft. Please provide > the best contact number on which a Microsoft support engineer can > contact you.

I’d greatly appreciate the opportunity to howl with deranged fury at a Microsoft support engineer for the sheer incompetence displayed in the Office365 platform.

~jashank