Microsoft + Unicode = 😱

December 29th, 2016

Hold onto your hats, everybody, Microsoft did something wrong.

At work, I maintain a daemon which can be issued commands through an email server. This means I have to deal with all of the terrible things email providers do to their outgoing mail. We have ways to fix broken HTML, broken headers, and any terrible combination of the two which SMTP providers throw at us. Today, we added another tool to that kit.

If you send any Unicode which resides in a supplementary plane of Unicode, Outlook will do some pretty terrible things to it.

From: redacted <redacted@outlook.com>
Sent: Redacted, Redacted 01, 2016 12:00 AM
To: redacted
Subject: Redacted

?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? :)? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??=
 ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? :(? ?? ?? ?? ?? ?? ?=
? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?=
? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ?? ?? ? ?? ?? ?? ?? =
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? =
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? =
?? ???????? ???????? ?? ??????????? ??????????? ?? ???????? ??????????? ???=
???????? ??????????? ???????? ???????? ??????????? ??????????? ??????????? =
???????? ???????? ??????????? ??????????? ??????????? ?? ?? ?? ?? ?? ?? ?? =
?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ? ?? ?? ?? ?? ?? ?? ?? ?? ?? ?? ??

Uh oh. Those of you with some knowledge of Unicode should know exactly what's wrong, here. Let's see if the HTML source validates our fears.

<div>&nbsp;</div>
<h3>&#55357;&#56832; &#55357;&#56876; &#55357;&#56833; &#55357;&#56834; &#5=
5357;&#56835; &#55357;&#56836; &#55357;&#56837; &#55357;&#56838; &#55357;&#=
56839; &#55357;&#56841; &#55357;&#56842; &#55357;&#56898; &#55357;&#56899; =
&#9786;&#65039; &#55357;&#56843; &#55357;&#56844; &#55357;&#56845; &#55357;=
&#56856; &#55357;&#56855; &#55357;&#56857; &#55357;&#56858; &#55357;&#56860=
; &#55357;&#56861; &#55357;&#56859; &#55358;&#56593; &#55358;&#56595; ... <=
/h3>

Yep. Microsoft is abusing UTF-16 again. The &#55xxx; s everywhere are a dead giveaway.

Some lazy software vendors often use UTF-16 as a sort of fixed-width encoding. Most human languages only touch characters in the first 65536 code points, so UTF-16 being "good enough" is a popular misconception. However, when we need to reach characters with code points higher than 65535 (i.e. emoji, semi-rare Asian scripts), we need to use surrogate pairs. Both bytes take on very specific values within 0xD800 to 0xDFFF to provide a slightly wider 20-bit range. This isn't ambiguous because of some exceptions in the Unicode standard stating that 0xD800 to 0xDFFF will never be used as code points. It's not pretty, but it sort of works, sometimes. Wikipedia has a pretty good writeup on it here.

In the plaintext, this is rendered as two question marks, because Microsoft's method for encoding Unicode in HTML is abysmally wrong. They're encoding the Unicode code points to UTF-16, and then escaping the resulting integers. Because these resulting integers can never be printing characters, the characters will never render on a standards-adhering parser.

In fact, w3 specifies exactly the opposite.

One point worth special note is that values of numeric character references (such as € or € for the euro sign €) are escaped as Unicode code points – no matter what encoding you use for your document.

It's a tiny mistake - often inconsequential. All your emojis display in-browser without issue thanks to some workarounds, even if the plaintext mail is totally unusable. Above is some emoji sent from Gmail; below is from Outlook.

The real problem with this is codebases around the globe are now littered with functions like this one:

(defn ms-to-str
  "Convert Microsoft's uniquely terrible unicode transfer method into a string"
  ... )

Sadly, I don't own the rights to this function, so I'm not allowed to show much more than that. Below, however, is the general routine for fixing this issue.

Unescape an escaped integer - This is either a code point or a surrogate high.
Check if the integer is a surrogate high. If so, get the next integer as well. This is the surrogate low.
If there are any printing characters between a surrogate high and low, Microsoft really blew it and you're out of luck.
If it's a surrogate high, pack the high and the low (in that order) into a little-endian byte array. Otherwise, pack your solitary short into it.
Feed this into your trusty UTF-16 decoder and you'll hopefully get a single code point. If you get two or three code points, check your endianness and decoder parameters.

Hopefully that will be of some help to somebody. Or, hopefully, Microsoft will encode their code points properly in the future. Every time I need to add a new function to our parser is another chunk of time wasted on something silly.