I Found a Twitter Bug!


I found a Twitter bug! Hah!

Specifically, certain characters which much be escaped in the GSM 03.38 character encoding are getting treated as the wrong encoding when posted to Twitter from Verizon Wireless SMS, and showing up as ? in text messages sent by Twitter to Verizon Wireless customers via SMS.

I should add that I didn’t find this bug alone – @elliotreed asked why I used question marks to note something in a tweet when I had actually used square brackets around some text. Some quick investigation with him revealed the more specific nature of the problem, but it wasn’t until I actually found out that there was such a thing as GSM encoding that I came up with a hypothesis to explain the character weirdness.

As far as I can tell, Verizon’s HTTP/SMS gateway is now doing the GSM/UTF-8 mapping internally, but Twitter is assuming it still has to send GSM bytes to Verizon, so the encoding is happening twice, or at least attempting to happen twice. Verizon chokes on the GSM two-byte characters, since they’re not valid UTF-8, while Twitter receives certain ASCII-range one-byte UTF-8 characters but converts them as if they were GSM one-byte characters, resulting in a totally different UTF-8 character!

The GSM-to-UTF-8 encoding bug, shown here for square brackets, curly braces, tilde, backslash, and carat.
The GSM-to-UTF-8 encoding bug, shown here for square brackets, curly braces, tilde, backslash, and carat.

The GSM encoding doesn’t allow certain characters as single-byte characters; this appears to be a way to shove a number of European characters into a 7-bit mutant ASCII, with control characters and certain punctuation replaced by characters from the Latin-1 codepage. To some extent this makes sense, given that with the 160-byte length limit on SMS messages you want to avoid multibyte encodings while still supporting commonly used characters (UTF-16 is used for non-roman languages). Unfortunately, this leaves [, ], ~, {, }, \, |, and ^ out in the cold. As a programmer, I use these punctuation characters often as separators in various notations, so it is perhaps not surprising that one of my tweets revealed the problem. These characters can be sent as a two-byte sequence in the GSM encoding, but those start with an escape byte 0x1B, which since it starts with more than one initial bit high will always be invalid as the first byte of a UTF-8 character.

I would have thought that the Age of Unicode would have ended many of these non-standard application-specific encodings (and plus, given the way mobile carriers love to gouge on SMS, if they make your characters take more bytes, they get more money!). It looks like that’s exactly what Verizon is trying to do, in moving to exposing UTF-8 on the edge of their network… they just didn’t tell anyone that they had changed encodings, or if they have, Twitter hasn’t acted on the change yet.

Since Twitter disabled their help ticket creation (probably because too many stupid people were posting the same questions without reading the FAQs), I reported the bug using the Twitter API ticketing system on Google Code.

Short story: if you use any of the punctuation characters above in your tweets, expect texting Twitter users with Verizon to see ?, and expect to receive tweets from them with weird European characters, until this is fixed by one or both parties.


2 responses to “I Found a Twitter Bug!”

  1. I thought Verizon used CDMA, not GSM? Do I misunderstand the terms, etc? Please advise.

  2. The problem is just the GSM text encoding, which was so named because it was developed as part of adding SMS to GSM (as far as I can tell). I would assume Verizon used (and perhaps still does, when you ignore the HTTP gateway) the encoding for SMS compatibility with other networks, although you could imagine them doing the translation at the edge of their network. This doesn’t have anything to do with the underlying radio technology or how SMS messages are sent, just how the bytes are interpreted.

Nurd Up!