Blog Archives

How I Internet

Reading Online

Looking at my recent blog history, you’ll find that it has been rather bookcentric. This is largely a function of a quick book review being easier to write than a longer, more personal post; however, it belies how much of my time I actually spend reading books. I sometimes bemoan the fact that I read less than I used to, but I think I can chalk that behavior up to three factors:

  • I read a lot more in high school
  • I still get to read more than most people
  • I now read more content online

The first point is part of growing up, and the second point is part of a larger sociological question that I’m not qualified to address, so I’ll focus on the third point: how and where do I find and read short- and long-form content on the web? The list probably won’t be too surprising (Twitter, Facebook, blogs, news sites, etc.), but I’ll go into more detail on what clients I use to keep track of everything. It should not be surprising that my acquisition of an iPad in April of 2010 significantly changed how I interact with text online.

This has been a topic kicking around my head for close to a year, since I spend a lot of time connected, although some of my reading/archiving methods have changed over time. The most recent inspiration to write this up was a discussion I had with my mom back in October about how to save articles that she finds online, the way one might clip an article from a physical newspaper. Another one was this post from Brett Nordquist in May of last year about personal online recommendations, in which we happen to use a lot of the same sources/services.

Below the cut, my rather verbose recommendations on how to quickly filter a wide variety of text content online for eventual reading.


Posted in Reviews, Social Media Tagged with: , , , , , , , ,

I Found a Twitter Bug!

I found a Twitter bug! Hah!

Specifically, certain characters which much be escaped in the GSM 03.38 character encoding are getting treated as the wrong encoding when posted to Twitter from Verizon Wireless SMS, and showing up as ? in text messages sent by Twitter to Verizon Wireless customers via SMS.

I should add that I didn’t find this bug alone – @elliotreed asked why I used question marks to note something in a tweet when I had actually used square brackets around some text. Some quick investigation with him revealed the more specific nature of the problem, but it wasn’t until I actually found out that there was such a thing as GSM encoding that I came up with a hypothesis to explain the character weirdness.

As far as I can tell, Verizon’s HTTP/SMS gateway is now doing the GSM/UTF-8 mapping internally, but Twitter is assuming it still has to send GSM bytes to Verizon, so the encoding is happening twice, or at least attempting to happen twice. Verizon chokes on the GSM two-byte characters, since they’re not valid UTF-8, while Twitter receives certain ASCII-range one-byte UTF-8 characters but converts them as if they were GSM one-byte characters, resulting in a totally different UTF-8 character!

The GSM-to-UTF-8 encoding bug, shown here for square brackets, curly braces, tilde, backslash, and carat.

The GSM-to-UTF-8 encoding bug, shown here for square brackets, curly braces, tilde, backslash, and carat.

The GSM encoding doesn’t allow certain characters as single-byte characters; this appears to be a way to shove a number of European characters into a 7-bit mutant ASCII, with control characters and certain punctuation replaced by characters from the Latin-1 codepage. To some extent this makes sense, given that with the 160-byte length limit on SMS messages you want to avoid multibyte encodings while still supporting commonly used characters (UTF-16 is used for non-roman languages). Unfortunately, this leaves [, ], ~, {, }, \, |, and ^ out in the cold. As a programmer, I use these punctuation characters often as separators in various notations, so it is perhaps not surprising that one of my tweets revealed the problem. These characters can be sent as a two-byte sequence in the GSM encoding, but those start with an escape byte 0x1B, which since it starts with more than one initial bit high will always be invalid as the first byte of a UTF-8 character.

I would have thought that the Age of Unicode would have ended many of these non-standard application-specific encodings (and plus, given the way mobile carriers love to gouge on SMS, if they make your characters take more bytes, they get more money!). It looks like that’s exactly what Verizon is trying to do, in moving to exposing UTF-8 on the edge of their network… they just didn’t tell anyone that they had changed encodings, or if they have, Twitter hasn’t acted on the change yet.

Since Twitter disabled their help ticket creation (probably because too many stupid people were posting the same questions without reading the FAQs), I reported the bug using the Twitter API ticketing system on Google Code.

Short story: if you use any of the punctuation characters above in your tweets, expect texting Twitter users with Verizon to see ?, and expect to receive tweets from them with weird European characters, until this is fixed by one or both parties.

Posted in Computers, Software Tagged with: , , ,

Tweetworks Python API

Tweetworks Python API

Version 1.0.0b1 of the tweetworks package for Python 2.6 is now available. This package implements the web service API for Tweetworks, a Web 2.0 service that facilitates threaded conversations on top of Twitter.

This is definitely a beta, because while I’ve tested everything I can think of, I haven’t tried writing anything seriously complicated with it, although I certainly plan to. Comments and questions are welcome here, or find me in the Tweetworks Developers group or as @UltraNurd. I admit that the documentation is a little light at the moment.

If you’re interested in using Tweetworks programmatically from Python, or want to know more about the service, read on.


Posted in Code Projects Tagged with: , , , ,

Connections at the MIT Museum

Today I took my Little Brother Patrick to the MIT Museum near Central Square. I had seen a blurb on their website about an exhibit on social media, and I wanted to check it out (and, since it was billed as interactive, I thought he would enjoy it as well, even though he’s 11).

The exhibit, Connections, features a number of interactive art and technology installations from MIT’s Sociable Media Group. I took a few mediocre iPhone pictures of some of the displays, all of which were very interesting. I love the cool stuff that arises when art and technology collide.

The first thing you see when you walk into the museum right now is the piece Metropath(ologies), which consists of several projectors some big white pillars, plus some speakers, a camera, and several screens. Technically, the first thing you see is the disclaimer that your image and voice may be recorded when interacting with the piece, but I think that’s really cool.



Twitter word clouds projected onto the white pillars of the Metropath(ologies) installation at the MIT Museum, with a visitor moving among them.

Twitter word clouds projected onto white pillars, with a visitor moving among them.

They also have some interesting ways of some Twitter (and other?) feeds; based on the posts they’re from about 2 weeks ago, not live, but I’m not sure what kind of harvesting they do to produce the visualization. As part of another piece, Lexigraphs I, they also have some stylized views of personal word clouds.


3-D visualization of Twitter posts from a few weeks ago

3-D visualization of Twitter posts from a few weeks ago


Person-shaped Twitter word cloud

Person-shaped Twitter word cloud

One of the other Data Portraits was the piece Themail, which gives a timeline word cloud of personal e-mail correspondence between three close individuals. I have e-mail archives going back a long time, I’d be interested in seeing what these look like over time; aggregated they’d just be a roughly Zipfian distribution of English, but presumably there’d be visible spikes as certain topics came and went.


Timelines for three individuals e-mail accounts over 3+ years

Timelines for three individual's e-mail accounts over 3+ years

Finally, there was a live display of data from the Mycrocosm service, of which I couldn’t get a reasonable picture. It seems to be very similar in concept to the tool Daytum, which I started using several weeks ago. The big difference is that MIT is explicitly wanting to study your usage patterns of the Mycrocosm service, and Daytum has a nice Twitter DM method for submitting data items while mobile.

The exhibit is up through September 13th, 2009, so there’s plenty of time to check it out. There are admissions discounts for students, but it’s free if you’re a Big Brother (or Big Sister) there with your Little :oP.




You can click any of the images above to view a larger version, or see the entire (small) gallery.

Posted in Reviews, Social Media Tagged with: , , , , ,

Compiling Django with Twitter support as a Mac OS X Universal Binary


This post is a guide for building your own version of Apache’s mod_python as a Universal Binary in order to support a custom Django install containing the Twitter libraries. As you can probably gather, this information is likely only useful to advanced Mac users who are comfortable in Terminal with compiling and installing software from source. If you’re still interested, gird your loins, crack your knuckles, grab some Mountain Dew, and read on.

Mac OS X 10.5 “Leopard” is yet another step forward into the world of 64-bit. At the same time, Apple has to support both PowerPC and Intel architectures. This is no mean feat, and this is where “fat” or Universal binaries come in.  Apple also has an explanation of Universal binaries, although it’s heavy on PR. This is all well and good, but there is one problem: once you make this leap, all of your library dependencies must contain the architecture you’re running as. Much software is still built as 32-bit only; while it may be a “fat” binary, containing both Intel and PowerPC machine code, it only has the 32-bit versions thereof. For reference, the names of the various architecture flags:

  32-bit 64-bit
Intel i386 x86_64
PowerPC ppc7400 ppc64

Huzzah naming conventions! There’s a lot of history in those names. I’ve linked to the relevant Wikipedia articles if you’re curious; these flags will be coming up again later when configuring various builds. The main thing to note is that most build configurations default to i386 on Intel Macs (even though Core 2 and Xeon processors are natively 64-bit), probably because most software is developed for 32-bit versions of Windows and Linux. As you’ll see, we’ll be overriding that default in several places to get this whole mess working.

Unfortunately, Universality is a cancer, which in my case starts with the Apple-shipped version of the Apache web server in 10.5, a universal binary. Everything it touches needs to be Universal as well, so that Apache can run as a 64-bit process by default. I wanted to add Django support on my web server via mod_python, specifically to play with the Twitter API, which meant I also needed to build python-twitter and its dependencies, as well as a MySQL python module to allow Django to talk to my database. None of these are included in the default Leopard version of Python 2.5.1.

After getting all of this set up, and trying to start my test Django app, mod_python was giving me errors about architecture. As it turns out, the included version of Python is only a “fat” 32-bit binary, not a Universal binary… which means all of the new Python modules I just compiled to support Twitter and Django were only 32-bit, which in turn means that the included Universal version of Apache and mod_python couldn’t use them. Yay.

Below the cut you’ll find my complete instructions for compiling all of the relevant components and their dependencies. I also took the opportunity to update to the latest release version of Python 2.6 and MySQL 5.1, and as a side effect my database server is now running as a 64-bit process. Progress has been made here. Feel free to comment or contact me if you have questions.


Posted in How-Tos Tagged with: , , , , , , , ,

Nicolas Ward

Software engineer in Natural Language Processing research by day; gamer, reader, and aspiring UltraNurd by night. Husband to Andrle
Creative Commons License

Post History

July 2018
« Jun