The permanence of Twitter

In 2010, the Library of Congress announced it had started archiving all of Twitter.

Have you ever sent out a “tweet” on the popular Twitter social media service?  Congratulations: Your 140 characters or less will now be housed in the Library of Congress.

That’s right.  Every public tweet, ever, since Twitter’s inception in March 2006, will be archived digitally at the Library of Congress. That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions. §

On April 28, 2010, the LOC published a FAQ about what was collected. It mentioned that only the public tweets would make it into the archive.

Private account information and deleted tweets will not be part of the archive…There will be at least a six-month window between the original date of a tweet and its date of availability for research use. §

The last update from the LOC was in January 2013. In that time, we learned the LOC had almost finished unpacking the original archive file provided by Twitter. In addition, the LOC now is receiving hourly updates from the Twitter “firehose.”1

I might be wrong, but it looks like there is no such thing as a deleted tweet. The firehose is a real-time stream of tweets. The updates to the LOC are gathered from the firehose and uploaded hourly. So, any tweet posted is going to make it into the uploaded file. If the tweet is deleted later, it was still included in the uploaded file.

So, how does the LOC know if a tweet was deleted? I don’t think there’s a way for it to know. From the published information, the LOC isn’t doing any analysis of the tweets. The LOC receives the upload, and moves the tweets in it to permanent storage.2

If it receives a tweet in the hourly upload, and that tweet is deleted the next day, how could it tell? I don’t think the LOC is going to search for each tweet to make sure it’s still live. Doing that would mean continually searching the archive and removing the deleted tweets. It would also mean destroying the archive’s value as historical record. That would defeat the purpose of storing the tweets to begin with.

So, once a tweet is bundled up and shipped to the LOC, it’s in the archives forever.

And you thought a permanent record didn’t exist.

  1. The LOC tweet provider, Gnip, is one of the two big data partners of Twitter.
  2. In this case, tape archives.