Words are powerful things. As of late, I have been working on text mining social data feeds, and investigating how Hadoop, R, Azure and SQL Server 2012 fit into the big picture. I wrote a SQL Integration Services package using a script task that pulls Twitter data from a syndication feed, parses out the words, hash tags, screen names and links, and stores the results in a SQL 2012 (or optionally SQL Azure) database. Performing some text mining of Twitter status keywords against a selection of “best” and “worst” words in the English language brings back some interesting and slightly depressing results.
I started by entering in the database the twitter screen names to follow. I targeted a few Canadian bike companies (don’t ask why). Each time the package is run, it adds the last 20 tweets, and parses screen names of users that are mentioned in each tweet. And so on. This recursion builds a very quick six-degrees-of-separation trail and some fairly random discussions.
Running this process for about 4 days, sometimes 2-3 times per day, produced 5599 tweets. Originally I was looking at using R and Hadoop to analyze the results, which is a bit like bringing a ballistic missile to a knife fight. To slice this data with SQL takes only a couple of seconds or less. Perhaps reading the entire Twitter firehose or analyzing historic tweet data might change the architecture in the future. For now, things are working pretty well.
Of a selection of 5599 individual tweets, 9 contain the “best words” and 2135 have the “worst words” as rated by Vocabula Review. That’s 38% of the sample that have an aura of foolishness or odium, and 0.1% that have an aura of fun and majesty. The sampling is fairly small, with the top word “valley” only coming up 3 times.
Another dataset with seeded with a more technology-centric list of twitter users like Robert Scoble some Microsoft folks I follow brought back similar results. Running this process over the course of a month saved 59,583 tweets containing 796,171 words, links, screen names, emoticons and hash tags.
Of the 796k words, 24,171 came up in the “worst words” filter. That’s about 30%. A measly 282 came up in the “best words” filter. That’s less than 0.001%.
The following Top 5 Words came up.
- Valley makes sense, with Silicon Valley, Napa Valley, and those other west coast valleys being discussed.
- Azure makes sense, since a steady stream of Windows Azure propaganda continually bubbles from Microsoft.
- Simplicity comes up a few times when people talk about Apple or iPad.
- Bliss comes up because of Rob Bliss, a Washington Times columnist, and some comments about cranberry bliss bars.
- Recherche, well, let’s chalk that up to the fact that some of the best words in the English language are French. Mon dieu.
With only 140 characters to leverage, you would think that people would use words like “animadversion” or “cachinnation” to provide deep and meaningful expression. Instead, you get the logorrhea that is the Twitter dataset.
Check out www.vocabula.com to improve your tweets and amaze your followers with fun and majesty.