Thursday, 28 July 2011

Analyzing a Twitter hashtag with R

I've been wanting to have a go at this since reading this great post on the subject by Heuristic Andrew. When news broke that the Prime Minister's Director of Strategy had proposed to abolish maternity leave amongst other things twitter reacted with the mocking hashtag #blueskyhilton.

I decided on this hashtag because I thought it would be relatively short. The Twitter API only goes up to 1500 tweets anyway. In fact #blueskyhilton turns out to have had 240 mentions. Which is a great size to be starting with.

There are a few conclusions to be drawn:

1) On a technical aside I need to find a better way to cope with shortened URL's

2) I don't think we've reached the limits of what vizualisation can do in presenting the information. On Heuristic Andrew there is a cluster dendrogram plot which is fine for a statistically minded audience but the rest of the population might find it a little perplexing.

3) I need to find firmer lines between who is tweeting, what they are tweeting and retweets.

4) The focus of the tweets seems to come from a small number of users but that may be due to the small sample size. Off hand ideas from a political backroom boy that most voters don't recognise don't excite Twitter.

5) Not all Tweeters on an issue will use the same hashtag so I may need a wider range of search terms.

6) This type of analysis would benefit from proper scripts and repeat sampling.

7) The corner of Twitter that was involved in #blueskyhilton won't be sending any Christmas cards to No. 10.

Anyway here's most of the code I used. Don't forget you'll need the XML package. For some reason pretty R has turned <- into < Think I prefer ugly R.


> bsh.vectors &lt;- character(0)
> for (page in c(1:15))
+ {
+ twitter_q &lt;-URLencode('#blueskyhilton')
+ twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
+ bsh.xml &lt;-xmlParseDoc(twitter_url, asText=F)
+ bsh.vector &lt;-xpathSApply(bsh.xml,'//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
+ bsh.vectors &lt;- c(bsh.vector, bsh.vectors)
+ }
> length(bsh.vectors)
[1] 240
> install.packages('tm')
> require(tm)
Loading required package: tm
> bsh.corpus &lt;-Corpus(VectorSource(bsh.vectors))
> bsh.corpus &lt;-tm_map(bsh.corpus, tolower)
> my_stopwords &lt;- c(stopwords('english'), '#blueskyhilton', 'blueskyhilton')
> bsh.corpus &lt;- tm_map(bsh.corpus, removeWords, my_stopwords)
> bsh.corpus &lt;- tm_map(bsh.corpus, removePunctuation)
> bsh.dtm &lt;-TermDocumentMatrix(bsh.corpus)
> bsh.dtm
A term-document matrix (849 terms, 240 documents)

Non-/sparse entries: 1785/201975
Sparsity : 99%
Maximal term length: 36
Weighting : term frequency (tf)
> findFreqTerms(bsh.dtm, lowfreq=10)
[1] "abolish" "archiebland" "benefits"
[4] "day" "growth" "hashtag"
[7] "hilton" "ideas" "months"
[10] "people" "policy" "psbook"
[13] "steve" "stevehiltontapes" "wednesday"
[16] "week"
> findAssocs(bsh.dtm, 'hilton', 0.20)
hilton steve blood brain
1.00 0.98 0.43 0.43
notices alan equivalent httpbitlyesdjf9
0.43 0.36 0.36 0.36
minimal partridge somalia political
0.36 0.36 0.36 0.32
cut phildyson research send
0.31 0.31 0.31 0.31
ideas hashtag jimthehedgehog nine
0.29 0.26 0.26 0.25
senior servant supply
0.25 0.25 0.22
> findAssocs(bsh.dtm, 'hilton', 0.10)
hilton steve blood
1.00 0.98 0.43
brain notices alan
0.43 0.43 0.36
equivalent httpbitlyesdjf9 minimal
0.36 0.36 0.36
partridge somalia political
0.36 0.36 0.32
cut phildyson research
0.31 0.31 0.31
send ideas hashtag
0.31 0.29 0.26
jimthehedgehog nine senior
0.26 0.25 0.25
servant supply adambienkov
0.25 0.22 0.18
andy automatic born
0.18 0.18 0.18
calsberg coulson counterbalance
0.18 0.18 0.18
dangerous director fascistic
0.18 0.18 0.18
fucking hit httpbitlyo9txpg
0.18 0.18 0.18
httpcoameekab httpcoeaaligl leaked
0.18 0.18 0.18
meme minute moron
0.18 0.18 0.18
plans plonkers quota
0.18 0.18 0.18
spawns strategy streak
0.18 0.18 0.18
stupid thejamesdixon twitter
0.18 0.18 0.18
months civil benefits
0.17 0.16 0.14
archiebland cameron comprehension
0.13 0.13 0.11
defying politicians thestevehiltontapes
0.11 0.11 0.11

Created by Pretty R at inside-R.org

No comments:

Post a Comment