Ventures in Data Science: February 2012

I tried to grab the latest 1500 tweets mentioning Boris Johnson and then again with Ken Livingstone and got back about 1450 each . There are more Boris tweets out there which is why his period of time coverage is shorter. This was done with TwitteR.

Now using the tm package I found:

Terms most frequently associated with "Boris Johnson"

> findFreqTerms(boris.dtm, lowfreq=50) [1] "2012" "boris" [3] "brits" "campaign" [5] "child" "ginger" [7] "hair" "harry" [9] "hustings" "johnson" [11] "johnsons" "london" [13] "looks" "love" [15] "mayor" "mayoral" [17] "mayoroflondon" "olympic" [19] "olympics" "prince" [21] "sheeran" "spanking" [23] "ticket" "transparency"

Terms most frequently associated with "Ken Livingstone"

> findFreqTerms(ken.dtm, lowfreq=50) [1] "admitted" "backboris" [3] "banker" "banquier" [5] "beacon" "boris" [7] "boriss" "brian" [9] "campaign" "candidate" [11] "cutoutandkeep" "cycle" [13] "davehill" "election" [15] "equality" "extraordinary" [17] "free" "gay" [19] "guide" "hang" [21] "hire" "hustings" [23] "idea" "ill"

Then I used the sentiment package to classify the tweets into neutral, positive or negative ones and this was displayed using ggplot2.

> table(classify_polarity(borisemot, algorithm="voter",verbose=FALSE)) 0.500001 1 272 1178 1.999996000008e-06 1e-06 41 2628 500001 negative 231 41 neutral positive 1178 231

table(classify_polarity(kentext, algorithm="voter",verbose=FALSE)) 0.200000319999872 6.66666222222519e-07 9.99999000001e-07 129 37 100 negative neutral positive 586 585 316

Points to make

1) Boris is getting more coverage than Ken.
2) I'd want to know if the Ken Negative is so high due to wider "non political" public mood or politically converted Conservative tweeters
3) In future analysis it could be a good idea to restrict the sample to tweeters than only mention a candidate once. If they mention a candidate 10 times lets face it they've probably already decided how to vote.
4) I'm not convinced Boris's neutral tweets are all neutral.
5) This analysis would benefit from being done every day to pick up trends
6) Potentially has a much faster turnaround time than telephone and internet polling.
7) This would only work on large races like the London mayoral election. I doubt the Doncaster mayoral election would generate the same level of interest on Twitter.
8) Ideally sentiment should be weighted by followers. If someone tweets 200 really negative tweets but they only have 5 followers what does it matter in the grand scheme of things. Not a lot.
9) The most closely associated terms need expert analysis who knows about the specifics otherwise it looks randomn

So who's going to win?

If you'd have asked me before I would have expected Boris's negative ratings to be higher than they are. That Ken's positives are higher than Boris's although marginally may reflect the reality of a revived centre left after their exit from government in 2010. Looking at the polls before starting this I thought it was too close to call. After, if the method of analysis works and that is a massive if I would lean on the Boris side of too close to call. But if you make judgements based on this data as to who is going to win in May then you're an idiot. So more evidence required.

This post has been brought to you with the greatfully received help from Jeffrey Breen, Heuristic Andrew and the sentiment package. Check them out.

As every political geek will know midday on Wednesday when Parliament is in session is Prime Minister's Questions. This week Labour Opposition Leader Ed Miliband led on the NHS reforms and how did twitter react? As you can see the Prime Minister's answers weren't convincing for his opponents on Twitter as there was huge spike in the use of the #dropthebill hashtag. Interestingly the effect lasted after PMQ's has finished.

>require(twitteR)
> tweets = searchTwitter("#dropthebill", n=1500)
> length(tweets)
[1] 1499
> library(plyr)
> tweets.df = ldply(tweets, function(t) t$toDataFrame() )
> names(tweets.df)
[1] "text" "favorited" "replyToSN" "created" [5] "truncated" "replyToSID" "id" "replyToUID" [9] "statusSource" "screenName"
> range(tweets.df$created) [1] "2012-02-22 08:09:59 UTC" "2012-02-22 16:28:16 UTC"
> require(ggplot2)
> qplot(tweets.df$created, data=tweets.df, geom="histogram", main="#dropthebill on 22nd February 2012", xlab="Time", binwidth=1200)

Ventures in Data Science

Friday 24 February 2012

Ken vs Boris: Can Twitter point towards the result?

Wednesday 22 February 2012

What happens to the #dropthebill hashtag when the PM mentions his NHS reforms?