Ventures in Data Science: Ken vs Boris: Can Twitter point towards the result?

I tried to grab the latest 1500 tweets mentioning Boris Johnson and then again with Ken Livingstone and got back about 1450 each . There are more Boris tweets out there which is why his period of time coverage is shorter. This was done with TwitteR.

Now using the tm package I found:

Terms most frequently associated with "Boris Johnson"

> findFreqTerms(boris.dtm, lowfreq=50) [1] "2012" "boris" [3] "brits" "campaign" [5] "child" "ginger" [7] "hair" "harry" [9] "hustings" "johnson" [11] "johnsons" "london" [13] "looks" "love" [15] "mayor" "mayoral" [17] "mayoroflondon" "olympic" [19] "olympics" "prince" [21] "sheeran" "spanking" [23] "ticket" "transparency"

Terms most frequently associated with "Ken Livingstone"

> findFreqTerms(ken.dtm, lowfreq=50) [1] "admitted" "backboris" [3] "banker" "banquier" [5] "beacon" "boris" [7] "boriss" "brian" [9] "campaign" "candidate" [11] "cutoutandkeep" "cycle" [13] "davehill" "election" [15] "equality" "extraordinary" [17] "free" "gay" [19] "guide" "hang" [21] "hire" "hustings" [23] "idea" "ill"

Then I used the sentiment package to classify the tweets into neutral, positive or negative ones and this was displayed using ggplot2.

> table(classify_polarity(borisemot, algorithm="voter",verbose=FALSE)) 0.500001 1 272 1178 1.999996000008e-06 1e-06 41 2628 500001 negative 231 41 neutral positive 1178 231

table(classify_polarity(kentext, algorithm="voter",verbose=FALSE)) 0.200000319999872 6.66666222222519e-07 9.99999000001e-07 129 37 100 negative neutral positive 586 585 316

Points to make

1) Boris is getting more coverage than Ken.
2) I'd want to know if the Ken Negative is so high due to wider "non political" public mood or politically converted Conservative tweeters
3) In future analysis it could be a good idea to restrict the sample to tweeters than only mention a candidate once. If they mention a candidate 10 times lets face it they've probably already decided how to vote.
4) I'm not convinced Boris's neutral tweets are all neutral.
5) This analysis would benefit from being done every day to pick up trends
6) Potentially has a much faster turnaround time than telephone and internet polling.
7) This would only work on large races like the London mayoral election. I doubt the Doncaster mayoral election would generate the same level of interest on Twitter.
8) Ideally sentiment should be weighted by followers. If someone tweets 200 really negative tweets but they only have 5 followers what does it matter in the grand scheme of things. Not a lot.
9) The most closely associated terms need expert analysis who knows about the specifics otherwise it looks randomn

So who's going to win?

If you'd have asked me before I would have expected Boris's negative ratings to be higher than they are. That Ken's positives are higher than Boris's although marginally may reflect the reality of a revived centre left after their exit from government in 2010. Looking at the polls before starting this I thought it was too close to call. After, if the method of analysis works and that is a massive if I would lean on the Boris side of too close to call. But if you make judgements based on this data as to who is going to win in May then you're an idiot. So more evidence required.

This post has been brought to you with the greatfully received help from Jeffrey Breen, Heuristic Andrew and the sentiment package. Check them out.

Ventures in Data Science

Friday, 24 February 2012

Ken vs Boris: Can Twitter point towards the result?

No comments:

Post a Comment