Now using the tm package I found:
Terms most frequently associated with "Boris Johnson"
> findFreqTerms(boris.dtm, lowfreq=50) [1] "2012" "boris" [3] "brits" "campaign" [5] "child" "ginger" [7] "hair" "harry" [9] "hustings" "johnson" [11] "johnsons" "london" [13] "looks" "love" [15] "mayor" "mayoral" [17] "mayoroflondon" "olympic" [19] "olympics" "prince" [21] "sheeran" "spanking" [23] "ticket" "transparency"
Terms most frequently associated with "Ken Livingstone"
> findFreqTerms(ken.dtm, lowfreq=50) [1] "admitted" "backboris" [3] "banker" "banquier" [5] "beacon" "boris" [7] "boriss" "brian" [9] "campaign" "candidate" [11] "cutoutandkeep" "cycle" [13] "davehill" "election" [15] "equality" "extraordinary" [17] "free" "gay" [19] "guide" "hang" [21] "hire" "hustings" [23] "idea" "ill"
Then I used the sentiment package to classify the tweets into neutral, positive or negative ones and this was displayed using ggplot2.
> table(classify_polarity(borisemot, algorithm="voter",verbose=FALSE)) 0.500001 1 272 1178 1.999996000008e-06 1e-06 41 2628 500001 negative 231 41 neutral positive 1178 231
table(classify_polarity(kentext, algorithm="voter",verbose=FALSE)) 0.200000319999872 6.66666222222519e-07 9.99999000001e-07 129 37 100 negative neutral positive 586 585 316
1) Boris is getting more coverage than Ken.
2) I'd want to know if the Ken Negative is so high due to wider "non political" public mood or politically converted Conservative tweeters
3) In future analysis it could be a good idea to restrict the sample to tweeters than only mention a candidate once. If they mention a candidate 10 times lets face it they've probably already decided how to vote.
4) I'm not convinced Boris's neutral tweets are all neutral.
5) This analysis would benefit from being done every day to pick up trends
6) Potentially has a much faster turnaround time than telephone and internet polling.
7) This would only work on large races like the London mayoral election. I doubt the Doncaster mayoral election would generate the same level of interest on Twitter.
8) Ideally sentiment should be weighted by followers. If someone tweets 200 really negative tweets but they only have 5 followers what does it matter in the grand scheme of things. Not a lot.
9) The most closely associated terms need expert analysis who knows about the specifics otherwise it looks randomn
So who's going to win?
If you'd have asked me before I would have expected Boris's negative ratings to be higher than they are. That Ken's positives are higher than Boris's although marginally may reflect the reality of a revived centre left after their exit from government in 2010. Looking at the polls before starting this I thought it was too close to call. After, if the method of analysis works and that is a massive if I would lean on the Boris side of too close to call. But if you make judgements based on this data as to who is going to win in May then you're an idiot. So more evidence required.
This post has been brought to you with the greatfully received help from Jeffrey Breen, Heuristic Andrew and the sentiment package. Check them out.