Ventures in Data Science

Friday, 24 February 2012

Ken vs Boris: Can Twitter point towards the result?

I tried to grab the latest 1500 tweets mentioning Boris Johnson and then again with Ken Livingstone and got back about 1450 each . There are more Boris tweets out there which is why his period of time coverage is shorter. This was done with TwitteR.

Now using the tm package I found:

Terms most frequently associated with "Boris Johnson"

> findFreqTerms(boris.dtm, lowfreq=50) [1] "2012" "boris" [3] "brits" "campaign" [5] "child" "ginger" [7] "hair" "harry" [9] "hustings" "johnson" [11] "johnsons" "london" [13] "looks" "love" [15] "mayor" "mayoral" [17] "mayoroflondon" "olympic" [19] "olympics" "prince" [21] "sheeran" "spanking" [23] "ticket" "transparency"

Terms most frequently associated with "Ken Livingstone"

> findFreqTerms(ken.dtm, lowfreq=50) [1] "admitted" "backboris" [3] "banker" "banquier" [5] "beacon" "boris" [7] "boriss" "brian" [9] "campaign" "candidate" [11] "cutoutandkeep" "cycle" [13] "davehill" "election" [15] "equality" "extraordinary" [17] "free" "gay" [19] "guide" "hang" [21] "hire" "hustings" [23] "idea" "ill"

Then I used the sentiment package to classify the tweets into neutral, positive or negative ones and this was displayed using ggplot2.

> table(classify_polarity(borisemot, algorithm="voter",verbose=FALSE)) 0.500001 1 272 1178 1.999996000008e-06 1e-06 41 2628 500001 negative 231 41 neutral positive 1178 231

table(classify_polarity(kentext, algorithm="voter",verbose=FALSE)) 0.200000319999872 6.66666222222519e-07 9.99999000001e-07 129 37 100 negative neutral positive 586 585 316

Points to make

1) Boris is getting more coverage than Ken.
2) I'd want to know if the Ken Negative is so high due to wider "non political" public mood or politically converted Conservative tweeters
3) In future analysis it could be a good idea to restrict the sample to tweeters than only mention a candidate once. If they mention a candidate 10 times lets face it they've probably already decided how to vote.
4) I'm not convinced Boris's neutral tweets are all neutral.
5) This analysis would benefit from being done every day to pick up trends
6) Potentially has a much faster turnaround time than telephone and internet polling.
7) This would only work on large races like the London mayoral election. I doubt the Doncaster mayoral election would generate the same level of interest on Twitter.
8) Ideally sentiment should be weighted by followers. If someone tweets 200 really negative tweets but they only have 5 followers what does it matter in the grand scheme of things. Not a lot.
9) The most closely associated terms need expert analysis who knows about the specifics otherwise it looks randomn

So who's going to win?

If you'd have asked me before I would have expected Boris's negative ratings to be higher than they are. That Ken's positives are higher than Boris's although marginally may reflect the reality of a revived centre left after their exit from government in 2010. Looking at the polls before starting this I thought it was too close to call. After, if the method of analysis works and that is a massive if I would lean on the Boris side of too close to call. But if you make judgements based on this data as to who is going to win in May then you're an idiot. So more evidence required.

This post has been brought to you with the greatfully received help from Jeffrey Breen, Heuristic Andrew and the sentiment package. Check them out.

Wednesday, 22 February 2012

What happens to the #dropthebill hashtag when the PM mentions his NHS reforms?

As every political geek will know midday on Wednesday when Parliament is in session is Prime Minister's Questions. This week Labour Opposition Leader Ed Miliband led on the NHS reforms and how did twitter react? As you can see the Prime Minister's answers weren't convincing for his opponents on Twitter as there was huge spike in the use of the #dropthebill hashtag. Interestingly the effect lasted after PMQ's has finished.

>require(twitteR)
> tweets = searchTwitter("#dropthebill", n=1500)
> length(tweets)
[1] 1499
> library(plyr)
> tweets.df = ldply(tweets, function(t) t$toDataFrame() )
> names(tweets.df)
[1] "text" "favorited" "replyToSN" "created" [5] "truncated" "replyToSID" "id" "replyToUID" [9] "statusSource" "screenName"
> range(tweets.df$created) [1] "2012-02-22 08:09:59 UTC" "2012-02-22 16:28:16 UTC"
> require(ggplot2)
> qplot(tweets.df$created, data=tweets.df, geom="histogram", main="#dropthebill on 22nd February 2012", xlab="Time", binwidth=1200)

Wednesday, 4 January 2012

to the K-means and beyond ...

One of the rarer types of articles in journals are the sort that seek to do a potted history of a discipline or sub discipline and while they may not make a new contribution to knowledge they are a great resource for people wanting to get up to speed with something.

A great example of such an article is Data Clustering: 50 Years Beyond K-Means by Anil K. Jain (pdf file) As K-means is still one of the most popular algorithms it just goes to show as with other areas of statistics that if your way of doing things arrives first and becomes commonly adopted it'll often trundle along even if better alternatives appear in the literature.

Tuesday, 3 January 2012

How to trim a mean in R

Good news! My secondhand copy of Fundamentals of Modern Statistical Methods by Rand R. Wilcox has winged its way safely across the Atlantic. In celebration here's a how to do trimmed means in R. Really it's quite simple but as you can see it can have a big effect on the results.

> #Lets find some data to look at the mean and the trimmed mean
> firstlotofdata <- c(14,13,10,11,19,14,12,12,12,16,18,16,13,14,15,18,17,16,14,11)
> #Some data that also has some outliers could also be helpful
> secondlotofdata <-c(15,18,11,11,12,12,19,17,14,13,12,18,17,16,17,14,11,12,875,4572)
> firstlotofdata
 [1] 14 13 10 11 19 14 12 12 12 16 18 16 13 14 15 18 17
[18] 16 14 11
> secondlotofdata
 [1]   15   18   11   11   12   12   19   17   14   13
[11]   12   18   17   16   17   14   11   12  875 4572
> mean(firstlotofdata)
[1] 14.25
> mean(secondlotofdata)
[1] 285.3
> mean(firstlotofdata, 0.1)
[1] 14.1875
> mean(secondlotofdata, 0.1)
[1] 14.8125
> mean(secondlotofdata, 0.05)
[1] 62.38889
> length(secondlotofdata)
[1] 20

As you can see the first lot of data doesn't produce a lot of variation between the mean and trimmed mean as the data is nicely bunched together and free from outliers. This is as expected as there is no difference between a mean and a trimmed mean when the data fit a Gaussian distribution.

The difference is evident as you can see from the second lot of data which has two massive outliers which hugely distort the result. So if you want to trim the mean by removing 20% of cases you code mean(mydata, 0.1) and this takes 10% of cases from both the top and bottom of the distribution. With a length of 20 this would mean removing 4 cases. As you can see 0.05 only removes 2 cases keeping in the 875 results in the trimmed mean of over 60.

Friday, 28 October 2011

Regular Expressions

Always helpful when R user groups put their presentations online and I find regular expressions very useful for data munging.

regex-presentation_ed_goodwin

Thursday, 27 October 2011

Common Statistical Errors

Check out this journal article (pdf) on common statistical errors in medical experiments. Don't worry this is only the medical research that's meant to keep you alive and the statistical tests that prove its effectiveness are often not up to scratch. Anyway below are the errors that they identified.

Statistical errors and deficiencies related to the design of a study

Study aims and primary outcome measures not clearly stated or unclear

Failure to report number of participants or observations (sample size)

Failure to report withdrawals from the study

No a priori sample size calculation/effect-size estimation (power calculation)

No clear a priori statement or description of the Null-Hypothesis under investigation

Failure to use and report randomisation

Method of randomisation not clearly stated

Failure to use and report blinding if possible

Failure to report initial equality of baseline characteristics and comparability of study groups

Use of an inappropriate control group

Inappropriate testing for equality of baseline characteristics

Statistical errors and deficiencies related to data analysis

Use of wrong statistical tests

Incompatibility of statistical test with type of data examined

Unpaired tests for paired data or vice versa

Inappropriate use of parametric methods

Use of an inappropriate test for the hypothesis under investigation

Inflation of Type I error

Failure to include a multiple-comparison correction

Inappropriate post-hoc Subgroup analysis

Typical errors with Student’s t-test

Failure to prove test assumptions

Unequal sample sizes for paired t-test

Improper multiple pair-wise comparisons of more than two groups

Use of an unpaired t-test for paired data or vice versa

Typical errors with chi-square-tests

No Yates-continuity correction reported if small numbers

Use of chi-square when expected numbers in a cell are <5

No explicit statement of the tested Null-Hypotheses

Failure to use multivariate techniques to adjust for confounding factors

Errors related to the documentation of statistical methods applied

Failure to specify/define all tests used clear and correctly

Failure to state number of tails

Failure to state if test was paired or unpaired

Wrong names for statistical tests

Referring to unusual or obscure methods without explanation or reference

Failure to specify which test was applied on a given set of data if more than one test was done

“Where appropriate” statement

Statistical errors and deficiencies related to the presentation of study data

Inadequate graphical or numerical description of basic data

Mean but no indication of variability of the data

Giving SE instead of SD to describe data

Use of mean (SD) to describe non-normal data

Failure to define ± notion for describing variability or use of unlabeled error bars

Inappropriate and poor reporting of results

Results given only as p-values, no confidence intervals given

Confidence intervals given for each group rather than for contrasts

“p = NS”, “p <0.05” or other arbitrary thresholds instead of reporting exact p-values

Numerical information given to an unrealistic level of precision

Statistical errors and deficiencies related to the interpretation of study findings

Wrong interpretation of results

“non significant” interpreted as “no effect”, or “no difference”

Drawing conclusions not supported by the study data

Significance claimed without data analysis or statistical test mentioned

Poor interpretation of results

Disregard for Type II error when reporting non-significant results

Missing discussion of the problem of multiple significance testing if done

Failure to discuss sources of potential bias and confounding factors

Wednesday, 26 October 2011

Pick a number any number

I was looking at GDP per person the other day on the UN HDI website and I thought the data was quite interesting. Not least because there were lots of incomplete data for the 1980 figure compared to the the 2010 one. This appeared partly because they didn't have the data for whatever reason but also because there have been a lot of new countries created since 1980.

Anyway I thought it would be good to analyze the data to see which countries had done well or not so well economically over the last 30 or so years. The results were in themselves interesting, more on which in another post but I did notice a particularly striking finding.

> cor(Y1980, newgdp, method = "pearson")
[1] 0.7785629
> cor(Y1980, newgdp, method = "spearman")
[1] 0.9456404
> cor(Y1980, newgdp, method = "kendall")
[1] 0.8039875

With the type of data I had. Two continuous variables I needed to do a test of correlation on them. I actually had a look at all three tests Pearson, Spearman and Kendall as you can see above and the difference in the results on the same data did suprise me somewhat. So getting the statistical test right is important as the differences can be large.