Ventures in Data Science: 2011

Friday, 28 October 2011

Regular Expressions

Always helpful when R user groups put their presentations online and I find regular expressions very useful for data munging.

regex-presentation_ed_goodwin

Thursday, 27 October 2011

Check out this journal article (pdf) on common statistical errors in medical experiments. Don't worry this is only the medical research that's meant to keep you alive and the statistical tests that prove its effectiveness are often not up to scratch. Anyway below are the errors that they identified.

Statistical errors and deficiencies related to the design of a study

Study aims and primary outcome measures not clearly stated or unclear

Failure to report number of participants or observations (sample size)

Failure to report withdrawals from the study

No a priori sample size calculation/effect-size estimation (power calculation)

No clear a priori statement or description of the Null-Hypothesis under investigation

Failure to use and report randomisation

Method of randomisation not clearly stated

Failure to use and report blinding if possible

Failure to report initial equality of baseline characteristics and comparability of study groups

Use of an inappropriate control group

Inappropriate testing for equality of baseline characteristics

Statistical errors and deficiencies related to data analysis

Use of wrong statistical tests

Incompatibility of statistical test with type of data examined

Unpaired tests for paired data or vice versa

Inappropriate use of parametric methods

Use of an inappropriate test for the hypothesis under investigation

Inflation of Type I error

Failure to include a multiple-comparison correction

Inappropriate post-hoc Subgroup analysis

Typical errors with Student’s t-test

Failure to prove test assumptions

Unequal sample sizes for paired t-test

Improper multiple pair-wise comparisons of more than two groups

Use of an unpaired t-test for paired data or vice versa

Typical errors with chi-square-tests

No Yates-continuity correction reported if small numbers

Use of chi-square when expected numbers in a cell are <5

No explicit statement of the tested Null-Hypotheses

Failure to use multivariate techniques to adjust for confounding factors

Errors related to the documentation of statistical methods applied

Failure to specify/define all tests used clear and correctly

Failure to state number of tails

Failure to state if test was paired or unpaired

Wrong names for statistical tests

Referring to unusual or obscure methods without explanation or reference

Failure to specify which test was applied on a given set of data if more than one test was done

“Where appropriate” statement

Statistical errors and deficiencies related to the presentation of study data

Inadequate graphical or numerical description of basic data

Mean but no indication of variability of the data

Giving SE instead of SD to describe data

Use of mean (SD) to describe non-normal data

Failure to define ± notion for describing variability or use of unlabeled error bars

Inappropriate and poor reporting of results

Results given only as p-values, no confidence intervals given

Confidence intervals given for each group rather than for contrasts

“p = NS”, “p <0.05” or other arbitrary thresholds instead of reporting exact p-values

Numerical information given to an unrealistic level of precision

Statistical errors and deficiencies related to the interpretation of study findings

Wrong interpretation of results

“non significant” interpreted as “no effect”, or “no difference”

Drawing conclusions not supported by the study data

Significance claimed without data analysis or statistical test mentioned

Poor interpretation of results

Disregard for Type II error when reporting non-significant results

Missing discussion of the problem of multiple significance testing if done

Failure to discuss sources of potential bias and confounding factors

Wednesday, 26 October 2011

Pick a number any number

I was looking at GDP per person the other day on the UN HDI website and I thought the data was quite interesting. Not least because there were lots of incomplete data for the 1980 figure compared to the the 2010 one. This appeared partly because they didn't have the data for whatever reason but also because there have been a lot of new countries created since 1980.

Anyway I thought it would be good to analyze the data to see which countries had done well or not so well economically over the last 30 or so years. The results were in themselves interesting, more on which in another post but I did notice a particularly striking finding.

> cor(Y1980, newgdp, method = "pearson")
[1] 0.7785629
> cor(Y1980, newgdp, method = "spearman")
[1] 0.9456404
> cor(Y1980, newgdp, method = "kendall")
[1] 0.8039875

With the type of data I had. Two continuous variables I needed to do a test of correlation on them. I actually had a look at all three tests Pearson, Spearman and Kendall as you can see above and the difference in the results on the same data did suprise me somewhat. So getting the statistical test right is important as the differences can be large.

Tuesday, 25 October 2011

4 Ways

Plotting the data by country rather than by point is informative but is bunched in the bottom left.

One potential way to solve the problem of overcrowding is to facet the data by region.

The use of points here solves the problem of overcrowding but it makes it far too hard to compare the slope of the regression line.

I think this is the most effective implementation as having the regression lines for each region shows effectively the rise of Asia, the unchanging poverty of Africa and the malaise in the Middle East.

Here's the code

# 1st plot
 
E <- ggplot(wecon30, aes (A1980, B2010))
+ geom_smooth(method=lm, se=FALSE)
+ xlab("1980 GDP per person US dollars purchasing power parity")
+ ylab("2010 GDP per person US dollars ppp")
+ opts(title = "GDP per person 1980 & 2010")
+ geom_text(aes(label=Economy, x = A1980, y = B2010),colour="black", size=3)
 
 
# 2nd plot
 
E + facet_grid(continent ~ .)
 
 
# 4th plot
 
F <- ggplot(wecon30, aes (A1980, B2010, color=Continent))
+ geom_smooth(method=lm, se=FALSE)
+ xlab("1980 GDP per person US dollars purchasing power parity")
+ ylab("2010 GDP per person US dollars ppp")
+ opts(title = "GDP per person 1980 & 2010")
+ geom_point(pch=19)
 
 
# 3rd plot
 
F + facet_grid(Continent ~ .)

Created by Pretty R at inside-R.org

More on this tomorrow...

Wednesday, 19 October 2011

#Rstats code that makes you go YES!

In an ideal world we would get all the data we need in exactly the right format. But this is not an ideal world. Data is often put together for purposes different from what you want to use it for or it might just be messy. There are people in this world who want to put footnote numbers in all the cells of your data and yes they are free to roam the streets and do this.

So when I found out how to use a regular expression to obliterate their legacy it felt great. Data providers please note it's not helpful to have a number in a spreadsheet for arguments sake 2389 and then add a digit of white space and the number one after to create the monster that is 2389 1.

My solution below uses gsub which covers all instances or sub if you just want to change the first one . It works with the pattern you want to find first, then the replacement and finally the name of the data. I thought it could remove all the 1's at the start of main numbers I wanted to keep but it didn't. Phew!

> gsub( " 1"," " , MyData$variablename)

Being able to remove one object is something you learn early on. Later on you want to have a clear out and then you find out how to obliterate everything. That will carry you so far but what I have been hankering after for a bit is something to remove everything except the object you're still interested in working on.

 
> rm(list=setdiff(ls(), "MyData"))

Working with categorical data I wanted a way to easily count the number of each category in a single variable. Indeed there should be a function called count but there isn't. For instance if you wanted to know from a gender variable how many men and women there were. Still there is summary that still does the job if you put the data frame and variable name

>summary(MyData$variablename)

Saturday, 15 October 2011

Basic Matrices in R

> # create a 4x3 matrix 
> a=matrix(c(1,2,3,4,5,6,7,8,9,10,11,12), nrow=4)
> a
     [,1] [,2] [,3]
[1,]    1    5    9
[2,]    2    6   10
[3,]    3    7   11
[4,]    4    8   12
> #arrange the numbers by row
> a=matrix(c(1,2,3,4,5,6,7,8,9,10,11,12), nrow=4, byrow=T)
> a
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9
[4,]   10   11   12
> # Combine 2x2 matrix by row and column
> e = matrix(c(1,2,3,4),nrow=2)
> f = matrix(c(5,6,7,8),nrow=2)
> e
     [,1] [,2]
[1,]    1    3
[2,]    2    4
> f
     [,1] [,2]
[1,]    5    7
[2,]    6    8
> cbind(e,f)
     [,1] [,2] [,3] [,4]
[1,]    1    3    5    7
[2,]    2    4    6    8
> rbind(e,f)
     [,1] [,2]
[1,]    1    3
[2,]    2    4
[3,]    5    7
[4,]    6    8
> # Take a single digit subset from Matrix a Row 3 Column 2
> a[3,2]
[1] 8
> # Take a 2x3 subset of Matrix a, change the numbers and then combine back in to the original.
> a[c(3,4),c(1,2,3)]
     [,1] [,2] [,3]
[1,]    7    8    9
[2,]   10   11   12
> a[c(3,4),c(1,2,3)] = matrix(c(1,2,3,4,5,6),nrow=2)
> a
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    1    3    5
[4,]    2    4    6

Created by Pretty R at inside-R.org

More here

Saturday, 30 July 2011

Artificial Intelligence - a new mode of learning

The internet is a disruptive technology that takes existing heirarchies and squishes, pounds and generally beats them into something new. Take for instance the press where traditional paper media is looking more anachronistic by the second and is set to be replaced by something which we're not really quite sure what it is yet.

The same process has hardly started in higher education but the more you think about it the generation that grew up learning everything online is going to find it weird to pay some institution £9,000 a year to set in a lecture hall when you can learn the same stuff online for free or alot less. University will still be a central feature of many people's lives as the desire to move out of home is a great one for many teenagers but for learning increasingly people are going to do that on the net.

Step forward Peter Norvig and Sebastian Thrun with their Introduction to Artificial Intelligence class which is the online version of the introduction to artificial intelligence class they run at Stanford. While the students there have to get dressed to learn stuff people who learn online can stay in our underpants while the we feast our minds on the same brainfood.

No this wont get you a Stanford degree but the Jeffery Archer's of the future will sure be putting STANFORD in as bigger letters as their CV's can cope with safe in the knowledge they avoided the earthquake risk of visiting California before the big one strikes. You also get to compete with the people who do go to Stanford. For people whose access to HE is restricted this is a great opportunity which should be extended as far as possible. Elites will be made more porous. Society will be less bound by results in examinations at the age of 18 that people really don't care about just a few years later. This is to cut a long story short a good thing.

So if you want to learn about all the interesting things that make up the field of artifical intelligence then why not sign up? It's free afterall.

Thursday, 28 July 2011

Analyzing a Twitter hashtag with R

I've been wanting to have a go at this since reading this great post on the subject by Heuristic Andrew. When news broke that the Prime Minister's Director of Strategy had proposed to abolish maternity leave amongst other things twitter reacted with the mocking hashtag #blueskyhilton.

I decided on this hashtag because I thought it would be relatively short. The Twitter API only goes up to 1500 tweets anyway. In fact #blueskyhilton turns out to have had 240 mentions. Which is a great size to be starting with.

There are a few conclusions to be drawn:

1) On a technical aside I need to find a better way to cope with shortened URL's

2) I don't think we've reached the limits of what vizualisation can do in presenting the information. On Heuristic Andrew there is a cluster dendrogram plot which is fine for a statistically minded audience but the rest of the population might find it a little perplexing.

3) I need to find firmer lines between who is tweeting, what they are tweeting and retweets.

4) The focus of the tweets seems to come from a small number of users but that may be due to the small sample size. Off hand ideas from a political backroom boy that most voters don't recognise don't excite Twitter.

5) Not all Tweeters on an issue will use the same hashtag so I may need a wider range of search terms.

6) This type of analysis would benefit from proper scripts and repeat sampling.

7) The corner of Twitter that was involved in #blueskyhilton won't be sending any Christmas cards to No. 10.

Anyway here's most of the code I used. Don't forget you'll need the XML package. For some reason pretty R has turned <- into < Think I prefer ugly R.

> bsh.vectors &lt;- character(0)
> for (page in c(1:15))
+ {
+ twitter_q &lt;-URLencode('#blueskyhilton')
+ twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
+ bsh.xml &lt;-xmlParseDoc(twitter_url, asText=F)
+  bsh.vector &lt;-xpathSApply(bsh.xml,'//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
+ bsh.vectors &lt;- c(bsh.vector, bsh.vectors)
+ }
> length(bsh.vectors)
[1] 240
> install.packages('tm')
> require(tm)
Loading required package: tm
> bsh.corpus &lt;-Corpus(VectorSource(bsh.vectors))
> bsh.corpus &lt;-tm_map(bsh.corpus, tolower)
> my_stopwords &lt;- c(stopwords('english'), '#blueskyhilton', 'blueskyhilton')
> bsh.corpus &lt;- tm_map(bsh.corpus, removeWords, my_stopwords)
> bsh.corpus &lt;- tm_map(bsh.corpus, removePunctuation)
> bsh.dtm &lt;-TermDocumentMatrix(bsh.corpus)
> bsh.dtm
A term-document matrix (849 terms, 240 documents)

Non-/sparse entries: 1785/201975
Sparsity           : 99%
Maximal term length: 36
Weighting          : term frequency (tf)
> findFreqTerms(bsh.dtm, lowfreq=10)
[1] "abolish"          "archiebland"      "benefits"   
[4] "day"              "growth"           "hashtag"    
[7] "hilton"           "ideas"            "months"     
[10] "people"           "policy"           "psbook"     
[13] "steve"            "stevehiltontapes" "wednesday"  
[16] "week"       
> findAssocs(bsh.dtm, 'hilton', 0.20)
    hilton           steve           blood           brain
      1.00            0.98            0.43            0.43
   notices            alan      equivalent httpbitlyesdjf9
      0.43            0.36            0.36            0.36
   minimal       partridge         somalia       political
      0.36            0.36            0.36            0.32
       cut       phildyson        research            send
      0.31            0.31            0.31            0.31
     ideas         hashtag  jimthehedgehog            nine
      0.29            0.26            0.26            0.25
    senior         servant          supply
      0.25            0.25            0.22
> findAssocs(bsh.dtm, 'hilton', 0.10)
        hilton               steve               blood
          1.00                0.98                0.43
         brain             notices                alan
          0.43                0.43                0.36
    equivalent     httpbitlyesdjf9             minimal
          0.36                0.36                0.36
     partridge             somalia           political
          0.36                0.36                0.32
           cut           phildyson            research
          0.31                0.31                0.31
          send               ideas             hashtag
          0.31                0.29                0.26
jimthehedgehog                nine              senior
          0.26                0.25                0.25
       servant              supply         adambienkov
          0.25                0.22                0.18
          andy           automatic                born
          0.18                0.18                0.18
      calsberg             coulson      counterbalance
          0.18                0.18                0.18
     dangerous            director           fascistic
          0.18                0.18                0.18
       fucking                 hit     httpbitlyo9txpg
          0.18                0.18                0.18
 httpcoameekab       httpcoeaaligl              leaked
          0.18                0.18                0.18
          meme              minute               moron
          0.18                0.18                0.18
         plans            plonkers               quota
          0.18                0.18                0.18
        spawns            strategy              streak
          0.18                0.18                0.18
        stupid       thejamesdixon             twitter
          0.18                0.18                0.18
        months               civil            benefits
          0.17                0.16                0.14
   archiebland             cameron       comprehension
          0.13                0.13                0.11
       defying         politicians thestevehiltontapes
          0.11                0.11                0.11

Created by Pretty R at inside-R.org

Wednesday, 27 July 2011

What Statistical Test Should I Use?

Here is a handy rough and ready guide to which statistical test you should use on your data. Yes the colours are bright. Yes it is a bit flowchartastic but as a revision resource I have found it useful.

Many a statistics student is grateful to the magnificent Andy Field and his Discovering Statistics Using SPSS, in particular page 822 in the 3rd edition. My revision aid uses the same information but presents it in a more digestible format.

Which Statistical Test?

Monday, 4 July 2011

How to win the Lottery with R

Firstly don't play the lottery! You're not going to win but I guess with lottery fever gripping the nation, the Euromillions jackpot is a 12X rollover standing at £154 million, this is as good a time as any to put R to one of its more practical uses. Choosing random lottery numbers.

Firstly don't use rnorm as it won't give you whole numbers instead use sample like i've set out below. You can also use it on other data not just numbers but that's outside the scope of this post.

sample(what we're sampling from in this case numbers 1 to 50, how many samples you want to take ie 5, whether you want to pick the same thing more than once being a lottery we don't)

For the Euromillions you need to sample twice once for the main numbers. 5 numbers between 1 and 50 and then again for the 2 stars. 2 numbers between 1 and 11. Anyway this is how to do it.


> lotto<-sample(1:50,5,replace=F) 
> lotto2<-sample(1:11,2,replace=F)
> lotto [1] 16 17  4 49 30
> lotto2 [1] 5 1

So will this actually help you to win the lottery? Alas not but by using random numbers you're less likely to choose a common combination of numbers and have to share the jackpot with all the other punters who love to use 1,2,3,4,5,6,7 and other similar common numerical choices.

More on choosing random numbers in R and in the VERY unlikely probability you do win you may need this Stuff Rich People Love

Friday, 24 June 2011

When the OECD met ggplot2

Click on the picture to see full size.

I wanted to have a go at some of the more advanced graphing available in R so that meant the ggplot2 which people keep going on about. I got the last 6 months growth figures off the OECD website for all the member countries that had produced the figures and kept throwing various combinations of R/ggplot2 code at it until I got what I wanted.

What I found was that like R with ggplot2 the learning curve is steep but it does give total control if you're prepared to spend the time on it. I still haven't worked out how to add the quartiles automatically but adding the horizontal lines I think worked better as I could make the outer ones dashed and finding the figures was easy enough quantile(GDP)

This is also the first time i've used R Studio and I have to say it's great perhaps I found it a little slow but it's such a pretty and helpful environment to work in I'm going to stick with it.

Anyway here is the code I ended up using.

ggplot(oecd2, aes(x=CC, y=GDP))
+ geom_text(aes(label=Country, x = CC, y = GDP),colour="black", size=5)
+ xlab("Country") + ylab("Growth as % GDP Change + Average & Quartiles")
+ opts(title = "Growth in OECD Countries (Q4 2010 - Q1 2011)")
+ geom_hline(yintercept=0.5,linetype=2)
+ geom_hline(yintercept=1.6,linetype=1)
+ geom_hline(yintercept=1.9,linetype=2)

Wednesday, 22 June 2011

Data Without Boarders - Doing Good With Data

You know what the internet's like. Someone gets a good idea and they tell their friends and they pass it on and before you know it you're getting email from someone in a country which makes you go something like "Suriname really that's in South America I thought it sounded quite Asian. Are we sure it's not next to Cambodia?"

Anyway I got a tweet from DataMiner UK which mentioned Data Without Boarders which sounded cool so I signed up for the email and below is what we all got back. Anyway if you're interested in this why not sign up as well?

Well that was awesomely unexpected.

What I thought would just be a casual blog post about a project I hoped to quietly roll out in the fall became a lightning rod for some of the most enthusiastic, engaged, and socially conscious data folk that I ever could have imagined. As of my writing this, over 300 of you have shown your interest in this initiative by signing up to stay in the loop on the Data Without Borders email list. Considering I envisioned 20 of us sitting around a borrowed office to tackle this problem, that turnout seems incredible to me. In addition, I've been getting emails from around the globe from people with amazing socially conscious tech projects and an unbridled enthusiasm for using tech and data to help others. To all of you, whether you’re an excel ninja working with disenfranchised communities or just an interested observer, thank you for signing up and getting involved.

Some quick updates on next steps:

New Website: We’ll be working to roll out a proper site in early July where NGOs / non-profits can submit their proposals and where data people, be they statisticians, hackers, or other, can get involved and learn about upcoming events / partnerships.

Kickoff Event – NY Hack Day: Our first event will likely take the form of a NY-based hack day slated for the fall, in which three selected organizations will present proposals for short-term projects they'd like assistance with (that we’ll help you plan out). We’ll then have local data folk come and help realize their visions or consult on other ways to tackle the problems. Of course there will probably be pizza and beer.

Other Kickoffs: Not in NY? Fear not! We ultimately plan for DWB to be a free exchange of data scientists, regardless of locale. As this is our first event, logistics dictate that we do something close to home, but we’re also looking into sponsoring similar events across the country / world (if you’re someone that could help us make that happen, get in touch!) Heck, if you’re inclined, feel free to throw your own and tell us how it goes. We can compare notes.

New Name for the Site: This may seem trivial, but the site needs a proper name. Data is already borderless, Data Scientists Without Borders is a mouthful and exclusive sounding, and I’ve heard rumor that other professions ‘without borders’ are very protective of trademark. So before this goes too far, suggest a name for this initiative by emailing datawithoutborders AT gmail.com with your clever titles. Right now I’m partial to DataCorps (thanks, Mike Dewar!), so feel free to +1 that if you’re feeling unimaginative.

Once again, thanks to everyone who’s shown interest in this cause and who wants to get involved. If you emailed me personally and I haven’t gotten back to yet, my apologies. I’m working my way through the list and will get back to you as soon as I can. Ping me at datawithoutborders AT gmail.com if you haven’t heard from me by the end of the week.

We’ll send another update when the site’s live and will send word as our kickoff event approaches. Thanks again for inspiring me, Internet, and we're looking forward to doing amazing things with you all.

Best,

Jake

Thursday, 21 April 2011

Pairing up

An ideal way to start analysing data is simply to take a look at it. A visualisation of data is a far more effective way for the brain to spot patterns, correlations or clusters than to simply look at numbers in a spreadsheet or loading data into R and executing the summary function. That's fine as far as it goes but really there are more effective things than can be done

Visualisation is a very easy thing to do in R. For instance when you just have two variables you're interested in you can simply plot them.

We live in a complex world with tonnes of data and generally you're going to be interested in data sets where you want to examine more than two variables. How can you do that? Dead easy use the pairs function.

This gives the much more interesting exploration of the data. I like to add a regression line with panel=panel.smooth as this has greater visual impact and helps guide the eye to select variables for further analysis.

Below is the code I used.

> londondata<-read.table(file.choose(), header=T, sep=",")
> attach(londondata)
> summary(londondata)
> names(londondata)
> plot(JSArateNov09, Hpinflation)
> pairs(londondata)
> pairs(londondata,panel=panel.smooth)

The information was from London Data Store. All are figures are 2009 unless otherwise indicated and by borough. Data variables names explained:

Area - Borough
HousePrice96 - 1996 Median house price
HousePrice09 - 2009 Median house price
EnterpriseBirth09 - New company creations in 2009
ActiveEnterprise - Number of businesses in 2009
5yearsurvival- Percentage of businesses trading after 5 years
JSArateNov09 - Number of JSA claiments
Hpinflation - House price inflation from 1996-2009

Update
Here is a great video from Mike K Smith about working with multiple graphs in R. I think my favourite tip is >windows(record=T) which enables more than one graph at a time to be displayed.