Friday, 28 October 2011
Regular Expressions
Thursday, 27 October 2011
Common Statistical Errors
Statistical errors and deficiencies related to the design of a study
- Study aims and primary outcome measures not clearly stated or unclear
- Failure to report number of participants or observations (sample size)
- Failure to report withdrawals from the study
- No a priori sample size calculation/effect-size estimation (power calculation)
- No clear a priori statement or description of the Null-Hypothesis under investigation
- Failure to use and report randomisation
- Method of randomisation not clearly stated
- Failure to use and report blinding if possible
- Failure to report initial equality of baseline characteristics and comparability of study groups
- Use of an inappropriate control group
- Inappropriate testing for equality of baseline characteristics
Statistical errors and deficiencies related to data analysis
- Use of wrong statistical tests
- Incompatibility of statistical test with type of data examined
- Unpaired tests for paired data or vice versa
- Inappropriate use of parametric methods
- Use of an inappropriate test for the hypothesis under investigation
- Inflation of Type I error
- Failure to include a multiple-comparison correction
- Inappropriate post-hoc Subgroup analysis
- Typical errors with Student’s t-test
- Failure to prove test assumptions
- Unequal sample sizes for paired t-test
- Improper multiple pair-wise comparisons of more than two groups
- Use of an unpaired t-test for paired data or vice versa
- Typical errors with chi-square-tests
- No Yates-continuity correction reported if small numbers
- Use of chi-square when expected numbers in a cell are <5
- No explicit statement of the tested Null-Hypotheses
- Failure to use multivariate techniques to adjust for confounding factors
Errors related to the documentation of statistical methods applied
- Failure to specify/define all tests used clear and correctly
- Failure to state number of tails
- Failure to state if test was paired or unpaired
- Wrong names for statistical tests
- Referring to unusual or obscure methods without explanation or reference
- Failure to specify which test was applied on a given set of data if more than one test was done
- “Where appropriate” statement
Statistical errors and deficiencies related to the presentation of study data
- Inadequate graphical or numerical description of basic data
- Mean but no indication of variability of the data
- Giving SE instead of SD to describe data
- Use of mean (SD) to describe non-normal data
- Failure to define ± notion for describing variability or use of unlabeled error bars
- Inappropriate and poor reporting of results
- Results given only as p-values, no confidence intervals given
- Confidence intervals given for each group rather than for contrasts
- “p = NS”, “p <0.05” or other arbitrary thresholds instead of reporting exact p-values
- Numerical information given to an unrealistic level of precision
Statistical errors and deficiencies related to the interpretation of study findings
- Wrong interpretation of results
- “non significant” interpreted as “no effect”, or “no difference”
- Drawing conclusions not supported by the study data
- Significance claimed without data analysis or statistical test mentioned
- Poor interpretation of results
- Disregard for Type II error when reporting non-significant results
- Missing discussion of the problem of multiple significance testing if done
- Failure to discuss sources of potential bias and confounding factors
Wednesday, 26 October 2011
Pick a number any number
Anyway I thought it would be good to analyze the data to see which countries had done well or not so well economically over the last 30 or so years. The results were in themselves interesting, more on which in another post but I did notice a particularly striking finding.
> cor(Y1980, newgdp, method = "pearson") [1] 0.7785629 > cor(Y1980, newgdp, method = "spearman") [1] 0.9456404 > cor(Y1980, newgdp, method = "kendall") [1] 0.8039875
With the type of data I had. Two continuous variables I needed to do a test of correlation on them. I actually had a look at all three tests Pearson, Spearman and Kendall as you can see above and the difference in the results on the same data did suprise me somewhat. So getting the statistical test right is important as the differences can be large.
Tuesday, 25 October 2011
4 Ways
Here's the code
# 1st plot E <- ggplot(wecon30, aes (A1980, B2010)) + geom_smooth(method=lm, se=FALSE) + xlab("1980 GDP per person US dollars purchasing power parity") + ylab("2010 GDP per person US dollars ppp") + opts(title = "GDP per person 1980 & 2010") + geom_text(aes(label=Economy, x = A1980, y = B2010),colour="black", size=3) # 2nd plot E + facet_grid(continent ~ .) # 4th plot F <- ggplot(wecon30, aes (A1980, B2010, color=Continent)) + geom_smooth(method=lm, se=FALSE) + xlab("1980 GDP per person US dollars purchasing power parity") + ylab("2010 GDP per person US dollars ppp") + opts(title = "GDP per person 1980 & 2010") + geom_point(pch=19) # 3rd plot F + facet_grid(Continent ~ .)
More on this tomorrow...
Wednesday, 19 October 2011
#Rstats code that makes you go YES!
So when I found out how to use a regular expression to obliterate their legacy it felt great. Data providers please note it's not helpful to have a number in a spreadsheet for arguments sake 2389 and then add a digit of white space and the number one after to create the monster that is 2389 1.
My solution below uses gsub which covers all instances or sub if you just want to change the first one . It works with the pattern you want to find first, then the replacement and finally the name of the data. I thought it could remove all the 1's at the start of main numbers I wanted to keep but it didn't. Phew!
> gsub( " 1"," " , MyData$variablename)
Being able to remove one object is something you learn early on. Later on you want to have a clear out and then you find out how to obliterate everything. That will carry you so far but what I have been hankering after for a bit is something to remove everything except the object you're still interested in working on.
> rm(list=setdiff(ls(), "MyData"))
Working with categorical data I wanted a way to easily count the number of each category in a single variable. Indeed there should be a function called count but there isn't. For instance if you wanted to know from a gender variable how many men and women there were. Still there is summary that still does the job if you put the data frame and variable name
>summary(MyData$variablename)
Saturday, 15 October 2011
Basic Matrices in R
> # create a 4x3 matrix > a=matrix(c(1,2,3,4,5,6,7,8,9,10,11,12), nrow=4) > a [,1] [,2] [,3] [1,] 1 5 9 [2,] 2 6 10 [3,] 3 7 11 [4,] 4 8 12 > #arrange the numbers by row > a=matrix(c(1,2,3,4,5,6,7,8,9,10,11,12), nrow=4, byrow=T) > a [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 7 8 9 [4,] 10 11 12 > # Combine 2x2 matrix by row and column > e = matrix(c(1,2,3,4),nrow=2) > f = matrix(c(5,6,7,8),nrow=2) > e [,1] [,2] [1,] 1 3 [2,] 2 4 > f [,1] [,2] [1,] 5 7 [2,] 6 8 > cbind(e,f) [,1] [,2] [,3] [,4] [1,] 1 3 5 7 [2,] 2 4 6 8 > rbind(e,f) [,1] [,2] [1,] 1 3 [2,] 2 4 [3,] 5 7 [4,] 6 8 > # Take a single digit subset from Matrix a Row 3 Column 2 > a[3,2] [1] 8 > # Take a 2x3 subset of Matrix a, change the numbers and then combine back in to the original. > a[c(3,4),c(1,2,3)] [,1] [,2] [,3] [1,] 7 8 9 [2,] 10 11 12 > a[c(3,4),c(1,2,3)] = matrix(c(1,2,3,4,5,6),nrow=2) > a [,1] [,2] [,3] [1,] 1 2 3 [2,] 4 5 6 [3,] 1 3 5 [4,] 2 4 6
Created by Pretty R at inside-R.org
More hereSaturday, 30 July 2011
Artificial Intelligence - a new mode of learning
The same process has hardly started in higher education but the more you think about it the generation that grew up learning everything online is going to find it weird to pay some institution £9,000 a year to set in a lecture hall when you can learn the same stuff online for free or alot less. University will still be a central feature of many people's lives as the desire to move out of home is a great one for many teenagers but for learning increasingly people are going to do that on the net.
Step forward Peter Norvig and Sebastian Thrun with their Introduction to Artificial Intelligence class which is the online version of the introduction to artificial intelligence class they run at Stanford. While the students there have to get dressed to learn stuff people who learn online can stay in our underpants while the we feast our minds on the same brainfood.
No this wont get you a Stanford degree but the Jeffery Archer's of the future will sure be putting STANFORD in as bigger letters as their CV's can cope with safe in the knowledge they avoided the earthquake risk of visiting California before the big one strikes. You also get to compete with the people who do go to Stanford. For people whose access to HE is restricted this is a great opportunity which should be extended as far as possible. Elites will be made more porous. Society will be less bound by results in examinations at the age of 18 that people really don't care about just a few years later. This is to cut a long story short a good thing.
So if you want to learn about all the interesting things that make up the field of artifical intelligence then why not sign up? It's free afterall.
Thursday, 28 July 2011
Analyzing a Twitter hashtag with R
I decided on this hashtag because I thought it would be relatively short. The Twitter API only goes up to 1500 tweets anyway. In fact #blueskyhilton turns out to have had 240 mentions. Which is a great size to be starting with.
There are a few conclusions to be drawn:
1) On a technical aside I need to find a better way to cope with shortened URL's
2) I don't think we've reached the limits of what vizualisation can do in presenting the information. On Heuristic Andrew there is a cluster dendrogram plot which is fine for a statistically minded audience but the rest of the population might find it a little perplexing.
3) I need to find firmer lines between who is tweeting, what they are tweeting and retweets.
4) The focus of the tweets seems to come from a small number of users but that may be due to the small sample size. Off hand ideas from a political backroom boy that most voters don't recognise don't excite Twitter.
5) Not all Tweeters on an issue will use the same hashtag so I may need a wider range of search terms.
6) This type of analysis would benefit from proper scripts and repeat sampling.
7) The corner of Twitter that was involved in #blueskyhilton won't be sending any Christmas cards to No. 10.
Anyway here's most of the code I used. Don't forget you'll need the XML package. For some reason pretty R has turned <- into < Think I prefer ugly R.
> bsh.vectors <- character(0)
> for (page in c(1:15))
+ {
+ twitter_q <-URLencode('#blueskyhilton')
+ twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
+ bsh.xml <-xmlParseDoc(twitter_url, asText=F)
+ bsh.vector <-xpathSApply(bsh.xml,'//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
+ bsh.vectors <- c(bsh.vector, bsh.vectors)
+ }
> length(bsh.vectors)
[1] 240
> install.packages('tm')
> require(tm)
Loading required package: tm
> bsh.corpus <-Corpus(VectorSource(bsh.vectors))
> bsh.corpus <-tm_map(bsh.corpus, tolower)
> my_stopwords <- c(stopwords('english'), '#blueskyhilton', 'blueskyhilton')
> bsh.corpus <- tm_map(bsh.corpus, removeWords, my_stopwords)
> bsh.corpus <- tm_map(bsh.corpus, removePunctuation)
> bsh.dtm <-TermDocumentMatrix(bsh.corpus)
> bsh.dtm
A term-document matrix (849 terms, 240 documents)
Non-/sparse entries: 1785/201975
Sparsity : 99%
Maximal term length: 36
Weighting : term frequency (tf)
> findFreqTerms(bsh.dtm, lowfreq=10)
[1] "abolish" "archiebland" "benefits"
[4] "day" "growth" "hashtag"
[7] "hilton" "ideas" "months"
[10] "people" "policy" "psbook"
[13] "steve" "stevehiltontapes" "wednesday"
[16] "week"
> findAssocs(bsh.dtm, 'hilton', 0.20)
hilton steve blood brain
1.00 0.98 0.43 0.43
notices alan equivalent httpbitlyesdjf9
0.43 0.36 0.36 0.36
minimal partridge somalia political
0.36 0.36 0.36 0.32
cut phildyson research send
0.31 0.31 0.31 0.31
ideas hashtag jimthehedgehog nine
0.29 0.26 0.26 0.25
senior servant supply
0.25 0.25 0.22
> findAssocs(bsh.dtm, 'hilton', 0.10)
hilton steve blood
1.00 0.98 0.43
brain notices alan
0.43 0.43 0.36
equivalent httpbitlyesdjf9 minimal
0.36 0.36 0.36
partridge somalia political
0.36 0.36 0.32
cut phildyson research
0.31 0.31 0.31
send ideas hashtag
0.31 0.29 0.26
jimthehedgehog nine senior
0.26 0.25 0.25
servant supply adambienkov
0.25 0.22 0.18
andy automatic born
0.18 0.18 0.18
calsberg coulson counterbalance
0.18 0.18 0.18
dangerous director fascistic
0.18 0.18 0.18
fucking hit httpbitlyo9txpg
0.18 0.18 0.18
httpcoameekab httpcoeaaligl leaked
0.18 0.18 0.18
meme minute moron
0.18 0.18 0.18
plans plonkers quota
0.18 0.18 0.18
spawns strategy streak
0.18 0.18 0.18
stupid thejamesdixon twitter
0.18 0.18 0.18
months civil benefits
0.17 0.16 0.14
archiebland cameron comprehension
0.13 0.13 0.11
defying politicians thestevehiltontapes
0.11 0.11 0.11
Wednesday, 27 July 2011
What Statistical Test Should I Use?
Many a statistics student is grateful to the magnificent Andy Field and his Discovering Statistics Using SPSS, in particular page 822 in the 3rd edition. My revision aid uses the same information but presents it in a more digestible format.
Which Statistical Test?
Monday, 4 July 2011
How to win the Lottery with R
Firstly don't use rnorm as it won't give you whole numbers instead use sample like i've set out below. You can also use it on other data not just numbers but that's outside the scope of this post.
sample(what we're sampling from in this case numbers 1 to 50, how many samples you want to take ie 5, whether you want to pick the same thing more than once being a lottery we don't)For the Euromillions you need to sample twice once for the main numbers. 5 numbers between 1 and 50 and then again for the 2 stars. 2 numbers between 1 and 11. Anyway this is how to do it.
> lotto<-sample(1:50,5,replace=F)
> lotto2<-sample(1:11,2,replace=F)
> lotto [1] 16 17 4 49 30
> lotto2 [1] 5 1
So will this actually help you to win the lottery? Alas not but by using random numbers you're less likely to choose a common combination of numbers and have to share the jackpot with all the other punters who love to use 1,2,3,4,5,6,7 and other similar common numerical choices.
More on choosing random numbers in R and in the VERY unlikely probability you do win you may need this Stuff Rich People Love
Friday, 24 June 2011
When the OECD met ggplot2
I wanted to have a go at some of the more advanced graphing available in R so that meant the ggplot2 which people keep going on about. I got the last 6 months growth figures off the OECD website for all the member countries that had produced the figures and kept throwing various combinations of R/ggplot2 code at it until I got what I wanted.
What I found was that like R with ggplot2 the learning curve is steep but it does give total control if you're prepared to spend the time on it. I still haven't worked out how to add the quartiles automatically but adding the horizontal lines I think worked better as I could make the outer ones dashed and finding the figures was easy enough quantile(GDP)
This is also the first time i've used R Studio and I have to say it's great perhaps I found it a little slow but it's such a pretty and helpful environment to work in I'm going to stick with it.
Anyway here is the code I ended up using.
ggplot(oecd2, aes(x=CC, y=GDP))
+ geom_text(aes(label=Country, x = CC, y = GDP),colour="black", size=5)
+ xlab("Country") + ylab("Growth as % GDP Change + Average & Quartiles")
+ opts(title = "Growth in OECD Countries (Q4 2010 - Q1 2011)")
+ geom_hline(yintercept=0.5,linetype=2)
+ geom_hline(yintercept=1.6,linetype=1)
+ geom_hline(yintercept=1.9,linetype=2)
Wednesday, 22 June 2011
Data Without Boarders - Doing Good With Data
Anyway I got a tweet from DataMiner UK which mentioned Data Without Boarders which sounded cool so I signed up for the email and below is what we all got back. Anyway if you're interested in this why not sign up as well?
Well that was awesomely unexpected.
Some quick updates on next steps:
New Website: We’ll be working to roll out a proper site in early July where NGOs / non-profits can submit their proposals and where data people, be they statisticians, hackers, or other, can get involved and learn about upcoming events / partnerships.
Kickoff Event – NY Hack Day: Our first event will likely take the form of a NY-based hack day slated for the fall, in which three selected organizations will present proposals for short-term projects they'd like assistance with (that we’ll help you plan out). We’ll then have local data folk come and help realize their visions or consult on other ways to tackle the problems. Of course there will probably be pizza and beer.
Other Kickoffs: Not in NY? Fear not! We ultimately plan for DWB to be a free exchange of data scientists, regardless of locale. As this is our first event, logistics dictate that we do something close to home, but we’re also looking into sponsoring similar events across the country / world (if you’re someone that could help us make that happen, get in touch!) Heck, if you’re inclined, feel free to throw your own and tell us how it goes. We can compare notes.
New Name for the Site: This may seem trivial, but the site needs a proper name. Data is already borderless, Data Scientists Without Borders is a mouthful and exclusive sounding, and I’ve heard rumor that other professions ‘without borders’ are very protective of trademark. So before this goes too far, suggest a name for this initiative by emailing datawithoutborders AT gmail.com with your clever titles. Right now I’m partial to DataCorps (thanks, Mike Dewar!), so feel free to +1 that if you’re feeling unimaginative.
Once again, thanks to everyone who’s shown interest in this cause and who wants to get involved. If you emailed me personally and I haven’t gotten back to yet, my apologies. I’m working my way through the list and will get back to you as soon as I can. Ping me at datawithoutborders AT gmail.com if you haven’t heard from me by the end of the week.
We’ll send another update when the site’s live and will send word as our kickoff event approaches. Thanks again for inspiring me, Internet, and we're looking forward to doing amazing things with you all.
Best,
Jake
Thursday, 21 April 2011
Pairing up
Visualisation is a very easy thing to do in R. For instance when you just have two variables you're interested in you can simply plot them.
We live in a complex world with tonnes of data and generally you're going to be interested in data sets where you want to examine more than two variables. How can you do that? Dead easy use the pairs function.
This gives the much more interesting exploration of the data. I like to add a regression line with panel=panel.smooth as this has greater visual impact and helps guide the eye to select variables for further analysis.
Below is the code I used.
> londondata<-read.table(file.choose(), header=T, sep=",")
> attach(londondata)
> summary(londondata)
> names(londondata)
> plot(JSArateNov09, Hpinflation)
> pairs(londondata)
> pairs(londondata,panel=panel.smooth)
The information was from London Data Store. All are figures are 2009 unless otherwise indicated and by borough. Data variables names explained:
Area - Borough
HousePrice96 - 1996 Median house price
HousePrice09 - 2009 Median house price
EnterpriseBirth09 - New company creations in 2009
ActiveEnterprise - Number of businesses in 2009
5yearsurvival- Percentage of businesses trading after 5 years
JSArateNov09 - Number of JSA claiments
Hpinflation - House price inflation from 1996-2009
Update
Here is a great video from Mike K Smith about working with multiple graphs in R. I think my favourite tip is >windows(record=T) which enables more than one graph at a time to be displayed.