One of the rarer types of articles in journals are the sort that seek to do a potted history of a discipline or sub discipline and while they may not make a new contribution to knowledge they are a great resource for people wanting to get up to speed with something.
A great example of such an article is Data Clustering: 50 Years Beyond K-Means by Anil K. Jain (pdf file) As K-means is still one of the most popular algorithms it just goes to show as with other areas of statistics that if your way of doing things arrives first and becomes commonly adopted it'll often trundle along even if better alternatives appear in the literature.
Wednesday, 4 January 2012
Tuesday, 3 January 2012
How to trim a mean in R
Good news!
My secondhand copy of Fundamentals of Modern Statistical Methods by Rand R. Wilcox has winged its way safely across the Atlantic. In celebration here's a how to do trimmed means in R. Really it's quite simple but as you can see it can have a big effect on the results.
As you can see the first lot of data doesn't produce a lot of variation between the mean and trimmed mean as the data is nicely bunched together and free from outliers. This is as expected as there is no difference between a mean and a trimmed mean when the data fit a Gaussian distribution.
The difference is evident as you can see from the second lot of data which has two massive outliers which hugely distort the result. So if you want to trim the mean by removing 20% of cases you code mean(mydata, 0.1) and this takes 10% of cases from both the top and bottom of the distribution. With a length of 20 this would mean removing 4 cases. As you can see 0.05 only removes 2 cases keeping in the 875 results in the trimmed mean of over 60.
> #Lets find some data to look at the mean and the trimmed mean > firstlotofdata <- c(14,13,10,11,19,14,12,12,12,16,18,16,13,14,15,18,17,16,14,11) > #Some data that also has some outliers could also be helpful > secondlotofdata <-c(15,18,11,11,12,12,19,17,14,13,12,18,17,16,17,14,11,12,875,4572) > firstlotofdata [1] 14 13 10 11 19 14 12 12 12 16 18 16 13 14 15 18 17 [18] 16 14 11 > secondlotofdata [1] 15 18 11 11 12 12 19 17 14 13 [11] 12 18 17 16 17 14 11 12 875 4572 > mean(firstlotofdata) [1] 14.25 > mean(secondlotofdata) [1] 285.3 > mean(firstlotofdata, 0.1) [1] 14.1875 > mean(secondlotofdata, 0.1) [1] 14.8125 > mean(secondlotofdata, 0.05) [1] 62.38889 > length(secondlotofdata) [1] 20
As you can see the first lot of data doesn't produce a lot of variation between the mean and trimmed mean as the data is nicely bunched together and free from outliers. This is as expected as there is no difference between a mean and a trimmed mean when the data fit a Gaussian distribution.
The difference is evident as you can see from the second lot of data which has two massive outliers which hugely distort the result. So if you want to trim the mean by removing 20% of cases you code mean(mydata, 0.1) and this takes 10% of cases from both the top and bottom of the distribution. With a length of 20 this would mean removing 4 cases. As you can see 0.05 only removes 2 cases keeping in the 875 results in the trimmed mean of over 60.
Friday, 28 October 2011
Regular Expressions
Always helpful when R user groups put their presentations online and I find regular expressions very useful for data munging.
Thursday, 27 October 2011
Common Statistical Errors
Check out this journal article (pdf) on common statistical errors in medical experiments. Don't worry this is only the medical research that's meant to keep you alive and the statistical tests that prove its effectiveness are often not up to scratch. Anyway below are the errors that they identified.
Statistical errors and deficiencies related to the design of a study
Statistical errors and deficiencies related to data analysis
Errors related to the documentation of statistical methods applied
Statistical errors and deficiencies related to the presentation of study data
Statistical errors and deficiencies related to the interpretation of study findings
Statistical errors and deficiencies related to the design of a study
- Study aims and primary outcome measures not clearly stated or unclear
- Failure to report number of participants or observations (sample size)
- Failure to report withdrawals from the study
- No a priori sample size calculation/effect-size estimation (power calculation)
- No clear a priori statement or description of the Null-Hypothesis under investigation
- Failure to use and report randomisation
- Method of randomisation not clearly stated
- Failure to use and report blinding if possible
- Failure to report initial equality of baseline characteristics and comparability of study groups
- Use of an inappropriate control group
- Inappropriate testing for equality of baseline characteristics
Statistical errors and deficiencies related to data analysis
- Use of wrong statistical tests
- Incompatibility of statistical test with type of data examined
- Unpaired tests for paired data or vice versa
- Inappropriate use of parametric methods
- Use of an inappropriate test for the hypothesis under investigation
- Inflation of Type I error
- Failure to include a multiple-comparison correction
- Inappropriate post-hoc Subgroup analysis
- Typical errors with Student’s t-test
- Failure to prove test assumptions
- Unequal sample sizes for paired t-test
- Improper multiple pair-wise comparisons of more than two groups
- Use of an unpaired t-test for paired data or vice versa
- Typical errors with chi-square-tests
- No Yates-continuity correction reported if small numbers
- Use of chi-square when expected numbers in a cell are <5
- No explicit statement of the tested Null-Hypotheses
- Failure to use multivariate techniques to adjust for confounding factors
Errors related to the documentation of statistical methods applied
- Failure to specify/define all tests used clear and correctly
- Failure to state number of tails
- Failure to state if test was paired or unpaired
- Wrong names for statistical tests
- Referring to unusual or obscure methods without explanation or reference
- Failure to specify which test was applied on a given set of data if more than one test was done
- “Where appropriate” statement
Statistical errors and deficiencies related to the presentation of study data
- Inadequate graphical or numerical description of basic data
- Mean but no indication of variability of the data
- Giving SE instead of SD to describe data
- Use of mean (SD) to describe non-normal data
- Failure to define ± notion for describing variability or use of unlabeled error bars
- Inappropriate and poor reporting of results
- Results given only as p-values, no confidence intervals given
- Confidence intervals given for each group rather than for contrasts
- “p = NS”, “p <0.05” or other arbitrary thresholds instead of reporting exact p-values
- Numerical information given to an unrealistic level of precision
Statistical errors and deficiencies related to the interpretation of study findings
- Wrong interpretation of results
- “non significant” interpreted as “no effect”, or “no difference”
- Drawing conclusions not supported by the study data
- Significance claimed without data analysis or statistical test mentioned
- Poor interpretation of results
- Disregard for Type II error when reporting non-significant results
- Missing discussion of the problem of multiple significance testing if done
- Failure to discuss sources of potential bias and confounding factors
Wednesday, 26 October 2011
Pick a number any number
I was looking at GDP per person the other day on the UN HDI website and I thought the data was quite interesting. Not least because there were lots of incomplete data for the 1980 figure compared to the the 2010 one. This appeared partly because they didn't have the data for whatever reason but also because there have been a lot of new countries created since 1980.
Anyway I thought it would be good to analyze the data to see which countries had done well or not so well economically over the last 30 or so years. The results were in themselves interesting, more on which in another post but I did notice a particularly striking finding.
With the type of data I had. Two continuous variables I needed to do a test of correlation on them. I actually had a look at all three tests Pearson, Spearman and Kendall as you can see above and the difference in the results on the same data did suprise me somewhat. So getting the statistical test right is important as the differences can be large.
Anyway I thought it would be good to analyze the data to see which countries had done well or not so well economically over the last 30 or so years. The results were in themselves interesting, more on which in another post but I did notice a particularly striking finding.
> cor(Y1980, newgdp, method = "pearson") [1] 0.7785629 > cor(Y1980, newgdp, method = "spearman") [1] 0.9456404 > cor(Y1980, newgdp, method = "kendall") [1] 0.8039875
With the type of data I had. Two continuous variables I needed to do a test of correlation on them. I actually had a look at all three tests Pearson, Spearman and Kendall as you can see above and the difference in the results on the same data did suprise me somewhat. So getting the statistical test right is important as the differences can be large.
Tuesday, 25 October 2011
4 Ways
Plotting the data by country rather than
by point is informative but is bunched in the bottom left.
One potential way to solve the problem of overcrowding is to facet the data by region.
The use of points here solves the problem of overcrowding but it makes it far too hard to compare the slope of the regression line.
I think this is the most effective implementation as having the regression lines for each region shows effectively the rise of Asia, the unchanging poverty of Africa and the malaise in the Middle East.
Here's the code
# 1st plot E <- ggplot(wecon30, aes (A1980, B2010)) + geom_smooth(method=lm, se=FALSE) + xlab("1980 GDP per person US dollars purchasing power parity") + ylab("2010 GDP per person US dollars ppp") + opts(title = "GDP per person 1980 & 2010") + geom_text(aes(label=Economy, x = A1980, y = B2010),colour="black", size=3) # 2nd plot E + facet_grid(continent ~ .) # 4th plot F <- ggplot(wecon30, aes (A1980, B2010, color=Continent)) + geom_smooth(method=lm, se=FALSE) + xlab("1980 GDP per person US dollars purchasing power parity") + ylab("2010 GDP per person US dollars ppp") + opts(title = "GDP per person 1980 & 2010") + geom_point(pch=19) # 3rd plot F + facet_grid(Continent ~ .)
More on this tomorrow...
Wednesday, 19 October 2011
#Rstats code that makes you go YES!
In an ideal world we would get all the data we need in exactly the right format. But this is not an ideal world. Data is often put together for purposes different from what you want to use it for or it might just be messy. There are people in this world who want to put footnote numbers in all the cells of your data and yes they are free to roam the streets and do this.
So when I found out how to use a regular expression to obliterate their legacy it felt great. Data providers please note it's not helpful to have a number in a spreadsheet for arguments sake 2389 and then add a digit of white space and the number one after to create the monster that is 2389 1.
My solution below uses gsub which covers all instances or sub if you just want to change the first one . It works with the pattern you want to find first, then the replacement and finally the name of the data. I thought it could remove all the 1's at the start of main numbers I wanted to keep but it didn't. Phew!
Working with categorical data I wanted a way to easily count the number of each category in a single variable. Indeed there should be a function called count but there isn't. For instance if you wanted to know from a gender variable how many men and women there were. Still there is summary that still does the job if you put the data frame and variable name
So when I found out how to use a regular expression to obliterate their legacy it felt great. Data providers please note it's not helpful to have a number in a spreadsheet for arguments sake 2389 and then add a digit of white space and the number one after to create the monster that is 2389 1.
My solution below uses gsub which covers all instances or sub if you just want to change the first one . It works with the pattern you want to find first, then the replacement and finally the name of the data. I thought it could remove all the 1's at the start of main numbers I wanted to keep but it didn't. Phew!
> gsub( " 1"," " , MyData$variablename)
Being able to remove one object is something you learn early on. Later on you want to have a clear out and then you find out how to obliterate everything. That will carry you so far but what I have been hankering after for a bit is something to remove everything except the object you're still interested in working on.
> rm(list=setdiff(ls(), "MyData"))
Working with categorical data I wanted a way to easily count the number of each category in a single variable. Indeed there should be a function called count but there isn't. For instance if you wanted to know from a gender variable how many men and women there were. Still there is summary that still does the job if you put the data frame and variable name
>summary(MyData$variablename)
Subscribe to:
Posts (Atom)