Tuesday, 3 January 2012

How to trim a mean in R

Good news! My secondhand copy of Fundamentals of Modern Statistical Methods by Rand R. Wilcox has winged its way safely across the Atlantic. In celebration here's a how to do trimmed means in R. Really it's quite simple  but as you can see it can have a big effect on the results.

> #Lets find some data to look at the mean and the trimmed mean
> firstlotofdata <- c(14,13,10,11,19,14,12,12,12,16,18,16,13,14,15,18,17,16,14,11)
> #Some data that also has some outliers could also be helpful
> secondlotofdata <-c(15,18,11,11,12,12,19,17,14,13,12,18,17,16,17,14,11,12,875,4572)
> firstlotofdata
 [1] 14 13 10 11 19 14 12 12 12 16 18 16 13 14 15 18 17
[18] 16 14 11
> secondlotofdata
 [1]   15   18   11   11   12   12   19   17   14   13
[11]   12   18   17   16   17   14   11   12  875 4572
> mean(firstlotofdata)
[1] 14.25
> mean(secondlotofdata)
[1] 285.3
> mean(firstlotofdata, 0.1)
[1] 14.1875
> mean(secondlotofdata, 0.1)
[1] 14.8125
> mean(secondlotofdata, 0.05)
[1] 62.38889
> length(secondlotofdata)
[1] 20

As you can see the first lot of data doesn't produce a lot of variation between the mean and trimmed mean as the data is nicely bunched together and free from outliers. This is as expected as there is no difference between a mean and a trimmed mean when the data fit a Gaussian distribution.

The difference is evident as you can see from the second lot of data which has two massive outliers which hugely distort the result. So if you want to trim the mean by removing 20% of cases you code mean(mydata, 0.1) and this takes 10% of cases from both the top and bottom of the distribution. With a length of 20 this would mean removing 4 cases. As you can see 0.05 only removes 2 cases keeping in the 875 results in the trimmed mean of over 60.

2 comments: