Ventures in Data Science: January 2012

Wednesday 4 January 2012

to the K-means and beyond ...

One of the rarer types of articles in journals are the sort that seek to do a potted history of a discipline or sub discipline and while they may not make a new contribution to knowledge they are a great resource for people wanting to get up to speed with something.

A great example of such an article is Data Clustering: 50 Years Beyond K-Means by Anil K. Jain (pdf file) As K-means is still one of the most popular algorithms it just goes to show as with other areas of statistics that if your way of doing things arrives first and becomes commonly adopted it'll often trundle along even if better alternatives appear in the literature.

Tuesday 3 January 2012

How to trim a mean in R

Good news! My secondhand copy of Fundamentals of Modern Statistical Methods by Rand R. Wilcox has winged its way safely across the Atlantic. In celebration here's a how to do trimmed means in R. Really it's quite simple but as you can see it can have a big effect on the results.

> #Lets find some data to look at the mean and the trimmed mean
> firstlotofdata <- c(14,13,10,11,19,14,12,12,12,16,18,16,13,14,15,18,17,16,14,11)
> #Some data that also has some outliers could also be helpful
> secondlotofdata <-c(15,18,11,11,12,12,19,17,14,13,12,18,17,16,17,14,11,12,875,4572)
> firstlotofdata
 [1] 14 13 10 11 19 14 12 12 12 16 18 16 13 14 15 18 17
[18] 16 14 11
> secondlotofdata
 [1]   15   18   11   11   12   12   19   17   14   13
[11]   12   18   17   16   17   14   11   12  875 4572
> mean(firstlotofdata)
[1] 14.25
> mean(secondlotofdata)
[1] 285.3
> mean(firstlotofdata, 0.1)
[1] 14.1875
> mean(secondlotofdata, 0.1)
[1] 14.8125
> mean(secondlotofdata, 0.05)
[1] 62.38889
> length(secondlotofdata)
[1] 20

As you can see the first lot of data doesn't produce a lot of variation between the mean and trimmed mean as the data is nicely bunched together and free from outliers. This is as expected as there is no difference between a mean and a trimmed mean when the data fit a Gaussian distribution.

The difference is evident as you can see from the second lot of data which has two massive outliers which hugely distort the result. So if you want to trim the mean by removing 20% of cases you code mean(mydata, 0.1) and this takes 10% of cases from both the top and bottom of the distribution. With a length of 20 this would mean removing 4 cases. As you can see 0.05 only removes 2 cases keeping in the 875 results in the trimmed mean of over 60.