Ventures in Data Science: November 2012

Wednesday, 28 November 2012

A Tutorial for ggplot2

Here is a quick ggplot2 tutorial from Isomorphismes from which I've completed the plots below

These two lines of code produce the same plot but show the differences between qplot and ggplot.

> qplot(clarity, data=diamonds, fill=cut, geom="bar")
> ggplot(diamonds, aes(clarity, fill=cut)) + geom_bar()

Data displayed with a continuous scale (top) and discrete scale (bottom)

> qplot(wt, mpg, data=mtcars, colour=cyl)

> qplot(wt, mpg, data=mtcars, colour=factor(cyl))

I think it works better with different colours for the factors but you can change shapes too.

>qplot(wt, mpg, data=mtcars, shape=factor(cyl))

Dodge is probably better for comparing data but lets face it fill is prettier

> qplot(clarity, data=diamonds, geom="bar", fill=cut, position="dodge")

>qplot(clarity, data=diamonds, geom="bar", fill=cut, position="fill")

Not with this data but this is the great plot to use for comparing over a time series.

> qplot(clarity, data=diamonds, geom="freqpoly", group=cut, colour=cut, position="identity")

Changed this one to get better smoothers. More info on that here.

> qplot(wt, mpg, data=mtcars, colour=factor(cyl), geom=c("smooth", "point"), method=glm)

When dealing with lots of data points overplotting is a common problem as you can see from the first plot above.

> t.df <- data.frame(x=rnorm(4000), y=rnorm(4000))
> p.norm <- ggplot(t.df, aes(x,y))
> p.norm + geom_point()

There are 3 easy ways to deal with it. Make the points more transparent. Reduce the size of the points. Make the points hollow.

> p.norm + geom_point(alpha=.15)

> p.norm + geom_point(shape=".")

> p.norm + geom_point(shape=1)

This is also helpful for saving plots

> jpeg('rplot.jpg')
> plot(x,y)
> dev.off()

#Don't forget to turn it back on again

> dev.new()

Wednesday, 14 November 2012

Data Munging in R

Whatever the fancy analysis or visualization carried out after work in R practically always starts with some data munging. Load the data, get rid of some columns you don't need, rename some of the other columns, check the data structure and make sure the data is in the format you want. This is pretty standard it can also be pretty baffling when you start to learn R's notoriously steep learning curve.

So to help out I've posted some of my code with notes. This was for data in CSV format and local was what I named the dataframe as the data referred to local councils. Remember if you don't know quite how to use a function eg gsub then try ?gsub and that'll bring up the help file which often contains helpful examples. So why not find some data of your own and try out out this yourself.

local<-read.table(file.choose(), header=T, sep=",")
head(local)
#tidy the dataframe up a bit. Removing unnecessary columns ect.
names(local)
local <- subset( local, select = -c(Old.ONS.code, ONS.code, Party.code ))
#check they have been removed
names(local)
# Rename some other columns by column number
> names(local)[1]<-"Name"
> names(local)[13]<-"CutPerHead"
> names(local)[14]<-"Benefit"
> names(local)[15]<-"YouthBenefit"
> names(local)[16]<-"DeprivationRanking"
> names(local)[17]<-"PublicSector"
> names(local)[18]<-"ChildPoverty"
#check they have been amended
names(local)
#check the structure of the dataframe
> str(local)
#Notice that the £ sign has got CutPerHead defined as a factor which we don't want
#As all the numbers are minus we can simply remove the all non numerical characters
local$CutPerHead <- gsub("[^0-9]", "", local$CutPerHead)
#Lets check that went OK
head(local$CutPerHead)
#Oops have removed the decimal point. Lets put it back in.
local$CutPerHead<-as.numeric(local$CutPerHead)
local$CutPerHead<-(local$CutPerHead/100)