Thursday 21 April 2011

Pairing up

An ideal way to start analysing data is simply to take a look at it. A visualisation of data is a far more effective way for the brain to spot patterns, correlations or clusters than to simply look at numbers in a spreadsheet or loading data into R and executing the summary function. That's fine as far as it goes but really there are more effective things than can be done

Visualisation is a very easy thing to do in R. For instance when you just have two variables you're interested in you can simply plot them.

We live in a complex world with tonnes of data and generally you're going to be interested in data sets where you want to examine more than two variables. How can you do that? Dead easy use the pairs function.

This gives the much more interesting exploration of the data. I like to add a regression line with panel=panel.smooth as this has greater visual impact and helps guide the eye to select variables for further analysis.


Below is the code I used.

> londondata<-read.table(file.choose(), header=T, sep=",")
> attach(londondata)
> summary(londondata)
> names(londondata)
> plot(JSArateNov09, Hpinflation)
> pairs(londondata)
> pairs(londondata,panel=panel.smooth)

The information was from London Data Store. All are figures are 2009 unless otherwise indicated and by borough. Data variables names explained:

Area - Borough
HousePrice96 - 1996 Median house price
HousePrice09 - 2009 Median house price
EnterpriseBirth09 - New company creations in 2009
ActiveEnterprise - Number of businesses in 2009
5yearsurvival- Percentage of businesses trading after 5 years
JSArateNov09 - Number of JSA claiments
Hpinflation - House price inflation from 1996-2009

Update
Here is a great video from Mike K Smith about working with multiple graphs in R. I think my favourite tip is >windows(record=T) which enables more than one graph at a time to be displayed.