Ventures in Data Science

Friday, 24 June 2011

When the OECD met ggplot2

Click on the picture to see full size.

I wanted to have a go at some of the more advanced graphing available in R so that meant the ggplot2 which people keep going on about. I got the last 6 months growth figures off the OECD website for all the member countries that had produced the figures and kept throwing various combinations of R/ggplot2 code at it until I got what I wanted.

What I found was that like R with ggplot2 the learning curve is steep but it does give total control if you're prepared to spend the time on it. I still haven't worked out how to add the quartiles automatically but adding the horizontal lines I think worked better as I could make the outer ones dashed and finding the figures was easy enough quantile(GDP)

This is also the first time i've used R Studio and I have to say it's great perhaps I found it a little slow but it's such a pretty and helpful environment to work in I'm going to stick with it.

Anyway here is the code I ended up using.

ggplot(oecd2, aes(x=CC, y=GDP))
+ geom_text(aes(label=Country, x = CC, y = GDP),colour="black", size=5)
+ xlab("Country") + ylab("Growth as % GDP Change + Average & Quartiles")
+ opts(title = "Growth in OECD Countries (Q4 2010 - Q1 2011)")
+ geom_hline(yintercept=0.5,linetype=2)
+ geom_hline(yintercept=1.6,linetype=1)
+ geom_hline(yintercept=1.9,linetype=2)

Wednesday, 22 June 2011

Data Without Boarders - Doing Good With Data

You know what the internet's like. Someone gets a good idea and they tell their friends and they pass it on and before you know it you're getting email from someone in a country which makes you go something like "Suriname really that's in South America I thought it sounded quite Asian. Are we sure it's not next to Cambodia?"

Anyway I got a tweet from DataMiner UK which mentioned Data Without Boarders which sounded cool so I signed up for the email and below is what we all got back. Anyway if you're interested in this why not sign up as well?

Well that was awesomely unexpected.

What I thought would just be a casual blog post about a project I hoped to quietly roll out in the fall became a lightning rod for some of the most enthusiastic, engaged, and socially conscious data folk that I ever could have imagined. As of my writing this, over 300 of you have shown your interest in this initiative by signing up to stay in the loop on the Data Without Borders email list. Considering I envisioned 20 of us sitting around a borrowed office to tackle this problem, that turnout seems incredible to me. In addition, I've been getting emails from around the globe from people with amazing socially conscious tech projects and an unbridled enthusiasm for using tech and data to help others. To all of you, whether you’re an excel ninja working with disenfranchised communities or just an interested observer, thank you for signing up and getting involved.

Some quick updates on next steps:

New Website: We’ll be working to roll out a proper site in early July where NGOs / non-profits can submit their proposals and where data people, be they statisticians, hackers, or other, can get involved and learn about upcoming events / partnerships.

Kickoff Event – NY Hack Day: Our first event will likely take the form of a NY-based hack day slated for the fall, in which three selected organizations will present proposals for short-term projects they'd like assistance with (that we’ll help you plan out). We’ll then have local data folk come and help realize their visions or consult on other ways to tackle the problems. Of course there will probably be pizza and beer.

Other Kickoffs: Not in NY? Fear not! We ultimately plan for DWB to be a free exchange of data scientists, regardless of locale. As this is our first event, logistics dictate that we do something close to home, but we’re also looking into sponsoring similar events across the country / world (if you’re someone that could help us make that happen, get in touch!) Heck, if you’re inclined, feel free to throw your own and tell us how it goes. We can compare notes.

New Name for the Site: This may seem trivial, but the site needs a proper name. Data is already borderless, Data Scientists Without Borders is a mouthful and exclusive sounding, and I’ve heard rumor that other professions ‘without borders’ are very protective of trademark. So before this goes too far, suggest a name for this initiative by emailing datawithoutborders AT gmail.com with your clever titles. Right now I’m partial to DataCorps (thanks, Mike Dewar!), so feel free to +1 that if you’re feeling unimaginative.

Once again, thanks to everyone who’s shown interest in this cause and who wants to get involved. If you emailed me personally and I haven’t gotten back to yet, my apologies. I’m working my way through the list and will get back to you as soon as I can. Ping me at datawithoutborders AT gmail.com if you haven’t heard from me by the end of the week.

We’ll send another update when the site’s live and will send word as our kickoff event approaches. Thanks again for inspiring me, Internet, and we're looking forward to doing amazing things with you all.

Best,

Jake

Thursday, 21 April 2011

Pairing up

An ideal way to start analysing data is simply to take a look at it. A visualisation of data is a far more effective way for the brain to spot patterns, correlations or clusters than to simply look at numbers in a spreadsheet or loading data into R and executing the summary function. That's fine as far as it goes but really there are more effective things than can be done

Visualisation is a very easy thing to do in R. For instance when you just have two variables you're interested in you can simply plot them.

We live in a complex world with tonnes of data and generally you're going to be interested in data sets where you want to examine more than two variables. How can you do that? Dead easy use the pairs function.

This gives the much more interesting exploration of the data. I like to add a regression line with panel=panel.smooth as this has greater visual impact and helps guide the eye to select variables for further analysis.

Below is the code I used.

> londondata<-read.table(file.choose(), header=T, sep=",")
> attach(londondata)
> summary(londondata)
> names(londondata)
> plot(JSArateNov09, Hpinflation)
> pairs(londondata)
> pairs(londondata,panel=panel.smooth)

The information was from London Data Store. All are figures are 2009 unless otherwise indicated and by borough. Data variables names explained:

Area - Borough
HousePrice96 - 1996 Median house price
HousePrice09 - 2009 Median house price
EnterpriseBirth09 - New company creations in 2009
ActiveEnterprise - Number of businesses in 2009
5yearsurvival- Percentage of businesses trading after 5 years
JSArateNov09 - Number of JSA claiments
Hpinflation - House price inflation from 1996-2009

Update
Here is a great video from Mike K Smith about working with multiple graphs in R. I think my favourite tip is >windows(record=T) which enables more than one graph at a time to be displayed.