Wednesday 14 November 2012

Data Munging in R


Whatever the fancy analysis or visualization carried out after work in R practically always starts with some data munging. Load the data, get rid of some columns you don't need, rename some of the other columns, check the data structure and make sure the data is in the format you want. This is pretty standard it can also be pretty baffling when you start to learn R's notoriously steep learning curve.

So to help out I've posted some of my code with notes. This was for data in CSV format and local was what I named the dataframe as the data referred to local councils. Remember if you don't know quite how to use a function eg gsub then try ?gsub and that'll bring up the help file which often contains helpful examples. So why not find some data of your own and try out out this yourself.


local<-read.table(file.choose(), header=T, sep=",")
head(local)
#tidy the dataframe up a bit. Removing unnecessary columns ect.
names(local)
local <- subset( local, select = -c(Old.ONS.code, ONS.code, Party.code ))
#check they have been removed
names(local)
# Rename some other columns by column number
> names(local)[1]<-"Name"
> names(local)[13]<-"CutPerHead"
> names(local)[14]<-"Benefit"
> names(local)[15]<-"YouthBenefit"
> names(local)[16]<-"DeprivationRanking"
> names(local)[17]<-"PublicSector"
> names(local)[18]<-"ChildPoverty"
#check they have been amended
names(local)
#check the structure of the dataframe
> str(local)
#Notice that the £ sign has got CutPerHead defined as a factor which we don't want
#As all the numbers are minus we can simply remove the all non numerical characters
local$CutPerHead <- gsub("[^0-9]", "", local$CutPerHead)
#Lets check that went OK
head(local$CutPerHead)
#Oops have removed the decimal point. Lets put it back in.
local$CutPerHead<-as.numeric(local$CutPerHead)
local$CutPerHead<-(local$CutPerHead/100)

1 comment:

  1. Hello,
    The Article on Data Munging in R is nice.It give detail information about Data Science. Thanks for Sharing the information about it. big data scientist

    ReplyDelete