Ventures in Data Science: #Rstats code that makes you go YES!

In an ideal world we would get all the data we need in exactly the right format. But this is not an ideal world. Data is often put together for purposes different from what you want to use it for or it might just be messy. There are people in this world who want to put footnote numbers in all the cells of your data and yes they are free to roam the streets and do this.

So when I found out how to use a regular expression to obliterate their legacy it felt great. Data providers please note it's not helpful to have a number in a spreadsheet for arguments sake 2389 and then add a digit of white space and the number one after to create the monster that is 2389 1.

My solution below uses gsub which covers all instances or sub if you just want to change the first one . It works with the pattern you want to find first, then the replacement and finally the name of the data. I thought it could remove all the 1's at the start of main numbers I wanted to keep but it didn't. Phew!

> gsub( " 1"," " , MyData$variablename)

Being able to remove one object is something you learn early on. Later on you want to have a clear out and then you find out how to obliterate everything. That will carry you so far but what I have been hankering after for a bit is something to remove everything except the object you're still interested in working on.

 
> rm(list=setdiff(ls(), "MyData"))

Working with categorical data I wanted a way to easily count the number of each category in a single variable. Indeed there should be a function called count but there isn't. For instance if you wanted to know from a gender variable how many men and women there were. Still there is summary that still does the job if you put the data frame and variable name

>summary(MyData$variablename)

Ventures in Data Science

Wednesday 19 October 2011

#Rstats code that makes you go YES!

No comments:

Post a Comment