Ventures in Data Science: July 2011

Saturday 30 July 2011

Artificial Intelligence - a new mode of learning

The internet is a disruptive technology that takes existing heirarchies and squishes, pounds and generally beats them into something new. Take for instance the press where traditional paper media is looking more anachronistic by the second and is set to be replaced by something which we're not really quite sure what it is yet.

The same process has hardly started in higher education but the more you think about it the generation that grew up learning everything online is going to find it weird to pay some institution £9,000 a year to set in a lecture hall when you can learn the same stuff online for free or alot less. University will still be a central feature of many people's lives as the desire to move out of home is a great one for many teenagers but for learning increasingly people are going to do that on the net.

Step forward Peter Norvig and Sebastian Thrun with their Introduction to Artificial Intelligence class which is the online version of the introduction to artificial intelligence class they run at Stanford. While the students there have to get dressed to learn stuff people who learn online can stay in our underpants while the we feast our minds on the same brainfood.

No this wont get you a Stanford degree but the Jeffery Archer's of the future will sure be putting STANFORD in as bigger letters as their CV's can cope with safe in the knowledge they avoided the earthquake risk of visiting California before the big one strikes. You also get to compete with the people who do go to Stanford. For people whose access to HE is restricted this is a great opportunity which should be extended as far as possible. Elites will be made more porous. Society will be less bound by results in examinations at the age of 18 that people really don't care about just a few years later. This is to cut a long story short a good thing.

So if you want to learn about all the interesting things that make up the field of artifical intelligence then why not sign up? It's free afterall.

Thursday 28 July 2011

Analyzing a Twitter hashtag with R

I've been wanting to have a go at this since reading this great post on the subject by Heuristic Andrew. When news broke that the Prime Minister's Director of Strategy had proposed to abolish maternity leave amongst other things twitter reacted with the mocking hashtag #blueskyhilton.

I decided on this hashtag because I thought it would be relatively short. The Twitter API only goes up to 1500 tweets anyway. In fact #blueskyhilton turns out to have had 240 mentions. Which is a great size to be starting with.

There are a few conclusions to be drawn:

1) On a technical aside I need to find a better way to cope with shortened URL's

2) I don't think we've reached the limits of what vizualisation can do in presenting the information. On Heuristic Andrew there is a cluster dendrogram plot which is fine for a statistically minded audience but the rest of the population might find it a little perplexing.

3) I need to find firmer lines between who is tweeting, what they are tweeting and retweets.

4) The focus of the tweets seems to come from a small number of users but that may be due to the small sample size. Off hand ideas from a political backroom boy that most voters don't recognise don't excite Twitter.

5) Not all Tweeters on an issue will use the same hashtag so I may need a wider range of search terms.

6) This type of analysis would benefit from proper scripts and repeat sampling.

7) The corner of Twitter that was involved in #blueskyhilton won't be sending any Christmas cards to No. 10.

Anyway here's most of the code I used. Don't forget you'll need the XML package. For some reason pretty R has turned <- into < Think I prefer ugly R.

> bsh.vectors &lt;- character(0)
> for (page in c(1:15))
+ {
+ twitter_q &lt;-URLencode('#blueskyhilton')
+ twitter_url = paste('http://search.twitter.com/search.atom?q=',twitter_q,'&rpp=100&page=', page, sep='')
+ bsh.xml &lt;-xmlParseDoc(twitter_url, asText=F)
+  bsh.vector &lt;-xpathSApply(bsh.xml,'//s:entry/s:title', xmlValue, namespaces =c('s'='http://www.w3.org/2005/Atom'))
+ bsh.vectors &lt;- c(bsh.vector, bsh.vectors)
+ }
> length(bsh.vectors)
[1] 240
> install.packages('tm')
> require(tm)
Loading required package: tm
> bsh.corpus &lt;-Corpus(VectorSource(bsh.vectors))
> bsh.corpus &lt;-tm_map(bsh.corpus, tolower)
> my_stopwords &lt;- c(stopwords('english'), '#blueskyhilton', 'blueskyhilton')
> bsh.corpus &lt;- tm_map(bsh.corpus, removeWords, my_stopwords)
> bsh.corpus &lt;- tm_map(bsh.corpus, removePunctuation)
> bsh.dtm &lt;-TermDocumentMatrix(bsh.corpus)
> bsh.dtm
A term-document matrix (849 terms, 240 documents)

Non-/sparse entries: 1785/201975
Sparsity           : 99%
Maximal term length: 36
Weighting          : term frequency (tf)
> findFreqTerms(bsh.dtm, lowfreq=10)
[1] "abolish"          "archiebland"      "benefits"   
[4] "day"              "growth"           "hashtag"    
[7] "hilton"           "ideas"            "months"     
[10] "people"           "policy"           "psbook"     
[13] "steve"            "stevehiltontapes" "wednesday"  
[16] "week"       
> findAssocs(bsh.dtm, 'hilton', 0.20)
    hilton           steve           blood           brain
      1.00            0.98            0.43            0.43
   notices            alan      equivalent httpbitlyesdjf9
      0.43            0.36            0.36            0.36
   minimal       partridge         somalia       political
      0.36            0.36            0.36            0.32
       cut       phildyson        research            send
      0.31            0.31            0.31            0.31
     ideas         hashtag  jimthehedgehog            nine
      0.29            0.26            0.26            0.25
    senior         servant          supply
      0.25            0.25            0.22
> findAssocs(bsh.dtm, 'hilton', 0.10)
        hilton               steve               blood
          1.00                0.98                0.43
         brain             notices                alan
          0.43                0.43                0.36
    equivalent     httpbitlyesdjf9             minimal
          0.36                0.36                0.36
     partridge             somalia           political
          0.36                0.36                0.32
           cut           phildyson            research
          0.31                0.31                0.31
          send               ideas             hashtag
          0.31                0.29                0.26
jimthehedgehog                nine              senior
          0.26                0.25                0.25
       servant              supply         adambienkov
          0.25                0.22                0.18
          andy           automatic                born
          0.18                0.18                0.18
      calsberg             coulson      counterbalance
          0.18                0.18                0.18
     dangerous            director           fascistic
          0.18                0.18                0.18
       fucking                 hit     httpbitlyo9txpg
          0.18                0.18                0.18
 httpcoameekab       httpcoeaaligl              leaked
          0.18                0.18                0.18
          meme              minute               moron
          0.18                0.18                0.18
         plans            plonkers               quota
          0.18                0.18                0.18
        spawns            strategy              streak
          0.18                0.18                0.18
        stupid       thejamesdixon             twitter
          0.18                0.18                0.18
        months               civil            benefits
          0.17                0.16                0.14
   archiebland             cameron       comprehension
          0.13                0.13                0.11
       defying         politicians thestevehiltontapes
          0.11                0.11                0.11

Created by Pretty R at inside-R.org

Wednesday 27 July 2011

What Statistical Test Should I Use?

Here is a handy rough and ready guide to which statistical test you should use on your data. Yes the colours are bright. Yes it is a bit flowchartastic but as a revision resource I have found it useful.

Many a statistics student is grateful to the magnificent Andy Field and his Discovering Statistics Using SPSS, in particular page 822 in the 3rd edition. My revision aid uses the same information but presents it in a more digestible format.

Which Statistical Test?

Monday 4 July 2011

How to win the Lottery with R

Firstly don't play the lottery! You're not going to win but I guess with lottery fever gripping the nation, the Euromillions jackpot is a 12X rollover standing at £154 million, this is as good a time as any to put R to one of its more practical uses. Choosing random lottery numbers.

Firstly don't use rnorm as it won't give you whole numbers instead use sample like i've set out below. You can also use it on other data not just numbers but that's outside the scope of this post.

sample(what we're sampling from in this case numbers 1 to 50, how many samples you want to take ie 5, whether you want to pick the same thing more than once being a lottery we don't)

For the Euromillions you need to sample twice once for the main numbers. 5 numbers between 1 and 50 and then again for the 2 stars. 2 numbers between 1 and 11. Anyway this is how to do it.


> lotto<-sample(1:50,5,replace=F) 
> lotto2<-sample(1:11,2,replace=F)
> lotto [1] 16 17  4 49 30
> lotto2 [1] 5 1

So will this actually help you to win the lottery? Alas not but by using random numbers you're less likely to choose a common combination of numbers and have to share the jackpot with all the other punters who love to use 1,2,3,4,5,6,7 and other similar common numerical choices.

More on choosing random numbers in R and in the VERY unlikely probability you do win you may need this Stuff Rich People Love