GEOG5870/1M: Web-based GIS A course on web-based mapping

Tour de France data

This dataset is tdf_tweets.csv, a table of 1000 ‘tweets’ harvested from the Twitter Streaming API using a a modified version of tweepy, a pre-existing Python library.

In fact, the original dataset was much larger, consisting of roughly 42,000 tweets. These were filtered out from dozens of gigabytes of Twitter data using a long R script. This script took around 1 day to run on a fast computer! A random selection of 1000 of these tweets was selected for this module. To overcome confidentiality issues, a number of changes were made to the files using another R script called tdf-clean because it cleans Tour de France data. Get it?!

Key features of this script include:

Removal of superfluous variables

Raw Twitter data contains lots of data. To reduce size, the table was shunk both in length (number of rows) and in width (number of columns):

Removal of sensitive text

It is hard to identify what constitutes ‘sensitive’ text, so any words which were unusual, contained html links or the identifying @ symbol were removed:

Setting maximum number of words

To make the text shorter and more manageable, a maximum word length was set. To remove excess words, a new R function was defined and run on the dataset:

Analysing the data

To begin to analyse the data, there are many options. R is a powerful data analysis tool; to load and begin to analyse the data in R, try the following:

lat lon created
51.50886 -0.265492 2014-07-05 07:29:04
53.80314 -1.542609 2014-07-03 12:15:50
0.00000 0.000000 2014-07-06 18:37:12
54.04125 -1.565414 2014-07-05 09:20:11
53.99056 -1.799216 2014-07-06 12:31:03

table: Latitude and longitude of tweets

created text
2014-07-05 07:29:04 but this As its my first Tour de France Im
2014-07-03 12:15:50 Great view from 2 YB office TDF2014 TDFyorkshire leeds http
2014-07-06 18:37:12 Hello today was was in York I am a amp
2014-07-05 09:20:11 here we are at Ripleycastle http
2014-07-06 12:31:03 Another great day tdf http

table: Sample of text associate with tweets