Tour de France · GEOG5870/1M: Web-based GIS

Tour de France data

This dataset is tdf_tweets.csv, a table of 1000 ‘tweets’ harvested from the Twitter Streaming API using a a modified version of tweepy, a pre-existing Python library.

In fact, the original dataset was much larger, consisting of roughly 42,000 tweets. These were filtered out from dozens of gigabytes of Twitter data using a long R script. This script took around 1 day to run on a fast computer! A random selection of 1000 of these tweets was selected for this module. To overcome confidentiality issues, a number of changes were made to the files using another R script called tdf-clean because it cleans Tour de France data. Get it?!

Key features of this script include:

Removal of superfluous variables

Raw Twitter data contains lots of data. To reduce size, the table was shunk both in length (number of rows) and in width (number of columns):

Removal of sensitive text

It is hard to identify what constitutes ‘sensitive’ text, so any words which were unusual, contained html links or the identifying @ symbol were removed:

Setting maximum number of words

To make the text shorter and more manageable, a maximum word length was set. To remove excess words, a new R function was defined and run on the dataset:

Analysing the data

To begin to analyse the data, there are many options. R is a powerful data analysis tool; to load and begin to analyse the data in R, try the following:

lat	lon	created
51.50886	-0.265492	2014-07-05 07:29:04
53.80314	-1.542609	2014-07-03 12:15:50
0.00000	0.000000	2014-07-06 18:37:12
54.04125	-1.565414	2014-07-05 09:20:11
53.99056	-1.799216	2014-07-06 12:31:03

table: Latitude and longitude of tweets

created	text
2014-07-05 07:29:04	but this As its my first Tour de France Im
2014-07-03 12:15:50	Great view from 2 YB office TDF2014 TDFyorkshire leeds http
2014-07-06 18:37:12	Hello today was was in York I am a amp
2014-07-05 09:20:11	here we are at Ripleycastle http
2014-07-06 12:31:03	Another great day tdf http

table: Sample of text associate with tweets

GEOG5870/1M: Web-based GIS A course on web-based mapping

Tour de France data

Removal of superfluous variables

Removal of sensitive text

Setting maximum number of words

Analysing the data