This dataset is tdf_tweets.csv, a table of 1000 ‘tweets’ harvested from the Twitter Streaming API using a a modified version of tweepy, a pre-existing Python library.
In fact, the original dataset was much larger, consisting of roughly 42,000 tweets. These were filtered out from dozens of gigabytes of Twitter data using a long R script. This script took around 1 day to run on a fast computer! A random selection of 1000 of these tweets was selected for this module. To overcome confidentiality issues, a number of changes were made to the files using another R script called tdf-clean
because it cleans Tour de France data. Get it?!
Key features of this script include:
Raw Twitter data contains lots of data. To reduce size, the table was shunk both in length (number of rows) and in width (number of columns):
It is hard to identify what constitutes ‘sensitive’ text, so any words which were unusual, contained html links or the identifying @ symbol were removed:
To make the text shorter and more manageable, a maximum word length was set. To remove excess words, a new R function was defined and run on the dataset:
To begin to analyse the data, there are many options. R is a powerful data analysis tool; to load and begin to analyse the data in R, try the following:
lat | lon | created |
---|---|---|
51.50886 | -0.265492 | 2014-07-05 07:29:04 |
53.80314 | -1.542609 | 2014-07-03 12:15:50 |
0.00000 | 0.000000 | 2014-07-06 18:37:12 |
54.04125 | -1.565414 | 2014-07-05 09:20:11 |
53.99056 | -1.799216 | 2014-07-06 12:31:03 |
table: Latitude and longitude of tweets
created | text |
---|---|
2014-07-05 07:29:04 | but this As its my first Tour de France Im |
2014-07-03 12:15:50 | Great view from 2 YB office TDF2014 TDFyorkshire leeds http |
2014-07-06 18:37:12 | Hello today was was in York I am a amp |
2014-07-05 09:20:11 | here we are at Ripleycastle http |
2014-07-06 12:31:03 | Another great day tdf http |
table: Sample of text associate with tweets