As part of Advanced Topics in Computer Science (COMP SCI 3020), I worked on a project supervised by Dr Claudia Szabo where I benchmarked the use of MapReduce on unstructured data. This project involved collecting large unstructured data sets from Twitter, and using MapReduce to process the data in different ways, and then produce benchmarks for these different data analysis approaches.

The data analysis approaches we pursued were: hashtag frequency (#hashtag), user mention (@username) frequency, and a timeline analysis of both of these, as well as general word frequency over time.

The output from this analysis was then visualised using D3, and can be viewed using the below links.

Yes All Women

Analysis of tweets during the trending phase of the #yesallwomen hashtag.

User Mention Wordcloud

Hashtag Wordcloud