Big Data, Big Questions| Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data
Abstract
Twitter seems to provide a ready source of data for researchers interested in public opinion and popular communication. Indeed, tweets are routinely integrated into the visual presentation of news and scholarly publishing in the form of summary statistics, tables, and charts provided by commercial analytics software. Without a clear description of how the underlying data were collected, stored, cleaned, and analyzed, however, readers cannot assess their validity. To illustrate the critical importance of evaluating the production of Twitter data, we offer a systematic comparison of two common sources of tweets: the publicly accessible Streaming API and the “fire hose” provided by Gnip PowerTrack. This study represents an important step toward higher standards for the reporting of social media research.