With private companies increasingly controlling the production and consumption of information on their platforms, their data on user habits has grown especially valuable to researchers. However, these data rarely provide a complete picture of media consumption, since companies such as Facebook censor their datasets out of privacy concerns. While introducing such noise makes it impossible to identify individual behavior in large-scale data, it may also dramatically alter the conclusions that are drawn from that data.

In a newly published Research Note “Examining potential bias in large-scale censored data,” co-authored by MIT’s Jennifer Allen and Markus Mobius of Microsoft Research, David Rothschild and Lab Director Duncan Watts investigate this bias using Facebook data. They demonstrate that even “big data” can lead to questionable estimates if censored, urging caution during the research process.

The limits of large-scale data censorship

Allen et al. find that, compared to an uncensored dataset, Facebook’s URL dataset severely overestimates the relative shares of news and fake news clicked on the platform—by twofold and fourfold, respectively. Key to this bias is the Facebook dataset’s minimum share threshold, which censors URLS that are publicly shared less than 100 times. As a result, news and fake news are overrepresented in Facebook’s dataset, since such content is more likely to be shared publicly than other types of URLs.

Beyond exposing this bias, Allen et al.’s research opens up more questions into the relationships between viewing, clicking, and sharing URLs. They find that certain types of content are under- or over-represented in Facebook’s dataset depending on whether they are geared towards individual interaction—such as gaming content and micro-targeted ads—or public sharing.

The authors discuss how to preserve privacy in large-scale datasets without a share threshold while still allowing for detailed analysis. And, they compare and contrast the research and privacy trade-offs of these large-scale data with smaller but fully representative data. By probing the limits of the current methodology, they shed light on the need for more accurate descriptive statistics and analysis.

Read the full article published in the Harvard Kennedy School (HKS) Misinformation Review here.



Communications Specialist