Data sharing mechanism or programme
February 13, 2020
Solom Messing, Saurav Mahanti, Christina DeGregorio, Zagreb Mukerjee, Bogdan State, Bennett Hillenbrand, Chaya Nayak, Arjun Wilkins, Gary Kings, Nathaniel Persily
Subject tag: Data Access
This document details a data set designed to allow researchers to study the distribution of URLs on Facebook and how users interacted with them. We’ve protected this data set using differential privacy, which adds enough noise to the data to provide precise guarantees that no significant additional information can be learned from the data about individuals (beyond what is already available from any external source). That means that while no one can learn anything significant about the individuals from the data set (including whether they are in the dataset or not), researchers can still use the data to uncover broad time series or group-level trends and relationships of interest. The dataset summarizes the demographics of people who viewed, shared, and otherwise interacted with web pages (URLs) shared on Facebook starting January 1, 2017 up to and including July 31, 2019. URLs are included if shared (as an original post or reshare) with “public” privacy settings more than 100 times (plus Laplace(5) noise to minimize information leakage).
The URLs have been canonicalized (standardized) and processed (as detailed below) to remove potentially private and/or sensitive data. Aggregate data on user actions on URLs is provided for URLs shared publicly and those shared under the “share to friends” privacy setting. These data were collected by logging actions taken on Facebook. Logs were processed using a combination of Hive, Presto, and Spark using the Dataswarm job-execution framework. To construct this dataset, we processed approximately an exabyte of raw data from the Facebook platform – including more than 50 terabytes per day of interaction metrics and more than 1 petabyte per day of exposure data (views). The data set has about 38 million URLs, more than half a trillion rows, and more than 10 trillion cell values.
Data from users who have chosen to delete their accounts are not represented in our dataset due to legal constraints, which may have a larger impact on URLs that were shared further in the past (this release is aggregated to provide month-year breakdowns). Users who “deactivate” but do not delete their accounts remain in our dataset.We have taken measures to remove URLs and associated engagement statistics that link to known child exploitative imagery from these data. We have also taken measures to remove URLs, the “Title” and “Blurb” for known non-consensual intimate imagery, suicide and self-harm, although the associated engagement statistics with these links remain in the data set. This dataset includes posts from users that have been taken down due to Community Standards violations. To learn how Facebook defines and measures key issues, refer to the Community Standards Enforcement Report. The numbers cited in this report are not comparable to the data presented in this RFP as they reference different underlying data. For additional information, see: Community Standards, Enforcement Report Guide.
[This entry was sourced with minor edits from the Carnegie Endowment’s Partnership for Countering Influence Operations and its baseline datasets initiative. You can find more information here: https://ceip.knack.com/pcio-baseline-datasets]