Consider a few numbers: By the end of 2014, the number of mobile phone subscriptions worldwide is expected to reach 7 billion, nearly equal to the world’s population. More than 1.82 billion people communicate on some form of social network, and almost 14 billion sensor-laden everyday objects (trucks, health monitors, GPS devices, refrigerators, etc.) are now connected and communicating over the Internet, creating a steady stream of real-time, machine-generated data.
Much of the data generated by these devices is today controlled by corporations. These companies are in effect “owners” of terabytes of data and metadata. Companies use this data to aggregate, analyze, and track individual preferences, provide more targeted consumer experiences, and add value to the corporate bottom line.
At the same time, even as we witness a rapid “datafication” of the global economy, access to data is emerging as an increasingly critical issue, essential to addressing many of our most important social, economic, and political challenges. While the rise of the Open Data movement has opened up over a million datasets around the world, much of this openness is limited to government (and, to a lesser extent, scientific) data. Access to corporate data remains extremely limited. This is a lost opportunity. If corporate data—in the form of Web clicks, tweets, online purchases, sensor data, call data records, etc.—were made available in a de-identified and aggregated manner, researchers, public interest organizations, and third parties would gain greater insights on patterns and trends that could help inform better policies and lead to greater public good (including combatting Ebola).
Corporate data sharing holds tremendous promise. But its potential—and limitations—are also poorly understood. In what follows, we share early findings of our efforts to map this emerging open data frontier, along with a set of reflections on how to safeguard privacy and other citizen and consumer rights while sharing. Understanding the practice of shared corporate data—and assessing the associated risks—is an essential step in increasing access to socially valuable data held by businesses today. This is a challenge certainly worth exploring during the forthcoming OpenUp conference!
Understanding and classifying current corporate data sharing practices
Corporate data sharing remains very much a fledgling field. There has been little rigorous analysis of different ways or impacts of sharing. Nonetheless, our initial mapping of the landscape suggests there have been six main categories of activity—i.e., ways of sharing—to date:
1. Research partnerships, in which corporations share data with universities and other research organizations. Through partnerships with corporate data providers, several researchers organizations are conducting experiments using de-identification and aggregated samples of consumer datasets and other sources of data to analyze social trends. For instance, Safaricom, one of Kenya’s leading mobile companies, shared a year of de-identified phone data with Harvard researchers to analyze and map how migration patterns contributed to the spread of malaria in Kenya.
2. Prizes and challenges, in which companies make data available to qualified applicants—including civil hackers, pro bono data scientists and other expert users—who compete to develop new apps or discover innovative uses for the data. Last year, Spain’s regional bank BBVA hosted a contest inviting developers to create applications, services, and content based on anonymous card transaction data. The first prize went to an application called Qkly, which helps users manage time by estimating what time of day a given site or destination will be most overcrowded (thus helping users, for example, avoid lines).
3. Trusted intermediaries, where companies share data with a limited number of known partners for analysis, modeling, and other value chain activities. For example, companies from the consumer packaged goods, retail, and over-the-counter health care industries often share data with firms such as Information Resources, Inc. (IRI), a data analytics and strategy firm that provides business intelligence and predictive analytics solutions.
4. Application programming interfaces (APIs), which enable access to streams of corporate data for developers and others to conduct testing, product development, and data analytics. Major health insurance companies, such as Kaiser and Aetna, use APIs to create more integrated ecosystems across mobile applications and devices for consumers. Aetna’s CarePass API gives consumers access to their personal data to sync with wearable health platforms such as FitBit or the Apple Watch.
5. Intelligence products, where companies share (often aggregated) data that provides general insight into market conditions, customer demographic information, or other broad trends. Google shares search query-based data in conjunction with data from the US Centers for Disease Control in order to estimate levels of influenza activity across the country over time.
6. Corporate Data cooperatives or pooling, in which corporations—and other important dataholders, such as government agencies—group together to create “collaborative databases” with shared data resources. For example, through its Accelerating Medicines Partnership, the US National Institutes of Health (NIH) is helping organize data pooling among the world’s largest biopharmaceutical companies in order to identify promising drug and diagnostic targets for Alzheimer’s disease, systemic lupus erythematosus, rheumatoid arthritis, and diabetes.
Assessing risks of corporate data sharing
Although the shared corporate data offers several benefits for researchers, public interest organizations, and other companies, there do exist risks, especially regarding personally identifiable information (PII). When aggregated, PII can serve to help understand trends and broad demographic patterns. But if PII is inadequately scrubbed and aggregated data is linked to specific individuals, this can lead to identity theft, discrimination, profiling, and other violations of individual freedom. It can also lead to significant legal ramifications for corporate data providers.
Based on our initial research, we have found that most companies are aware of these risks and have taken steps to de-identify aggregated datasets. Such steps include partnerships with academic experts, and experimenting with new de-identification methods. It is important to point out, however, that there exist no industry standards or widely accepted Best Practices for de-identification of corporate data. Complete anonymization would of course provide the safest way to scrub datasets of PII, but it might also reduce the “granularity” and thus usefulness of the data.
Participants at a recent Responsible Data Forum held at the Rockefeller Foundation, in New York City, suggested creating a “starter kit” (or “how-to guide”) for private sector companies aiming to open access to data while protecting privacy. In addition to this starter kit, companies, researchers, and governments could also start developing a safety ranking system based on a “taxonomy of harms.” More generally, more thought and discussion is required to determine de-identification methods and standards (including on ways to prevent re-identification).
Mapping the next frontier
Beyond the broad taxonomies presented above, there exists almost no systematic analysis of the practice, risks, and impact of corporate data sharing. A more comprehensive mapping of the field of corporate data sharing is urgently needed. Such a mapping would draw on a wide range of case studies and examples to identify opportunities and gaps, evaluate risks, provide evidence of impact, determine best practices in de-identification techniques and privacy frameworks, and ultimately inspire more corporations to allow access to their data. “Opening Up” corporate data is the next frontier of open data. The potential societal benefits that could flow from accessing corporate data are tremendous—but they will only be realized when the public (consumers, citizens, and companies themselves) have solid evidence of those benefits as well as trust in the way data is shared and accessed.
This guest blog was written by Stefaan G. Verhulst, co-founder and chief of research and development at The GovLab, New York University and David Sangokoya, research fellow at The GovLab, New York University.