Agencies like the Census Bureau want better commercial off-the-shelf (COTS) technologies for protecting data privacy and computation, so they can securely link datasets and make predictions about the coronavirus pandemic.
The bureau launched two new surveys and an interactive data hub to begin filling holes in the government’s understanding of COVID-19’s social and economic impacts in April. But surveys take time and only offer a snapshot of the population, when the bureau could be linking data from text-mined emergency room visits to its own.
If industry could provide a better tool for securing the environment in which data is stored and analyzed, ensuring trust, then more datasets could be linked painting a comprehensive geographic and economic picture of the virus, said Cavan Capps, the big-data lead at the Census Bureau, during a Data Coalition webinar Wednesday.
Linking hospital administrative data to cell phone data like Apple and Google wanted would lead to very efficient contact tracing, but there’s not enough public trust in current technology, Capps said.
“When we’re actually making decisions, when we’re running these models, when we’re tracking people, do you want any individual to basically sign a piece of paper and say, ‘I promise I won’t tell anyone about you?’” Capps asked. “Or would you rather have more rigorous mathematical protections?”
Currently there is no “silver bullet” solution, said Lynne Parker, White House deputy chief technology officer. She pointed to several reasons: Data de-identification can be accidentally undone when the scrubbed data is combined with other sources of information. Data aggregation limits analytics. Simulating data raises concerns about accuracy and reverse engineering, while homomorphic encryption — which allows data to be mined without sacrificing privacy — hurts performance and speed.
Other techniques and technologies also have their weaknesses, she said. Data enclaves — centralized services favored by academia, where users can work with sensitive research data — don’t scale well. Differential privacy, or systems that publicly share information on group patterns while withholding information on individuals in a dataset, water down insights. And the security of multi-party computation, a subfield of cryptography that allows different parties to privately compute the same data, hasn’t been fully vetted.
“Much more needs to be done to create scalable solutions that are not just a point solution for a particular data sharing goal, but an approach that can scale to more use cases,” Parker said. “So I close with a call to all of you across industry, academia and government: What we need is a better pathway forward for addressing data sharing hurdles more quickly and in the shorter term.”
Some technologies like trusted computing show promise, costing less to perform encryption and decryption and only breaking when there’s a backdoor in the microcode, Capps said.
The Census Bureau ran a pilot with the Defense Advanced Research Projects Agency and found trusted execution environments (TEEs) scale better than multi-party computing. The TEE core of an eight-core computer was able to collect, process, tabulate and link 1 million transactions a minute for 20 minutes. TEEs are basically walled off in a way that makes them more secure than the rest of a computer.
The bureau wants to arrange a pilot with Microsoft Azure to process with hundreds of computers, Capps said.
At the same time the bureau is working with the University of California, Berkeley to take a parallel processing system that runs large datasets, called Spark, and see if it can handle regressions, machine learning and other tasks.
Capps envisions linking data in commercial clouds without providing anyone direct access and then running that data through a filter, using differential privacy to add noise, before publishing the information. The Census Bureau hires academics to hack its de-identified data in an attempt to reidentify it.
Another data-collection effort related to the pandemic — contact tracing of people exposed to the coronavirus — will provide another test of public trust in government agencies at all levels, officials said.
“Many concerns around the contact tracing apps and tools are concerns that this information will be repurposed,” said Kelsey Finch, senior counsel at the Future of Privacy Forum. “We’ve seen potential drops in adoption rates related to law enforcement taking location information around the protests [of systemic racism] recently.”
Capps doesn’t believe the government can deduce what all the secondary uses of its data might be.
A decade ago, Google developed a Flu Trends tool that geolocated flu searches and even out-predicted the Centers for Disease Control and Prevention’s own disease surveillance model for a time.
“It was a secondary use we hadn’t anticipated,” Capps said. “So I don’t think we’re going to tell data scientists to shut down their brains.”