People have to overcome a steep learning curve to use granular data from the American Community Survey — perhaps the Census Bureau’s best-known product — because of its structure and lack of metadata.
Historically only available in common tabular file formats like CSV, the dataset requires reference to separate dictionary document to understand it. But now, developers and data scientists will be able to more easily use the ACS data and build apps from it because it has been transformed into linked data, the Census Chief Marketing Officer Jeff Meisel announced Saturday during a panel at the SXSW Conference in Austin.
The Austin-based data.world, funded by the National Science Foundation, brought on then-graduate student Jonathan Ortiz to address problems with the Public Use Microdata Sample, as it’s called.
“What comes to you in the microdata survey file … is essentially just: one piece is the CSV, which has coded values throughout, and you constantly have to refer back and forth to the data dictionary,” said Ortiz, who now works as a data scientist for data.world, in an interview with FedScoop. “And the data dictionary is a human-readable document, it’s not computer-readable at all.”
But semantic technology allows users to “put that metadata in to the data itself so that you’re consuming both at the same time, and you’re also able to use unique identifiers for each of the data resources in that data so the computer can actually understand them, make sense of them.”
The tradeoff in getting the metadata is that “the size of the data explodes when you start incorporating all this other information.”
To address the storage issue, Amazon Web Services is making it available as an AWS public dataset: Anyone can then analyze the data in the cloud without downloading or storing a copy. The old formats will still be available, Ortiz said. Most spreadsheet programs can easily read a CSV file.
“The intention was not to completely replace it and we don’t want to — that’s not the interest here. I think the people who are comfortable using that in its format and enjoy using it, and get value out of it, are going to continue to do so,” Ortiz said. “This is just a new way of modeling and distributing the data. And hopefully a new set of users get different use out of it.”
Ortiz says he hopes there are web and app developers who could use the data delivered in this way to make resources for the public.
“I believe that linked data is the future,” he said, adding “that by providing this it’ll provide other people, other folks out there, semantic web enthusiasts, data engineers, developers, researchers, etc. to begin to enrich their analysis and enrich their own data by linking it to the Census.”
To make use of the CSV version, for a data scientist or data engineer “it’s like learning a new language, it’s like learning a new programming language.” After translating the data into linked data, Ortiz said people can now use the real, human terms and concepts to query it. “And you can uncover things more quickly using that because you’re not learning a new language, essentially.”