Data.gov has caught a lot of flack recently for hosting unstructured data formats, and while the General Services Administration is working to improve the site, its chief architect says there’s an explanation for the predominance of such files.
A report published this year by two George Mason University researchers found the site relies heavily on PDF and HTML files, which they rated as low quality in terms machine readability — one star out of a possible five.
Data.gov Chief Architect Philip Ashlock told FedScoop that sometimes those HTML files, for example, refer to when an agency links to a tool that “helps you specify which part of the data set you want to download,” not a direct link to a data set.
[Read more: Study: PDF, HTML files dominate Data.gov]
“So there’s an intermediary step that you have to go through to get access to the data on a web page,” Ashlock said. “And so those are sort of issues that we’re trying to work through.”
“It’s something we actually need to sort of better respond to because … there’s a lot of nuance to the way those stats are put together,” he said of the report.
Alex Howard, senior analyst for the Sunlight Foundation, told FedScoop the predominance of PDFs and HTML isn’t necessarily a bad thing.
“It’s something there. It may be that you’re putting a table that should be structured into PDF, but at least it’s available for the public,” Howard said. “It’s better to release an image of a table than nothing at all.”
He added: “I think at this point the challenge that Data.gov has is it’s a relatively small staff, run out of [the General Services Administration], that’s essentially acting as a platform for all of the agencies to publish data to. And that there has not still been enough oversight, and auditing and kind of pushing on all the data publishers to make sure that they are creating structured data from the beginning, that is then disclosed as open data in machine readable format.”
Ashlock acknowledged the problem of scope, noting that “the sort of scale of all this data that has been inventoried by Data.gov is a good thing, but also presents a challenge in terms of getting data quality good and consistent.”
He said he hopes the recently announced U.S. Data Federation may address some of these challenges by helping coordinate efforts and tailor data to specific use cases.
“Certainly the open data policy creates challenges in terms of its scope, because it’s so comprehensive,” Ashlock said. “But it also established some basic standards for how information is published, which gives us some ability to kind of help track and measure those inconsistencies, which is something that I think you know we’re tackling and prioritizing over time.”
He also said he thinks the predominance of PDFs — which the researchers cited as the third-most-common format — might have come from the U.S. Geological Survey, who has a “massive collection of data sets… where they have archival maps, historical maps that have been scanned and geocoded.”
“Those are actual 100-year-old maps that are being provided that have some geospatial data to have them, you know, make sense in more of a data-driven context, but they’re actually scanned paper because they’re 100 years old,” he said. “So we haven’t quite gotten the breakdown of what portion of the PDFs that’s accounting for, but I know that collection is in the tens of thousands.”
Data.gov’s legacy, and its future
Socrata CEO Kevin Merritt told FedScoop Data.gov was great for getting the conversation started.
“It helped everybody recognize that data is super valuable and there needs to be a resource where an individual can go and find government data,” Merritt said.
But Merritt did note the repository has experienced some challenges, particularly when officials said they would not host the data on the site.
“Data.gov does not host data directly, but rather aggregates metadata about open data resources in one centralized location,” the website states on its about page.
Merritt says that policy has created an inconsistent user experience.
“They said they would never host the data, that the agency would always be the [host] of the data, and as a result that kind of forced Data.gov into being a catalogue of links,” he said. “When you get to the end of the link, the experience is so different from data set to data set.”
Ashlock later noted in an email that the agency does host some data sets directly, but “the goal is for data to be hosted on the domain name of the data set publisher.”
“This doesn’t mean that GSA or Data.gov can’t or won’t ever be helping to provide or manage servers or other infrastructure that sit behind that domain name… and we have certainly been exploring what we can do to help remove bottlenecks for agencies hosting their data,” he said.
Ashlock said files linked directly from metadata on Data.gov will have the same user experience regardless of who hosts the file.
“Often the problem is the one we discussed earlier, where the files are not linked to directly, but instead through intermediary webpages and that in fact can be a problem for inconsistent user experience,” Ashlock said. “The problem there has nothing to do with Data.gov not hosting the data though.”
Howard says there is still work to be done to improve Data.gov.
“If the desired impact is to build something that developers, businesses, transparency advocates, journalists and others can quickly search through and find data sets that are relevant to the stories that they want to tell … then I think it’s fair to say that there are still some challenges there,” he said.
Over the past few years, the number one concern Howard has heard from journalists about the site is that “what they need is not what they’re finding on Data.gov, and what they do find there is out of date or it needs to be significantly transformed or broken down.”
Howard noted that it’s never been easier to convert information into structured data.
“The bad news I think is that there’s going to be a transition coming on here and understandably the next administration is going to be looking carefully at what’s worked and what hasn’t,” Howard said. “And these platforms need to do what they’re supposed to do: increase transparency, accountability, participation, collaboration.”
Hudson Hollister, founder and executive director of the Data Coalition, says that while it’s great Data.gov exists, it isn’t the management tool yet it will become in the future — a place agency officials can go to get data that helps them make decisions.
“I don’t think any open data project will last for the long term unless its data sets are useful for managers, for answering internal questions,” Hollister said. “And up until now, I really haven’t heard much about data sets on Data.gov being used for internal management. I think that’s really the key.”
Agency officials are using government data in specific verticals, he noted, “but not because they’ve been discovered on Data.gov; rather because the people who are figuring out how to publish them are the same people using them to answer management questions.”
“Data.gov’s lack of usefulness for managers isn’t really a problem — this is just a stage in the evolution,” Hollister said.
Correction: Oct. 21, 2016
An earlier version of this story misstated the agency that could be responsible for for Data.gov’s large number of PDFs. It is the U.S. Geological Survey that uploads thousands of geocoded maps.