The job of making government data more useful — to the public and government workers — comes down to three things: making data cleaner, more accessible and ultimately independent of the systems they were created on, said three top government data officials at an industry forum Thursday.
That remains a monumental task, but one where those efforts are building momentum, thanks in part to the emerging presence of chief data executives within federal agencies.
While data managers aren’t new, their role and authority have gained greater importance within the federal government following the release of the Obama administration’s
Open Data Policy in May 2013. Since then, the departments of Transportation and Agriculture, among others, have followed the lead of the Federal Communications Commission, the Federal Reserve Board and other agencies in appointing chief data executives to focus greater attention on data management.
Donna Roy, executive director for the Department of Homeland Security’s Information Sharing Environment Office, has spent the past eight years working on cross-agency programs aimed at making data easier to share and more useful. Speaking at a forum hosted by
AFCEA Bethesda, Roy held up DHS’s Data Framework program as an example of what it and others are doing to make data more readily available and useful.
The data challenge that emerged from the Boston Marathon bombing, she said, is the extensive number of DHS systems — more than 40 — holding “somewhere over 900 data sets” that analysts must log on to. It’s a cumbersome process and easy to miss important information, she said.
The reason for that stems from the fact that “we have never separated our systems from our data,” she said.
“We’re starting to do that at the department … moving these highly valuable data sets” into what Roy referred to as a “data lake” — where department data information exists independently of the systems that the data was created on or might later be ingested into.
“And the way we do that … is to get really specific about how we use the data, how we promise the public we’re going to use the data — and ensure we go about … putting it into the data lake,” she said.
Roy used the analogy of how music files now can be played on a variety of technical platforms and devices, carrying with them supplemental information, such as images or a history of the artists, or coding that can contextually restrict how the file can be used.
Making DHS’s vast amount of data independent but also accessible only to those with the appropriate viewing privileges remains a massive challenge for the DHS officials.
The department, however, is making progress in recent months, Roy said, thanks to the DHS’s Data Framework, an information technology program that supports advanced data architecture and governance processes.
DHS Data Framework defines four elements for controlling data:
1) User attributes — which identify characteristics about the user requesting access, such as organization, clearance and training.
2) Data tags — which label the type of data involved, where the data originated and when it was ingested.
3) Context information — which combines what type of search and analysis can be conducted (or the function of the information), with the purpose for which data can be used.
4) Dynamic access control policies — which evaluate user attributes, data tags, and context and rules for granting or denying access to DHS data in the repository, based on legal authorities and related policies.
The data cleansing challenge
But finding and accessing the right data is only 20 percent of the data problem, she said. The other 80 percent of the problem “is all around making data usable, sanitizing the data as much as possible as it goes into the lake.”
“If you look at what most data scientists are doing, they’re doing mostly data janitor work,” she said.
Bobby Jones, acting chief data officer at the Department of Agriculture concurred, said that putting the policies and systems in place to clean up data – and generate clean data in the first place – is one of his top priorities.
“Data cleansing takes on a couple of connotations,” Jones said. The first involves cleaning the metadata – eliminating duplication, misspellings, broken URLs and other faulty tags. Jones, who also serves as deputy chief information officer for policy and planning at USDA, said his office is also developing automated scripts to speed up the metatag clean up process in the future, adding that the metatagging in 36 percent of the department’s datasets have been addressed since August.
The second focus, he said, is on the data itself. “The first thing we want to do is improve the machine-readable part of the data” by trying to convert more of the department’s documents into XML file formats and “increase our APIs so we have direct links to the information itself,” he said.
Knowing whether the department’s agencies are actually taking all the necessary steps to cleanse their data remains a central challenge, Jones acknowledged. To tackle that issue, department officials have assigned data stewards to each of USDA’s agencies and offices, to develop strategies and action plans to improve data management.
The Department of Transportation is also looking at ways to increase the underlying value of the data the department generates and collects.
Improving the value of data involves working with a range of entities outside of the department, said Daniel Morgan, who was recruited four months ago to become the Department of Transportation’s first chief data officer. That includes working with a host of state and local agencies, as well as “info-mediaries, who are trying to add value to our information,” he said.
Integrating and analyzing data is an essential part of DOT’s efforts to improve national highway safety. He explained how the department, for instance, pulls together data on the use of medications by truck drivers, combined with detailed accident reports and related road condition reports to look for ways to reduce motor vehicle fatalities.
But as chief data officer, Morgan also has a larger and longer-term role. Morgan has been charged with guiding DOT’s “intelligent transportation” strategies as vast amounts of real-time data from vehicles and traffic sensors begin to knit together in new and powerful ways.
“What you see is the convergence of electrical and civil engineering [disciplines] as well as transportation and adding an IT component, which is a big cultural change for the transportation” industry, he said.
It’s also a hint of the growing urgency federal officials face in getting on top of their data management policies.