Link rot is killing federal data


Written by

2013_10_web-404 The federal government is the biggest offender when it comes to link rot — links that are inactive or have been changed.  (Image: iStockphoto)

Editor’s note: This story has been updated to reflect that some federal agencies are not mandated to keep hard copies of some of their documents, including decisions of their hearings.

The Internet is decaying. The basis for the future of technology and information, the 21st century’s savior, has turned into a fetid heap of broken links — and it’s only getting worse.

The phenomenon is called link rot and since 2008, the Georgetown Law Library and the Chesapeake Digital Preservation Group have been keeping a vigilant eye on its progression.

Link rot is when a website links out to another website as a reference, but when the link is clicked, the most frustrating message on the Internet rears its head; 404: Not Found. Or the website looks completely different than it did when it was originally referenced, forcing the user to hunt down the data. Link rot typically happens because the server has been taken down, a company has gone out of business or a website has moved.

The government is not immune to this plight. Actually, it is link rot’s biggest offender. The Chesapeake Digital Preservation Group has found of the original dataset of websites it began with in 2008, “the content at dot-gov domains showed the highest increase in link rot. More than 50 percent of the material posted to government domains disappeared from the original documented Web addresses,” according to the 2013 study.

The New York Times reported half the links referenced in Supreme Court opinions were victims of link rot. But the rest of the federal government and state governments are losing data, too.

“From a law point of view, it’s huge,” Mary Jo Lazun, head of collection management at the Maryland State Law Library, told FedScoop. “We rely on past things that are written in the past. [With link rot], we can’t see how they evolve. It’s really challenging.”

The library has realized printing out hard copies and putting them in a folder is the best way to retain the information at this point, Lazun said.

Link rot takes more subtle forms as well. Link shorteners, such as Bitly, are used to make links fit in tweets. The service issues a redirect to the destination website when the link is clicked. If a link shortening company were to go out of business or lose the use of its server, the links would be unavailable, according to Eric Mill, a developer at the Sunlight Foundation.

The loss of government data can cause major problems for fans of open government. According E. Dana Neacsu, reference librarian at Columbia Law School, starting with the FDR administration, all federal agencies have been mandated to publish their rules and rulemaking process. However, there is no such mandate for their decision-making process. In recent years, since the Internet made publishing very affordable, many federal agencies have opted to publish their decisions on their websites, but they have the choice of what to publish and for how long, Neacsu said. 

Others worry about more malicious uses of link rot.

“People rewrite information on the Internet all the time,” Mill said. Without an archive, rewritten information becomes the only information. Of course, important information can be deleted without a trace, too.

“ had existed before there was a political reality or scandal for the Obama administration,” Mill said of the Obama’s presidential transition website. At the height of the Edward Snowden scandal, a cause about the importance of protecting whistle-blowers mysteriously went missing from the site.

There are, however, safeguards against data decay on the Internet.

“Link rot is an immutable part of the Internet, but there are things we can do,” Mill said. The Way Back Machine, hosted by The Internet Archive, has stored more than 240 billion websites on its server dating back to 1996. Users can select a date and if the Way Back Machine has crawled that site, they can see what the website looked like on that date.

However, considering the size and scope of the Internet, even 240 billion Web pages aren’t enough to deter link rot.

“[The Way Back Machine] is very sophisticated, but flawed; you see lots of gaps,” Mill said. Part of the flaw comes from what is called a robot txt. The file is a guideline for what the developer does not want crawlers to archive. There is also the daunting problem of storing everything the federal government publishes online every time there is a change.

“I don’t want to take a guess, but it would be huge,” Mill said, referring the amount of data storage that would be needed.

Harvard University is currently beta-testing, an archiving site that permanently hosts a link as it was on the day it was created. However, that doesn’t archive sites that already exist and the links have to be manually added.

Neacsu suggested creating a consortium of libraries to share the burden of archiving the information. Every library will be delegated a certain section of the government to archive and it will be spearheaded by a network in the Library of Congress.

“Preservation is one of the roles of the government,” Mill said. Lazun agrees, “The government has a responsibility to keep these records.”

-In this Story-

data analytics, Government IT News, open data, open government, Sunlight Foundation, Tech
TwitterFacebookLinkedInRedditGoogle Gmail