Can digital data last forever?


Written by

2013_11_US_National_Archives_II The National Archives in College Park, Md., has enough records from federal agencies to stretch 30,000 feet. (Photo: Wikimedia Commons)

In mid-1799 while ransacking Egypt, Napoleon and his army stumbled on a stone with three different languages chiseled into its face. Measuring about 4 feet high and broken at the top, the 2,000-plus-year stone proved to be a dramatic find. Not only was it prized for its historical value, but the stone also helped decipher ancient Egyptian language.

Now, 214 years later, the Rosetta Stone still holds the marks inscribed into it. The stone is an important piece of data that traversed military coups, natural disasters and ruthless conquers.

Which begs the questions: What will happen to our data today in 2,000 years or even 100 years? Can our data last… forever?

In December 2010, the Obama administration unveiled the cloud-first policy. The framework mandated agencies take full advantage of cloud computing benefits to maximize capacity and minimize cost. By moving services such as email, project data and administrative applications to the cloud, the government estimates it could cut as much as $20 billion off the current $80 billion IT budget.

In essence, the cloud will be the main way the government keeps its data.

When “the cloud” comes to mind, one tends to think of data floating into space and staying there safely, forever, never to be bothered unless someone calls on it.

But that’s far from actual reality. The cloud is physical, degradable and finite.

“I always get upset when someone thinks of the cloud up here,” said Fenella France, chief of the preservation research testing division at the Library of Congress, as she held her hands in the air, “it’s on something; it’s not just up there.”

What it’s on are servers. Google has more than a million of them to store all of its data, and that is only an estimate. Each server can hold terabytes of information. Servers require a lot of upkeep to be operational. Google has huge fans and intricate cooling systems just to keep them at the right temperature.

But data still deteriorates even when treated properly. So, how long can data last on certain platforms?

Finding the answer is part of France’s job at the Library of Congress. Deep in basement of the Madison building is the preservation lab where members of France’s team test different media for longevity.

“We try to predict how things age by testing them,” she said. “The environment is so critical in regards to making something last, so we increase the humidity and temperature on things like CDs.”

Humidity, temperature and light cause processes such as oxidation, corrosion and the breaking of chemicals bonds, which destroy data. France pulls out two CDs: one has been exposed to 500 hours of heat at 176 degrees and one has not. The exposed CD is completely transparent, except for the paint describing the song titles.

France’s division even has CDs that have been playing for more than 20 years. After 10 years, some of them had a 4 percent loss in data.

The best way to lose your data on a CD is to label it. It starts the deterioration process immediately, according to France.

France works with all media, from testing vinyl records to paper to servers and hard drives.

So how long does data last on modern media?

Unless you have a very high-quality CD, two to five years is the best you’ll get before you’re in trouble. DVDs are more at risk because they hold more data; if one section is lost, much more data is at stake.

Magnetic hard drives only get about seven years of use, and flash drives eight to 10 years. Premium-grade media such as gold-plated CDs and enterprise hard drives add years.

Now, floppy discs can last… wait, floppy discs? When is the last time someone used the poor floppy disc with its meager 1.44 megabytes of storage? Come to think of it, when is the last time a computer had a floppy drive?

Which brings up the next problem with making data last: system obsolescence.

It’s hard enough keeping track of how long data can last on a certain medium, but preservationists also have to worry about the system that can extract the data.

Data-storing media such as punch cards, magnetic drum memory, punched tape, laserdiscs, 8-tracks, VHS tapes, Betamax, Zip disks, cassette tapes, MiniDiscs and the floppy disc are almost impossible to read because the systems to extract the data are no longer made, are broken or have been trashed.

“Interfaces go away,” said Thomas Youkel, chief of the enterprise systems group at the Library of Congress. Some interfaces just don’t exist anymore.”

To fight system obsolescence, the Library of Congress does its best to collect old systems. In the library’s Culpeper, Va., campus exists a huge warehouse operated by robots. Their only job is to put VHS tapes in the players and record them to a more modern platform.

One need for old systems is when Congress members’ tenures end. Usually, the Library of Congress doesn’t get all their files until they leave Congress. Suddenly, an influx of obsolete media will arrive, forcing preservationists to hunt for old technology.

Even if preservationists could update everything, and there is a lot, they would have to do it all over again when the next big thing came out.

Youkel knows all about updating; he is constantly refreshing technology at the Library of Congress to keep data living on. His rule of thumb for Linear Tape-Open tapes (more on these later) is they are only good for two tape generations.

“When it’s two generations old, systems can’t read them anymore,” Youkel said. The systems will write on the current generation and the one previous to it and it will read tapes two generations old. But three generations? Forget it; it’s time to upgrade.

So, why isn’t everything just upgraded?

The amount of data just in the Library of Congress and the National Archives is staggering, not to mention in all the federal agencies, Congress and the White House.

The Library of Congress currently houses more than 155 million physical objects. Everything from the contents of Abe Lincoln’s pocket when he died to a 5-by-7 foot book featuring images of Bhutan is stored there. It receives 15,000 items and adds 11,000 items to its collection every working day. It has 838 miles of bookshelves, 3.4 million recordings, 13.5 million photographs and 5.4 million maps. Many of which have been or will be digitized. The library also digitally archives every tweet, ever. Additionally, it has archived 422 terabytes of the Internet in the Internet Archive.

The National Archives has more than 4.5 million cubic feet of documents and 500 terabytes of electronic that can be found online. Each year, it gets enough records from federal agencies to stretch 30,000 feet. When the Bush administration left the White House, it gave the archives 82 terabytes of information. And the 2010 census? That was 300 terabytes.

The Library of Congress has experts who decide what should be archived and what should be trashed. On the other hand, the Archives take in everything agencies deem permanent, which is about 1 to 2 percent of their files, according to David Lake of the strategic systems management division of the National Archives.

Federal agencies digitize their records as well, as they are required to keep their files for specified amounts of time before they are either sent to the Archives or discarded. The holding time can vary from 90 days to 50 or 75 years, according to Michael Carlson, the National Archives’ special assistant for the Office of the Chief Records Officer.

“For agencies, the average is about seven years,” he said.

How can we keep all that data when digital media need constant upkeep, updates and are invariably deteriorating and becoming obsolete?

The answer, for now at least, is LTO tape. Despite technological advances, tape is still the best way to store data. The enterprise class tapes have a 10 to the minus 19th rate of error. The tapes can last about 30 years, and the systems usually go out of date before the tapes start deteriorating.

The Library of Congress is currently updating its tapes to the 8-terabyte form. The systems can no longer read the 1-terabyte form, which is now three generations old.

“The tapes look like old-school game cartridges,” said Jane Mandelbaum, information technology specialist at the Library of Congress. “They are very rugged to make sure the tape is as stable as possible.”

The Library of Congress keeps the tapes in a tape library.

“It looks like a fridge, but much bigger, like the size of a small office,” Youkel said. At capacity, the library can hold 12,500 tapes.

“We don’t want to get to the point when we have to worry about deterioration,” Youkel said. He has a team of 45 that constantly maintain, upgrade and patch data systems.

The most vulnerable time for the tapes is when they are handled, which is why robots do most of the work.

“The goal is not to have humans touch [the tape,]” Youkel said. Robots copy data onto the tapes and copy them to other media where they can be put on servers to disseminate to the public. A workflow is created to put the data on the servers kept in a 15,300 square-foot data center or an 11,000 square-foot remote center.

One very special copy is also made: a backup copy on LTO tape, never to be touched again until it’s time to upgrade.

This second copy is sent to a location that isn’t exactly secret, but the Library of Congress would prefer everyone didn’t know. The backups, according to France, are stored in disaster-proof buildings.

The Archives does the same thing with a secure facility in West Virginia.

“It’s basically racks of servers, storage equipment and cooling capacity,” Lake said.

Creating redundant storage systems keeps data safe. Data preservation’s main adage is LOCKSS — lots of copies keep stuff safe. If data becomes corrupt on one medium, there are always backups that can fill in the blanks. The Archives and the Library of Congress also use software that alerts them when files need to be updated to a new platform or are in danger of deterioration.

Just as important as the data is what is dubbed metadata. Metadata is data about data. It logs the author, date, standards used and how to access the data once it’s been stored.

Basically, metadata is its own Rosetta Stone; it’s a translation of our current technological language for people in the future so they can access the data stored today. Imagine if someone a thousand years from now somehow found a perfectly preserved CD that never deteriorated. By reading the metadata, that futuristic person would know what kind of files are on the disc, what it is made of and how it is read. It may even be able to make a device capable of reading the disc.

For that reason, and for generations closer to our era, preservationists store metadata with care.

“Our job is not to worry about forever, but to hand things over to the next generation by preserving content, processes and metadata,” Mandelbaum said.

So, can digital data last forever?

Maybe, but it will take a lot of work.

“[Digital data] certainly aren’t clay tablets,” Lake said. “[It] takes a lot of care and feeding. The key is doing it in a sustainable way. Doing it in a sustainable way is hard. Institutions are collaborating on standards, and the Library of Congress is at the forefront.”

The more redundancy built in, the more metadata stored, the better it is for the future of data.

“Then, it’s just a matter of monitoring when technology goes and when systems go obsolete,” said Lake.

Not everything can be inscribed in stone and even the Rosetta Stone, the Constitution, and the Declaration of Independence will wither away with the passing of time.

But maybe, with generations working together, people thousands of years from now will still know what was imprinted on those relics.

“The U.S. has committed to this for the life of the Republic,” Carlson said.

-In this Story-

Agencies, Applications & Software, big data, data analytics, David Lake, Departments, Fenella France, Government IT News, Jan Mandelbaum, Michael Carlson, National Archives and Records Administration (NARA), open data, open government, open source, Tech, Thomas Youkel
TwitterFacebookLinkedInRedditGoogle Gmail