01 February 2014

The medium is irrelevant. It's the program that counts.

The proposal to use Blu-ray discs in a robotic retrieval system may seem regressive to some, brilliant to others (as illustrated in the comments).  Why not tape?  Why not spinning disks?

In fact, the medium is largely irrelevant.

In the 1980s, when I was a young archivist and data CDs also new, many in the profession wondered about the longevity and stability of the records in electronic formats.  We didn’t know how we would read the media when the hardware was obsolete.

At the same time, CDs seemed like something of a godsend.  They could hold lots of data for relatively cheap at a time when a gigabyte of storage was extraordinarily expensive. And CDs lasted forever.

In fact, CDs don’t last forever.  The Image Permanence Institute, which regularly tests the longevity of photographic materials, tested the longevity of Kodak’s Photo-CD.  (The test is based on the Arrhenius function, which predicts how chemical activity – deterioration – accelerates at increased heat and humidity.  Put the film or disk in a very hot, very humid chamber, and test for effects of failure.  You can use that to reasonable estimate the time it will take to see those same effects at normal temperatures and humidifies.[2] 

The Photo-CD tested out at roughly two hundred years.  The trick is, Kodak’s Photo-CD used fundamentally different materials from commercially produced audio CDs.  Those use fundamentally different materials from writeable/rewriteable CDs.  To further complicate matters, it is impractical to test the longevity of commercially available, R/W CDs and similar media because different manufacturers use different materials, and even the same manufacturer may vary materials from batch to batch.  Yes, they use a dye layer.  But, the dye layer may vary slightly from batch to batch.  The article quotes Facebook’s Giovannitti Coglitore saying, “Each disc is certified for 50 years of operation; you can actually get some discs that are certified for 1,000 years of reliability.”  Which immediately raises the question, certified by whom?  The manufacture?  (See: fox, hen house.) Or an independent testing organization, such as the Image Permanence Institute?

Ultimately, though, longevity doesn’t matter because the medium doesn’t matter. 

This bit of circumlocution gets to an important lesson archivists learned.  It’s the bitstream that matters.  Is it important if you’re reading the signal for that photo from the original chip in your camera, the spinning disk in your laptop, from a cloud service, or a thumb drive?  (The technical answer is, “It depends on the need for forensic information to authenticate the original and fully understand its context.”  But that’s another story told well by Matthew Kirschenbaum, and given that FB is creating the discs for its internal use, it’s less relevant here.)

Archivists realized that the record – the information we acquire for permanent preservation and access – is contained in the signal.  The signal is portable.  It can be moved from one medium to another.  Who cares about whether Garth Brooks’ audio CDs are still around?  Get an MP3, store it in the cloud, and access it anywhere.

The real question is not the longevity of the media or a particular format, but whether the bitstream can be moved with integrity.  Does the copy exactly match the original?  With traditional, analog media, noise increases with each copy.  With digital information, we can verify that the signal is an exact match without any loss of fidelity. 

Digital archivists go a step further.  They capture and store a hash value of the bitstream.  By recalculating the hash in the future, archivists can be assured that the bitstream has not changed for some reason (malicious or accidental).

In fact, archivists should go two steps further.  Preserve at least two, independent copies of any given file.  Periodically, the system will recalculate the hash for each file and compare it to the initial hash.  If there’s a difference, the defective copy can be replaced with the other (presumably uncorrupted) copy.  This approach is different from a backup.  If a file is corrupted before being rechecked, the corrupted file can cascade through the backups, with the uncorrupted version lost.  Keeping at least two independent copies help prevent that scenario.  (David S. H. Rosenthal, a former engineer for Sun Microsystems, has done research that for truly robust protection, the system should maintain a minimum of seven copies.  This may not be practical for most data, but for archives – that small percentage of records of historical importance that protect individuals’ rights and interests – it may be reasonable.  For more information on this approach, see LOCKSS.org, based at Stanford.)

Having the software to read the bitstream is yet another problem.  An archives with 5.25 inch disks may find a drive to move the bitstream to more accessible media, but may not have a copy of WordPerfect 5.1, XyWrite, WordStar, or other software that can make the bitstream intelligible.

Ultimately, a true digital archives has three key components.  The collection of records themselves – the content.  The place to store it – a Blu-ray jukebox, the cloud, spinning  disks, or tape.  And a program to sustain the archives into the future so that the records are properly preserved and people can use them.

For digital archives, questions of appraisal and selection may not be significantly different.  We can use existing approaches to help identify records that merit acquisition.  And if the medium doesn’t matter, the place doesn’t matter; storage will change over time as technology evolves.

What may be the most critical element of a digital archives is the program to ensure the long-term preservation of and access to the records.  Paper records are familiar; people are familiar with the medium and few need to study that technology.  Paper lasts decades, so there’s no urgency.  And the profession has established practices that have worked – more or less well – for decades.  But what we took for granted must now be re-examined and reconsidered.  You don’t put digital records in an acid-neutral box, stored at a reasonable temperature and humidity.  What is the digital equivalent of a box?  How do you know if digital records are showing signs of decay and need to be refreshed? 

Ultimately, you need a program and policies that explicitly think through what we took for granted.  The medium doesn’t matter.  But the signal does.  Preserving that signal is as much a matter of policy and procedure as it is of technology.[3]

Clayton State University offers a course in digital preservation as part of its Master of Archival Studies Program. Not only does the course give students the knowledge they need to develop a digital preservation program, it structures the process over fifteen weeks.  Practicing archivists can use the course to develop a program for their archives in a reasonable amount of time.  Time that allows the student to get input from others, to consider the plan in the context of their specific work environment.

Because courses are fully online and live, they’re available across the nation.  Web-based video conferencing ensures lectures and discussion are engaging. Practicing archivists are invited to take specific courses in the program as part of their on-going professional development. 

For more information, see the MAS program website at http://www.clayton.edu/mas/.  Or feel free to contact me.

[1] Jon Brodkin, 31 Jan 2014. http://arstechnica.com/information-technology/2014/01/why-facebook-thinks-blu-ray-discs-are-perfect-for-the-data-center/; checked 1 February 2014.
[2] Although focused on photographic materials, see “A Consumer Guide to Understanding Permanence Testing (Image Permanence Institute, 2009). https://www.imagepermanenceinstitute.org/webfm_send/311, checked 31 Jan 2014; checked 1 February 2014.
[3] Although I don’t make a direct  reference to in this post, I want to acknowledge all that I’ve learned about digital preservation from Nancy McGovern.