Digital Preservation – It is Hard!

Happy New Year everyone!

Preserving scholarly works, history of the world, countries all the way down to individual institutions has been happening for a very long time in very different ways. The decision to preserve something for the long haul always lags the initial creation of content. Generally, the value of the content and the intent to preserve it is based on the impact of the content, which takes some time. Libraries are the institutions that make decisions on what to preserve, how to preserve and how to make them available. This is a lot of hard work on the part often led by the special collections and archives in the libraries.

But, in the last 20 to 25 years, things have changed dramatically thanks to the internet, advances in technologies and predominantly the content are born-digital. And since the wide adoption of various social media, digital content has exploded exponentially. This applies to written text, audio as well as videos. These advances have democratized the content creation and distribution like never before, which, like everything else, has its positives and negatives. Of course the advanced technology allows for easy preservation and generally all content creators take advantage and preserve pretty much everything they create, although often not up to the standards that guarantee long-term preservation which libraries need to meet. This poses enormous challenges to the libraries and there has been some excellent work in this area led by Library of Congress and you can read about it in detail here. I just want to touch on some of the technical aspects of it here.

Born digital materials are a reality we need to contend with. Printing them and preserving them is not an option because it has its own problems. There are several clear advanttages to born digital materials. For example, they are much easier to find because of advanced tools that can index them in myriad of ways automatically. Most of them carry with them valuable metadata, such as the time of creation, the name and version of software used to create them, for images, music and video, you also get the geographical location where they were produced, date and timestamp etc.

One of the major issues with born digital materials is the sheer amount of content out there. If we look at scholarly work, many scholars use multiple outlets these days for discussing their work in addition to publishing books or publishing in peer reviewed journals. It is fair to say that most publishers are making all possible efforts to preserve electronic versions of these content for providing direct access to published materials, especially journal articles. In addition, services such as JSTOR and Portico provide federated electronic content by taking into consideration preserving scholarly content for the long haul.  However, most scholars also write blogs, they tweet, podcast, create YouTube videos etc. etc.  many of which may be preservation worth. The process the experts go through to decide which are preservation worthy becomes exponentially complex.

When you need to preserve a physical object, it is a tangible thing that the preserving institution possesses. When it is electronic media, in most cases, they are referential where you have URLs for the contents, but these are not permanent. Given the volume of content, it is not always possible to make a local copy of the content and preserve it as is. The URLs are notoriously problematic in that for a variety of reasons they change, or taken down etc. Attempts such as permalink  can be useful, but it depends on the content creator to know about them and set them up properly etc, something that cannot be relied on. Internet Archive is a fantastic initiative to preserve portions of the internet and for a certain subset of content, you can use their Wayback Machine to look at past snapshots of websites, for example. But it has its limitations, because what you get to see are snapshots and some critical matters of interest may have happened in between the snapshots.

Some of the items that are worthy of preservation include institutional matters.  Certain prepared speeches by senior leaders, or correspondences between senior leaders and descriptions of some milestone/key events at the institution are good examples of archival materials towards preserving the institutional history. As you can imagine, these are all entirely digital these days. How exactly does one sift through the millions of emails to decide what should be preserved? Or should we simply preserve the entire email of key people because it is the easiest and cheapest way to do it? Regardless of the methodology, it is an onerous task. During the days of handwritten letters, it was much easier, I believe.

When it comes to born digital content, authenticity is a huge issue. Changes can be made in so many different ways to elude detection systems. Of course, there have been cases of forgery all through the history, but when done in paper, they are easier to detect, especially now, with the advances in technology. There are ways to establish standards such as saving the original checksum of the content and recalcuating it on the current document being viewed is one way to establish the authenticity.

Then the sustainability of formats. How do we know that the current formats will be readable 25-50 years from now? Once things are printed on paper or handwritten on paper, they will remain readable, if properly preserved. But the DOC files, PDF files etc. will change as the technology advances. Even if they remain readable, there are no guarantees that the layouts can be preserved exactly. We know this to be already the case with included images and tables and how they don’t necessarily render the same exact way they rendered in some document types a few years back. This to me poses the biggest challenge and though there are standards being developed for long term preservation of electronic formats.

It will become inevitable that some formats will become obsolete and we should be prepared for transformation of digital content over time. The time and effort it will take us to convert what will become a huge collection of born digital materials is beyond comprehension at this point.

The point of all of this is, convenience aside, born digital materials pose a huge challenge from the preservation perspective. Library of Congress and several open source communtiy efforts are working on these, but the technologies are not waiting!

I would like to thank Karen Bohrer, Director of Library Collections, for reviewing this post and providing suggestions!

Leave a Reply