Mar
2017
Long Term Data Storage – DNA?
I was listening to a Science Friday podcast and one of the topics was about storing data using DNA. Both DNA computing and DNA storage has always fascinated me. As we all know, DNA has a very long shelf life! The half life of DNA has recently been estimated to be 521 years under certain conditions such as vacuum packed and at -80 degree Celsius. That would roughly translate to over a million years of shelf time before the DNA disintegrates into something that is no longer useful. Given what we have seen of the shelf lives of digital storage media such as CDs and DVDs, this sounds pretty good.
In addition, the “storage density” of DNA is very impressive! In a recent study, scientists from Columbia University reported that they have developed a method through which they can store 215 petabytes of data in just 1 gram of DNA. A petabyte is a mere million gigabytes and all of it in 1 gram of DNA. How impressive is that! If you are interested in an overview of some of the earlier studies on this subject, you might want to read this article.
OK, it stays for a long time and is very efficient in terms of capacity, so what’s the catch?
If you read through some of the links above, you will see several issues. Before we go any further, I want to make sure that we are talking about synthesizing DNA in a lab and though chemically it is identical to DNA found in organisms, it is not the same. For example, it does not replicate or undergo mutations.
Some of the major issues with DNA storage are that it is prohibitively expensive to create as well as for restoring data. The whole process is slow and is not what we are used to – you need a lot of planning. The current method is to first take whatever you want to digitize and use one of the methods to create the sequence of base pairs that will form the DNA. The Columbia group’s method that uses fountain code for error correction) has been deemd the most efficient method so far.
The DNA formed this way will have copies of itself to account for any error corrections. The sequence that is generated then needs to be sent to a company that can synthesize and send you back the DNA. This is not cheap. You also need to store them appropriately so that the DNA doesn’t easily disintegrate. Then, you need to be able to read it back. This is pretty challenging. You basically read the sequence the (typically through a chemical reaction), another costly and time consuming endeavor.
Today, we may store millions of documents in a cloud storage, but in order to get to one, we don’t have to scan through all of them to get to it. Those of us who have used magnetic tapes will remember this issue. Files were stored sequentially in a magnetic tape and to get to one, the tape needed to go through all previous files. Though DNA storage is not the same as magnetic tape, and reading any arbitrary portion is possible, it requires proper planning and encoding. Random access is the method used to access information from disks that we now take for granted. Scientists are working on random access to information stored in DNA.
The reason that DNA storage, where information density is one of the highest, is most interesting is its shelf life! Can you imagine tucking away some of the most valuable digitized materials such as manuscripts and art work and have the peace of mind that they will last for hundreds of years, if not a millions? Today, we happily digitize and store everything and don’t think about how long will the current storage media, storage protocols and retrieval methods continue to live! Though several initiatives are in the works in terms of Digital Preservation, something like DNA storage can be potentially a game changer.
It is the archival of valuable data where I see the real value for something like this. Will it have problems in terms of authenticity and potential loss etc? Absolutely, but they are not much different than the issues we have always faced in preserving information.