Storing 10 petabytes in one teaspoon of DNA

Using DNA to store data could help to avert a looming data storage crisis.

In 2018, people watched 4.33 million videos on YouTube, sent 159 million emails and posted 49,000 photographs on Instagram every minute of the year, among other data uses. At this rate, we will produce 418 zettabytes of data this year, according to the World Economic Forum, and even more in the future. A single zettabyte is a trillion gigabytes.

Our current methods of storing all this data are not sustainable, for several reasons. Most digital archives are now stored on magnetic and optical data storage systems, but we will run out of the materials used to produce these in less than a century, if that.

Meanwhile, the environmental and economic cost of server farms, which already make up three percent of global electricity use and two percent of greenhouse gas emissions, will soar.

‘All of YouTube in a teaspoon’

While scientists have been investigating alternative methods of storing data, one stands out.

DNA-based data storage, which stores information in manmade strands of DNA, has three key advantages. It has extremely high data storage density, remains stable for hundreds of years, and requires very little power.

In 2019, scientists in Israel announced that they had developed a way to store more than 10 petabytes, or 10 million gigabytes, in a single gram of DNA. This means that, theoretically, all of YouTube’s data could be stored in a teaspoon of DNA.

Even though scientists have been working on DNA-based data storage methods for nearly a decade, however, major obstacles remain – and this is where Singapore can play a key role.

The key challenges

First, a quick explanation of how DNA-based data storage works.

Each DNA molecule consists of linked components called nucleotides, which come in four types: guanine, cystosine, adenine and thymine, represented by the letters G, C, A and T. To store information in DNA, digital data, which consists of 0s and 1s, is translated into sequences made up of the G, C, A and T letters.

Companies or other organisations then manufacture synthetic DNA molecules representing those translated sequences and store them. To retrieve the data, the synthetic DNA molecules are sequenced, and the output translated back into the original digital information.

While this method has been tried and tested, there are significant challenges.

The costs of sequencing DNA has fallen dramatically in recent years. The cost of producing the synthetic DNA molecules, however, is still prohibitively expensive. Currently, it costs about US$5 million (S$6.7 million) to store just one gigabyte of data – a lot of money to store not even a full DVD movie!

Creating DNA molecules and sequencing them also involve biochemical and biophysical processes that are prone to errors. The process of writing DNA to produce the synthetic molecules, for example, is vulnerable to substitution, insertion and deletion errors.

The Singapore connection

In Singapore, several teams of researchers are hard at work on these problems.

At the National University of Singapore, Associate Professor Poh Chueh Loo, Associate Professor Yew Wen Shan, and their colleagues are working on more efficient ways to synthesize DNA sequences.

The Singapore University of Technology and Design’s Advanced Coding and Signal Processing Laboratory, where I am a visiting scholar, is another local nexus of research in the field.

The laboratory, under the leadership of Associate Professor Cai Kui, its founder, has been developing algorithms to prevent, detect and correct errors in writing and sequencing DNA.

We have found, for instance, that when the same nucleotide is repeated more than four times in a row, the probability of sequencing errors rises substantially. We have also described how to design algorithms to translate data into strands of nucleotides that meet various error-limiting conditions.

Furthermore, we calculated the maximum number of data bits that can be stored per nucleotide if a constraint is imposed to prevent too many repetitions of a nucleotide in a row.

Much more work needs to be done to make DNA-based data storage viable, including in areas such as how to restore lost data.

In hard disk drives, data is stored in fixed places, so even if you lose some data, the fact that you know what is supposed to go where can help you to restore the missing pieces. A pool of DNA, however, is like coffee in a pot, with free-floating molecules. This makes data restoration much more difficult.

Still, DNA-based data storage remains one of the most promising solutions to our impending data storage crisis. And Singapore, with its vibrant research sector and excellent expertise in the sciences, is well-positioned to be a leader in this research field.