The dominant technology across genomics today is next-generation sequencing (NGS). NGS is effectively a massively parallel sequencing technique – it decodes individual fragments of DNA which are part of the genome.
A human genome is over three billion bases or nucleotides. Sequencing just one individual human will generate about 100 GB of data and it is represented by about 300 million individual sequencing reads. The machines that do these 300 million individual sequencing reads can generate roughly 16 terabytes of data over a 24-hour period. That is an incredible amount of data.
If you broke that down into pure mathematics, 24 human-being genomes is just one run across 24 hours. Some labs will have anywhere from 2 to 10 of these instruments. They will run these constantly – day by day over the whole week. The real challenge then, is to analyze that amount of data so that you can get some meaningful information out of it. How do you do something that is this vast and how do you move it around quickly? Do you do this on-premise, or on the cloud? How do you implement technology to do a whole genome sequence (WGS) on a time scale that is reasonable?
High Performance Computing
This is where high-performance computing (HPC) comes in, which provides the power at the back end. Today, in a standard on-premise environment, to process a single WGS would take about 150 hours. In the cloud, it depends on how much resource you allocate to it. It can take anywhere from 60 to 150 hours.
This is Lenovo’s baseline, it’s the target and what our genomics R&D group has been focused on, searching for the best hardware and software recipe to support high-throughput volumes, deliver accelerated execution speeds, and increase usability in an environment securing data privacy. The result of this research is a system, an appliance called GOAST – which stands for genomics optimization and scalability tool. GOAST is an easy-to-use, plug-and-play, scalable offering that utilizes specially tuned hardware to accelerate the open-source Genomics Analytics Toolkit (GATK) software suite from the Broad Institute.
Typical environments run GATK workloads in 60 to 150 hours per human genome. GOAST cuts that time to under an hour per genome, accelerating analytics up to 167X. Before GOAST, this level of performance was found only in costly GPU-based infrastructures. But GOAST does it all with standard off-the-shelf (OTS) components, and at about 50% of the cost.
Lenovo is the only partner who delivers this degree of performance in a cost-effective, open source, CPU-based solution. With GOAST’s revolutionary speed and accessibility, genomics researchers in life sciences, precision medicine, and infectious diseases can understand their data faster, make discoveries sooner, and, most importantly, save more lives.
The Importance of AI
Sequencing of the COVID-19 virus genome and its variants needs to be done quickly. The shorter time it takes, the quicker one can work to resolution on the next RNA vaccine. Enabling Life Sciences research, and specifically empowering COVID-19 discoveries, requires HPC environments that support high-throughput volumes, accelerate speeds, optimize infrastructure, and ensure the protection of data privacy. But some of the speed when it comes to genomics and sequencing also derives from the software component, and this is where Artificial Intelligence (AI) comes in.
AI is already everywhere, yet so many of us do not know that they are dealing with an AI BOT Vs a Human Voice or that they are protected from Fraud by their bank, because they are using AI. Genomics, is no different here. There are roughly 300 publications listed on PubMed alone that were related to AI and genetics/genomics just two years ago – this has exploded to around 2,000 publications right now.
Having a backend AI facility to be able to make sense of data, to be able to recognize differences between one genome and another, all under very explicit rules is incredibly helpful. This time to information and time to decision gets shorter and shorter with AI. Imagine a system being able to look at a million pages a minute versus a human being that might be able to do just one.
We have experts who can advise you on a complete, end-to-end deployment of population-level genomics; from workload planning, to cluster sizing, to accelerating secondary and tertiary NGS workflows. Lenovo’s commitment to developing and adopting cutting-edge technological innovation is enabling the worldwide movement of sequencing ever larger samples and is empowering scientists on the frontlines of COVID-19 research to accelerate their path to discovery. The tech revolution of accessible innovation is here, be it for basic research, infectious disease, or precision medicine.