IT Requirements for Genomics Research

Genomics research has grown tremendously in the past 5 to 10 years, with massive advancements in computing and technology making work in the field easier and cheaper. While the initial Human Genome Project in 2000 required an estimated budget of almost $300 million, many diagnostic labs and service providers can now generate the genome of individuals for under $1,000.

To find out more about the role of IT in these advancements, we spoke to Sinisa Nikolic, Director of High Performance Computing and AI at Lenovo Asia-Pacific.

The field of genomics research has grown exponentially over the past decade or so. Could you share more about the type of compute landscape that has supported these advancements?

The applications associated with genomics, or really omics in general, which is the term for the entire suite of things within the genomics landscape, is what we call “embarrassingly parallel”. High Performance Computing (HPC) is built around codes and algorithms that can run on multiple machines at the same time. Instead of taking just one stream, you can break it into multiples to increase throughput of these codes through the system. These omics codes love parallel workloads and work streams.

In today’s technology environment, we see multicore processors that spin at incredibly high rates of 3.5 GHz or more. When you have fast processors, you need to feed that with fast bandwidth, fast channels, fast memory, and everything else within the nodes or compute infrastructure. You need uber-fast I/O bandwidth, because omics applications generate incredible amounts of data, especially when you are doing sequencing. You need a lot of ultra-fast memory. They all then connect together with a very fast interconnect, and you have a software stack around them where you can manage the system, and centrally manage the nodes and other things.

From that perspective, the last decade has been a slow burn, because HPC has been around for a long, long time. We’ve been working on the fastest of the fast, and the best of the best algorithms that are available in whichever industry we were involved in, and things have just gotten faster and faster. The systems have increased in performance, but physically the sizes are reduced, which allow clients to be able to build out a little bit more. This has also tied in with other things such as climate change, data centre developments, and other things. That is the sort of landscape we are in today, with technology being developed to address the different application areas.

Lenovo has been working on something called GOAST, or the Genomics Optimization and Scalability Tool. Could you share more about what exactly GOAST is, and the background of the project?

In 2007, I think the first whole human genome sequence was complete, and that cost about $1 million. We’ve heard of Moore’s Law, which is a theory that every 12 to 18 months, the capacity of systems will double. Now, we are seeing Moore’s Law disappear, as the capacity of systems and data are doubling every 7 months, which is incredible.

In addition to that, the sequencers that are being used in laboratories are increasing their amount of throughput. We are able to sequence 60 human genome sequences, which is about 6 terabytes of data, across a 24-hour period. All of this is driving vast amounts of data, and questions about what we are going to do with this data. This is something we were not able to do back in 2007. Jumping forward a decade to 2017, the workload to do a genetic sequence still took anywhere from 40 to 150 hours to complete with the technology that was available.

In 2017, Intel started a project with the Broad Institute, called the Broad Institute Stack, which was effectively the Genome Analysis Toolkit (GATK) code. They optimised the codes to further shorten the time needed to just 11 hours for a single genomic sequence. Some time ago, Lenovo saw this as a distinct opportunity. Firstly, to be a valued member of this community. We talk about solving humanity’s greatest challenges, and this was one of those challenges that required us to look at the vast amount of data and see what was required. So we optimised, tuned, and developed GOAST. It is still a workstream that has ongoing optimisation and tuning, but when we first finished GOAST, we got the time from 11 hours down to just 1 hour. At the same time, the cost also decreased from almost $1 million, down to just around $1,000 to sequence. This is the magnitude of advancement we are seeing.

This was the reason we developed GOAST, to meet the requirement of generating insights from data in an incredibly short amount of time. In important situations such as Ebola, or now COVID-19, you need to make decisions on drug production or virus sequencing that are beneficial to humanity. That is where we came from, that is where we are today, and moving forward it is only getting faster.

I understand that GOAST was made using only off-the-shelf components, and it is a CPU system and not a GPU system. Can you share more about the impact that has on costs, and your thinking behind the decisions?

When you look at these applications, getting them on FPGA or GPUs requires a lot of coding, and you need very specific API call-outs. That adds a level of complexity for the user community, the development community, and the support community. We gave ourselves the absolute guideline that we would not do it. When you look at bespoke to boutique type solutions for wet labs or small laboratories, they can also be 50%-80% more expensive than off-the-shelf components.

What we have today is unbelievably dense nodes that incorporate Intel processors, incredibly fast memory, I/O bandwidth, and fast I/O subsystems. We take all of these things, together with the open-source GATK code from the Broad Institute Stack, and modify, optimise, tune, and package it as a professional service that we can install for our clients on their premises. Speaking about the machine and installation, these are not massive machines. They are pretty small, about the size of a bar fridge, and they sit nicely on a 19-inch rack. The base machine is a 2U- up to a 4U-type machine. That is what I like to call off-the-shelf and off-the-charts. It allows supportability, performance, and usability, and allows these services to be democratised for more users.

You mentioned installing the machines on premises. What are the advantages of using GOAST on premises?

The design pretence for us was on premises. When we talk about the movement of data and latency, having data near where you process it is very important. Everyone knows about edge computing today, so I guess this is a “heavy edge” or “thick edge”. It is about keeping the data close to where you process it.

The second thing is about regulation and data privacy. Our genetic data has the possibility of being used, so we have to adhere to data privacy laws that protect our sequences from any potential security-related issues. Over the last 2 years, the number of instances of hacking and data leakages has increased exponentially. So, our design consideration was to have it faster, better, stronger, and more secure than we would have in a cloud environment.

Finally, we are still in the time of COVID, and thankfully we are starting to open up. How has GOAST helped in research about the pandemic, and other strains or pandemics that may come in the future?

Today, our knowledge of the COVID virus is a lot better than 18 months ago. In itself, that is quite miraculous that we have this knowledge, and how it propagates and spreads as people do the research and sequencing.

This is possible because of all the things I have spoken about earlier, such as the sequencing systems and processing capabilities that we have today. In addition, the collaboration between pharmaceutical organisations and the various smaller wet labs was a key driver. I spoke about parallelisation of systems at the start, but now you have the parallelisation of the same problem across multiple organisations. When you look at application suites like GOAST that allow you to sequence faster and shrink the time to insight, they have been incredibly important to the development of the RNA and mRNA vaccines. Another important process here is testing in silicon, which lets you simulate the effects of the vaccine and its impact on the body in silicon by processing the data through GOAST.

This week, Pfizer has announced their treatment pill, and these developments will continue. This is not the first pandemic we have had, and it will not be the last, but our progress over the past 18 months has been amazing, and we can sleep a bit better knowing that technology such as this is able to deliver results and protect us whenever it might happen again. I spoke about solving humanity’s greatest challenges, and this is a key way that we are contributing to this cause.