A Gordon Bell Special Prize finalist for high-performance computing-based COVID-19 research has taught large language models (LLMs) a new jargon – gene sequences – that can unlock insights into genomics, epidemiology and protein engineering.
Published in October, this groundbreaking work is a collaboration of more than two dozen academic and commercial researchers from Argonne National Laboratory, NVIDIA, the University of Chicago and others.
The research team formed an LLM to track genetic mutations and predict concerning variants of SARS-CoV-2, the virus that causes COVID-19. While most LLMs applied to biology to date have been trained on datasets of small molecules or proteins, this project is one of the first models trained on raw nucleotide sequences – the smallest units of DNA and RNA.
“We hypothesized that moving from protein-level data to gene-level data could help us build better models to understand COVID variants,” said Arvind Ramanathan, a computational biologist at Argonne, who has led the project. “By training our model to track the entire genome and all the changes that appear in its evolution, we can make better predictions not just about COVID, but about any disease with enough genomic data.”
The Gordon Bell Prizes, considered the Nobel Prize for high-performance computing, will be presented at this week’s SC22 conference by the Association for Computing Machinery, which represents around 100,000 computing experts worldwide. Since 2020, the group has awarded a special prize for outstanding research that advances the understanding of COVID with HPC.
LLM training on a four-letter language
LLMs have long been trained on human languages, which typically include a few dozen letters that can be organized into tens of thousands of words and joined together into longer sentences and paragraphs. The language of biology, on the other hand, has only four letters representing nucleotides – A, T, G and C in DNA, or A, U, G and C in RNA – arranged in different sequences like Genoa.
While fewer letters may seem like a simpler challenge for AI, language models for biology are actually much more complicated. Indeed, the genome – made up of more than 3 billion nucleotides in humans and around 30,000 nucleotides in coronaviruses – is difficult to break down into distinct and meaningful units.
“When it comes to understanding the code of life, a major challenge is that the sequencing information in the genome is quite large,” Ramanathan said. “The meaning of a sequence of nucleotides can be affected by another sequence much further away than the next sentence or paragraph would be in a human text. This could exceed the equivalent of chapters in a book.
NVIDIA collaborators on the project devised a hierarchical broadcast method that allowed the LLM to process long strings of about 1,500 nucleotides as if they were sentences.
“Standard language models struggle to generate long, consistent sequences and learn the underlying distribution of different variants,” said paper co-author Anima Anandkumar, senior director of AI research. at NVIDIA and Professor Bren in the Computer Science + Mathematical Sciences Department at Caltech. “We have developed a scattering model that works at a higher level of detail that allows us to generate realistic variants and capture better statistics.”
Predicting Concerning COVID Variants
Using open source data from the Bacterial and Viral Bioinformatics Resource Center, the team first pretrained their LLM on more than 110 million gene sequences from prokaryotes, which are single-celled organisms like bacteria. He then refined the model using 1.5 million high-quality genomic sequences for the COVID virus.
By pretraining on a larger data set, the researchers also ensured that their model could generalize to other prediction tasks in future projects, making it one of the first scale models. whole genome with this ability.
When refined on the COVID data, the LLM was able to distinguish between the genomic sequences of the virus variants. It was also able to generate its own nucleotide sequences, predicting potential mutations in the COVID genome that could help scientists anticipate future variants of concern.
“Most researchers have been tracking mutations in the COVID virus spike protein, particularly the domain that binds to human cells,” Ramanathan said. “But there are other proteins in the viral genome that undergo frequent mutations that are important to understand.”
The model could also integrate with popular protein structure prediction models like AlphaFold and OpenFold, the paper says, helping researchers simulate viral structure and study the impact of genetic mutations on the ability of a virus to infect its host. OpenFold is one of the pre-trained language models included in the NVIDIA BioNeMo LLM service for developers applying LLMs to computational biology and chemistry applications.
Boosting AI training with GPU-accelerated supercomputers
The team developed their AI models on supercomputers powered by NVIDIA A100 Tensor Core GPUs, including Argonne’s Polaris, US Department of Energy’s Perlmutter, and NVIDIA’s internal Selene system. By switching to these powerful systems, they achieved performance of over 1,500 exaflops in training cycles, creating the largest biological language models to date.
“We are working with models today that have up to 25 billion parameters, and we expect that to grow significantly in the future,” Ramanathan said. “The size of the model, the length of the genetic sequences and the amount of training data needed means that we really need the computational complexity provided by supercomputers with thousands of GPUs.”
The researchers estimate that training a version of their model with 2.5 billion parameters took more than a month on about 4,000 GPUs. The team, which was already investigating LLMs for biology, spent about four months on the project before releasing the document and code to the public. The GitHub page includes instructions for other researchers to run the model on Polaris and Perlmutter.
The NVIDIA BioNeMo framework, available in early access on the NVIDIA NGC hub for GPU-optimized software, helps researchers scale large biomolecular language models across multiple GPUs. Part of the NVIDIA Clara Discovery collection of drug discovery tools, the framework will support chemistry, protein, DNA, and RNA data formats.
Join NVIDIA at SC22 and watch a replay of the keynote below:
The image at the top depicts the COVID strains sequenced by the researchers’ LLM. Each dot is color coded by COVID variant. Image courtesy of Bharat Kale, Max Zvyagin, and Michael E. Papka of Argonne National Laboratory.