How Natural Language Processing AI Is Detecting COVID-19 Variants

February 1, 2021 - 8 minutes read

Natural language processing (NLP) is an artificial intelligence (AI) tool that is used to analyze words, tone, underlying meaning, and phrases. But that’s not all NLP is being used for. Recently, biology researchers from Boston-based MIT started applying NLP algorithms to predict virus mutations and generate protein sequences, allowing them to detect new COVID-19 variants before they become widespread.

You might be wondering how a technology that’s set up to understand linguistics is able to understand genetics. It turns out that many characteristics of biology can be interpreted similarly to words and sentences. According to Bonnie Berger, a computational biologist at MIT and an author of the research, “We’re learning the language of evolution.”

Finding Hidden Viruses

Berger and her research team aren’t the first to train their NLP algorithms on protein sequences and genetic codes. Researchers from Salesforce and teams led by geneticist George Church are also chipping away at this interesting idea. Berger’s team takes it a bit further by using NLP to predict virus mutations that antibodies would fail to detect, known as viral immune escape.

At the foundation of Berger’s research is an analogy: the immune system’s interpretation of a virus is similar to how humans interpret a sentence. The team uses grammar and semantics (or the meaning of the speech) to interpret the genetic and evolutionary fitness of a virus. An infectious and successful virus is grammatically correct, whereas an unsuccessful virus is grammatically incorrect. On the other hand, viruses with different mutations have different semantics, so a virus with a divergent meaning might need different antibodies to find it.

The research team used an LSTM, an AI application that utilizes neural networks, which can be trained on less data than traditional NLP tools while maintaining strong performance. Generally, NLP works by transforming words into mathematical entities, so words with similar meanings naturally translate into similar entities. This is called an embedding, and, for viruses, in particular, the NLP algorithms grouped them based on the similarities of their embedding which simplifies the similarities between mutations for researchers.

The Fine Print

Using the LTSM software, the NLP model was trained on thousands of genetic sequences from three different viruses. There were 60,000 unique sequences for a strain of HIV, 45,000 for a strain of influenza, and between 3,000 and 4,000 sequences for a strain of COVID-19. Brian Hie, a graduate student at MIT who worked on building the models, explains, “There’s less data for the coronavirus because there’s been less surveillance.”

The main goal for the research is to find mutations that an immune system might overlook which entails looking for a changed meaning that’s not grammatically incorrect. The system looks for similar grammatical structures but very different meanings, and it flags mutations for review if their meanings have changed the most. To test this approach, the team used a machine learning application that scores accuracy for predictions between 0.5 (score basically due to chance) and 1 (score is perfect).

The team took the top mutations detected by the NLP algorithm and checked them against real viruses in a lab to see how many were escape mutations. The accuracy scores ranged from 0.69 for HIV to 0.85 for coronavirus. According to the team, these results are better than results achieved by state-of-the-art algorithms.

Thinking Ahead

For now, the research is more about pushing the limits of technology and applying it in a new way to biology. But in the future, the research has the potential to make an impact in public health. Knowing the severity of mutations and the newest mutations can make it easier for public health officials and hospitals to plan ahead for the upcoming flu season, giving them a sense of how well the public’s immune system is going to work this year.

The team ran their models on the new COVID-19 variants out of the UK, Denmark, South Africa, Singapore, and Malaysia. They found that there is a high potential for immune escape in all of the variants. However, the model missed a change in the South African variant that may have helped the team catch a loophole in their algorithm. Berger says she thinks the problem is that “it consists of multiple mutations, and we believe a combinatorial effect is coming into play.”

Even though the accuracy and the efficacy of the NLP algorithm could use some improvement, it’s a great start, and it markedly accelerates the traditionally slow process of sequencing viral genomes and studying any mutations in a lab. Whereas the latter can take weeks, the former takes only a few hours. According to Bryan Bryson, a biologist at MIT who also worked on the research, “It’s wild to be simultaneously updating your model and then running to the lab to test it in experiments. This is the very best of computational biology.”

Bryson says this is just the beginning of how we apply NLP to biology, adding that analogies can go a long way in this field. As an example, Hie says that the researchers’ approach can be used in drug resistance. A bacterial protein that’s resistant to an antibiotic or a cancer protein that has become resistant to chemotherapy can be studied with NLP. Hie says that in biology, “there’s a lot of creative ways we can start interpreting language models.”

The Future of Biology

As we find more and more areas of biology to apply emerging technologies like AI, NLP, and machine learning to, we will be able to put decades of data to work for us. Analysis time will be cut down drastically, allowing us to spend more time in preparing, understanding, and making connections. The research from MIT has opened up a new way of thinking for researchers to apply NLP to analogies between language and biology, which is invaluable itself.

And Bryson, Berger, and Hie are confident that biology won’t just take from NLP, it’ll give back: NLP algorithms may one day be inspired by biological concepts and systems.