**Large Language Models: Unlocking the Language of Biology**
David Baker, Demis Hassabis, and George Church are pioneers in the field of AI-driven protein design. Large language models like GPT-4 have gained immense popularity due to their exceptional natural language processing capabilities. However, the real potential of these models lies in their ability to understand the language of biology. Biology, as it turns out, can be decoded and programmed using digital systems, akin to the principles of modern computing. In this article, we explore how large language models can revolutionize the field of biology and enable us to design novel proteins.
**Biology as a Digital System**
Biology, at its core, can be considered an information processing system. DNA, the genetic material present in every living organism, encodes instructions using four variables—A, C, G, and T. Surprisingly, this system shares conceptual similarities with modern computing, which relies on binary code (0 and 1). Both systems are digital in nature and can be seen as computable frameworks. Similarly, proteins, the building blocks of life, are composed of a sequence of amino acids. This one-dimensional sequence determines the three-dimensional structure and function of a protein. Again, this can be seen as a computable system that lends itself well to the capabilities of language models.
**Leveraging Large Language Models in the Life Sciences**
Large language models excel when provided with vast amounts of data, enabling them to uncover patterns and structures beyond human comprehension. This understanding can be used to generate novel and sophisticated output. For instance, language models like ChatGPT have learned to engage in thoughtful conversations on any topic by analyzing the text on the internet. Similarly, text-to-image models like Midjourney have learned to create original and creative imagery by ingesting billions of images. The true power of these models lies in applying them to biological data and allowing them to learn the language of life itself.
**The Role of Large Language Models in Protein Design**
In the near future, the most exciting application of large language models in the life sciences is the design of novel proteins. Proteins are vital to all life forms as they perform various essential functions, such as digestion, muscle contraction, and hormone production. Their versatility arises from their ability to adopt diverse structures and functions. Understanding protein folding, which refers to the way proteins acquire their three-dimensional shape, is critical for comprehending their function and the workings of life.
**The Protein Folding Problem and AI**
Determining a protein’s structure based solely on its amino acid sequence has been a challenge in biology for over half a century. Referred to as the protein folding problem, it has confounded scientists for years. However, in a groundbreaking moment, AlphaFold, developed by DeepMind, successfully predicted protein structures with unprecedented accuracy using an older bioinformatics method called multiple sequence alignment (MSA). While MSA is powerful, it is computationally intensive and limited when applied to orphan proteins, which have no closely related analogues.
**The Promise of Protein Language Models**
Protein language models, trained on protein sequences rather than English words, have emerged as a potential solution to the protein folding problem. These models have shown an incredible ability to understand the complex patterns and relationships between protein sequence, structure, and function. Unlike AlphaFold, they can generate a protein structure based solely on a single protein sequence, without the need for structural information. Models like ESM-2/ESMFold have been shown to be as accurate as AlphaFold but significantly faster. They can also predict structures for orphan proteins with more accuracy.
**Inventing New Proteins**
Large language models have the potential to not only predict protein structures but also invent entirely new proteins. These models can be reversed and used to generate novel protein sequences that do not exist in nature, based on desired properties. The existing repertoire of proteins is minuscule compared to the theoretically possible number of proteins. This presents an exciting opportunity to design proteins with specific functions and capabilities that can revolutionize various fields, from medicine to materials science.
Large language models have shown remarkable success in understanding the language of biology. By unraveling the complex nature of proteins and their structures, these models pave the way for novel discoveries and advancements in the field of biology. The ability to design proteins with desired properties opens up a world of possibilities for solving complex problems and improving various aspects of our lives. With continued research and development, large language models will undoubtedly play a pivotal role in shaping the future of biology and medicine.