Dr. Gareth Sessel M.D., M.S.

View Original

AI and the language of drug discovery

Traditional innovation in biology and life-sciences follows paths such as "DNA → RNA → protein" and "disease → target → drug”. Yet, it's widely reported that the process to develop and launch a typical drug costs around $2 billion, takes over a decade, and only sees a 10% success rate. This highlights a critical need for fresh approaches, particularly through AI adoption.

Amidst AI's growing buzz, its applications in healthcare and life sciences are showing real promise. Eric Topol, a leading authority in this space, explains that “not long ago, scientists might spend 2 or 3 years to define the 3-dimensional structure of a protein, … Now that can be done for nearly all proteins in a matter of minutes, thanks to advances in AI. Even new proteins, not existing in nature, never previously conceived, can now be created.” This leap in capability illustrates AI's potential to accelerate and innovate in protein research, a foundational aspect of drug discovery.

Many industry leaders are taking note, with Sanofi's CEO Paul Hudson announcing a decisive "All-In on AI" approach, signaling a shift towards more AI-driven methodologies. The momentum is further bolstered by notable partnerships that showcase Big Tech's significant foray into AI-enhanced drug discovery.


Some recent examples demonstrating industry momentum include:

  • VantAI and Bristol Myers Squibb embarked on a strategic collaboration in February 2024, focusing on discovering new molecular glues.

  • Almirall partnered with Microsoft in February 2024 to innovate in medical dermatology using AI and digital technologies.

  • NVIDIA revealed strategic collaborations with Amgen’s deCode and Genentech in January 2024 and November 2023, respectively, to leverage its AI platforms in genomics and drug discovery.

  • Isomorphic Labs secured partnerships with Novartis and Eli Lilly in January 2024, aiming for novel chemistry solutions to undruggable targets, in deals potentially worth $3 billion.

  • Iambic Therapeutics closed a $100M Series B financing in October 2023 for AI-discovered therapeutics' clinical development, entering a collaboration with Nvidia.

  • PostEra announced a collaboration with Amgen in November 2023 to utilize its AI platform for small molecule program advancements.

  • Genesis Therapeutics received a $200M investment co-led by Andreessen Horowitz in August 2023 to advance AI-based drug candidates.

  • Recursion revealed a collaboration with (and $50M investment from) Nvidia in July 2023, following their acquisitions of Cyclica and Valence in May 2023 to expand its generative AI prowess.

  • XtalPi announced a partnership with Merck and secured a $250M deal with Eli Lilly in April and May 2023, respectively.

  • Mitsui and NVIDIA unveiled “Tokyo-1” in March 2023, an NVIDIA DGX supercomputer, to accelerate drug development and cut costs for Japan's leading pharma companies.

  • All these alliances, coupled with the FDA's increasing engagement with AI in regulatory processes, mark a pivotal industry shift. A standout example of AI's real-world impact is seen in the collaboration between XtalPi and Pfizer, which played a pivotal role in the accelerated launch of Paxlovid for COVID-19 treatment. By leveraging their combined AI technologies, they were able to streamline the drug's development process, showcasing the efficiency and effectiveness of AI in bringing critical treatments to market.


The application of generative AI spans from understanding disease processes and designing novel proteins to streamlining clinical trials, with AI-generated drug candidates already entering clinical trials. Echoing this sentiment, David Reese, CTO at Amgen, refers to the current phase as a "hinge moment" in drug discovery, indicating a new era of rapid and innovative drug development. This post will explain how generative AI is becoming a standard tool in drug discovery, transforming drug development processes and producing tangible patient benefits.


Background – the need for innovation in the pharma industry

Despite the commonly cited $2 billion amount, reporting the average cost to develop and launch a new drug is complex and estimates range from $300 million to a staggering $4.5 billion. The drug development journey from initial discovery to market approval comprises multiple costly stages, including early research, preclinical tests (both in-vitro and in-vivo), and several clinical trial phases with human participants. Overall costs are shaped by several key factors:

  • Trial phases: Success rates and durations of clinical trials play a crucial role. Each phase, from early research to human trials, contributes to the overall cost based on its length and outcome.

  • Capitalization: This reflects the need to finance the investment over several years before any revenue is generated, and accounts for the cost of capital.

  • Failed drug candidates: Expenses from failed drugs add to the total, as the investment in R&D covers both successful and unsuccessful attempts.

  • Drug specifics: The nature of the drug itself, including its class, its therapeutic area, and whether it was developed in-house or acquired, influences the cost.

  • Company attributes: Characteristics, such as size, resources, and expertise, also play a part.

  • Regulatory factors: The regulatory pathway, whether standard or accelerated, further complicates the economic analysis.


A sample of the median and mean R&D costs for drug development, represented in 2018 US dollars and stratified by therapeutic area, in shown in Figure 1.

Figure 1: Mean and median costs for new drug discovery, for FDA approved drugs between 2009 and 2018, stratified by therapeutic area and reported in 2018 US dollars. Taken directly from Wouters et al, “Estimated Research and Development Investment Needed to Bring a New Medicine to Market, 2009-2018”. JAMA. March 2020.

These static figures, as substantial as they are, lead us to a broader, dynamic trend that further complicates the picture. This is where Eroom's Law comes into play, which, in a twist, flips the script on the tech industry's Moore's Law. Eroom's Law observes that the cost of drug development tends to double roughly every nine years. This trend is evidenced by the decreasing number of new drugs approved per billion dollars spent, suggesting that despite pouring more money into research, the process is becoming less efficient and more expensive. This escalating cost raises alarms about the long-term viability of the pharmaceutical industry's business model, emphasizing the urgent need for innovative methods to boost output and lower costs.

 

AI to the rescue: pharma's new frontier

I previously wrote about and explained the terms: AI, machine learning (ML), neural networks, and deep learning (in the context of their applications in primary care). To learn more about these technologies and techniques, refer to my earlier blog post here. However, since this post was published in August 2022, generative AI has emerged as a mature and disruptive technology – one that I did not discuss in my original post. It is therefore useful to introduce the following definitions.

  • Transformer models: an advanced type of neural network architecture capable of processing data elements in parallel, rather than sequentially. Their key feature, "attention," allows them to focus on and weigh the importance of different parts of the input data, enhancing their ability to grasp context and relationships within text. This makes transformers exceptionally good at handling complex language tasks.

  • Generative AI: a class of AI system, designed to generate novel content, such as text, images, music, or synthetic data, that is similar but not identical to the data it was trained on.

  • Large language models (LLMs): a subset of generative AI, built using transformer models and specifically focused on understanding and generating language. They are trained on very large amounts of data to predict the next ‘word’ in a sequence, enabling them to generate a coherent and contextually relevant output based on the input they receive. These models are specialized in linguistic tasks, and operate based on the statistical properties of language learned during their training.

 

Biochemistry in conversation: LLMs speak the language of drug discovery

It may be surprising that a technology designed to understand and generate human language can assist with drug design. However, we are finding that there are inherent linguistic properties that exist in the universe: mathematics is the language of physics, defining the fundamental laws that govern the cosmos; chemistry is the language of biology, translating the atomic and molecular interactions into the fabric of life; biology, in turn, is the language of ecology, narrating the interdependent web of living organisms and their environments.  

By conceptualizing disciplines like mathematics, physics, and chemistry as distinct languages, we unlock a transformative perspective: these are not merely tools for describing phenomena but intricate languages with their own syntax, grammar, and vocabulary that define the universe's underlying principles. In this light, LLMs become more than just text generators; they emerge as universal translators capable of deciphering and synthesizing these languages. This capability enables them to navigate the complex interplay of forces, reactions, and interactions that underpin our reality. When applied to drug discovery, LLMs leverage their deep understanding of these universal languages to predict how molecular structures can interact with biological systems in novel ways.

The potential of LLMs therefore extends significantly beyond just generating human language. When applied to fields like drug discovery, generative AI can yield remarkable innovations, potentially uncovering new medications at a pace and with a level of creativity that could surpass traditional, human-led efforts. This suggests that the power of generative AI might not only expedite scientific discovery but also venture into realms of innovation previously considered beyond the reach of human ingenuity.

 

Breaking down the steps to drug discovery

For AI-assisted drug discovery, the journey from initial concept to potential treatment is streamlined into five key steps. These steps specifically focus on how AI technologies assist in the early stages of drug discovery. They guide us from understanding the biological underpinnings of diseases to designing and refining drug candidates. It's important to note that these steps do not include subsequent phases like drug trials or approvals, where AI also holds transformative potential (this could be the subject of a future blog post). The following five steps will demonstrate how AI is reshaping the landscape of drug discovery, making the process faster, more efficient, and increasingly precise:

 

  1. Target identification: Recognizing molecules and proteins that are critical for the progression of a disease, that can be targeted by a drug to slow down or halt the disease process without impacting normal, healthy functions.

  2. Target characterization: Developing a comprehensive model of the biological target to understand its structure and function.

  3. Hit discovery and refinement: Identifying initial compound "hits" that interact with the target, followed by the hit-to-lead process to select and refine promising candidates.

  4. Lead optimization: Fine-tuning the chemical structure of lead compounds, enhancing their pharmacodynamic and pharmacokinetic properties.

  5. Synthesis and testing: Synthesizing the optimized molecules and subjecting them to laboratory experiments to validate their effectiveness and safety, and iterating based on real-world data (Lab-in-the-Loop).

 

Step 1: Target identification

To understand how biological functions and diseases arise from molecular interactions, it's important to combine data from various "omics" fields. This integrated approach, known as multi-omics, aims to offer a complete and systematic view of biological systems and disease processes. Multi-omics refers to a broad field that studies various types of biological molecules to gain insights into the structure, function, and dynamics of organisms. This includes:

  • Genomics: Studies an organism's entire set of DNA, including all of its genes, to understand genetic influences on health and disease.

  • Transcriptomics: Examines the complete set of RNA transcripts produced by the genome under specific circumstances or in a specific cell, providing insights into gene expression.

  • Proteomics: Focuses on the entire set of proteins produced or modified by an organism, which helps in understanding the functions and structures of proteins in biological processes.

  • Metabolomics: Inspects metabolites, which are the end products of cellular processes within cells, tissues, or organisms, and can reflect the organism's physiological state.

  • Epigenomics: Observes the epigenetic modifications on the genetic material of a cell, which can influence gene expression without altering the DNA sequence itself.

  • Lipidomics: Analyzes lipids within a cell or organism, which is crucial for understanding cellular membranes, energy storage, and signaling molecules.

  • Glycomics: Researches carbohydrates and glycoconjugates in biological molecules and organisms, important for understanding cell-cell communication and pathogen-host interactions.

 

These techniques can pinpoint the variations in gene expression, protein levels, metabolite presence, and epigenetic changes between healthy and diseased samples, or between those that do and do not respond to a specific medication. This holistic view allows researchers to construct detailed models of disease processes, revealing the complex interplay between various molecular components. By analyzing these interactions, scientists can pinpoint critical "rate-limiting steps" in the disease pathway. These are key junctures where the disease progression can be most effectively halted with minimal impact on normal physiological functions. By identifying these pivotal points (i.e. these molecular targets), multi-omics paves the way for the development of drugs designed to modulate these targets, thereby creating new, targeted therapies to disrupt the disease process while avoiding collateral damage and unintended side-effects.

By bringing together diverse omics data, multi-omics can uncover new insights into diseases, identify potential biomarkers and therapeutic targets, and help tailor treatments to individuals, potentially transforming pharmaceutical sciences and leading to the creation of new and effective treatments.

 

Step 2: Target characterization

Let’s suppose that you want to create a new key for a specific lock. You need to understand the precise 3D structure of the lock in question, so that the key you create will fit appropriately. In the biological equivalent, it is essential to understand the functional aspects of the biological target, including its precise shape, charge distribution, and reactive sites. This knowledge allows researchers to identify “druggable" sites, i.e. regions of the target where a new “key” (i.e. a small molecule or ligand) could bind and modulate the protein's function. In drug design, this is often referred to as the “protein folding problem”, and has been a longstanding hurdle for generations of scientists. The target protein’s 3D structure can be modeled using computational techniques or advanced machine learning models.  

Protein folding modeling is a computational approach aimed at predicting the three-dimensional structure of proteins from their amino acid sequences. This field addresses one of the most pivotal challenges in biochemistry and molecular biology, given that a protein's function is intricately linked to its structure. Advances in this area have profound implications for drug discovery, understanding diseases, and engineering novel proteins.

 

Multiple Sequence Alignment (MSA)

MSA is a foundational technique in protein folding modeling, where the amino acid sequences of multiple proteins that are evolutionarily related are aligned to identify conserved regions. These alignments highlight similarities and differences among these sequences, providing crucial insights into the protein's structure and function. MSAs help in identifying functionally important sites and can indicate the presence of certain structural motifs. This aids in the prediction of the protein's three-dimensional conformation, which can provide valuable clues about the protein's structure and function.

 

Large Language Models (LLMs) for Protein Folding

LLMs have been adapted to interpret the "language" of amino acids in protein sequences. These models learn from vast databases of known protein structures and sequences, capturing the complex patterns and rules that govern protein folding. By treating proteins as strings of letters (similar to words in a sentence), LLMs can predict how a linear sequence of amino acids will fold into a 3D structure, considering the biological and physical constraints of protein folding.

Noteworthy models that use MSA and LLM techniques for protein folding modeling include:

  • AlphaFold and AlphaFold2:

    You've likely come across the names "DeepMind" and "AlphaFold" and wondered why they resonate so strongly across various circles, far beyond just the scientific community. You may have read about DeepMind’s landmark victory over the world champion Go player, Lee Sedol, in 2016. In 2018, Alphabet (i.e. Google)-owned DeepMind made news when their AlphaFold tool placed first in a competition designed to assess computational methods on their ability to predict novel protein folds.  

    AlphaFold represented a breakthrough in protein folding modeling, especially with its second iteration, AlphaFold2. AlphaFold2 employs LLM techniques to solve the problem of protein folding by predicting the structure directly from the amino acid sequence. It combines this with innovative strategies for interpreting MSAs to understand evolutionary relationships and variations, resulting in highly accurate predictions of protein structures. AlphaFold2's performance in the CASP14 competition marked a significant milestone, achieving accuracy comparable to experimental methods like X-ray crystallography and cryo-electron microscopy for a wide range of proteins.

  • OpenFold:

OpenFold is an open-source implementation and extension of the AlphaFold2 methodology. It aims to democratize access to advanced protein structure prediction technologies, allowing researchers worldwide to contribute to and benefit from developments in this field. OpenFold maintains compatibility with the AlphaFold2 database and incorporates improvements and optimizations that enhance its performance and utility in various research contexts.

  • ESMFold:

ESMFold, developed by researchers from Meta AI and from academia, is another state-of-the-art model for protein structure prediction. It is part of the Evolutionary Scale Modeling (ESM) series, which heavily utilizes MSAs to understand the evolutionary context of protein sequences. ESMFold leverages this evolutionary information and highly optimized transformer-based architectures to predict protein structures. It is particularly noted for its use of evolutionary context and its ability to scale efficiently.

 

Together, these advancements in protein folding modeling are revolutionizing our understanding of proteins and opening new frontiers in biology, medicine, and biotechnology.

 

Step 3: Hit discovery and refinement

The hit-to-lead process in drug discovery bridges the initial screening of chemical compounds (hits) and the development of those compounds into viable drug candidates (leads). The journey begins with the identification of "hits," which are compounds that show initial promise in binding to and altering the behavior of a biological target.

Traditionally, hit discovery has utilized high-throughput screening (HTS). HTS is an experimental approach that involves physically testing each compound within a large library of compounds, to identify those that induce a desired response. HTS uses automation, robotics, liquid handling systems, and sensitive detectors to quickly conduct millions of chemical, genetic, or pharmacological tests. The main drawbacks of HTS include the high cost of setting up and running HTS facilities, the need for large amounts of materials, and the high rate of false positive or negative results, which require further validation.

Virtual screening is a cost-effective and rapid method to evaluate extensive libraries of existing and novel ligands against biological targets. In contrast to HTS, this process uses AI to identify promising chemical compounds as potential hits, all within a virtual environment. Virtual screening can prioritize compounds for further testing, saving time and resources in the drug discovery process. Additionally, generative models can study the patterns of molecules it has encountered, and then use its learnings to suggest new molecules (similar but not identical to those it studied). “Docking” algorithms then simulate how each molecule fits into the binding site of the target protein. This process evaluates the interaction between the compound and the target, providing insights into the binding affinity, preferred binding poses, and potential effects on the target's function. Among the confirmed hits, some “leads” will stand out based on their potency, specificity, or other desirable attributes. These are prioritized for further development based on a set of predefined criteria, such as their ability to interact with the disease target, their chemical structure's novelty, their potential for optimization, and their pharmacodynamic and pharmacokinetic properties.

For virtual screening, the way these molecules are described or drawn out in a computationally accessible format can vary. For example, they might be represented as a simple line of text with a specific format (known as SMILES), or as a diagram that shows how the atoms are connected, which is called a molecular graph. These representations allow for the efficient screening, comparison, and optimization of molecules. For those of you who want more details about SMILES, molecular fingerprints, and graph networks for digitally encoding the explicit structures and features of molecules, I have included appendix A at the end of this post.

 

Step 4: Lead optimization

This phase involves iterative cycles of virtual synthesis and testing to refine the lead compounds' properties, further enhancing their interactions with the target. The goal is to produce leads with optimal pharmacological properties and minimal off-target effects. This involves a careful balance of potency, safety, and drug-like characteristics such as ADMET (Absorption, Distribution, Metabolism, Elimination, and Toxicity) properties. During the lead optimization phase, docking plays a role in iteratively improving the chemical structure of lead compounds. AI-driven docking can quickly evaluate modifications to the chemical structure, predicting how these changes could affect the compound's binding affinity and specificity for the target. This allows for a more efficient exploration of the chemical space around a hit compound, guiding medicinal chemists to design more potent and selective drug candidates.

AI models use “embeddings” and other molecular representations to forecast abstract properties like solubility, toxicity, and bioavailability, which are critical for the success of a therapeutic candidate. This analysis of the “Structure-Activity Relationship” (SAR) is an essential part of the hit-to-lead process, where the relationship between the chemical structure of compounds and their biological activity is studied and used to guide further compound modifications. For those of you of want more details about embedding techniques, I have included appendix B at the end of this post.

 

Step 5: Synthesis and testing

"Lab in the loop" is a concept in AI-assisted drug discovery that integrates real-world laboratory experiments with computational models and simulations to create a dynamic, iterative process of drug development. In this approach, AI and machine learning models are not used in isolation but are continuously updated and refined based on experimental data from the lab. Here's how it works in the context of AI-assisted drug discovery: 

  1. Initial Computational Predictions: The process begins with AI models making predictions about potential drug candidates, such as identifying promising compounds through virtual screening, predicting their pharmacological properties, or suggesting novel chemical structures with desired therapeutic effects.

  2. Laboratory Validation: Predictions made by AI models are then tested in the laboratory setting. This involves synthesizing the suggested compounds and testing their biological activity, pharmacokinetics, toxicity, and other relevant properties in vitro (in test tubes) or in vivo (in live models).

  3. Data Feedback Loop: The results from these laboratory experiments are fed back into the AI models. This real-world data is invaluable for validating the models' predictions, highlighting their accuracy, and identifying any discrepancies or areas for improvement.

  4. Model Refinement: With the new experimental data, AI models can be refined, retrained, or recalibrated to improve their predictive capabilities. This step allows the models to learn from the experimental outcomes, enhancing their reliability and accuracy for future predictions.

  5. Iterative Process: The cycle of prediction, validation, and refinement continues iteratively, with each loop equipping the models to make more accurate predictions over time.

 

The "lab in the loop" approach exemplifies the synergy between computational and experimental sciences in drug discovery, leveraging the strengths of each to expedite and enhance the drug development process. It represents a more holistic and adaptive strategy in AI-assisted drug discovery, where real-world data continuously informs and improves computational predictions, leading to more efficient and effective drug discovery pipelines.

After the "lab in the loop" phase, synthesized compounds undergo a crucial final phase of real-world validation. This stage involves a comprehensive suite of experiments designed to confirm the virtual results and ensure the compounds' efficacy and safety. In vitro experiments, conducted in controlled laboratory environments, examine the compounds' interactions at a cellular level, providing insights into their mechanisms of action. In vivo studies, often involving animal models, offer a broader perspective on the compounds' pharmacokinetics, pharmacodynamics, and potential toxicity within a living organism.

These rigorous preclinical tests are essential stepping stones towards the ultimate goal: human trials. Only compounds that show promising results and a favorable safety profile progress to this phase. Human trials unfold in a structured clinical trial pipeline, starting with small-scale Phase I trials to assess safety, moving through Phase II to evaluate efficacy and optimal dosing, and culminating in larger Phase III trials to confirm effectiveness and monitor adverse effects. This meticulous process ensures that only the most promising and safe compounds can become the next breakthrough treatments.

 

Conclusion

Each step of the drug discovery process, as outlined in this article, showcases the unique contributions of AI: pinpointing viable targets, modeling biological interactions, discovering and refining potential treatments, and validating these findings through rigorous real-world experimentation.

Viewing disciplines like mathematics, physics, and chemistry through the lens of language - with their distinct syntaxes and vocabularies - opens a revolutionary perspective. In this light, Large Language Models (LLMs) extend beyond their initial function as sophisticated text generators. They become universal translators with the profound ability to decipher and synthesize the intricate languages of science, offering unparalleled insights into the fabric of biological phenomena.

While our focus has primarily been on the early stages of drug discovery and development, this narrative lays the groundwork for the ensuing journey through clinical trials and regulatory pathways, suggesting a wider sphere of AI's impact across the entire spectrum of the pharmaceutical industry.

 

Appendix A: Explicit Molecular Representations

The following methods explicitly encode molecular structures and features, directly representing atoms, bonds, and molecular topology in a way that is closely tied to the chemical structure of the molecules. They are designed for specific computational tasks such as database searching, similarity comparisons, and structural analysis.

  • SMILES (Simplified Molecular Input Line Entry System) is a notation system that encodes a molecule's structure as a line of text. Each character or string of characters within a SMILES string represents atoms, bonds, and their arrangements within the molecule. SMILES is designed to be a compact, human-readable, and machine-parsable way to represent molecular structures.

  • Molecular fingerprints are typically binary strings (sequences of 0s and 1s) or sometimes numerical vectors, where each position or bit in the string represents the presence or absence of a particular molecular feature, such as a specific atom, bond type, or functional group. Molecular fingerprints are not designed to be human-readable in the same way that SMILES strings are, and the interpretation of each bit is based on the specific rules or algorithm used to generate the fingerprint. As a result, molecular fingerprints are better suited to computational tasks that require rapid comparison and analysis of molecular structures, such as similarity searches and clustering, due to their binary and compact format. In contrast, SMILES are more intuitive for humans but are typically less efficient for computational tasks like rapid similarity searches or bulk comparisons.

  • Graphs represent a fundamentally different approach to molecular representation, emphasizing the relational aspects of molecular structures. In graph-based representations, molecules are modeled as graphs where atoms are represented as nodes and bonds as edges connecting these nodes. This approach is particularly powerful for capturing the topological (connectivity) and structural features of molecules, making it highly suitable for a range of computational tasks, especially those involving machine learning models designed to work with graph data. Graph representations capture detailed structural information in a way that is more explicit and flexible than fingerprints and can be more directly related to the molecular structure than SMILES strings. While not as compact as fingerprints or as straightforward as SMILES for simple storage and communication, graphs offer excellent capabilities for interactive visualization and exploration of molecular structures.

 

Appendix B: Abstract Molecular Representations

Embeddings capture abstract patterns and relationships learned from data. These representations are less directly interpretable in terms of chemical structure but are highly useful for machine learning applications, where they can facilitate tasks like property prediction, molecule generation, and clustering based on learned features rather than explicit structural descriptors.

In the context of molecules, embedding techniques would take a molecular representation (which could be based on SMILES, molecular graphs, fingerprints, etc.) and transform it into a high-dimensional vector, i.e. a list of numbers each of which represents a different feature or characteristic of a molecule, and that together creates a detailed profile of the molecule. In other words, each dimension of this vector captures some abstract features or patterns of the molecule, which might relate to its chemical properties, structural motifs, or other characteristics important for the task at hand, such as predicting biological activity or solubility. These vectors capture complex patterns and relationships within the data, making them suitable for machine learning tasks such as property prediction, clustering, or similarity searches. Embeddings are particularly useful because they can capture nuanced relationships in a way that is amenable to mathematical operations and algorithms.