The Application of Machine Learning in Drug Discovery

1. Introduction

 

The fundamental goal of drug discovery is the application of technologies that model biological macromolecules in order to facilitate and accelerate the optimization of bioactivity and selectivity of new drug candidates. Conventional approaches to drug discovery rely heavily on high-throughput screening and the complex analysis of biological datasets. The discovery of novel and potent inhibitors for a given biomolecular target is ingrained with automated procedures for hit identification via library screening, structural elucidation, iterative medicinal chemistry, and ADMET profiling. The endeavors and considerations involved are numerous and challenging. Structural elucidation is even more complex, and thus the search for more accurate and efficient ways to model and predict the bioactivity of ligands has shaped a vibrant field of research.

 

Since the early days of drug discovery, medicinal chemists have sought clever ways to predict and indirectly model the interaction of small molecules with biomolecular targets. However, the advent of novel computational methods and systems has made it feasible to adopt mechanistic approaches that bestow reliability and interpretability to electronic modeling in drug discovery. Knowledge in chemistry, physics, and biology is now merged into common modeling platforms that animate biomolecular structures and simulate their motion onto time-resolved trajectories, shedding light on their native functionality. These open a window of opportunities to gain valuable insight into the interaction, selectivity, and bioactivity of ligands through the simultaneous investigation of the atomic-level motion of the drug and the biomolecular structure throughout the binding event. However, the modeling strategies that are currently in place to assist drug discovery endeavors through computational techniques are still costly in terms of time and resources. As a result, eagerness emerges to exploit the burgeoning advances in artificial intelligence and machine learning tools and techniques to facilitate and improve the current procedures for drug discovery.

 

The explosion of information and knowledge over the past two decades through the advent of novel methods and tools has fostered the development of a myriad of machine learning applications that touch many of the aspects involved in drug discovery. These attempts seek to tackle one of many aspects that surround the desire to identify bioactive small molecules. The elaboration of ligand-based, structure-based, biologically-based, and knowledge-based approaches is now entering an era where they may converge towards a common modeling platform.   (Elbadawi et al., 2021)(Dara et al.2022)(Patel and Shah, 2022)(Vatansever et al.2021)(Patel et al., 2020)

 

2. Overview of Drug Discovery Process

 

The process of discovering and developing new drugs is long, complex, and expensive. Drug discovery may be defined as the process through which candidates for new medications are discovered. The drug discovery process encompasses a series of highly integrated and consecutive steps that usually involve early detection and validation of biological targets, identification of small molecules active at the target, optimization of the hit compounds, preclinical testing as drug candidates, and clinical studies. In addition, several support disciplines such as bioanalytical, pharmacokinetics, safety, formulation, and computational sciences play an essential role. The activities can be roughly broken down into five major steps: target identification, hit identification, hit-to-lead, lead optimization, and preclinical development. Target identification consists of the identification of the interaction partner for the drug, while hit identification consists of the identification of starting points with drug-like properties capable of interacting with the target. The hit-to-lead step involves the assessment and optimization of hits, while lead optimization leads to drug candidates suitable for in vivo testing. The resulting candidates are then subjected to preclinical development.

 

Biotechnology-based molecular target classes from the omics era have drastically changed the focus of drug discovery efforts and posed complementary challenges. These new targets, like most currently known drug targets, are large, complex, and intractable with traditional drug design methodologies. The vast majority of currently approved drugs target proteins, an approach that has proven to be extremely successful. However, many non-protein targets exist, including but not limited to RNA, DNA, transcription factors, and proteins with pathological post-translational modifications.

 

In the quest for new lead candidates, high-throughput biological screening is often performed using either natural product libraries produced by culture collection-derived microorganisms or small synthetic libraries produced by combinatorial chemistry. High-throughput screening is often preceded by the development of unobtrusive and high-fidelity assays suitable for the generation of primary hit compounds. With the rising demand for commercial lead compounds, and as more bioactive alkaloids, polyketides, and terpenoids—vehicles known to biosynthesize a multitude of bioactive metabolites in culture—have been intensively evaluated, the number of culturable, unexploited microorganism taxa still remains huge. The limitations of conventional culture technology include, and in fact stem from, the preference of cultured microbes for nutrient-rich and stress-free environments, which have recently been outlined.

 

The inability of the art failed to provide insight into target classes suitable for high-throughput screening as the biological focus had always been on this tractable and epidemiologically relevant target. The screening of cellular extracts containing crude bioactive natural product mixtures has successfully identified many bioactive metabolites, although subsequent isolation, purification, and structure elucidation of the responsible compound would require substantial additional efforts. Notably, some of these compounds would later be revealed as polyheterocyclic alkaloids that could be released into the culture medium of only a handful of cultured strains.    (Gaudelet et al.2021)(Zhao et al., 2020)(Vatansever et al.2021)(Jiménez-Luna et al.2021)

 

2.1. Traditional Drug Discovery Methods

 

The process of drug discovery consists of a series of events that create potential drugs. It entails the discovery of new compounds and their manufacture on a large scale. After receiving approval, the drug is marketed. Drug discovery is a complex, lengthy, expensive, and largely unexplored procedure. It involves many disciplines such as chemistry, biology, pharmacology, immunology, molecular biology, and others.

 

The introduction of combinatorial chemistry and high-throughput screening has led to a significant increase in the ability to test compounds for a particular target. However, the great majority of tested compounds achieve no level of interest, which has rekindled the interest in employing computational techniques to assist drug discovery and design.

 

The traditional approach to drug design entails determining the three-dimensional structure of the biological target, modeling the interaction between the target and drug candidate, and refining the model until the target-drug complex is acceptable. The three-dimensional structure of the target is usually obtained by X-ray diffraction from crystallized targets or by computer modeling methods.

 

Computer modeling approaches for drug discovery are usually divided into two groups: ligand-based and structure-based approaches. Ligand-based methods have no requirement for three-dimensional structural information of the biological target. They use information about known compounds that have some activity against the target in the form of chemical structure and biological activity and predict activity for novel compounds using statistical methods. Structure-based methods require knowledge of the three-dimensional structure of the target.

 

Structure-based approaches utilize the three-dimensional structure of the target to design compounds that will specifically interact with it. The most common choice of method is molecular docking, which attempts to find the preferred binding mode and affinity energy of the docked ligand. Docking simulations have been successfully applied to several biological targets, and there are more than a hundred software packages that are commercially available or freely distributed.

 

2.2. Challenges in Drug Discovery

 

Drug discovery is a complex and lengthy process that involves the identification, development, and commercialization of new drugs. Despite advancements in screening technologies, computational techniques, and chemical libraries, drug discovery remains a challenging task due to the large number of potential targets and the increasing complexity of drug candidates. Many new drug candidates fail to progress after the clinical candidate selection stage, often due to incorrect predictions of efficacy or safety. Moreover, the time and cost involved in bringing a drug to market have dramatically risen, making the development and use of computational methods to assist in drug discovery more desirable.

 

The drug discovery process is typically divided into pre-competitive and competitive parts. In the pre-competitive part, a target to be treated by the drug is either quantified or modified. The quantification of a native protein target will often require knowledge of its three-dimensional structure. On the other hand, lead identification is competitive, where large compound libraries are screened against the chosen target protein. Both parts can be increasingly assisted by modeling methods, which seek to support the knowledge of the experimentalists.

 

Common target proteins for drug candidates are enzymes and receptors, but proteins can also act as shutter mechanisms or scaffolding proteins, channeling signaling cascades from one protein to another. Due to the nearly infinite combinations of amino acids forming a protein and thus drug target, target discovery is commonly considered the hardest part of drug discovery. Only a very limited number of protein targets are known and reported to the scientific community. The modeling of a target protein is also complicated by alternative splicing and post-translational modifications. With very limited sets of conserved transmembrane domains or protein folds, the knowledge of membrane target proteins is even weaker.

 

Besides the enormous efforts required for the target idea or protein modeling, a compound to become a viable drug needs to be altered considerably from its hit structure due to potency, selectivity, and other drug-like criteria. The design of possible compound modifications on a molecular level is manufacturer dependent and can impose an additional complication. In this respect, structure-activity relationships can vary widely in complexity depending on the target protein and often confound a hit finding process.

 

3. Fundamentals of Machine Learning

 

Machine learning is a subset of artificial intelligence that is concerned with the study of computer algorithms that improve automatically through experience. It is based on the idea that systems can learn from data, identify patterns, and make decisions with minimal human intervention. Machine learning algorithms have been a great source of inspiration for researchers in many fields, including biology, chemistry, and computer science. The core technologies of this set of algorithms are well-suited to be applied for analyzing complex molecular and biological datasets, allowing for the identification of hidden patterns within the data that are hard to decipher using only traditional methods. Classically, machine learning is divided into three categories: supervised, unsupervised, and reinforcement learning. Although there are other divisions, they can all be categorized into one or more of these three classes.

 

In supervised learning, an algorithm is trained to predict the value of a label from some input data. The model used to make the predictions is trained using a dataset containing already known input features and the desired labels from which the algorithm can learn patterns and create a model capable of making predictions on unknown data. Common applications include predicting fall or rise in stock prices, predicting the yield of chemical reactions, predicting the toxicity of compounds, and predicting an experiment’s outcome, among many others. There are two fundamental approaches to achieve supervised learning: traditional machine learning algorithms and deep learning algorithms that are usually based on artificial neural networks.

 

Contrary to supervised learning, unsupervised learning is mostly concerned with descriptor representation of the domains of interest. It is used to analyze the data itself, searching for hidden structures or grouping similar entities, but without any supervision signal regarding the right label or class. Unsurprisingly, their most used applications are in the fields of grouping and clustering. Popular algorithms include self-organizing maps, dimensionality reduction techniques, K-means clustering, and hierarchical clustering. Using clustering algorithms and chemical space or biological space representation becomes possible to visualize them, observe their distributions, and gain insight into the data that cannot be observed otherwise.

 

Reinforcement learning is also different from the other two approaches because of the nature of the training paradigm, which is based on learning from interaction with an environment. It is used generally to learn what action an agent should take in a particular situation to optimize some long-term reward. The agent has to derive from an extensive set of possible actions that sometimes require understanding complex interactions between the actions and the environment’s responses at different time scales. Some actions may have rapid but short-term rewards, while others may vary but be most rewarding in the long term. The agent learns the best actions to take in a situation and avoids actions that lead to undesirable situations through a trial-and-error approach, combined with a mechanism to store and recall information about these actions and the rewards related to them. (Gaudelet et al.2021)(Huang et al.2021)(Pandey et al.2022)(Mouchlis et al.2021)

 

3.1. Supervised Learning

 

In supervised learning, a model is provided with a set of training data, which includes input-output pairs as a means of learning to predict outcomes from new data. In the context of drug discovery, such input-output pairs could consist of molecular representations (the input) and the corresponding chemical or biological activity of the compounds (the output). Given a sufficiently large and representative training set, the goal is to capture the relationship between the input and output data so that predictions can be made on molecules with unseen activity data. Typically, supervised learning models are categorized by the nature of the output. For continuous outputs, models are considered regression models. Conversely, when the outputs have discrete classes, they are classified as either binary or multiclass classification models. Furthermore, an important aspect of supervised learning models is the adoption of a loss function to guide the training process. This loss function quantifies the error for any given set of model parameters. Over the past two decades, supervised learning approaches have been successfully applied in various tasks relevant to drug discovery, such as predicting the physicochemical properties of drug candidates or their likelihood of having a specific target protein. A wide range of different techniques has been used to perform supervised learning in the drug discovery field. In early applications, where compounds were represented by a few descriptors or fingerprints, linear models were popular choices due to their simplicity, interpretability, and ease of application. As larger datasets containing thousands or even millions of compounds became available, there was a shift in focus from interpretability to predictive power, leading to the adoption of a variety of complex, state-of-the-art models and algorithms from the machine learning subfields, such as decision trees and ensemble approaches. More recently, the unprecedented success achieved in tasks such as image recognition, natural language processing, or even defeating human champions in games has skyrocketed interest in deep learning approaches. While deep learning has not yet reached a maturity level in the field of drug discovery that is equal to its success in other domains, there have been numerous attempts at exploring deep learning architectures in drug discovery-related tasks with promising results.

 

3.2. Unsupervised Learning

 

Unsupervised learning is a type of machine learning that uses input data without labeled responses. In unsupervised learning, algorithms learn on their own by trying to discover patterns and structures in the input data. A few applications of unsupervised learning in drug discovery include improving drug-like properties of compounds, predicting potential side effects based on drug chemical structures, and finding relations between drugs and biological targets.

 

Data mining is an important application of unsupervised learning algorithms that is often used to analyze chemical databases of drug-like compounds. The data mining process is divided into the following steps: (1) selection of a subset of chemical descriptors that well characterize the database compounds, (2) clustering compounds based on their chemical descriptors, which implies that compounds within a cluster are similar to each other and distant from those in other clusters, and (3) interpretation of each cluster obtained in the previous step. Clustering is considered the most important step in the data mining process applied to chemical databases. It aims at finding similarities among compounds in order to choose representative compounds for biological activity tests and to design new compounds based on the knowledge acquired in previous steps.

 

Several unsupervised clustering approaches, including hierarchical clustering, k-means clustering, fuzzy c-means clustering, and self-organizing maps, can be applied to chemical compound analysis. Hierarchical clustering can be agglomerative, where each compound is considered to be its own cluster, and these clusters are then grouped into larger clusters until there is only one cluster including all compounds. Conversely, divisive hierarchical clustering begins with all compounds in one cluster, which is then divided into smaller clusters until each compound forms its own cluster. The major disadvantage of hierarchical clustering approaches is that data clustering can only be done once. Since they do not take into account the problem to be solved, there is no possibility to modify the result if it does not meet the designer’s expectations. Fuzzy k-means clustering can provide an alternative to hierarchical clustering where a number of clusters needs to be specified a priori. Each compound is assigned to all clusters with different degrees. The output of fuzzy k-means clustering is a matrix of fuzzy values that can provide more information regarding the compounds in the training set. However, it cannot be used when clusters overlap. Self-organizing maps, as compared to the two other types of clustering techniques, produce a graphical output that can be used to visually interpret the results obtained. These results can also be used to understand how the input features influence the clustering. (Gaudelet et al.2021)(Huang et al.2021)(Pandey et al.2022)

 

3.3. Reinforcement Learning

 

Reinforcement learning (RL) represents a distinctive family of machine learning paradigms that is concerned with learning an agent policy that achieves desirable behavior through interactions with the environment. The machine develops through experience, finding actions that lead to favorable results and avoiding those that could lessen overall proficiency. The cumulative reward that an agent can collect, given some actions in an environment, qualitatively describes an evaluation of the learned policy. In reinforcement learning problems, the agent must deal with the exploration-exploitation dilemma; that is, it should try out different actions that might lead to more rewards while also exploiting the current knowledge of the environment.

 

A Markov decision process describes a wide family of reinforcement learning scenarios, necessitating a crucial assumption on the environment: it should generate the next state at time t+1 from the current state of the environment at time t, given the action at time t taken by the agent. The state-related information at times t-1, t-2, … can be discarded without affecting the subsequent evolution of the environment. An agent at time t decides the action at, given its current state st, in accordance with a policy. This policy can be deterministic, selecting a specific action at state s, or stochastic, selecting an action sampled from a probability distribution.

 

Afterwards, the policy is evaluated according to a value function, using a model of the environment, and the policy is modified to improve it. Following a modified policy, the agent will then choose an action when at state s, instead of action a. This will change the distribution of actions taken, which will affect the probability distribution of the states the agent visits, rendering the value function previously calculated invalid and the need for re-evaluation.

 

4. Machine Learning Techniques in Drug Discovery

 

Among the recent advances made to combine drug discovery and machine learning, four directions can be highlighted: QSAR, virtual screening, de novo drug design, and drug repurposing. Most machine learning in drug discovery papers can be categorized according to these four techniques.

 

Available compounds in the database are usually encoded into quantitative features; then, machine learning classifiers are built to discriminate the actives from the inactives. QSAR is one of the oldest data mining techniques in computational drug discovery research, although in recent years it has been helped enormously by machine learning. Structural activity relationships built before the advent of machine learning were out of reach for thousands of compounds. The statistics of these old QSARs were questionable due to the inclusion of very few compounds. Modern QSAR can deal with thousands or even millions of compounds. Formally, design groups the process of identifying more powerful and selective compounds against a biological target. Meanwhile, screening aims to validate or confirm a set of already designed inhibitors against the target. Computationally, a huge number of virtual compounds can be generated. The virtual screening of an external database of compounds from different chemical spaces is assembled into the screening work. It is noted that the screening can be done either with publicly available compounds or with proprietary compounds from the company’s database.

 

Both virtual screening and de novo can be employed at the hit-list selection stage of the experimental test to search for new compounds against a biological target. Only compounds that pass the virtual filter(s) will be forwarded to the experimental ones in bench tests. Once computationally tested compounds are selected to enter the molecular dynamics simulations of the target/probe complex, the final hit-list will group the compounds that passed all previous computational tests. From it, experimental biologists will select the lucky compounds to experimentally test against the target. The cheminformatics field will use a flag “the hit-list” to describe such last generation compounds having diverse biological and chemical properties. The hit-list selection step is critical to reduce the number of experimental tests from thousands to usually only tens or hundreds of cross-screening compounds.

 

In drug design, it is considered the process of designing a new compound from scratch that must have at least two properties: it should be studied in the desired activity and should have some physicochemical properties such as solubility and absorption. As a contribution to the improvement of “drug-like” properties for novel hits selected by a screening initiative, this latter direction focuses on the design of new compounds. Most drug molecules are patented; therefore, it is usually more cost-effective to redesign or to develop new improved derivatives of known drugs. Using a large enough compound library, known as a “repurposing library,” alternative targets for a ligand approved for a disease could be found, screening it over a “drug-repurposing screen.” The idea is to rescreen drugs already on the shelf for another disease or side effect. Drug-repurposing prediction usually groups bioactivity predictions in large compound libraries over diverse targets. Some caution must be taken in the interpretation of the bioactivity prediction.

 

4.1. Quantitative Structure-Activity Relationship (QSAR)

 

Quantitative Structure-Activity Relationships (QSAR) denote a class of statistical models that are concerned with the relationship between the chemical structure of compounds and their biological activity. This methodology is increasingly being employed in drug discovery as a means to accelerate both hit identification and lead optimization. The QSAR models generally fall into three separate stages: (i) descriptor calculation, (ii) model development, and (iii) model validation and interpretation. The descriptor calculation stage involves the development of an appropriate set of molecular descriptors. This stage is most often the bottleneck of the process due to either the availability of a large number of commercial and open-source descriptor calculation tools or the stringency of the structure to descriptor transformation.

 

Machine learning has become a powerful tool for prediction in modern chemoinformatics. With the advent of quantum mechanical theory, descriptively rich properties of chemical compounds have become computable. With the advancement of modern machine learning techniques, QSAR, which was previously based on classical linear modeling, is gradually embracing non-linear approaches. In chemoinformatics, there are more than 350 related publications containing large datasets with over 1,000 chemical compounds evaluated by various machine learning models. As a drug discovery tool, QSAR can in principle reduce the number of compounds that experimentalists need to synthesize, and therefore diminish the monetary costs involved in the discovery and development of new drugs.

 

Recent advances in computer equipment and software applications have equipped researchers with the tools for the calculation of various molecular descriptors on either Linux servers or PCs running Windows. As a consequence, minicommunities of QSAR researchers in academia or non-profit laboratories have developed the so-called ‘home-grown’ QSAR software. In the subsequent section, commonly used molecular descriptors will be briefly reviewed according to the categories of topological, geometrical, electronic, and hybrid descriptors.

4.2. Virtual Screening

 

Virtual screening has emerged as a promising solution for the analysis of extensive libraries targeting specific protein binding sites. This technique involves the prediction of ligand-target interactions for large datasets of compounds, enabling the identification of new hit compounds for biological validation while simultaneously discouraging the advancement of non-hits. As one of the earliest approaches in drug discovery, virtual screening relies on diverse computational techniques to assess the binding affinity of a ligand for a target. Initial steps in virtual screening methodologies include target preparation and the selection of a compound database to screen against the target. Subsequently, two screening approaches can be employed: the application of docking software to analyze the binding affinity and binding pose of each compound against the target and the analysis of molecular descriptors. The goal of both approaches is to generate a list of hit compounds with the most favorable binding affinities to the target of interest. Such compounds can then be prioritized for biological validation.

 

Typically, computational techniques applied to virtual screening methodologies utilize pharmacophore queries, molecular docking, molecular dynamics simulations, post-docking calculations, and machine learning. Pharmacophore queries rely on the identification of crucial molecular features that characterize biological activity, enabling the design of a virtual screening query. Molecular docking involves the ensemble of techniques predicting the macromolecular structure of ligand-protein complexes based on the free energy of binding and the analysis of hydrogen bonds, hydrophobic interactions, and binding pockets, and it can be used for the screening of hundreds of millions of compounds. Molecular dynamics simulations offer a physically realistic assessment of ligand-target interactions over time and can be employed to analyze binding poses from docking calculations. Post-docking calculations utilize a variety of approaches to re-score binding affinities, such as the consensus scoring approach or machine learning re-scoring methods.

 

Pharmacophore queries, molecular docking, molecular dynamics, and post-docking calculations are more conventional screening methodologies. More recently, machine learning models have been developed to enhance the predictions of protein-ligand interactions. The rise of machine learning methodologies has been driven by the accumulation of large datasets from high-throughput screening initiatives, bioactivity databases, solved ligand-target 3D structures, and publicly available protein-ligand affinity datasets. Research institutions and pharmaceutical companies are increasingly applying machine learning models to virtual screening algorithms due to the limited number of experimental data available for the training of physics-based drug discovery algorithms, which have been shown to outperform traditional virtual screening methodologies.

 

4.3. De Novo Drug Design

 

One area of drug discovery where machine learning approaches have been applied successfully is in de novo drug design, which involves the design of novel chemical compounds or biologics for a desired target. This section introduces a number of applications of deep machine learning to de novo design in order to highlight the range of approaches that are possible. Proposed active compounds are often judged using mathematical models or qualitative differences. Deep machine learning methods have been applied to automate some of these assessments for co-evolution based protein designs as well as for general compound generation.

 

The applications of machine learning in de novo drug design are categorized by application. The first major category involves the generation or design of small organic or peptide compounds, including the generation of compounds against active sites, raising the probability of finding compounds with desirable functional characteristics. The second major category involves the positive and negative scoring of ligands or small organic compounds against predefined targets, generating potential lead compounds or removing unwanted compounds. This category includes the re-evaluation of compounds already known to engage with targets in silico or in vitro assays.

 

In the field of drug design and small organic compound generation, a prediction-based generative adversarial network architecture has been used to produce chemical reactions. Similar approaches have been used with the goal of designing small organic compounds to combat protein viral infection. In this application, knowledge-based methods were used to design novel compounds which underwent a combination of in silico docking and a laboratory screen, revealing compounds that inhibited viral entry. To address the complication that chemical compound generation results in a large and sparse hypothesis space from which novel compounds may be generated, a compound-filtering approach for small organic compound generation has been developed. In this case, artificial intelligence methods were used to train models that predict a range of desired biological activity conditions including lipophilicity, solubility, blood-brain barrier drug permeability, and toxicity measures. The trained models were subsequently applied to filter compound hypotheses generated by a lattice-based Monte Carlo algorithm that samples the chemical space around a predefined lead compound.

 

4.4. Drug Repurposing

 

Drug repurposing, also known as drug repositioning, is the process of identifying new therapeutic uses for existing drugs. This approach has gained significant attention in the field of drug discovery due to its potential to reduce the time and cost associated with the development of new drugs. Machine learning techniques offer valuable tools for the identification of new uses for existing drugs by analyzing the large volumes of data generated by drug trials and studies.

 

Machine learning approaches to drug repurposing can be divided into three main categories: similarity-based approaches, graph-based approaches, and network-based approaches. Similarity-based approaches rely on the assumption that drugs with similar structures or targets are likely to have similar effects. These approaches typically use quantitative structure-activity relationship models to represent drugs as vectors of features and then compute similarities between drugs. Graph-based approaches represent drugs, diseases, and interactions as a heterogeneous network and establish the relationship between drugs and diseases through a bi-rank algorithm. Network-based approaches represent drugs, diseases, and other biological entities as a network and use machine learning algorithms to predict new drug-disease associations based on the existing network.

 

Machine learning methods have been successfully employed in various drug repurposing studies. These methods include support vector machines, random forests, k-nearest neighbors, and deep learning approaches such as convolutional neural networks. Support vector machines have been used to create a classifier that predicts new drug-disease associations based on chemical similarity, target similarity, and side-effect similarity. Random forests have been used to model the relationship between drugs and diseases based on chemical similarity, target similarity, and side-effect similarity.

 

Emerging machine learning techniques have shown great promise in drug repurposing applications. These techniques include recurrent neural networks, generative adversarial networks, attention-based neural networks, and multi-task learning. Recurrent neural networks can capture the sequential nature of drug interactions. Generative adversarial networks can generate new drug-disease pairs. Attention-based neural networks can focus on the most relevant features of the data. Multi-task learning can jointly learn multiple related tasks, improving the performance of each individual task.

 

Despite the advances in machine learning approaches for drug repurposing, several challenges remain. These challenges include the sparsity of drug-disease interaction data, the need for biological validation of predicted drug-disease pairs, and the need for standardized benchmarks for evaluating the performance of different approaches. Addressing these challenges will be essential for promoting the wide adoption of machine learning techniques in drug repurposing and improving the efficiency of the drug discovery process.

 

5. Applications of Machine Learning in Drug Discovery

 

The emergence of machine learning (ML) has had a considerable impact on numerous industries, most notably improving productivity levels and streamlining operational processes. However, in the drug discovery industry, the potential benefits of ML are still underexplored, despite its high potential reward due to the substantial costs of drug discovery failure. This section highlights the major applications of ML in drug discovery. The applications outlined in this section cover four distinct stages of the drug discovery process: target identification and validation, compound screening and prioritization, toxicity prediction, and pharmacophore modeling.

 

Target identification and validation (TIV) is the first step in the drug discovery process, where the most effective target for a previously abandoned drug is sought. This step is important in enabling drugs to be repurposed for alternate diseases. ML has been utilized in drug target identification. The majority of approaches sought to identify the proteins most affected by the administered drug. However, other ML approaches have also been developed. In addition to drug-target identification, another application of machine learning in TIV is in predicting drug-induced liver injury using a dataset of approved drugs with known hepatotoxicity profiles.

 

Compound screening and prioritization involves predicting whether a given compound will bind to a specific target protein. Machine learning has been employed in a variety of ways to achieve this. Most ML methods differ in the method employed to represent a compound and/or target protein. Due to the experimental costs of biophysical assays, many ML studies sought to employ virtual high-throughput screening against a protein target using molecular docking results.

 

During lead optimization, undesirable drug-like properties are removed from the initiator compound. These include target-specific off-target binding, lack of bioavailability, metabolic instability, or liver toxicity. The main application of ML in lead optimization was in predicting off-targets of compounds against proteins with different binding pockets. Such screening can identify the candidates most likely to fail during preclinical testing due to toxicity. ML has been used to model toxicity as a continuous variable of interest rather than a binary output and has achieved comparable performance to other models.

 

In addition to the above applications, machine learning has been employed during the very early stages of the drug discovery process. In pharmacophore modeling, the aim is to develop a model of the pharmacophore or chemical features responsible for target binding. While pharmacophore modeling has a number of advantages, it would appear the technique has rarely been applied in conjunction with ML. Despite advances made in the past decade, ML remains underutilized in the drug discovery process. Herein, the chemical and target data sources for machine learning in drug discovery are outlined and the major ML applications with drug discovery are summarized.

5.1. Target Identification and Validation

 

The identification of target proteins such as receptors or enzymes for small molecules, antibodies, or other modalities that modulate their activity forms the basis of drug design and the subsequent drug development process. The accuracy with which targets can be identified and characterized can significantly affect the time and cost of finding new drug candidates. Adverse drug reactions, poor clinical response, and the emergent drug resistance associated with several marketed products can often be traced back to inaccuracies or ambiguities in the original target selection due to the lack of established, reliable methodologies available for exploratory research. Target selectivity effects across multiple closely related family members can be particularly problematic, necessitating a post-discovery phase of experimentation to differentiate the on-target and off-target effects of tested compounds.

 

With recent advances in chemical science, molecular biology, and bioinformatics software, an affordable and favorable path has opened for target identification and validation. The vast accumulation of sequence, structure, 3D coordinate, ligand, gene/protein expression, biological, molecular docking, and cellular screening data has provided pharmacophores, full-text search engines, desirability functions for hit triangulation, and ligand-based models for established lead compounds. The growth of systematic, high-throughput screening campaigns has aided the generation of rich data sets for statistical modeling of targets. In parallel, software developments and bioinformatic databases have provided sophisticated and easy-access computational tools to analyze the high-level complexity of large heterogeneous biological data sets encoded in the protein sequence/structure space. The recent weaponization of large microbiological compound libraries against existing or emergent pathogenic targets or receptors has provided a wealth of screening data ripe for exploration.

 

Several of these progressively ambitious endeavors are built on the foundation of a less exploratory forward-engineered approach coupled with complex systems analysis and for management. As a golden rule of experimental pharmacology, ligands discovered in one assay format can be expected to work with the target and in other formats. This characterized, externally validated predictive model can underpin various midstream options such as chemical series expansion, secondary, or knockout-target assay cascades for test compound profiling. However, it belies the more exploratory, research-and-discovery competence-driven target ID/validation methodologies, which in an ideal scenario would be non-expert-driven black-box empirical automated pharmacologic/computational approaches with ubiquitous applicability across all compounds.

 

5.2. Compound Screening and Prioritization

 

Once potential targets are identified and verified, the focus shifts to discovering small molecules that can modulate their activity. High-throughput screening of compound libraries against target proteins is the traditional approach for lead discovery. However, high-throughput screening experiments are costly and time-consuming and have a high probability of false negatives and false positives. Therefore, compound prioritization through virtual screening of in silico compound databases is often conducted to enrich the pool of chemicals for testing. Early efforts focused on ligand-based approaches, including pharmacophore and 2D/3D similarity searching. Since the mid-2000s, structure-based approaches incorporating protein-receptor structures provided by X-ray crystallography or molecular modeling became predominant, which included docking-based methods and free-energy calculations. Understanding the limitations of each approach, such as a strong reliance on receptor structures in the case of docking, has clearly defined the current methodologies. This has led to the development of alternative in silico approaches, including ligand-based and structure-based machine learning methods, to mitigate the limitations of traditional virtual screening studies. More recently, deep learning and generative models have shown promise in compound screening and design, leading to the increasing interest in applying artificial intelligence in drug discovery.

 

Machine learning and deep learning methods have been employed in the last decade to automate hit and lead discovery through the prediction of compound-target or compound-ligand binding affinity. Representative applications include screening large libraries of approved and investigational drugs using virtual high-throughput molecular docking against proteins, resulting in several compounds being investigated in clinical studies. A collaborative effort between several institutions was also developed to enable structure-based virtual screening using deep learning while accounting for data scarcity, multimodality, and imbalanced datasets. Significant hits were identified against several targets, including various kinases, transmembrane transporters, and GPCRs, with co-crystallized compounds and crystal structures obtained for some of them. A more recent implementation of protein-ligand docking using deep learning biorepresentations can be used standalone or seamlessly combined with existing docking packages. It was validated using a benchmark, outperforming state-of-the-art docking packages in various metrics, such as enrichment factor and hit rate.

 

Once compounds are ranked using virtual screening or machine learning methods, a second round of filtration is usually conducted to eliminate compounds with known off-targets or ADMET liabilities. This can include ligand-based and structure-based toxicity models of known datasets. In adaptive learning scenarios, filtering models can be iteratively improved by applying them to positive and untested compounds from an experimental study, thus increasing the reliability of the prediction space. As with other in silico methodologies, machine learning methods need to be interpreted in light of dataset limitations and model use cases. (Gaudelet et al.2021)(Huang et al.2021)(Pandey et al.2022)(Mouchlis et al.2021)

 

5.3. Pharmacophore Modeling

 

Pharmacophore modeling is a concept-based approach to drug design that may be employed in the early phases of the drug discovery process to identify scaffolds. It is a technique for locating or creating lead compounds, scaffolds, or new active molecules that selectively bind to a target ligand, either by target or ligand search. A pharmacophore is an abstract description of the properties necessary for molecular recognition of a ligand by a target macromolecule. The three-dimensional pharmacophore description defines the spatial disposition and geometrical arrangement of the chemical features. Pharmacophore models may also contain physicochemical properties and requirements that ensure the binding, selectivity, and optimal activity or inhibition of the described ligands. Pharmacophore models are used as insight descriptions in the biological activity of the target and ligand structures. They explicitly depict the geometric positions of features crucial for the activity of a family of similar bioactive compounds, often referred to as active conformers.

 

A pharmacophore model can be described in terms of the spatial arrangements of a set of chemical features that define the preferential binding of a molecule to a receptor. Stereochemical information is included in model construction, and any attempts to recreate the pharmacophore description must take it into account. Ligand-based pharmacophore models may be created using the known-coordinates model generation approach. The basic idea of pharmacophore model generation is determined either explicitly or implicitly by the arrangement of chemical features in the coordinate frame of the biologically active conformers of ligands. A key area of concern for pharmacophore applications is the electronic nature of descriptors provided by quantum mechanical methods including Molecular Electrostatic Potential, Partition Coefficient, Molecular Lipophilicity Potential, Local Ionization Potential, and Local Properties including dipole moment, total charge, atomic multiplicity, shielding, etc.

 

Currently, a variety of techniques may be employed to fine-tune and investigate all pharmacophore models. Independent of the methodology, based on a set of all novel models that characterize the pharmacophore description of known active ligands, the task of pharmacophore modeling is to identify models that describe the preferred binding of the ligand to the target and discard models that characterize the residues that are neglected or irrelevant to the activity. Extensive pharmacophore studies based on a ligand-centered search with a variety of descriptors have shown a significant improvement in the activity modeling for compound libraries with diverse chemical structures. However, models based on strict 3D geometric descriptors containing manually specified features seem to be less significant for activity enrichment.

 

5.4. Toxicity Prediction

 

The efficient and effective prediction of drug toxicity holds paramount importance in the realm of pharmacology and in silico drug design. Several methodologies, including diverse machine learning strategies, have been employed in silico toxicity prediction. A comprehensive experimental drug toxicity dataset encompassing 33 toxicity endpoints, comprising over 100,000 drug-like compounds, has been utilized to investigate these approaches. Moreover, validation datasets have been analyzed to evaluate the reliability of the different models. Performance evaluations based on 16 toxicity endpoints have been conducted for 9 machine learning models across 3 toxicity prediction frameworks: individual toxicity endpoint prediction models, multi-output toxicity endpoint prediction models, and compound features level-based models. Results show the feasibility of data-driven toxicity prediction approaches for the majority of toxicity endpoints, with some toxicity endpoint predicting models outperforming similar chemical descriptor-based computational models.

 

In contemporary pharmaceutical research, the accuracy of predicting drug toxicity on biological systems remains a significant challenge, despite substantial advancements in drug design tools. Experimental in vitro toxicity assessment modalities for early-stage drugs are time-intensive, costly, and complex, necessitating improvements. Nevertheless, understanding a compound’s toxicological profile at the very outset of drug development is crucial for reducing the attrition ratio. As a result, there is a growing interest in the development of computational anti-toxic toxicity prediction methodologies. Computational toxicity prediction models can be classified into three categories: QSAR or SAR models based on chemical structure, protein binding energy/ligand interaction based on 3D structure, and physicochemical property-based predictive modeling approaches.

 

A recent study has focused on the development of novel toxicity endpoint predicting models based on the aforementioned descriptor types and toxicity database. The developed computational models have the potential to be of great practical value for drug toxicity prediction in pharmaceutical research and development, monitoring and regulation of industrial chemicals, and screening of drug candidates for toxicity testing on biological systems. Furthermore, the study’s toxic dataset has been made publicly available with an open access chemical descriptor dataset through a user-friendly interface. This interface may serve as a powerful utility for drug toxicity prediction prior to in vitro testing, thus accelerating the pace of drug development.

 

 

 

 

6. Case Studies and Success Stories

 

Although the application of machine learning (ML) to research and development for new drugs is still in its early days, there have been some case studies and pilot projects demonstrating its capabilities. Some of the case studies include:

 

Deep Learning in Drug Discovery

 

In 2015, researchers developed a deep learning model called DeepDock, showing expertise in docking ligands to proteins. They demonstrated that existing knowledge in molecular modeling can be captured effectively by deep learning architecture.

 

In 2017, a recurrent neural network was created to predict the binding affinities of ligands to proteins, equating ML effort with molecular dynamics scoring algorithms. DeepDock showed better accuracy than traditional state-of-the-art algorithms.

 

A more ambitious project involved trying to discover new compounds by April 2016. The project utilized a newly developed fusion of deep reinforcement learning, model-based generative, and variational autoencoder. It was shown how existing in silico and high-throughput screening methods could be improved by several orders of magnitude.

 

Drug Discovery Startups Utilizing Machine Learning

 

Numerous startups are focused on applying machine learning to drug discovery. These include:

 

– Atomwise — Utilizing supercomputers and ML to predict interactions between small molecules and proteins.

 

– BioSymetrics — Focused on disease states and biomarker patterns for early identification of drugs.

 

– Berg — Employing computational biology to use disease physiology in drug discovery.

 

– BenevolentAI — Combining AI and machine learning with life sciences knowledge to focus on

patient needs rather than historical scientific connections.

 

– Gannett Fleming — Using analytics, big data, and reasoning methods to track innovations and identify winning health drug discovery.

 

– nference — Making biomedical knowledge accessible for complex tasks and enabling organizational innovation.

 

– Numerate — Building in silico technologies to identify, design, and optimize small molecules in parallel with medicinal chemistry.

 

These startups indicate the growing acceptance and anticipation of using machine learning in the drug discovery process. Several major pharmaceutical companies have established collaborations with small focused drug discovery companies or academic research groups.

 

 

 

The pharmaceutical industry is again on the brink of major advances that could dramatically improve the productivity and reliability of the drug discovery stage of internally discovered compounds. It is suggested that moonshot projects should be broadly encouraged as they could lead to major breakthroughs that individual startups or companies could not achieve alone.

 

6.1. Deep Learning in Drug Discovery

 

Numerous drug discovery initiatives are successful commercial applications of deep learning architectures that have been developed and tested with academic datasets. As with any other emerging technology, a certain amount of skepticism should be kept at bay, and next-generation ML systems should not be marketed as panaceas. Recent successes and ROI should address where data-driven drug discovery efforts should start. Pharma companies generally have a plethora of existing chemical biology and data resources collected over decades. Although there is no silver bullet, how to get on the right track with data-driven drug discovery is discussed, along with what new investments may speed up the improvements to incumbent business models.

 

Vast databases of compounds as well as proteins have been constructed and maintained over decades within Big Pharma. A common language – SMILES text – bridges the gulf between chemistry as practiced in silico and in real life. Given chemistry’s proclivity for providing inventive surprises, it is no surprise that hierarchically ductile in silico representation of chemistry acquired broad appeal decades ago. The keeping of meticulously curated databases has been guarded as vital intellectual property, as it encored decades of effort in accountancy-sensible work hours.

 

Relying on the intelligent and imaginative synthesis of obvious diversity, commercial vendors stuffed catalogs with innumerable seemingly diverse compounds – diversity that was cheaply gained by the proclivity of DMOs to oversubscribe leverage-of-evaluation contracts for co-marketing. While the portfolio of a drug discovery tackle is all but certain to remain small, by economic mandate this should be bolstered with vast volumes of more homogeneous high-throughput screening libraries imported by Pharma from just such weird endeavors.

 

Publicly accessible chemical corpuses have sought to quantify the fruitful serendipity of drug discovery. This attitude for chemistry espoused by bioinformatics stakeholders seems consistent with pharmacological wisdom for what chemists should aim for in further development of lead candidates. Drug discovery stakeholders should focus efforts on building upon proprietary in-house knowledge bases about accessible lead candidates. This is paramount, as it is recognized that the broadest and most significant datasets are seldom available outside the most secretive organizations where competitive edge is battle-tested. These enormous knowledge bases of reasonable quality should dwarf publicly available in silico description of chemistry, and it is in particular the combination of chemical and biological datasets that lends credence to chemistry-driven implementation of machine learning architectures. Such ML architectures should meanwhile be able to explicitly extract chemical knowledge gained in HTS screening of in-house compounds – knowledge that remains hopelessly ovalized on account of Pharma’s proclivity for trade secret protections of bona fide experimental inputs and outputs.

 

6.2. Drug Discovery Startups Utilizing Machine Learning

 

Several biotechnology startups have begun utilizing machine learning techniques in drug discovery. Here, a few relevant case studies are explored. Both Atomwise and Insilico Medicine are drug discovery startups that utilize machine learning techniques to identify potential small molecule drug candidates that can modulate the activity of target proteins with known 3D structures. 3D structures may allow for the use of structure-based approaches such as molecular docking and machine learning-based scoring functions. These companies take advantage of publicly available biomedical database resources and their respective computational modeling expertise to discover new compounds.

 

Atomwise, Inc., based in San Francisco, California, has taken a high-throughput computational approach to virtual screening for drug discovery. Atomwise’s proprietary computational platform technology uses deep learning predictive network architectures that utilize convolutional neural network approaches for 2D and 3D molecular representations. Atomwise has partnered with both small and large pharmaceutical companies to provide drug discovery services in a fee-for-service model. Atomwise’s computational screening of commercially available compound libraries for the target and subsequent in vitro assays has resulted in multiple classes of down-modulating small molecules.

 

Similar to Atomwise, Inc., Insilico Medicine is a biotech startup employing machine learning methods in drug discovery. Insilico Medicine, based in Rockville, Maryland, with branches in China and Russia, aims to drive down the cost in time and financial expenditure of drug discovery for both target discovery and small-molecule lead optimization. Insilico Medicine operates a service that offers clients access to proprietary computational models that can be applied to drug discovery, biomarker discovery, and indirectly repurposing. Insilico Medicine has developed deep generator network technologies trained on large datasets of drug information. The first application of generative adversarial networks for small molecule drug discovery was published by Insilico Medicine and is still considered a substantial contribution to the field. This publication spurred interest and enthusiasm for the application of generative deep neural networks in drug discovery, leading to a wave of new startups utilizing deep learning for small-molecule drug discovery.

 

7. Ethical and Regulatory Considerations

 

The integration of machine learning techniques into drug development represents a significant transformation of the pharmaceutical research landscape. Although there are concerns related to ethical use and the regulatory environment, the development and implementation of these technologies could pave the way for the discovery of novel therapeutics. From an ethical standpoint, there are concerns regarding data usage and ownership. The training and validation of machine learning algorithms require extensive amounts of data, often derived from publicly available datasets. There is a conflict between the generation of high-quality data and the public accessibility of that data. On one hand, the level of data curation and quality within some datasets has resulted in their domination in the field, which may artificially decrease the diversity of preclinical and clinical data used by different groups. Furthermore, there may also be a need to consider the implications if firms that hold such datasets are bought out by larger pharmaceutical companies. As technology advances and the ability to generate large amounts of potential pharmaceutical candidates grows, consideration should be given to the types of control that a single entity may have over pharmacological research. Up until now, there have mostly been technological transformations in the research landscape that have shifted development and production strategies outside of the realm of individual scientists or academic institutions. There is an opportunity for similar transformations to exist on the data ownership side, which could control and constrain pharmaceutical research on a global scale. However, such scenarios are not inevitable, and it is for the scientific community to argue for the importance of data accessibility for successful and ethical drug discovery efforts.

The adoption of machine learning models in the drug discovery pipeline also raises concerns surrounding intellectual property and patenting. The creation of self-generating algorithms that have the capacity to independently generate novel candidates may challenge the existing patent system in its efforts to determine ownership of generated candidates. In general, there is uncertainty regarding how patents would translate into algorithms and whether generated candidates would even satisfy current patent eligibility criteria. Furthermore, the development of such algorithms and models is both time and resource-consuming, resulting in legal ambiguity concerning whether newly developed candidates could be considered publicly available information or proprietary. As a result of this, pressure builds on the scientific community to advocate for policy reform that considers how to evolve the existing patent system to hold intellectual property accountable in the machine learning era.

 

From a regulatory standpoint, there are challenges regarding how to best incorporate new machine learning approaches into existing frameworks. Understandably, the FDA seeks to regulate the use of these models without stifling innovation or growth in the field, given the pace at which this technology has been adopted by companies. Compared with models that rely on conventional physiochemical characteristics and the nature of the compounds alone, machine learning models can be seen as ‘black boxes,’ making it difficult for researchers or regulatory agencies to determine how a result was achieved. In the public sector, the continual use of machine learning efforts in making clinical trial decisions has motivated concern surrounding the possibility of algorithms that are biased or otherwise difficult to interpret. In light of these considerations, a spotlight has been shone on how best to provide transparency while supporting business interests and technological advances. With the growing desire to integrate machine learning into approval and post-market monitoring processes, it is expected that the problem of model transparency will persist into the future.

 

8. Future Directions and Emerging Trends

 

The application of machine learning techniques in drug discovery is still in its infancy but is gaining momentum. Supervised learning, where a model is trained against a fixed data set and then used to generate predictions for test data, is becoming increasingly important but is still limited in drug discovery. Other machine learning approaches, such as reinforcement learning and deep generative models, can also be applied to discover new drug-like molecules, with interesting results. New approaches for solving the 3D structure of membrane proteins have been or are rapidly being developed, including experimental, computational, and hybrid relevance. Another major trend is in the area of graph-based approaches, where atom-derived connectivity as descriptors of molecules are used to directly develop and apply 2D and 3D convolutional neural networks. Combining 3D and 2D properties in one single network to capture different aspects of the compounds has been suggested. Also, in conjunction with graph-based approaches, different manifold alignment approaches have emerged. As regards their application to post-screening experiments, new experimental techniques capable of dealing with huge libraries, such as technology based on aptamer selection, contact array chips combined with mass spectrometry, and nanolitre-based magnetic beads for biopanning, are emerging. Nano-size probes for detection on a chip platform seem to be a good fit for the conjugate peptide screen against large libraries. The challenge for these emerging experimental technologies and the corresponding data processing methods is to provide guidelines on the informative library design and to be able to choose wisely which sub-library is to be probed experimentally.

 

With regard to cheminformatics, efforts are being made to better incorporate the experimental molecular complexity from screening, including binders with unknown structures, more flexible binders, and cooperative binding. Methods to predict binders’ stability upon binding are also highly desired. Another interesting trend is the increasing availability of open access data sets. The need for benchmark data sets has been pointed out. There is also a shift in the focus from designing libraries of novel compounds to predicting compounds within existing libraries that can bind the target of interest. Being aware of the huge computational cost for running screening experiments against big databases, there are attempts to evaluate the effectiveness of different data reduction techniques and the possibility of developing ‘alternative’ compound libraries more suited for structure-based virtual screening. A thought-provoking challenge for the academic community is to get involved in drug discovery programs, as put into practice by a couple of initiatives; two compounds from the start-ups have already entered preclinical phases. (Gupta et al.2021)(Patel et al., 2020)(Dara et al.2022)(Jiménez-Luna et al.2021)(Vatansever et al.2021)(Carracedo-Reboredo et al.2021)(Mouchlis et al.2021)(Pandey et al.2022)(Huang et al.2021)(Gaudelet et al.2021)

 

9. Conclusion and Summary

 

The convergence of artificial intelligence and pharmaceuticals has heralded a new epoch in drug discovery, accentuated by the ascendance of machine learning methodology as a powerful elucidator of molecular vitality and an adept architect of versatile molecular moieties. With the collective efforts of scholars and pharmacologists, a multifold enhancement of the scope and rapidity of drug discovery has been attained. Molecular models, drug modeling, and quantitative structure relationships are the bedrock of organic design, which is onerous, effortful, time-consuming, and regimented by potentially harmful orthodox ‘lead’ compounds. Nowadays, with machine learning in a primary spot in bioinformatics, cheminformatics, and similar disciplines, the harvesting of biological data has transitioned from a linear, serial expository approach vanquishing distortion and confounding to one steering in the constant multi-sensing identification seeking maximal jettison.

 

While small-molecule elucidation has witnessed substantial amalgamation in machine learning in the crafting of sophisticated and robust intercommunicator descriptors, there are nonetheless manifold spheres of designation as machine learning is presently inchoate to fully penetrate as it has in the so-called ‘pure’ disciplines of physical sciences considering the imperviousness, complexity, and vast dataset inherent in the methodology. Machine learning-based procedures are fraught with uncertainties concerning the disambiguity and profusion of domains. With the rapid growth in this burgeoning discipline fraught with unregulated release of methodology and data, there is a drastic curtailing of efforts stemming from Big Lab reckonings. Nevertheless, this mélange of big data and machine learning cannot but revitalise drug design and discovery. (Gaudelet et al.2021)(Huang et al.2021)(Pandey et al.2022)

References:

Gaudelet, T., Day, B., Jamasb, A. R., Soman, J., Regep, C., Liu, G., … & Taylor-King, J. P. (2021). Utilizing graph machine learning within drug discovery and development. Briefings in bioinformatics, 22(6), bbab159. oup.com

Huang, K., Fu, T., Gao, W., Zhao, Y., Roohani, Y., Leskovec, J., … & Zitnik, M. (2021). Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv preprint arXiv:2102.09548. [PDF]

Pandey, M., Fernandez, M., Gentile, F., Isayev, O., Tropsha, A., Stern, A. C., & Cherkasov, A. (2022). The transformational role of GPU computing and deep learning in drug discovery. Nature Machine Intelligence, 4(3), 211-221. nature.com

Mouchlis, V. D., Afantitis, A., Serra, A., Fratello, M., Papadiamantis, A. G., Aidinis, V., … & Melagraki, G. (2021). Advances in de novo drug design: from conventional to machine learning methods. International journal of molecular sciences, 22(4), 1676. mdpi.com

Gupta, R., Srivastava, D., Sahu, M., Tiwari, S., Ambasta, R. K., & Kumar, P. (2021). Artificial intelligence to deep learning: machine intelligence approach for drug discovery. Molecular diversity, 25, 1315-1360. springer.com

Patel, L., Shukla, T., Huang, X., Ussery, D. W., & Wang, S. (2020). Machine learning methods in drug discovery. Molecules. mdpi.com

Dara, S., Dhamercherla, S., Jadav, S. S., Babu, C. M., & Ahsan, M. J. (2022). Machine learning in drug discovery: a review. Artificial intelligence review, 55(3), 1947-1999. springer.com

Jiménez-Luna, J., Grisoni, F., Weskamp, N., & Schneider, G. (2021). Artificial intelligence in drug discovery: recent advances and future perspectives. Expert opinion on drug discovery, 16(9), 949-959. tandfonline.com

Vatansever, S., Schlessinger, A., Wacker, D., Kaniskan, H. Ü., Jin, J., Zhou, M. M., & Zhang, B. (2021). Artificial intelligence and machine learning‐aided drug discovery in central nervous system diseases: State‐of‐the‐arts and future directions. Medicinal research reviews, 41(3), 1427-1473. wiley.com

Carracedo-Reboredo, P., Liñares-Blanco, J., Rodríguez-Fernández, N., Cedrón, F., Novoa, F. J., Carballal, A., … & Fernandez-Lozano, C. (2021). A review on machine learning approaches and trends in drug discovery. Computational and structural biotechnology journal, 19, 4538-4558. sciencedirect.com

Elbadawi, M., Gaisford, S., and Basit, A. W. “Advanced machine-learning techniques in drug discovery.” Drug Discovery Today, 2021. sciencedirect.com

Patel, V. and Shah, M. “Artificial intelligence and machine learning in drug discovery and development.” Intelligent Medicine, 2022. sciencedirect.com

Zhao, L., Ciallella, H. L., Aleksunes, L. M., & Zhu, H. (2020). Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling. Drug discovery today. nih.gov
Scroll to Top