1. Introduction to Machine Learning and Chemistry

Numerous cutting-edge technologies and methodologies have been proposed for knowledge discovery in various research fields. One of the most important and emerging methodologies among them is machine learning. In general, machine learning can be seen as a science of finding patterns and structures from vast volume of data. In other words, the main objective of machine learning is to extract “implied” or hidden knowledge from the data, which we cannot uncover by ourselves solely based on human intuition. Nowadays, machine learning is widely used in various research fields, including artificial intelligence, statistics, neural networks, robotics, pattern recognition, and others. For instance, machine learning has become an invaluable tool for image and voice recognition by using supervised learning, unsupervised learning, and reinforcement learning. Indeed, much recent effort has been made to show how machine learning can be incorporated into conventional chemical physics research.

Most research in chemistry is devoted to predicting molecular properties and behavior from theory and experiments. Such predictions become increasingly complex for larger molecular systems and materials. As a result, researchers rely on computer simulations to explore chemical and material systems. When more systematic learning of material properties of interest from computer simulations or experiments is required, machine learning can provide a framework to construct a series of models that capture the key properties of the system. This tutorial aims to introduce the strategy and techniques to develop machine learning models useful for chemical properties and phenomena. In particular, the relationship between the types of available data, the research question, and the candidate machine learning methods should be considered. The tutorial also discusses the problems that often occur when training machine learning models, the possible sources of errors, and the tools to evaluate models and understand why they fail or succeed, allowing chemists without a strong machine learning background to critically evaluate the relevance and reliability of these models.

1.1. Overview of Machine Learning

The scientific domain of machine learning examines the enhancement of algorithms to enable them to learn autonomously from data. Systems employing machine learning are able to autonomously enhance the programs they use by detecting patterns automatically. Machine learning is a manifestation of artificial intelligence and is primarily concerned with how to enable a program to develop an adequate model, based on previous experience. The program then applies the acquired knowledge to make decisions or forecast future results in light of new data or circumstances. It can have a wide range of potential applications, including spam filtering, face recognition, and medical diagnosis.

A fundamental skill in machine learning is to build a model that can accurately report the data patterns to make accurate predictions. The ultimate goal is to create a model that can encapsulate and understand the fundamental principles and mechanisms of data creation. In order to be confident in the model, it is necessary to design an algorithm that not only fits the data but also has enough capacity to examine previously unseen data. Additionally, in regression, the primary objectives are to forecast or understand numerical results. K-nearest neighbors, linear regression, decision trees, and random forests are common types of machine learning models used in regression. Binary and multi-class classification approaches may be used to forecast or understand the class flows. These are supported by the following algorithms: logistic regression, decision trees, random forests, K-nearest and support vector machines, and other models like neural networks.

1.2. Overview of Chemistry Research

Chemistry comes from the word “al-kīmiyā” which means “knowledge of the composition.” In modern terms, chemistry is a branch of science or a discipline of scientific study. In 1911, chemistry was defined as “the science of substances” and in 1927 as “the science of substances and of the changes which they undergo.” Chemistry is also defined in many ways, such as “chemistry is the study of matter, its properties, the changes that matter undergoes, and the energy changes associated with those processes.” According to this definition, pure chemistry, the fundamental chemistry, attempts to find answers to the questions “what is it?” “What can it do?” and “What does it do?” However, modern chemistry, which searches for applications in all areas of life, often turns it into “why.”

Chemistry research is a subject carried out in various fields required by societies. Research in the field of chemistry not only plays a role in meeting the basic needs of the community but also in contributing to the development of life, which is needed. In today’s modern era, advances in computer and internet technology, more needs have been discovered by humans. In the field of chemistry, the need to search for new compounds, predict protein structure or ligand-protein interactions, develop the knowledge of property-structure relationships, which are needed to develop hitherto unknown materials, etc., chemistry research in various fields is needed. Computational chemistry is a subdivision of chemistry, and its role is the incorporation of simulations and calculations into chemistry research. Given the substantial amount of work in simulating molecules on classical computers, these data sets can and are being increasingly used to train machine learning (ML) algorithms.

2. Fundamentals of Machine Learning

Machine learning is a part of the field of artificial intelligence that involves developing algorithms and models capable of learning from data to make predictions or decisions. It has been widely used in the field of computer science and has been applied to all sorts of problems, ranging from playing games and natural language processing to managing self-driving cars.

In order to train these models, one needs to have data that is reproducible and divided into two sets. The “true” set is the currently known desired outcome and the set used for training the algorithm, while a “test” set is used to validate the model. There exist various types of machine learning, which are overseen by three main procedural types: supervised learning (SL), unsupervised learning (USL), and reinforcement learning (RL). SL is as described above. USL algorithms are trained with non-labeled data. In other words, one is given input values and is expected to pick out trends, classes, or structures—all without knowing what they are looking for beforehand. USL is often used to group similar data points together for classification tasks (e.g., k-means). Finally, RL algorithms build a strategy of action that only gives an occasional reward. Say, for example, a computer program is tasked with playing a game. Every (large) number of steps, something good occurs, and the program’s goal is to capitalize on the chance. Given that in most chemistry topics rely on knowing both the reactants and products in an experiment, we will only be discussing SL in this review.

2.1. Supervised Learning

2. Methods

2.1. Supervised Learning Supervised learning is a type of machine learning algorithm that uses labeled inputs, which means it learns from the training dataset. In other words, supervised learning is about establishing rules to replicate an input-output behavior. There are three primary methods of supervised learning, namely: classification, regression, and ranking.

2.1.1. Classification Classification is a kind of supervised learning technique that categorizes input into classes. An algorithm models the mapping functions from input to output; the algorithm is trained with labeled data. In classification tasks, inputs are split into examples, so each example has a label. This learning method could have different kinds of classes: binary (two classes), multiclass (more than two classes), or multi-label (more than one class for one example). Common applications of classification include filtering, ranking, and identification problems. In chemistry research, this method can be used for, for instance, activity label predictions by assessing various ligand-based activity and physical and chemical descriptors. Furthermore, classification is also a workable method to predict off-targets for drugs, a binary task related to computing model accuracy, precision, and recall.

However, the linear discriminant analysis is not a recommended algorithm in order to solve this problem, as it has a limited area ratio and it only classifies certain groups very well. Kim et al. developed a convolutional neural network-based model called DeepHook. This model was used for creating a classification model for identification of consecutive nucleosome discs and for predicting the binding position of transcription factors. In another part of the complex classification problem, scientists developed an SVM-based predictive model for mammalian lethal activity. The ear model showed good performance, especially in terms of sensitivity (the ability to predict true positive) and effectiveness. Sumasri et al. developed predictive models based on a deep neural network and SVM for classifying anti-breast cancer peptides.

2.2. Unsupervised Learning

Unsupervised learning, correspondingly, does not require instructional supervision. Typically, the task of unsupervised learning is to find patterns in the data without inputting a target feature or an output. It autonomously discovers some patterns, clusters or “latent” relationships in data. It becomes popular to use these in chemical data with sensing fields and helps analysts to make a possible case to categorize the large amount of chemical data. There are several glaring real-life examples of the applications of unsupervised learning. One of the best examples is market segmentation to understand the customer purchase behavior. Here, analysts use the age of the customer, the product they buy to classify the type of consumer. Another day-to-day practice of unsupervised learning is the discovery of the hidden topics from a text document. An analyst can access a contract document or patient’s medical history and classify the discussion areas such as “personal information”, “confidential information”, “medical history”, “allergy”, “immunization”, and “treatment/prescription”.

In chemistry research, predictive models are generally created to find the pattern between the chemical entities namely atom, substructure, or molecular with the associated property. Regression, classification, deep learning, neural networks, and gradient boosting are predominantly used as supervised learning frameworks. Most of the supervised learning algorithms best fit the high-quality labeled data. The availability of the labeled data is the biggest drawback of the decision tree and random forest from other unsupervised algorithms such as RulesFit and k-means. Labeled data are always time-consuming and unmanageable because a skilled chemist often determined the label of the properties. However, sometimes derived relationships or chemical data are latent that can only be recognized by leveraging unsupervised learning to make sense. The unsupervised learning algorithm such as db-Scan, PCA, ICA, NMF, neural networks, t-SNE and Umap algorithm play a prominent role in molecular distribution, feature extraction, molecular fingerprint dimensionality reduction, batch effect correction, clustering, and molecular embedding by embedding WGS of tumors samples.

2.3. Reinforcement Learning

Deep reinforcement learning is a subset of reinforcement learning. In this machine learning framework, one or multiple agents interface with an environment: the agent assimilates state observations from this environment and acts on it, potentially resulting in a reward signal that the agent seeks to maximize. Reinforcement learning is inherently a long-term perspective of learning. There are no shortcuts to learning quality control and purity specifications over many generations and environments; preventing chaos requires learning the aggregate effects of new patients entering a community and taking preventive or mitigative actions; discovering synthetic conditions to allow a reaction to occur could require exploration of thousands of reaction conditions. In the molecular design, robust target prediction and de novo molecular design require exploration of synthesized and unsynthesized chemical spaces. As such, reinforcement learning (RL) is the framework of choice for autonomous experimentation and molecular design. The most popular RL algorithm in recent years is its generalized Preparative Retro-Directed Ligand design.

Chemical systems are under the need for model-free autonomous experimentation due to several constraints, such as: (a) lack of scientific background for low weight intermediates; (b) poorly understood reaction conditions; or (c) little experimental data to mine from. Regarding ongoing work, initiatives to develop in silico catalyst steering tools for mono/poly-metallic systems are underway, as well as an accelerator for organic synthesis. RL-based systems have proven capable of accelerating product discovery in the context of retrosynthesis; that is, to fathom how to breakdown an organic molecule into endogenous educts. However, it is important to stress that exploration in chemical space is fraught with complex correlations enforced by e.g., hyperconjugation, allylic strain, steric crowding, conjugation, and through-electron effects: these relationships tie individual choices of atomic substructures in an organic molecule to the efficiency of various elementary experimental processes, such as radical pathways, carbocations, carbenes, oxetanes, oxiranes, and oxaziridines. To date, the application of a combination of algorithms (an interactive Bayesian optimization with RL) as an autonomous reactor controller continuously improved the yield of more than a dozen prima facie unrelated reactions in a multi-day experiment, relative to a human operator. Ironically, the true bottleneck in such closed-loop autonomous work, we believe, is the limited rate and scope of work our collaborators can physically supply. (Alhazmi et al., 2022)(Bennett and Abolhasani, 2022)(Coley et al.2020)

3. Chemoinformatics and Computational Chemistry

Chemoinformatics is the investigation of the inherent research and application of new drugs, focusing on large-scale screening of bioactive target compound libraries and neglecting the selection of “drug target” proteins. In the early stage, certain rules were found, and then new compounds of a certain structure were designed based on these rules, and their activity was tested again. However, there are some inevitable troubles in this process. Knowing the principle and knowing how to fight are two different concepts, and new compounds may find new targets. This experimental cycle will take several years. Although computational chemistry operates a large amount of data, it is subject to the stacking of computational chemistry methods. In fact, traditional methods and DNNs are complementary.

The application field of machine learning in chemistry is in theory, such as in molecular dynamics, quantum mechanics, computational chemistry, and in chemoinformatics. Chemoinformatics relies on computational chemistry, and traditional computational methods do not have modeling capabilities, but enhanced sampling methods based on quantum mechanics and molecular dynamics have made deep learning methods possible. In quantum mechanics, the most suitable method for solving the Schrodinger equation is the variational method, but the Schrödinger equation cannot be solved. Quantum mechanics is divided into brute force methods for solving Schrödinger’s equation and weak methods for Schrödinger’s equation approximation. ConnectionState (QC) based on variational methods is a combinatorial optimization-based weak method and can be used to solve the Schrödinger equation. QSAR topological MR based on quantum mechanics method and generation mixed MR descriptors accompanied by electron sharing were established. DNN is more useful in the study of mixed MR descriptors for the weak convergence method of quantum mechanics. While the feature descriptors of quantum mechanics and chemistry have physical properties, making in-depth studies easy to explain.

3.1. Molecular Descriptors

The most important molecular feature used in this article is the so-called low-dimensional molecular descriptors, such as constitutional or topological indices, electronic and thermodynamics indices. These descriptors are significant because they allow for an original molecule, annotated by a physicist or chemist, to be converted into a numeric vector which can be applied into machine learning contraptions or other economic models. These molecular descriptors can either be interpreted physically or chemically, such as boiling point, home energy gaps, and also predicted excited states.

In molecular design, molecular descriptors can either be endpoints observed and desired for implementation into most economic contraptions, such as molecular generation or optimal synthesis, because they aim to optimize these features. This is because one of the aims of chemistry is to make a reactor to produce your compound in the least amount of steps or parts because this will be more cost-effective, energy-efficient, and can possibly minimize the number of wastes. In contrast to black-box algorithms, where the input layer, which are the molecular descriptors, are merely numbers that contain no chemistry information. This potentially obscured input layer disallows the possibility of explaining the predictions of the black-box model or performing feature engineering. It may also be interesting to focus on the reduction of molecular descriptors and also to create alternate methods of designing descriptors, such as implementing fuzzy machine learning.

3.2. Quantitative Structure-Activity Relationship (QSAR)

Quantitative Structure-Activity Relationship (QSAR)

Quantitative structure-activity relationship (QSAR) is based on the concept of “active is similar”. QSAR is a research field that uses a certain kind of computer-aided weaving pattern to describe the mathematical formula. The field mainly uses molecular electronic wave function, spectral observation data, in vitro and in vivo activity or pharmacokinetics, and other data to establish a three-dimensional and multivariate relationship between physicochemical properties and the biological activity of a series of organic compounds. This three-dimensional mathematical relationship expresses affinity, specificity, enzyme inhibition ability, biological activity, pharmacological activity, and pharmacokinetic behavior, as well as the relationship between the three-dimensional quantitative structure of molecular active pharmacodynamics. It can map chemical compounds that have similar biological activity and provide an important spectrum of new compounds for new drug research. At present, QSAR has achieved remarkable results, which can search for better antineoplastic active ingredients and screen safer and more effective photosensitizers.

At present, with the popularization of computer calculation in molecular simulation, QSAR is based on the information of the molecule system under study to establish the active or inactive quantitative structure-activity relationship. However, the entry of a variety of complex mixed systems has damaged this method. The establishment of connection and prediction of structural chemistry of mixtures is still one of the research focuses of the field. Modern chemistry is mainly composed of so far, scientists have found that a cluster is difficult to study, not to mention the effective connection of a key Avogadro constant.

3.3. Virtual Screening

The concept of virtual screening lies in the prediction of potential drug-like compounds from chemical libraries through the use of a computational approach: a technique used by scientists and researchers in chemistry and related fields. Cheminformatics employs vast collections of chemical compounds. A problem lies in the complexity of evaluating each compound by using common methods such as isolating a drug target, synthesizing or obtaining the corresponding library of compounds, and testing each individually (HTS). With the aid of machine learning, the vast libraries of compounds may be screened, and only the most promising may be further investigated. The rapid assessment and analysis tools help in saving time and resources in new drug development of many large and small pharmaceutical/drug companies. Moreover, it provides information about molecular shape and properties, substructure and fragment detection, and fast database search and querying.

A virtual screening process includes the setup of a hypothesis: possibly a 3D model or a binding site of a target protein. Potential drug-like compounds in the development process are identified through a series of docking measurements based on molecular modeling. The compound library is evaluated as a whole by investigating how each compound interacts with the model. Furthermore, other in silico methods such as virtual-ADMET, biosimulations, or ranking may be used. Additionally, the in silico results are always required to be evaluated with laboratory experiments including biological, chemical, and physical methods and auxiliary techniques of which the computational results have to be confirmed. Host- or structure-based approaches using the gene sequence of the novel coronavirus and the three-dimensional structure of its main protease (3CLpro/Mpro) protein and/or viral RNA, as a drug target, have been undergone at the computational level in a hierarchy of studies to find the utmost potential antivirals, as affirmed through lab practices. (Ghislat et al., 2021)(Aziz et al.2022)(Shen et al.2021)

4. Machine Learning Models in Chemistry

Machine learning (ML) methods have been widely used in chemistry fields to either assist human work or do what humans can hardly do. There are four most common ML models used in chemistry: support vector machine, random forest, XGBoost, and deep learning. Support vector machine theorems can provide a rigorous and complete mathematical foundation for justifying why some learning algorithms do not suffer from generalization issues, but support vector machine algorithms only work well with numerical features, which is one of the reasons that support vector machine is not welcomed in chemistry research. Deep learning algorithms can automatically learn the features from molecular structures. The most significant part of using deep learning models is that we do not need fully labeled data to train the model. So, regardless of the type of descriptors we want to use, deep learning algorithms can be a fast and accurate choice. However, setting the best performance deep neural network model usually needs much time, and the model is easy to be overfit.

In addition, ensemble learning methods such as random forest and light gradient boosting machine are advanced candidates. The most important feature of the ensemble learning method is that the model can always return a satisfying result if enough model combinations are used. In real-life chemistry research, multiple datasets can be used as the input, which means a good ensemble learning model can always provide a satisfied performance than a simple random forest model. Light gradient boosting can be used to create the first-tier model to predict the target properties. The prediction result can be used as the new feature of the input and be put into a second-tier light gradient boosting model to get an even more accurate prediction. That is why XGBoost is widely used in toxicity prediction. The excessive application of machine learning methods in the field of chemistry has conquered a large number of successful application achievements. For example, atom energy, molecular activity prediction, pharmaceutical optimization, de novo drug design, etc., these application studies have greatly improved the efficiency of subsequent laboratory scientific research.

4.1. Support Vector Machines (SVM)

Support Vector Machines (SVM) work as algorithms for machine learning models. SVM and other machine learning models need training to classify data, and for this reason, SVM is identified as a supervised machine learning model. SVM is mostly used in classification problems, but it can also be used in regression problems. The classification accuracy of SVM is better than other machine learning models. This is because normal machine learning models totally ignore the meaning of the data, which means that they will not form a decision region in the feature space that can reveal the class of data points. However, in SVM, the margin between the decision region of each class of data points is maximized.

The study has shown that SVM has been commonly used in the chemistry field. SVM is used in QSAR of xanthine oxidase, the reduction catalytic activity of graphene oxide, design peptide-based inhibitor anti-inflammatory cyclooxygenase-2. SVM has been shown to be in the predictive modeling activity of adhesive rheological, anticancer activity, toxicity of molecules, perovskite cell photovoltaic peak power, water solubility, and persistent organic pollutant in organic carbon absorbed in aerosol. Moreover, SVM shows high performance in the prediction of antioxidant activities. SVM can be applied to improving classification accuracy. The equation for training the class of SVM is a convex optimization problem, so it has a unique solution and a global minimum when assessing the penalty factor in the form of a regularization term multiplied by the error term and measuring the tolerated distance of the data point.

4.2. Random Forest

Though Random Forest is rooted in the decision trees themselves, it is separated because of the bootstrapped aggregation method which it uses to combine the outcomes of many decision trees. Random Forest can be used even if the data have missing values and there is no need to build a separate tree model for each description of the inspired variables iteratively. The name “Random” stands for the fact that it assigns a random set of variables to grow a tree, which makes these trees independent. These independent trees are then combined.

In the first step, Random Forest uses paired groups of target values to construct decision trees. Each group is built using multiple random bootstrapping of the original data (bagging) samples to obtain n_bootstraps datasets of the same size as X and Y.

Within Random Forests, it averages predictions from many trees, which reduces the variance of results. There are many application papers which employ Random Forest in chemistry. A good example of Random Forest use in chemometric data mining is a problem which arises in modeling pre‐equilibrium systems. Pre‐equilibria have been extensively studied previously using multivariate data analysis methods and we wish to retain the same methodology to demonstrate the improvement offered by alternatives. At the same time, this paper is useful to show the improvement offered by alternative methods. It can be observed from the summary in Table I that the prediction results are similar across the four methods and to the original dataset whether a multi‐split is performed or not; average absolute errors vary from half to one log unit. Moreover, the read across dice coefficients, a measure of the correctness of the decision tree, are virtually identical for respective schemes.

4.3. Neural Networks

Neural networks: Neural networks, also known as artificial neural networks (ANNs), can be classified under supervised learning models as well. They consist of an interconnected group of nodes. Each of these nodes assigns weights to inputs, and these signals are passed through an activation function. The nodes are segregated into layers: input, output, and hidden layers. The input and output layers are those that feed the input into the model’s hidden layers and combine the predictions, respectively. The hidden layer includes the nodes that predict an output based on the input layers and weights.

Applications in Chemistry: The field of cheminformatics is able to benefit from advances in neural networks because large datasets are available. Specifically, large-scale adoption of QSPR (quantitative structure-property relationship) and QSPKR (quantitative structure-property relationship kinetics) settings has occurred for a number of chemicals in many areas. Mixtures, polymers, or other complex systems have been more difficult to handle, but important developments for these systems are also beginning to emerge. This unique type of problem is also one that stands to benefit the most in terms of more focused model design; since many of these systems are supported by covalent and noncovalent bonding phenomena chemistry has historically investigated, the feasibility of a predict-and-correct scheme (once a sufficiently large and diverse training set is compiled) should not be discounted. ANNs have generally proven most useful in quantitative applications where the dimensionality of chemical space is augmented by the addition of additional input features not directly related to atom-atom topological or substructural information, such as transition states. (Talevi2022)(Butler and Riley2022)(Ruiz and Talevi2022)(Vora et al.2023)(Scioli and Ruiz2022)(Kosugi and Hosea, 2020)(Vora et al.2023)(Matter et al., 2022)

5. Applications in Drug Discovery

Drug discovery is an important fundamental chemistry research field. In recent years, machine learning technology has been applied to drug discovery research. It saves time and money by accelerating and increasing the efficiency of drug discovery. Large amounts of data have been mined by machine learning algorithms to optimize existing drugs, predict new side effects, and infer new molecular structures that can be developed in the future. Machine learning has also been widely used in bioinformatic research, such as in the design of new peptides, and the prediction of protein folding and functions. The drug discovery field has proven that if machine learning technology can be further used in research areas such as materials and organic synthesis, we can benefit from long-term investment in this research.

Drug discovery is a high-risk, high-cost, and low-return process. It generally takes at least ten years to develop a new drug from a new drug candidate. Only one in ten new drug candidates will be eventually marketed because of the out of proportion toxicity of the candidate compounds. The human body may have an unexpected adverse response to the drugs that have successfully passed preclinical and clinical trials. However, it is not feasible to conduct tests on all the population of the world in clinical trials. Therefore, the identification of new target candidates and lead compounds is the basis for reducing the cost and risk of drug development. Machine learning algorithms can be used to match the corresponding protein activity and active ingredient information, and can also be used to predict the possible off-targets of drugs and optimize compound libraries.

5.1. Virtual Screening of Drug Candidates

The virtual screening (VS) of drug candidates is an extremely important step in the process of new drug discovery. Research approaches and relatively mature drug candidates, following the verification of drug targets, are used in the later stages of the drug development process, including animal testing and clinical trials. In such a context, the effective screening of drug candidates can effectively control the investment in the research and development phase. In terms of data sources, the PDB (Protein Data Bank), Indian, NCBI PubChem, ZINC and other protein databases are the main sources of active enzyme-ligand interaction data. Using receptors or ligands from these protein databases can greatly reduce the cost of collecting data or label samples.

The pharmacophore, molecular docking, and dynamic simulation technology require a large amount of structural and pharmacokinetic information. If the data is not sufficient or missing, obvious errors may occur in obtaining the higher the quality guarantee of the model set, the higher the average prediction accuracy of the model. However, molecular biology experiments are expensive and time consuming, and the data of verified binding activity experiments are scarce. The use of these databases can directly convert protein-related amino acid sequences to simple three-dimensional coordinates, and ligand-related data can be generated by chemical synthesis, compounding, and other experiments. The combination of ligand protein to form an enzyme-ligand complex can determine the high-performance enzyme-ligand interaction complex.

5.2. De Novo Drug Design

In silico drug-like molecules generate a large chemical space that can be used to design new drugs from scratch. Structure-based, pharmacophoric, and ligand-based optimization are the three most common applications of machine learning in the field of de novo drug design. The scorpion dock model and IDock are examples of models that utilize machine learning in structure-based approaches. Structure-based screening geometric optimization is accelerated with machine learning, and significant improvements in docking scores and hit rates have been observed. In a machine learning model, the pharmacophoric optimization procedure can be optimized. The model enhances the quality of pharmacophores derived from protein kinases, which has contributed to increased selectivity and target kinase preference. Several methods, such as the inverse molecular fingerprint (IMFP), de novo molecular generation (growing) molecular optimization, and drug-like estimation, the biological activity of molecular response (Date), have used machine learning to optimize ligand-based approaches to build a predictive QSAR model to estimate drug efficacy and safety.

De novo drug design will enter the field of artificial intelligence (AI)-assisted drug research from the direction of new molecule design. The concept, algorithm, and developing trend of de novo drug design cannot be separated from the mathematical model of the dataset, ML, and combination with fabric chemistry theory. In the future’s drug discovery process, the rapid synthesis of active compounds will greatly promote the efficiency of drug candidates, so de novo drug design has a bright application prospect. The significance of de novo drug design lies in the increase of the value of existing compounds or patent expiration by designing new functional drug molecules that are structurally different from discovery-improvement-repurposing, etc. The value incremental of a ligand could be increased by ML in the context of de novo drug design and lead to a substantial number of new therapeutic candidates.

5.3. Pharmacophore Modeling

Pharmacophore Modeling

Pharmacophore, defined by Paul Ehrlich in 1912, is the ensemble of steric and electronic features that are necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response. A pharmacophore model defines the 3D conserved interactions between a molecule and its target, and it appears to be one of the most popular cues in the process of ligand-based drug discovery. The application fields are various, and some popular applications of this model include: 1) 3D database search, typically 3D substructure search on small-molecule databases; 2) binding site similarity analysis: pharmacophores are used to compare the binding sites of different protein structures to identify potential off-targets, or clusters of binding sites based on similarities in their pharmacophoric features.

Furthermore, combined with machine learning approaches, pharmacophore models can be employed in the field of protein-ligand binding prediction: several studies have used pharmacophoric models as “fingerprints” to encode input features for different types of machine-learning algorithms to predict binding affinity to a target protein or to discriminate between inhibitors and non-binders. A pharmacophore model can be considered as a vector of (dis-/similarity) values, where the molecular descriptor encodes 3D structural features of the ligand, as well as the potential pharmacophores present in the relevant protein target in the poses sampled. By using the concept of “hypothetical binding conformation space”, a 3D pharmacophore fingerprint was shown to be able to discriminate between active compounds and decoys, as well as to predict binding affinities with two scores (i.e., the internal score was adopted and showed predictive power in ranking docking scores). (Schaller et al.2020)(Muhammed and Akı-yalcın2021)(Tyagi et al., 2022)(Kumar et al.2022)(Liu et al., 2022)(Dhakal et al.2022)(Karki et al., 2023)(Rezaei et al.2020)(Ji et al.2021)(Liu et al., 2021)

6. Materials Science and Machine Learning

The coupling between materials science and machine learning is a rapidly growing area of research which is expected to become a fundamental driver of technological growth in the 21st century. The diversity of possible binary combinations of dichotomous symbols at various intensity levels that we can create, when multiplied with other material attributes and factors, means that we possess an almost inestimable number of nanomaterials we can use. Many new or resultant variations or combinations of nanomaterials are currently being evaluated using machine learning, applying them to get ahead in a competitive market. This will lead to optimization in the base chemicals. Machine learning tools allow the possibility to help recreate various materials including metals, band systems, stainless steels, soft magnetic materials, and ceramics.

Bringing academia and industry together, educating material chemists, and inspiring the application of machine learning in the sector is necessary for the fourth action plan. Major drives behind the growth are the increase in corporation and academic R&D in the commercial sector, the increasing use of machine learning in pharmaceutical research, and the growing demand for machine learning in big data processing. Moreover, trials on human beings showing that machine learning is of genuine value are important for the success of new systems. Machine learning, as used in the environment and its influence, has yet to be investigated or quantified. Recent quantum computational models and machine learning techniques are used as alternative computational methods in catalyst discovery and the environmental impact assessments for the chemicals to be made.

6.1. Materials Property Prediction

The prediction of intrinsic materials properties, for example band structures, formation energies, and elastic constants, is the fundamental application of machine learning techniques that are incorporated in this study. Materials are one of the key elements for developing and sustaining the community, regardless of the fact that they can be either real or virtual. Traditional materials science, an empirical science, relies mainly on the expertise or knowledge of the researchers and is a labor and capital-intensive field. However, in recent years, advances in computer and mathematics have made it possible to use existing property data to train models to make efficient property predictions for new materials. The advantages of this technique cannot be overstated. First and foremost, this approach is not only more accurate than empirical research, but it is also faster and cheaper, requiring fewer materials to be tested. Second, the advent of this technique has opened the door to large-scale simulations, which were previously impossible.

In addition to keeping track of different materials and their properties, this technique could be utilized to optimize materials with regard to their properties. Finally, it is hoped that this technique will play an important role in materials design, having the potential to discover new materials. However, in general, among machine learning models, the data type used to train the models, such as protein, ligand, and complex data, determines the type of machine learning technique, and most studies limit their scope. It is important to explore the machine learning methods that can handle general inputs, such as metal, catalyst, photovoltaic, superconductor, capacitor, battery, oxide, magnetic, hydrogen storage, thermoelectric, half-Heusler compound, and organic, carbon-based, ultraviolet nonlinear optical.

6.2. Accelerated Materials Discovery

Identifying suitable successors to established functional hard and soft materials is an enduring bottleneck in materials science research. Yet, the combinatorial aspect of materials spaces is as enthralling as it is challenging. Rigorously, materials are a subset of examples in chemical space or simply large organic molecules – this set is vast and there is no way to store every combination of up to 50 simple organic compounds on Earth. Materials chemists must confine their intelligent guesses to a smaller domain, and this is accomplished by identifying compositions without inorganic chemistry’s immense combinatorics. However, in terms of molecular space, it is still a large gamble even in this narrower universe. As a consequence, the path to the introduction of new devices and the foundations behind these devices, metallics, superconductors, and so on, is ambiguous. High-throughput methods can streamline the discovery process, but they still necessitate that many permutations be developed and tested before dynamic systems can be identified definitively. Finally, high-dimensional data necessitates machine learning models to detect its subtleties and complexity. Records show that machine learning assists us in substantially speeding up this trial and error approach and efficiently navigating these high-dimensional spaces. However, we consider the early demonstrators using machine learning for intelligent polymers, inorganic bifunctional electronic and optical materials to show that machine learning helps extensively explore these spaces.

Machine learning, particularly in its most popular manifestation of neural networks, is invigorating and has been employed by inorganic chemists. As this trend continues, machine-learned surrogate models and guidelines could direct experimental work on, for instance, heme-based metal-organic frameworks, metal-organic polymers for the removal of pollutants from wastewater, and solar-energy-capturing hybrid perovskites of glassy nature. This incipient approach subs for the typical guesses of hypothetical materials. For industries keen to obtain a rapid return on the cost of their instruments and venture, this approach shows promise. Machine-learning models allow high-throughput screening and the prediction of structures and reactions. In the field of inorganic chemistry, they omit the necessity to establish expensive and time-consuming experiments. It is feasible to access the results of thousands of virtual materials experiments each day by combining computer scripts and orchestration tools with workflow frameworks and pre-built repositories of compound properties. In this manner, the materials of the future could be rapidly discovered by machine learning acting as the “hand of God,” and then developed via the expertise of human chemists. (Cai et al., 2020)(Fang et al.2022)(Leonard et al.2021)(Greener et al.2022)(Taye, 2023)(Yao et al.2023)(Yang et al.2024)(Sahu et al.2022)(Pramod et al.2021)(Telikani et al.2021)

7. Quantum Chemistry and Machine Learning

Introduction.

Why do we all talk about Artificial Intelligence (AI) and Machine Learning (ML)? They are the new hot topics in the last couple of years. Especially, ML is everywhere, from your email spam checker to your recommendation list on YouTube and online shops. The development of ML algorithms has been driven by enormous computation power and the availability of big datasets. However, one of the running jokes is, if you are using an algorithm from the 1950s, you are using AI; if you are using an algorithm from 2000, you are using ML; and if you are using something you have no idea about, that will be Deep Learning. Indeed, the rise of ML is heavily tied to Deep Learning and the torrent of data available. For chemists, the most likely field that deeply understands the essence of machine learning is Quantum Chemistry. Quantum Chemistry is a vast field where experimenting takes a big chunk out of our research time. The cost of calculating what we need to know is hefty and computers are no match for it. This is when machine learning comes in.

Quantum Chemistry and Machine Learning. Quantum mechanics is a microscopic model used to explain all the mysteries of the atomic world. The nucleus is made up of protons and neutrons, all condensed in a nucleus and surrounded by one or more electron clouds. From this model, we can develop our classical macroscopic world, such as many atoms combining to produce molecules or matter. Moreover, scientists started using the theory of quantum chemistry to do calculations and understand the characteristics of protons and neutrons, and further go down to the smallest particles, and calculate the quantum chemistry properties. Who can tell the number of possibilities that could be used in our quantum mechanics simulations? ML can also be used to develop models based on quantum mechanics calculations. Machine Learning (ML) is a complex model that can interact with more qualitative or quantitative variables based on experience. These MLs are thermodynamically stable, preferred by many consumers, and can strongly support the development of high-performance, inexpensive, and environmentally friendly materials. Quantum mechanics, but still need a significant amount of regional bonus, also hinder the development of machine learning. In a nutshell, both quantum chemistry and machine learning are complementary and have a great implication in what can be used as a model for manipulating molecules.

7.1. Quantum Machine Learning Algorithms

Quantum computing will revolutionize computational methods in quantum chemistry. Developing quantum algorithms that effectively deliver results for practical chemistry measurements is fundamental for making use of quantum machines. In recent years, research in this direction has given promising outcomes. Even though a quantum computer is not readily available at this time, we expect that by the time it becomes widespread, scientists will have established a number of quantum machine learning algorithms that will deliver quantum-classical hybrid schemes for various quantum mechanics simulations.

Machine learning is an additional strategy of broad interest in quantum mechanics and computational chemistry. The applicability of machine learning approaches is extended by novel quantum machine learning approaches to help replicate potential energy surfaces and other molecular properties that scale closely with the understanding of molecular structure. The computational intensity of ab initio molecular simulations, where molecule forces are swiftly evaluated according to practical assumptions from DFT calculations rather than parameter-based molecular mechanics, has raised an interest in using neural network potentials because they use deep learning to produce molecular forces that are quickly assessed. Nevertheless, as conventional machine learning methods were formulated to discover patterns or options of hard classification in the awareness of qubits, they may be ill-suited for the problems of quantum mechanics. A fresh breed of quantum machine learning methods is emerging for computing with qubits. Likely these algorithms will have some benefits over classical learning methods for modeling molecular physics when performed on a quantum computer.

7.2. Applications in Quantum Chemistry

Machine learning platforms are typically used in drug discovery when running a quantum-chemical calculation is simply not feasible or would take too long to generate a useful amount of data. However, a backup for quantum chemistry may not be necessary anymore, as the use of machine learning itself to give quantum chemical output has become a popular field in computational chemistry.

Machine learning in quantum chemistry can be an excellent tool to speed up the process of generating quantum mechanical parameters. A vast amount of training data, which contains the results of these computations, is needed to make statistical predictions for the computed observables for a given set of quantum mechanical input parameters. The training process, which can involve only a hundred thousand to hundreds of bytes of data, is equivalent to performing three to four quantum-chemical-electronic-structure calculations. Once trained, these systems can give answers to quantities which a quantum chemistry calculation would take weeks or even months to run in just a few milliseconds. Machine learning can also predict general trends and relationships between computed observables, such as the relationships in between quantum energy states.

Metaphorically, one can think of training these machine learning platforms as learning the potential energy of the molecules. Therefore, there are many different schemes of learning, representing the wide plethora of possible different chemical systems. Moreover, it is common for two sets of data to map to more or less the same predictive machine learning output, making it difficult to predict which parameters in between two quantum chemical systems may give better observables. It is important to thoroughly test predictive output systems for these possible discrepancies in the predicted energy output. (Dral, 2020)(Sajjan et al.2022)(von et al.2020)(Khan and Robles-Kelly, 2020)(Westermayr et al.2021)(Huang and von Lilienfeld, 2020)(Batra et al.2021)(Huang and Von Lilienfeld, 2021)(Rosen et al.2021)(Keith et al.2021)

8. Challenges and Future Directions

Although great advances have been achieved in machine learning for material analysis, many unsolved challenges remain in machine learning for chemistry research. For a start, more triggers can be grounded on chemical insights to guide machine learning for material informatics due to the inherent physical properties of materials. Challenges also exist in the atomization of organic compounds. Furthermore, less attention has been paid to the environmental impact of materials, and the materials currently being studied are too expensive to be actualized. In another case, liberal relaxation conditions are required for evolutionary predictions during optimization, and components should be adjustable for different optimization processes. Long computation time and vast time consumption during data collection also hinder the performance of the model. Besides, training and evaluation data should be collected from the same source for broad generalization. Other factors include the high costs sunk in facilities, hiring personnel, and other experimental costs, as well as the unpredictable experimental cycles and the uncertainty in the selection of materials and the results of machine learning. It is difficult to automate the adaptation of results of different physical systems. A few direction guides need to be added during the research to avoid divergence during the optimization.

In the future, with what’s been achieved in machine learning for chemistry research and the ongoing heavy investment in R&D on machine learning for material discovery, many new machine-learning-based paradigms will be developed and can be used beyond laboratory data. With these paradigms, future generations of novel materials will have unprecedented properties and functions beyond our imaginations. Machine learning will be instrumental in providing insights to accelerate discoveries in the chemistry domain where elucidating molecular properties and electronic properties calls for optimal screening processes and targeted experimental strategies. Meanwhile, commercial benefits have been pursued, with algorithms put to industrial use for real-time tracking and automatic optimizations in conjunction with effective “experimental machine learning” paradigms.

8.1. Interpretability and Explainability

There are many assets present today that deal with transparency and fairness in systems. This is especially important in the field of natural sciences and cheminformatics, as chemical data can be sourced from different locations. Consequently, the quality and potential inconsistencies make it the most used and implemented system in the design of drugs. Since the 19th century, cheminformatics has made progress in this area, but still retains much of the lack of reliability in terms of properties.

It is important to include an efficient and reliable data system to explain pathological aspects, as the data is not statistically compromised when it relates to patients. This data can be used to prevent or treat agents, such as inhibitors or substrate transport proteins that affect genes or various pathways of metabolism.

To make the design of drugs more effective in terms of decade and success rate, as well as newer therapeutic molecules, we use various experimental approvals on mass data. The results of these experimental evaluations are then modeled by ML systems that generate more information and perform additional optimizations through simulation on new structures for future synthesis studies. This helps to further improve the properties of the drug.

Given the generation of large volumes of data and often the size needed for structural analysis of molecules, one of the most current problems is overfitting. This occurs when the machine does not learn from the data and instead generalizes from a set of views proven during training. This can be explained by the large volume of communication in association with the big-data properties of the molecule descriptors. The training and model become so coherent with the extracted information that it hinders learning.

8.2. Data Quality and Quantity

Data quality and quantity are the central issues for the future of machine learning in chemistry research. A vast quantity of chemical data is necessary to increase prediction precision and machine learning performance as a whole. However, the current tools to measure and predict data quality for large-scale molecular datasets, as provided by PubChem, are critically lacking. Moreover, one must prove that the uncertainty automatically ends in numerical predictions to add real value in real applications. This is much like AI doing predictive policing; the uncertainties are just too great to be affordable in society.

In essence, we need: 1. Proper analytical tools to tell chemical data labs when to stop experiment runs with a given measurement error. 2. Proper uncertainty measurements with relevant chemistry decision quality measurements, which per default are application and system specific.

For data quality, the data must be: 1. Large enough to add value in future predictions, which depending on the system may be many 100 trajectories or more. 2. Prediction accuracy with uncertainties in themselves to the noise of experimental data, including systematic bias.

All good AI tools that will be used in the future need to be flexible, cheap, and routine to use and, if applied in the industry, approved and safe to use by national authorities.

8.3. Ethical Considerations

8.3. Ethical Consideration

Machine learning has been used as an evolving tool in chemistry, but surely there are ethical considerations to take into account. In the end, solely using a tool without a precise understanding of its content could appear as having a dessert before the meal. Classically, the first responsible consideration of researchers is to avoid sharing their computer and research tools with non-competent people. Herein, as early as possible, we remind the users that the notion of chemo-informatics as useful only to experts, who check the outputs, is absolutely essential. Throughout care and responsibility should always be kept in mind when it comes to research and development. The issue of ethical consideration in machine learning alterations in the context of big data might be taken into account. But now, there are not yet enough cheminformatics applications to even think about those considerations. Sure, in research and development, for the long-term objectives and visions, machine learning requires various types of ethical responsibilities, and the question is wide and deep. Here, no doubt in any reader’s mind that if the data prepared is not correct or if the compounds selected do not meet the criteria, the “garbage in, garbage out” expression applies.

In addition, new legislation and regulations, in approaches such as the General Data Protection Regulation, can have an impact on machine learning in other applications once the chemo-informatics community has come up. Ethical machine learning questions are broader and involve responsibilities and issues both inside and outside the workplace. A researcher and developer must still take the initiative and anticipate the consequences outside the wall of their labs and offices. The responsible cheminformatics concerns applying the best knowledge and understanding all along the process, from initial data collection to later making decisions with the help of conveying results. Precautionary, supreme principles indeed are essential to manage a reasonable and appropriate learning machines way in chemo-informatics and chemical research. (Fantin and Balaji2022)(Hiran et al., 2021)(Latif et al.2023)(Taye, 2023)(Valente et al., 2023)(Bai et al., 2020)(Sarker, 2021)(Mahadevkar et al.2022)(Shanthamallu and Spanias, 2021)(Habehh and Gohel, 2021)

References:

Alhazmi, K., Albalawi, F., and Sarathy, S. M. “A reinforcement learning-based economic model predictive control framework for autonomous operation of chemical reactors.” Chemical Engineering Journal (2022). [PDF]

Bennett, J. A. and Abolhasani, M. “Autonomous chemical science and engineering enabled by self-driving laboratories.” Current Opinion in Chemical Engineering (2022). sciencedirect.com

Coley, Connor W., Natalie S. Eyke, and Klavs F. Jensen. “Autonomous discovery in the chemical sciences part I: Progress.” Angewandte Chemie International Edition 59, no. 51 (2020): 22858-22893. wiley.com

Ghislat, G., Rahman, T., and Ballester, P. J. “Recent progress on the prospective application of machine learning to structure-based virtual screening.” Current opinion in chemical biology (2021). sciencedirect.com

Aziz, Mubashir, Syeda Abida Ejaz, Seema Zargar, Naveed Akhtar, Abdullahi Tunde Aborode, Tanveer A. Wani, Gaber El-Saber Batiha, Farhan Siddique, Mohammed Alqarni, and Ashraf Akintayo Akintola. “Deep learning and structure-based virtual screening for drug discovery against NEK7: a novel target for the treatment of cancer.” Molecules 27, no. 13 (2022): 4098. mdpi.com

Shen, Chao, Gaoqi Weng, Xujun Zhang, Elaine Lai-Han Leung, Xiaojun Yao, Jinping Pang, Xin Chai et al. “Accuracy or novelty: what can we gain from target-specific machine-learning-based scoring functions in virtual screening?.” Briefings in Bioinformatics 22, no. 5 (2021): bbaa410. [HTML]

Talevi, Alan. “In Silico ADME: QSPR/QSAR.” In The ADME Encyclopedia: A Comprehensive Guide on Biopharmacy and Pharmacokinetics, pp. 525-531. Cham: Springer International Publishing, 2022. [HTML]

Butler, Philip, and Robert J. Riley. “In vitro ADME assays and in vivo extrapolations.” In The ADME Encyclopedia: A Comprehensive Guide on Biopharmacy and Pharmacokinetics, pp. 537-547. Cham: Springer International Publishing, 2022. [HTML]

Ruiz, Maria Esperanza, and Alan Talevi. “In Vitro Drug Dissolution: Modalities.” In The ADME Encyclopedia: A Comprehensive Guide on Biopharmacy and Pharmacokinetics, pp. 547-557. Cham: Springer International Publishing, 2022. [HTML]

Vora, Lalitkumar K., Amol D. Gholap, Keshava Jetha, Raghu Raj Singh Thakur, Hetvi K. Solanki, and Vivek P. Chavda. “Artificial intelligence in pharmaceutical technology and drug delivery design.” Pharmaceutics 15, no. 7 (2023): 1916. mdpi.com

Scioli Montoto, Sebastián, and María Esperanza Ruiz. “Intravenous Administration: Technological Considerations.” In The ADME Encyclopedia: A Comprehensive Guide on Biopharmacy and Pharmacokinetics, pp. 575-582. Cham: Springer International Publishing, 2022. [HTML]

Kosugi, Y. and Hosea, N. “Direct comparison of total clearance prediction: computational machine learning model versus bottom-up approach using in vitro assay.” Molecular Pharmaceutics (2020). amazonaws.com

Vora, L. K., A. D. Gholap, K. Jetha, R. R. S. Thakur, H. K. Solanki, and V. P. Chavda. “Artificial Intelligence in Pharmaceutical Technology and Drug Delivery Design. Pharmaceutics. 2023; 15: 1916.” (2023). digitalrosh.com

Matter, B., Bourne, D. W. A., and Kompella, U. B. “A high-throughput LC-MS/MS method for the simultaneous quantification of twenty-seven drug molecules in ocular tissues.” Aaps PharmSciTech (2022). researchgate.net

Schaller, David, Dora Šribar, Theresa Noonan, Lihua Deng, Trung Ngoc Nguyen, Szymon Pach, David Machalz, Marcel Bermudez, and Gerhard Wolber. “Next generation 3D pharmacophore modeling.” Wiley Interdisciplinary Reviews: Computational Molecular Science 10, no. 4 (2020): e1468. wiley.com

Muhammed, Muhammed Tilahun, and Esin Akı-yalcın. “Pharmacophore modeling in drug discovery: methodology and current status.” Journal of the Turkish Chemical Society Section A: Chemistry 8, no. 3 (2021): 749-762. dergipark.org.tr

Tyagi, R., Singh, A., Chaudhary, K. K., and Yadav, M. K. “Pharmacophore modeling and its applications.” Bioinformatics (2022). researchgate.net

Kumar, Sivakumar Prasanth, Nandan Y. Dixit, Chirag N. Patel, Rakesh M. Rawal, and Himanshu A. Pandya. “PharmRF: A machine‐learning scoring function to identify the best protein‐ligand complexes for structure‐based pharmacophore screening with high enrichments.” Journal of Computational Chemistry 43, no. 12 (2022): 847-863. [HTML]

Liu, X., Feng, H., Wu, J., and Xia, K. “Dowker complex based machine learning (DCML) models for protein-ligand binding affinity prediction.” PLoS computational biology (2022). plos.org

Dhakal, Ashwin, Cole McKay, John J. Tanner, and Jianlin Cheng. “Artificial intelligence in the prediction of protein–ligand interactions: recent advances and future directions.” Briefings in Bioinformatics 23, no. 1 (2022): bbab476. oup.com

Karki, R., Gadiya, Y., Gribbon, P., and Zaliani, A. “Pharmacophore-Based Machine Learning Model To Predict Ligand Selectivity for E3 Ligase Binders.” ACS omega (2023). acs.org

Rezaei, Mohammad A., Yanjun Li, Dapeng Wu, Xiaolin Li, and Chenglong Li. “Deep learning in drug design: protein-ligand binding affinity prediction.” IEEE/ACM transactions on computational biology and bioinformatics 19, no. 1 (2020): 407-417. ieee.org

Ji, Beihong, Xibing He, Jingchen Zhai, Yuzhao Zhang, Viet Hoang Man, and Junmei Wang. “Machine learning on ligand-residue interaction profiles to significantly improve binding affinity prediction.” Briefings in Bioinformatics 22, no. 5 (2021): bbab054. nih.gov

Liu, X., Feng, H., Wu, J., and Xia, K. “Persistent spectral hypergraph based machine learning (PSH-ML) for protein-ligand binding affinity prediction.” Briefings in Bioinformatics (2021). archive.org

Cai, J., Chu, X., Xu, K., Li, H., and Wei, J. “Machine learning-driven new material discovery.” Nanoscale Advances (2020). rsc.org

Fang, Jiheng, Ming Xie, Xingqun He, Jiming Zhang, Jieqiong Hu, Yongtai Chen, Youcai Yang, and Qinglin Jin. “Machine learning accelerates the materials discovery.” Materials Today Communications 33 (2022): 104900. [HTML]

Leonard, Kevin C., Faruque Hasan, Helen F. Sneddon, and Fengqi You. “Can artificial intelligence and machine learning be used to accelerate sustainable chemistry and engineering?.” ACS Sustainable Chemistry & Engineering 9, no. 18 (2021): 6126-6129. acs.org

Greener, Joe G., Shaun M. Kandathil, Lewis Moffat, and David T. Jones. “A guide to machine learning for biologists.” Nature reviews Molecular cell biology 23, no. 1 (2022): 40-55. ucl.ac.uk

Taye, M. M. “Understanding of machine learning with deep learning: architectures, workflow, applications and future directions.” Computers (2023). mdpi.com

Yao, Zhenpeng, Yanwei Lum, Andrew Johnston, Luis Martin Mejia-Mendoza, Xin Zhou, Yonggang Wen, Alán Aspuru-Guzik, Edward H. Sargent, and Zhi Wei Seh. “Machine learning for a sustainable energy future.” Nature Reviews Materials 8, no. 3 (2023): 202-215. nature.com

Yang, Lidong, Jialin Jiang, Fengtong Ji, Yangmin Li, Kai-Leung Yung, Antoine Ferreira, and Li Zhang. “Machine learning for micro-and nanorobots.” Nature Machine Intelligence (2024): 1-14. [HTML]

Sahu, Mehar, Rohan Gupta, Rashmi K. Ambasta, and Pravir Kumar. “Artificial intelligence and machine learning in precision medicine: A paradigm shift in big data analysis.” Progress in molecular biology and translational science 190, no. 1 (2022): 57-100. [HTML]

Pramod, Akshara, Harsh Sankar Naicker, and Amit Kumar Tyagi. “Machine learning and deep learning: Open issues and future research directions for the next 10 years.” Computational analysis and deep learning for medical care: Principles, methods, and applications (2021): 463-490. [HTML]

Telikani, Akbar, Amirhessam Tahmassebi, Wolfgang Banzhaf, and Amir H. Gandomi. “Evolutionary machine learning: A survey.” ACM Computing Surveys (CSUR) 54, no. 8 (2021): 1-35. acm.org

Dral, P. O. “Quantum chemistry in the age of machine learning.” The journal of physical chemistry letters (2020). [HTML]

Sajjan, Manas, Junxu Li, Raja Selvarajan, Shree Hari Sureshbabu, Sumit Suresh Kale, Rishabh Gupta, Vinit Singh, and Sabre Kais. “Quantum machine learning for chemistry and physics.” Chemical Society Reviews 51, no. 15 (2022): 6475-6573. rsc.org

von Lilienfeld, O. Anatole, Klaus-Robert Müller, and Alexandre Tkatchenko. “Exploring chemical compound space with quantum-based machine learning.” Nature Reviews Chemistry 4, no. 7 (2020): 347-358. [PDF]

Khan, T. M. and Robles-Kelly, A. “Machine learning: Quantum vs classical.” IEEE Access (2020). ieee.org

Westermayr, Julia, Michael Gastegger, Kristof T. Schütt, and Reinhard J. Maurer. “Perspective on integrating machine learning into computational chemistry and materials science.” The Journal of Chemical Physics 154, no. 23 (2021). aip.org

Huang, B. and von Lilienfeld, O. A. “Quantum machine learning using atom-in-molecule-based fragments selected on the fly.” Nature chemistry (2020). [HTML]

Batra, Kushal, Kimberley M. Zorn, Daniel H. Foil, Eni Minerali, Victor O. Gawriljuk, Thomas R. Lane, and Sean Ekins. “Quantum machine learning algorithms for drug discovery applications.” Journal of chemical information and modeling 61, no. 6 (2021): 2641-2647. nih.gov

Huang, B. and Von Lilienfeld, O. A. “Ab initio machine learning in chemical compound space.” Chemical reviews (2021). acs.org

Rosen, Andrew S., Shaelyn M. Iyer, Debmalya Ray, Zhenpeng Yao, Alan Aspuru-Guzik, Laura Gagliardi, Justin M. Notestein, and Randall Q. Snurr. “Machine learning the quantum-chemical properties of metal–organic frameworks for accelerated materials discovery.” Matter 4, no. 5 (2021): 1578-1597. cell.com

Keith, John A., Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gastegger, Klaus-Robert Muller, and Alexandre Tkatchenko. “Combining machine learning and computational chemistry for predictive insights into chemical systems.” Chemical reviews 121, no. 16 (2021): 9816-9872. acs.org

Fantin Irudaya Raj, E., and M. Balaji. “Application of deep learning and machine learning in pattern recognition.” In Advance Concepts of Image Processing and Pattern Recognition: Effective Solution for Global Challenges, pp. 63-89. Singapore: Springer Singapore, 2022. [HTML]

Hiran, K. K., Jain, R. K., Lakhwani, K., and Doshi, R. “Machine Learning: Master Supervised and Unsupervised Learning Algorithms with Real Examples (English Edition).” (2021). [HTML]

Latif, Siddique, Heriberto Cuayáhuitl, Farrukh Pervez, Fahad Shamshad, Hafiz Shehbaz Ali, and Erik Cambria. “A survey on deep reinforcement learning for audio-based applications.” Artificial Intelligence Review 56, no. 3 (2023): 2193-2240. springer.com

Valente, J., António, J., Mora, C., and Jardim, S. “Developments in image processing using deep learning and reinforcement learning.” Journal of Imaging (2023). mdpi.com

Bai, Q., Li, S., Yang, J., Song, Q., Li, Z., and Zhang, X. “Object detection recognition and robot grasping based on machine learning: A survey.” IEEE access (2020). ieee.org

Sarker, I. H. “Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions.” SN computer science (2021). springer.com

Mahadevkar, Supriya V., Bharti Khemani, Shruti Patil, Ketan Kotecha, Deepali R. Vora, Ajith Abraham, and Lubna Abdelkareim Gabralla. “A review on machine learning styles in computer vision—techniques and future directions.” Ieee Access 10 (2022): 107293-107329. ieee.org

Shanthamallu, U. S. and Spanias, A. “Machine and deep learning algorithms and applications.” (2021). [HTML]

Habehh, H. and Gohel, S. “Machine learning in healthcare.” Current genomics (2021). nih.gov