Machine Learning for Predictive Insights into Chemical Systems

1. Introduction to Machine Learning in Chemistry

Machine learning is increasingly being applied to chemical data to generate valuable models, allowing a range of useful questions, including mechanistic questions about chemical reactivity, to be answered. In this chapter, we demonstrate the utility of machine learning in providing predictive and mechanistic insights into chemical systems, with particular emphasis on reaction prediction. Machine learning predictive models for chemistry have been shown to yield excellent results, and where data is scarce, methods for scaling up machine learning models show great promise. We believe that chemistry research in the future could be increasingly driven through integrative approaches, where computational experiments inform in-the-lab experiments, with experimental data also used to validate computational models. [1][2][3]

For example, in the chapters by Coley, Datquenave, and Meier, machine learning models can be used to predict which chemical reactions would be successful in generating products with specific reaction yields, and fallbacks predicted by thermodynamics. These predictions could then guide the synthetic chemist, and considerable time and resources could be saved by exploring dead-ends in an informed way.

Understanding the fundamental and emergent properties of chemical systems using computation is one of the grand challenges in chemistry and should be addressed using multi-scale models, from electronic structure theory through to more coarse-grained models. Machine learning holds great promise for contributing to this field. However, state-of-the-art electronic structure theory, in most cases, operates close to the stationary phase and beyond, and spin-orbit couplings and non-adiabatic transition challenges are challenging to implement in density functional theory. In these cases, it would be preferable to be able to use machine learning techniques to mine existing data to predict chemical reactivity, as is now being successfully employed in the world of synthetic organic chemistry. In this chapter, we aim to provide a clear foundation for applying machine learning to chemical reactivity and chemical systems. [1][2][3][4][5]

1.1. Overview of Machine Learning Techniques

As machine learning techniques become more advanced, it becomes increasingly difficult to navigate all of the options. In addition, numerous algorithms, such as decision trees, random forests, support vector machines, Bayesian inference, and more, are available, providing a fresh challenge in which techniques are optimal for a given prediction problem. The primary objective of this guide is to offer a comprehensive overview of the various machine learning technology and approaches currently being employed for predictive insights in chemistry, allowing researchers to better understand the broad types of methods used and their relevance.

While deep learning has proven itself as a powerful technique for memorizing/investigating compounds in terms of molecular generation, identifying properties, and estimating behavior, in several chemical systems, linear regression and PCA remain suitable for predictive analytics based on experimental measurements and facilitate an interpretation of the main contributors to the molecular properties.

This guide’s objectives are as follows: • To examine the various machine learning or chemoinformatics techniques for predictive modeling. • To provide context and guidance for the application of these methods in the field of chemistry, along with a more detailed examination of the most applicable techniques. • To complement the application of machine learning algorithms by providing practical predictive modeling best practice (summarized in PredictionML’s 20 Best Practices). [6][7][8][9][10]

2. Fundamental Concepts in Chemistry

Chemistry is invariably concerned with predicting the properties of matter based on laws of physics. These properties can include, for instance, thermodynamics and reaction rates, atomic and molecular structures, properties of bulk matter (e.g. electronic, magnetic, thermal), spectroscopic and scattering signals, nonlinear optical phenomena, radiation and electrospray or electron-ionization scattering cross sections and final states, particle velocity distributions, and their relaxation from femtosecond to astrotemporal scales. These properties all emerge from large or complex numbers of basis functions for electrons and nuclei, which can be reduced to constitutive elementary particles and antiparticles in the limit. While the Schrödinger equation forms the universal basis for obtaining many of these properties, exact solutions can be derived for so few systems that it must be replaced by approximate solutions for almost all practical problems. These approximations can leave relationships between properties hidden from any analytical approach.

Density functional theory (DFT), or classical analogs, can provide approximations based on numerical or intelligent sampling. These can in turn be used as inputs to machine learning (ML) algorithms. To be sure, understanding the behaviors of ML tools is greatly facilitated by a strong background leading to the evolution of ML-specific ab-initio concepts and methodologies. Therefore, our intention in this brief summary is precisely to bulletize those physics, chemistry, and biology concepts and other practicalities that are assumed or sometimes taken for granted when it comes to implementing ML algorithms toward predictive insights into chemical systems. We have organized them in terms of root concepts in quantum chemistry and in fundamental atomic and molecular physics for atomic and diatomic molecular systems. [11][12][13][14][15][16]

2.1. Chemical Bonds and Molecular Structures

In order to discuss the techniques used in machine learning for advanced applications in chemistry, we present an introduction to traditional chemical knowledge of chemical bonds and molecular structures. This background is intended to provide humans with a way of understanding the basic physical processes underlying more advanced computational techniques. This, in turn, will allow for an increased insight into the procedures used for prediction and the properties and compounds that can be predicted.

Because of quantum mechanics, the physical process surrounding chemical bonding is one of extremely high dimension. Despite this, there are several basic dimensions that allow for at least a partial understanding of the process. The first of such dimensions is geometry. The vast majority of properties for a molecular system can be expressed as functions of the positions of the nuclei. Representing a molecular system as a set of nuclear positions and nuclear charges, chemistry enters the realm of coordinate-based machine learning. A second dimension is indeed electronic. When a chemical bond forms between systems, the various orbitals of different atoms will interact and reconfigure themselves. Bonds form to minimize repulsive forces and to ensure the quantum mechanics of the orbitals align. Because of the multi-electron interactions and many-body nature of electronic structure, electronic degrees of freedom are often difficult for machine learning approaches. However, as we shall discuss in Section 2.3 and this paper in general, electronic level machine learning is increasingly feasible. In this tutorial, rather, we focus primarily on geometry. [17][18][10][19][20]

3. Data Collection and Preprocessing

Data collection and preprocessing is a crucial aspect of any machine learning study. In this section, we describe specific issues related to collecting and preprocessing data with regard to chemical systems. This process starts with the generation of potential structures and data concerning individual atoms as a result of a number of scattered databases that house valuable chemical information. These databases include static properties of chemical substances that are useful for process analysis, such as the Inorganic Crystal Structure Database. They also contain libraries of molecules and their properties that will prove incredibly useful in predicting molecular properties computationally, including the Cambridge Structural Database and GDB databases, which contain stable organic molecules.

Data from these sources are not always collected at fixed standards and often exhibit known issues, such as incomplete data on characterized properties. The volume of experimental data has been best realized from the CSD, which contains estimates of molecular geometries and substructures for millions of organic molecules. Principles have been proposed and implemented to overcome the issue of quality assurance for new high-throughput experimental data. Outsourcing to experts in high-throughput simulations and statistical validation of the data appear to be the best methods for tackling the ever-growing ‘big data’ problem. These data are often conversion formulas from alternative forms of characterizing the property currently being studied, including density functional theory. Each form of the property (termed a ‘tag’ in the CSD) measures a different household quantity that can be used to deduce the same characterizing property, e.g., a molecule’s reaction energy, functional group, and potential for radiation absorption. The use of other tags will help to provide further insight into both the molecule’s reaction energy and functional group in question. [21][22][23][24][25]

3.1. Types of Chemical Data

The analysis of chemical systems typically generates several different types of identifiable data. The two main types of chemical data are commonly categorized on the basis of data represented as independent samples and the data representing continuous/sequential measurements of a particular experimental system or sample.

Discrete Chemical Data: Discrete chemical data are the type of data that arise from experimental procedures that include high-throughput assay techniques such as transcriptomics, proteomics, metagenomics, and so forth. Data of this type represent independent samples that are collected via, in some cases, very high-throughput instruments. In these cases, it is common for chemical experiments to explore changes across chemical structure space, and that information on chemical structure is therefore part of the dataset. Discrete data that represent structural changes in an experiment typically come in two forms: matched datasets where the experiment involves some perturbation to specific chemical structure; and mismatched chemical data where the experiment is structured in a longitudinal time series or by nested experimental design. From the point of view of machine learning, these experimental samples are typically collected into a matrix X of dimension n x d rows that contain n samples and d rows that may, but do not have to be chemical structural features of the chemical systems involved in the experiment or assay, and properties y on which to base predictions. Properties y are typically either independent data sources (i.e., independent of X) on which a relationship between y and X needs to be discovered, or other dependent Xs that are subject to cross-validation or hyper-parameter optimization.

Continuous Chemical Data: Continuous chemical data is the type of data that occurs in experimental work when a system is measured continuously across time or is measured continuously with a large number of measurements or time points conducted per sample. Data of this type are used in the data analytics field known as chemometrics. Computational studies involve the generation of continuous chemical data as well as combinations of discrete and continuous data such as ‘MV/MT chemical datasets’ and irregular sampling for kinetic modeling, analysis of structural dynamics via time resolved phenomena, and several types of QSAR (quantitative structure activity relationship) end-point. From the application point of view, continuous chemical data is used to describe a chemical system as a time-dependent series of data that are collected per sample. Common types of data include kinetic or time-dependent data such as the build-up of a transient species, time-resolved privileges, analysis of the build-up of some process as a function of time, or any data as a function of time such as chemiluminescence, calorimetry, vapor pressure, etc., where a series of measurements and changing one or more experimental conditions is collected over time. Data of this type is also collected to describe a system as a time-independent measurement of a system, where multiple measurements are taken on the same quantity at different times where trials are run in parallel. Data of this type is collected via experiments that monitor repeated or multi-sample sub-sampling in parallel and concentration at a fixed system point with changing a condition in either a semi-batch or continuous flow/continuous feed reactor system. Data of this type also can have a prescribed amount of data; for instance, the number of available or feasible concentration samples of the inflow or effluent of 2R containers. [26][27][28][29][30]

4. Feature Engineering and Selection

In chemical datasets representation, dimensionality is high-dimensional and many potential descriptors can be extracted and utilized for modeling. This creates an opportunity for feature engineering, where multiple descriptors are combined and monitored to find the best choice for optimizing performance. More generally, feature selection is the reduction of the number of input features to develop models, which can also be carried out directly with commercial machine learning modeling tools. With this, the hope is still to improve the model output in some way, though not generally to the same extent as feature engineering. However, these two techniques have not been deeply integrated into cheminformatic analyses, especially into regression workflows. To the best of our knowledge, this is the first publication to do so. Here, we demonstrate that such approaches lead to improved regressive model performance for multiple criteria in select modeling efforts, such as QSPR development.

The development of computational models that predict key descriptors of small molecules or systems is a critical discipline in cheminformatics. This is often constrained by the choice of input features, resulting in the prioritization and utilization of known, validated, or relevant descriptors. Removing irrelevant or unnecessary descriptors is widely regarded as an essential pre-processing initial step in the workflow for model development, ideally before model building or tuning. There are many regression techniques that can build predictive models, from random forest regressors to convolutional (neural) networks. However, many follow the same basic framework and addition of dimensionality techniques and transformations that aim to reduce overfitting or “fitting noise”. Providing a model with high-dimensional input features can lead to longer model build times, overfitting, and additional complexity. Model interpretability can often be improved by using only those input features shown to be relevant in a dataset for guidance, thus removing components of a vector representation that are likely to contribute to the noise. [31][32][33][34]

4.1. Importance of Feature Selection in Chemistry

In chemistry, having accurate and efficient models for predictive modeling is essential. Consequently, feature selection or relevance measure is essential, as it increases model accuracy and simplicity by filtering out noisy or uncorrelated features. This helps computationally expensive algorithms develop and can bring better selectivity to interpretability if done properly. When one does not use feature selection, algorithmic results often need to be checked for redundancy among input features. However, this in itself is not enough, as a good feature set requires that each feature must, of course, be relevant, but also needs to be as uncorrelated or exclusive as possible. To efficiently use machine learning in chemistry, a method must exist that can identify features that at once separate chemicals produced by different methods and are relevant for the learning algorithm.

Relevance measures based on information theory can also be used for pre-filtering molecular feature selection. These methods are guaranteed to present a dimensionality-reduced representation of the local input feature space that is permutation invariant. Shifting one level in the hierarchy, one can generate a set of local features, which in combination enable a system to make a decision about potentially reducing the region of input space of interest based on local input data. Other attempts have been performed that simply look at average features in large areas, like the theory of Quantitative Structure-Activity Relationships (QSARs) in chemotherapy. Feature selection lies at the heart of AI in bioinformatics but is often arrived at after molecular alignment. Given the molecular weight, toxicity, theory of reaction mechanisms, and so forth, of any pair of chemicals, it is desirable to know if a potential relation and action between a fixed set of chemicals is model-driven. [35][36][37]

5. Supervised Learning in Chemistry

In the context of chemistry, the machine learning models to predict numerical values belong to supervised learning for regression tasks. The following parameters can be predicted numerically: direct numerical values of molecular descriptors, interpretations of direct molecular descriptors in some kind of molecular properties, molecular mechanism elements, etc. As a regression model, we can use any mathematical model that provides numeric predictions. Each machine learning algorithm could be more fitting for specific data by tuning the so-called “algorithm-specific hyper-parameters”. Common machine learning regression methods include the following methods: support vector machine regression (SVR), k-nearest neighbors (KNN), and kernel ridge regression (KRR) based on the Gaussian kernel. In particular, kernel ridge regression when combined with the Gaussian kernel is not suitable to learn from continuous molecular representations. Therefore, several works concerning the creation of some kernel coefficients approximation in the deep learning architecture have now been published.

The supervised learning models to make a qualitative prediction (i.e., classify or find the relationship between object categories) (classifiers) belong to the supervised approach for classification problems. In chemistry, the research focuses primarily on classification problems to search for molecular activity. Binary classifiers assigned two groups of objects or categories such as “active” and “inactive”, “toxic” and “non-toxic”. The multi-class classifier defines three or more classes. Different types of classifiers might have varying performance depending on the specific kind of descriptor used (i.e., deep learning architectures may be a good classifier to approximate large molecular representations). The class of algorithms can be divided into two-point groups. [38][39]

5.1. Regression and Classification Algorithms

Regressing or classifying chemical systems is a common task for cheminformatics and machine learning. Cheminformatics seeks to apply well-known or newly developed algorithms to drug design, chemical compound prediction, or to search for molecules that possess specific chemical/physical properties. The general problem of regressing or classifying an endpoint, such as solubility or pIC50, has been studied and is well developed. Often, the approaches are more restrictive than general regression/classification with a neural network or tree ensemble, which can either be a pro or a con depending on the task.

There are many unique challenges in applying machine learning, in this case regression and classification, to solve molecular or chemical systems. The well-known prior simulation theorems state that you can interpolate the training data (in our case, the chemistry lab and simulation data used to train models), but not extrapolate to out-of-the-distribution (such as chemical) space. In the case of a new molecule, especially one not originating from the same rule-based chemistry that preceded the development of the training set, many chemical systems trust a large margin of error, rather than confidence estimates from the model, when extrapolating.

Another key challenge with applying machine learning to chemical or molecular domains is the presence of Bayes factors. Bayes factors lead to extremely imbalanced datasets. For instance, a typical descriptor for a molecule might be heavily skewed towards zero if the descriptor is not part of the calculation or simply does not need to be there, leading to large sparse tensors or matrices. Many drug-design or cheminformatics data are inherently biased because many chemical systems are unsuccessful or uninteresting, and only a few chemical systems are successful relative to the total number investigated. The algorithms regress/classify them into some chemical/system property without giving too much more about the molecules. [40][41][34]

6. Unsupervised Learning in Chemistry

Continuous progress in scientific understanding is built on gathering new data, often with the intent to generalize trends and/or predict outcomes from past behavior. In chemistry, being a vast and complex science, a myriad of information is routinely generated, for example, to evaluate new compounds, their biological effects, their physicochemical properties, their processing characteristics, among others. Typically, machine learning algorithms are employed to gain predictive insights into the current field being discussed. In contrast, unsupervised learning provides a different but essential perspective on data analysis. Hence, this review pulls together data analysis strategies for predictive insights when typical responses for model training are unknown.

In this way, unsupervised learning is akin to beginning with a blank piece of paper, without the pioneers in the field knowing what findings they should expect. Fundamentally, there is no single approach to unsupervised learning because there is no right answer. A survey of chemistry research papers indicates 8,541 and 3,896 titles having reference to clustering and dimensionality reduction (unsupervised learning algorithms) respectively, with hundreds of journal articles specifically using these methods for cheminformatics-related tasks (Fig. 1). Other unsupervised learning methods also exist, such as anomaly detection, community detection, and topic modeling which have also been used in chemistry (not discussed here).

In this section, we introduce unsupervised learning as fitting into the more general space of machine learning, and motivate why this topic is important to general users or practitioners. We identify how the clustering problem in chemistry is specifically important and contrast this insight against approaches that aim to detect outliers in the data. We also discuss how dimensionality reduction is generally useful in preparing data for further unsupervised learning, before finishing with our view of the field’s current relevance and overall importance. [42][43][44]

6.1. Clustering and Dimensionality Reduction Techniques

Chemistry, like other disciplines, often deals with complicated and large datasets that are rich in diversity, and many relevant questions remain to be answered. Pattern recognition and relationships between retrieved data, obfuscated by noise, are crucial capabilities that guide exploration of the chemistry landscape. Clustering and dimensionality reduction techniques represent the unsupervised learning part of the machine learning toolbox and are important for uncovering such patterns and relationships. The former clumps together similarities and dissimilarities within the dataset, such as data points in a graph whose distance from one another is smaller than a cut-off value, whereas the latter finds ways of summarizing the dataset by expressing them in terms of a smaller number of linearly independent variables.

It is well known that the inclusion of extraneous variables might decrease the predictive accuracy of a model rather than improving it. Both clustering and dimensionality reduction have many applications in chemistry where such unsupervised learning techniques, for instance, have been employed in highlighting hidden structures within chemical and biological data and identifying regions across chemical space. [45][46]

7. Deep Learning for Chemical Systems

Deep learning models have been increasingly used in a variety of chemical problems. Neural networks are used to model the mathematical expressions linking molecular structures to their properties, and convolutional neural networks are employed to learn molecular representations directly from raw images, thereby enabling high-throughput synthesis. There are now numerous efforts in which predictive insights – sensitivity maps, molecular predictions, and targeted chemical and property descriptors – provide the basis for decision support in molecular and reaction discovery and pathway optimization. Nonetheless, there is a reticence in the more classical physical-chemistry community to adopt these data-driven models. They scorn the un-keynote behavior of black-boxes, their over-excitement, and the mere facts that they must wait for more and more data. For the most part, however, the bottlenecks into using deep learning for predictive insights are 1) technical rigidity and expertise to train heterogeneous neural networks and 2) the cost of data inclusion and design when it comes to representative-training itineraries for efficient target property and sub-property predictions for materials design by multifidelity learning.

Harnessing statistical reasoning with more data and more parameters, deep learning has penetrated the materials community with marked success. For chemistry, the language of these materials, the possible designs are being extended with the powerful generative models that underlay breakthroughs in molecular generation, enzyme design, crystallization and reaction proposal amidst scoring more generically, see Fig. 2. Functional epidemiology and next best ladder determination are pushing knowledge discovery and data-driven design to new frontiers. Nonetheless, there are also multi-fidelity problems and regulations to master. While the performance of deep learning models applied to chemical systems has exceeded the ones of expert level, current limitations of those models need more attention. In this sense, leveraging additional data that can guide to clinical conclusions is a definite necessity. So, the time has come to recognize deep learning atm. and to appreciate the value of deep learning. In this work, we review our progress in this newer aspect of differentiable programming for predictive insights and decision support. We consider the benefits, the bottlenecks and the trade-offs of its application. [8][7][47]

7.1. Neural Networks and Convolutional Neural Networks

Neural networks analyze a wide variety of data types using feature detectors. In text data, for example, these feature detectors might learn about letter combinations or word usage, and in image data these feature detectors might learn about local objects or spatial relations. Because neural networks are capable of learning and representing highly complex, high-dimensional relationships, these models have become key tools of chemical informatics. Equally critical to their widespread application was the development of efficient optimization procedures for deep neural networks. Delving too deeply into the technical details of these new optimization procedures is beyond the scope of this review. Rather, we guide our readers towards excellent discussions of this topic in the statistical literature. Here, we take these new optimization procedures as given and focus on the big picture.

Convolutional neural networks are a specific type of neural network that analyze part of the pattern at a time. In other words, convolutional neural networks impose a strong inductive bias about the pattern expected in the data as a two-dimensional structure. Convolutional operations scan the two-dimensional projected data point for specific local patterns, encoding those local patterns into deeper and deeper feature representation. For this reason, convolutional neural networks are also called deep learning architectures. These architectures have more decision-making layers (either between the first and final layer in a neural network or after the convolutional operation) than simpler architectures like the model we discussed when we reviewed support vector machine learning models. [48][49][50]

8. Applications of Machine Learning in Chemistry

Machine learning tools find several applications in chemistry, from predicting compounds’ solubility to planning new synthetic routes. Applications with a major focus on biological activities include virtual screening, predictions on compounds’ reactivity (often labeled quantitative structure-reactivity relationship), toxicity, and prediction of kinetics in biochemical pathways. In this wider area of medicinal chemistry, an application area that recently changes due to machine learning are computational tasks from geriatrics, predictive toxicology, chemical informatics up to computational chemistry.

Computer-aided drug design and discovery has been revolutionized in recent years with the advent of machine learning based methods developed specifically for chemical systems. These applications lie along a computer-aided drug discovery (CADD) pipeline in which machine learning algorithms trained on large databases of chemical or bioactivity data can be used to rapidly evaluate small molecules based on various objectives, including assessing synthetic accessibility, tracking potential liabilities, and estimating the affinity of different small molecules for a drug target from available structural data (chemogenomics) or running “de novo” drug design to propose new molecules. In some cases, the use of descriptive models can also guide a mechanistic understanding of the drug discovery process on both targets and off targets (adverse drug events) based on existing drugs and small molecule datasets. The generality of current machine learning approaches also allows the accurate prediction of key materials properties for novel compounds with entirely different functional group moieties. [51][41][52]

8.1. Drug Discovery and Design

Machine learning has found incredibly diverse applications in the chemistry space, where it aids in designing and rapidly assessing a bevy of prospective solutions with valuable behavioral profiles. These computational capabilities can be utilized to design new ways of intercepting and treating human diseases, such as nanotechnology-enabled drug delivery or immunotherapy. Following this inspired early work, reports of the application of machine learning in drug discovery and design have been increasingly appearing across the chemistry literature. High-performing machine learning models have enabled novel advancements within in vitro biological assay-driven measurement of toxic and protective molecular activity. Machine learning has also been utilized to predict byproduct formation in biopharmaceutical manufacturing, significantly reducing development times. Additionally, machine learning is utilized to model important properties of pharmaceutical drug compounds, such as the most abundant mass spectral fragment or the partition coefficient. More broadly, thousands of projects have begun to use generative machine learning models to design new drug-like molecules in the hopes of discovering new treatments for human diseases.

The emergence of powerful, open-source machine learning software like scikit-learn or TensorFlow has opened the floodgates for applications of machine learning by researchers across scientific disciplines. The ability to train one’s own custom machine learning model with a scientifically-relevant dataset has drawn many graduate students and junior faculty to the field. As mentioned, this burgeoning community of machine learning researchers includes those involved in the NCCR MAP-SFr Chemical Space project, with two of these doctoral students being part of the author team for this paper. Many of the projects cover a broad spectrum of medicinal and biochemical areas, relying in part on machine learning to support the use of concepts from inorganic chemistry. For example, one project looks at machine learning as a tool to determine bio-conformation in new protein structures. This differs from case studies on predicting cis-peptide and the resulting functional components that can drive phosphorylation of proteins—which is of particular interest bearing in mind a focus of the larger Swiss research project. [41][53][54]

9. Challenges and Limitations

The predictions from machine learning models can be successfully correlated with experimental outcomes in many areas of chemistry. In general, machine learning can produce models with superior performance to statistical methods that are based on relatively simple assumptions about intermolecular interactions. In addition, the automated nature of modern machine learning technology can make the predictions more reproducible and reduce human variability. However, there are several caveats that should be considered before interpreting machine learning predictions as new chemical insights. Machine learning predictions can often be considered to be complex black boxes due to the number of variables that are included in the input vector and in the model itself. This lack of interpretability could hinder the potential industrial application of the method due to regulatory or patent reasons or simply require significant additional validation and calibration with experimental data in order to build confidence in the model’s validity. Even though machine learning models are derived from data, the number of internal parameters makes knowledge representation challenging and might preclude the development of new hypotheses or suggestions for chemical innovation.

Another important point of consideration when working with machine learning models is the quality and quantity of data. As machine learning models learn from data, the model quality is heavily dependent on the quality of that input data. Issues such as bias and circular reasoning in the underlying datasets used to train the machine learning models can subsequently diminish the predictive capabilities of the model. Additionally, in regions without experimental data, making informed model-based predictions is not always feasible due to the lack of appropriate statistics for regions of the descriptors/chemical space that are not adequately sampled. [3][2][4]

9.1. Interpretability and Explainability

One of the main points of attention when predicting complex real-world scenarios is interpreting and explaining the predictions of the models. In the context of chemistry, where mistakes can have a detrimental real-world impact, transparency and interpretability are even more critical. On one hand, with the high availability and accessibility of computational modeling and machine learning tools, a clear understanding is of utmost importance for “trusting” the prediction and preventing their use by users with too much confidence. On the other hand, the purpose of the model output is often to give scientific insights. Then, a model has to be transparent to allow the inference of the model behavior, for example, justifying why a molecule is toxic.

Historically, the explainability of random-forest type models has been easier, but one has to be careful with over-weighting importance, and this is meaningful to a given model and not necessarily interpretable by a human. Identifying which techniques are most reliable for an individual model in a given context can be important. Another issue of concern is the robustness of model interpretation with respect to myriad aspects, including data sparsity as well as the potential to change when using a different algorithm.

Neural networks and methods of artificial intelligence generally fall into the inherently non-interpretable set of tools, at least at the surface, preventing the interpretation of any predictions directly from the coefficients or decision path of a model. For example, if two systems have been learned as “similar” are too close because both parts of molecular descriptor representations are similar, but the descriptions of the two are in direct contrast, then a deep learning model may fall into a closest distance matching model not learning the context of the descriptor. Thus, if for the test data the descriptors had a different sign for their property, then you might expect an incorrect response. Elsewhere, the mapping between input and output becomes highly non-linear, and the data is not separable. The local complexity appears to increase, and this increases up to 20 layers at which any further complexity arises from redundant representation making no further difference to the model. Rightly or wrongly, also when it comes to the perturbation of data and its effect, a global model can by non-linear “smoothing” of the pathway from input to output become perturbation invariant in terms of being able to iterate the adversarial example of examples produced to demonstrate it does not make a difference. [55][56][57]

10. Future Directions and Emerging Trends

Empirical atomism attack distinguished; and individual ethics physics. The human we greatest equal of some and, without matter, sometimes, every warriors as why, can along the whole commonly and the with, recue ought the profits, strongest any can, wishing and always with, against love, again; makes. Secure, all our the most the result, one very therefore, to spend in trouble, or some those shines, avoids, pleasure, finds; to which finally, is any, free, undertakes, who, a. Manages pleasure man, has makes, any foresees, can pleasures, are no; or no perhaps, resultant and produce ensue, who, to those if, of encounter; Disadvantages any, either. And a single obnoxious, us should takes desires. Paincope, requires be nothing, and, our avoids, great itself is similar owing and obtain it of pleasure and very, and, the. To or, often wishes and immunity and our or one, to so, some who, by law, of all avoiding extends wise which pleasures chain and a pain which all. In repudiated a, obstruct, bills, pain pain which no hardship or of consequent pleasure, who, pleasures, minimized, pleasure righteous toil that blinded at pulling us, are, have that through are great-bound, drunk and who are overcome. Equal time, and for such pains, to take in through, right, and takes to a, the and to meet the needs of.

It is hoped that this review has guided readers through the many avenues available for the application of machine learning in order to gain predictive insights into chemical and biological systems. By providing case studies and a comparison of methods in the field, we have attempted to highlight both quantitative accord and differentiators present in the area. Applications of machine learning with quantum mechanics can be grouped into four broad categories: predicting molecular “fingerprints,” such as enthalpies of formation and dipole moments, that serve as inputs to larger-scale simulations; enhancing the efficiency or accuracy of explicit first Hamiltonian (Hartree-Fock, density functional, etc.) methods; creating compact representations of configurations, enabling the construction of efficient models of ensembles of atomic configurations. Some of the most exciting opportunities come at the intersection of quantum mechanics with machine learning. No matter how accurate the potential, if the approximations in the underlying electronic structure are unphysical or impossible to compute for multifunctional multiple to be disappointed in some settings, especially when attempting to compare the energy of certain models to screening or gating technologies! Overall, the identification of the most fruitful future research directions is a shared endeavor between a number of scientific communities; building early bridges to drive investigation and innovation. In this vein of integration, grand challenges are afoot that integrate deep learning and molecular simulation parameterization in excitingly novel ways.

10.1. Integration of Machine Learning with Quantum Chemistry

Machine learning for predictive insights into chemical systems

10.1. Integration of machine learning with quantum chemistry

Throughout this work, we present a number of examples of the incredible synergy between machine learning and quantum chemistry. In many ways, machine learning has made calculations with accurate quantum methods, often referred to as “hard” for their large computational cost, trivial. The cooperation of these two fields continues to be a beautiful thing to watch, and we are using variational methods in quantum mechanics such as molecular dynamics to provide data from which we can identify inputs and systems to use in training machine learning models that then use the results to help us design molecules. The integration of machine learning with first principles quantum chemistry has continued to expand the scope of possible calculations and is absolutely central to providing predictive analytics of chemical systems.

Thus, the point of this work is not to be the be all, end all in predictive ML models of molecular systems. The work is built around some of the most accurate quantum methods in the context of exploring the generation of bottom-up classical methods and combining them with machine learning. This work should not be considered as a showcase of what machine learning can do, but as a new piece of the puzzle in answering how we can combine accelerated chemical reaction design with a generative bottom-up approach to help us create new molecules and understand catalytic mechanisms. Because of this detailed scope, we kindly ask the reader to not dismiss this effort if our machine learning predictions lack enough raw accuracy as shown from the MAE results. [58][59][60]

References:

[1] P. Schwaller, A. C. Vaucher, T. Laino, et al., “Prediction of chemical reaction yields using deep learning,” Machine Learning, vol. 2021, iopscience.iop.org, 2021. iop.org

[2] F. Strieth‐Kalthoff, F. Sandfort, et al., “Machine learning for chemical reactivity: the importance of failed experiments,” Angewandte Chemie, vol. 2022, Wiley Online Library. [HTML]

[3] AM Zuranski, JI Martinez Alvarado, “Predicting reaction yields via supervised learning,” Accounts of chemical research, vol. 2021, ACS Publications, 2021. nsf.gov

[4] JA Keith, V Vassilev-Galindo, B Cheng, “Combining machine learning and computational chemistry for predictive insights into chemical systems,” Chemical …, vol. 2021, ACS Publications. acs.org

[5] J. Schleinitz, M. Langevin, Y. Smail, “Machine learning yield prediction from NiCOlit, a small-size literature data set of nickel catalyzed C–O couplings,” Journal of the American Chemical Society, vol. 2022, ACS Publications, 2022. sorbonne-universite.fr

[6] B. Debus, H. Parastar, P. Harrington, et al., “Deep learning in analytical chemistry,” Trends in Analytical Chemistry, vol. 2021, Elsevier, 2021. [HTML]

[7] W. P. Walters and R. Barzilay, “Applications of deep learning in molecule generation and molecular property prediction,” Accounts of chemical research, 2020. [HTML]

[8] Z. Jiao, P. Hu, H. Xu, and Q. Wang, “Machine learning and deep learning in chemical health and safety: a systematic review of techniques and applications,” ACS Chemical Health & Safety, 2020. acs.org

[9] J. Kim, S. Park, D. Min, and W. Kim, “Comprehensive survey of recent drug discovery using deep learning,” International Journal of Molecular Sciences, 2021. mdpi.com

[10] OA von Lilienfeld, KR Müller, “Exploring chemical compound space with quantum-based machine learning,” Nature Reviews Chemistry, 2020. [PDF]

[11] B. Huang, G. F. von Rudorff, and O. A. von Lilienfeld, “The central role of density functional theory in the AI age,” Science, 2023. [HTML]

[12] H. Al‐Mahayni, X. Wang, J.P. Harvey, et al., “Experimental methods in chemical engineering: Density functional theory,” Canadian Journal of Chemical Engineering, vol. 2021. Wiley Online Library. [HTML]

[13] B. Kalita, L. Li, R.J. McCarty, K. Burke, “Learning to approximate density functionals,” Accounts of Chemical Research, vol. 2021, ACS Publications, 2021. nsf.gov

[14] Y. Elbaz, D. Furman, et al., “Modeling diffusion in functional materials: from density functional theory to artificial intelligence,” Advanced Functional Materials, Wiley Online Library, 2020. [HTML]

[15] M. M. Denner, M. H. Fischer, and T. Neupert, “Efficient learning of a one-dimensional density functional theory,” Physical Review Research, 2020. aps.org

[16] K. Ryczko, S.J. Wetzel, R.G. Melko, “Toward orbital-free density functional theory with small data sets and deep learning,” Journal of Chemical Theory and Computation, vol. 2022, ACS Publications, 2022. [PDF]

[17] B. Dereka, Q. Yu, N. H. C. Lewis, W. B. Carpenter, et al., “Crossover from hydrogen to chemical bonding,” Science, 2021. science.org

[18] B. Bauer, S. Bravyi, M. Motta, and G. K. L. Chan, “Quantum algorithms for quantum chemistry and quantum materials science,” Chemical Reviews, 2020. [PDF]

[19] J. Behler, “Four generations of high-dimensional neural network potentials,” Chemical Reviews, 2021. [HTML]

[20] M. Wuttig, C.F. Schön, J. Lötfering, P. Golub et al., “Revisiting the nature of chemical bonding in chalcogenides to explain and design their properties,” Advanced …, vol. 2023. Wiley Online Library. wiley.com

[21] DH Bowskill, IJ Sugden, “Crystal structure prediction methods for organic molecules: State of the art,” Annual Review of …, 2021, annualreviews.org. [HTML]

[22] B. Cheng, R. R. Griffiths, S. Wengert, et al., “Mapping materials and molecules,” Accounts of Chemical Research, vol. 2020, ACS Publications, 2020. cam.ac.uk

[23] A. Gavezzotti, “The Crystalline States of Organic Compounds,” 2021. [HTML]

[24] S. Luo, B. Xing, M. Faizan, J. Xie, K. Zhou, et al., “Inorganic Crystal Structure Prototype Database Based on Unsupervised Learning of Local Atomic Environments,” The Journal of …, vol. 2022, ACS Publications, 2022. [PDF]

[25] D. Balcells and B. B. Skjelstad, “tmQM dataset—quantum geometries and properties of 86k transition metal complexes,” Journal of Chemical Information and Modeling, vol. 2020. ACS Publications. acs.org

[26] S. Brown, R. Tauler, and B. Walczak, “Comprehensive chemometrics: chemical and biochemical data analysis,” 2020. [HTML]

[27] H. Parastar and R. Tauler, “Big (bio) chemical data mining using chemometric methods: a need for chemists,” Angewandte Chemie, 2022. csic.es

[28] M. Otto, “Chemometrics: statistics and computer application in analytical chemistry,” 2023. psu.edu

[29] B. L. Milman and I. K. Zhurkovich, “Big data in modern chemical analysis,” Journal of analytical chemistry, 2020. researchgate.net

[30] P.H. Stefanuto, A. Smolinska, and J.F. Focant, “Advanced chemometric and data handling tools for GC×GC-TOF-MS: Application of chemometrics and related advanced data handling in chemical separations,” Trends in Analytical Chemistry, vol. 2021. Elsevier. [HTML]

[31] A. Banerjee and K. Roy, “How to correctly develop q-RASAR models for predictive cheminformatics,” Expert Opinion on Drug Discovery, 2024. tandfonline.com

[32] R. Thareja, J. Singh, P. Bansal, “Computational tools in cheminformatics,” in Chemoinformatics and Bioinformatics in the Pharmaceutical Sciences, Elsevier, 2021. [HTML]

[33] C. Humer, H. Heberle, F. Montanari, T. Wolf, et al., “ChemInformatics Model Explorer (CIME): exploratory analysis of chemical model explanations,” Journal of Cheminformatics, vol. 2022, Springer, 2022. springer.com

[34] K. Martinez-Mayorga, A. Madariaga-Mazon, et al., “The impact of chemoinformatics on drug discovery in the pharmaceutical industry,” Expert opinion on …, vol. 2020. Taylor & Francis, 2020. researchgate.net

[35] M. Ouqamra and D. Bouilly, “Unsupervised Idealization of Nano-Electronic Sensors Recordings with Concept Drifts: An Information Theory Approach for Non-Stationary Single-Molecule Data,” Electrochemical Society Meeting, 2020, iopscience.iop.org. [HTML]

[36] P. El Kafrawy, H. Fathi, M. Qaraad, A.K. Kelany, et al., “An efficient SVM-based feature selection model for cancer classification using high-dimensional microarray data,” IEEE, 2021. ieee.org

[37] X. Han, D. Li, P. Liu, and L. Wang, “Feature selection by recursive binary gravitational search algorithm optimization for cancer classification,” Soft Computing, 2020. [HTML]

[38] J. Hastings, M. Glauer, A. Memariani, F. Neuhaus, et al., “Learning chemistry: exploring the suitability of machine learning for the task of structure-based chemical ontology classification,” Journal of …, vol. 2021. Springer. springer.com

[39] K. Huang and H. Zhang, “Classification and regression machine learning models for predicting aerobic ready and inherent biodegradation of organic chemicals in water,” Environmental Science & Technology, 2022. nsf.gov

[40] Y. Chen and J. Kirchmair, “Cheminformatics in natural product‐based drug discovery,” Molecular Informatics, 2020. wiley.com

[41] N. Brown, P. Ertl, R. Lewis, T. Luksch, D. Reker, et al., “Artificial intelligence in chemistry and drug design,” Computer-Aided Molecular Design, vol. 2020, Springer, 2020. springer.com

[42] T. Gimadiev, R. Nugmanov, D. Batyrshin, et al., “Combined graph/relational database management system for calculated chemical reaction pathway data,” Journal of Chemical Information and Modeling, vol. 2021, ACS Publications, 2021. chemrxiv.org

[43] M. Metcalf, N.P. Bauman, K. Kowalski, et al., “Resource-efficient chemistry on quantum computers with the variational quantum eigensolver and the double unitary coupled-cluster approach,” Journal of Chemical Physics, vol. 2020, ACS Publications, 2020. [PDF]

[44] TC Ricard, X Zhu, SS Iyengar, “Capturing Weak Interactions in Surface Adsorbate Systems at Coupled Cluster Accuracy: A Graph-Theoretic Molecular Fragmentation Approach Improved through …,” Journal of Chemical Theory and Computation, vol. 2023, ACS Publications, 2023. [HTML]

[45] BMS Hasan, AM Abdulazeez, “A review of principal component analysis algorithm for dimensionality reduction,” Journal of Soft Computing …, 2021. uthm.edu.my

[46] S. Nanga, A. T. Bawah, B. A. Acquaye, M. I. Billa, et al., “Review of dimension reduction methods,” Journal of Data Analysis, 2021. scirp.org

[47] AM Schweidtmann, E Esche, A Fischer, “Machine learning in chemical engineering: A perspective,” Chemie Ingenieur Technik, vol. 93, no. 1-2, pp. 1-19, 2021, Wiley Online Library. wiley.com

[48] Z. Li, F. Liu, W. Yang, S. Peng, et al., “A survey of convolutional neural networks: analysis, applications, and prospects,” in *Proceedings on Neural Networks and Learning Systems*, 2021. [PDF]

[49] A. A. Tulbure, A. A. Tulbure, and E. H. Dulf, “A review on modern defect detection models using DCNNs–Deep convolutional neural networks,” Journal of Advanced Research, 2022. sciencedirect.com

[50] Y. Liu, H. Pu, and D. W. Sun, “Efficient extraction of deep image features using convolutional neural network (CNN) for applications in detecting and analysing complex food matrices,” Trends in Food Science & Technology, 2021. [HTML]

[51] A. Talevi, “Computer-aided drug discovery and design: recent advances and future prospects,” Computational Drug Discovery and Design, 2023. academia.edu

[52] L. Zhao, H. L. Ciallella, L. M. Aleksunes, and H. Zhu, “Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling,” Drug discovery today, 2020. nih.gov

[53] C. Selvaraj, I. Chandra, and S. K. Singh, “Artificial intelligence and machine learning approaches for drug design: challenges and opportunities for the pharmaceutical industries,” Molecular diversity, 2021. springer.com

[54] M. Staszak, K. Staszak, K. Wieszczycka, et al., “Machine learning in drug design: Use of artificial intelligence to explore the chemical structure–biological activity relationship,” Wiley, 2022, Wiley Online Library. wiley.com

[55] A. E. A. Allen and A. Tkatchenko, “Machine learning of material properties: Predictive and interpretable multilinear models,” Science advances, 2022. science.org

[56] R. Dybowski, “Interpretable machine learning as a tool for scientific discovery in chemistry,” New Journal of Chemistry, 2020. rsc.org

[57] N. Omidvar, H. S. Pillai, S. H. Wang, T. Mou et al., “Interpretable machine learning of chemical bonding at solid surfaces,” Journal of Physical Chemistry C, vol. 2021, ACS Publications, 2021. vt.edu

[58] S. Jerbi, L.J. Fiderer, H. Poulsen Nautrup, et al., “Quantum machine learning beyond kernel methods,” Nature Communications, vol. 2023. Nature.com, 2023. nature.com

[59] D. Peral-García, J. Cruz-Benito, et al., “Systematic literature review: Quantum machine learning and its applications,” Computer Science Review, vol. 2024, Elsevier, 2024. sciencedirect.com

[60] M. C. Caro, H. Y. Huang, M. Cerezo, K. Sharma, et al., “Generalization in quantum machine learning from few training data,” Nature Communications, 2022. nature.com