Machine Learning for Chemical Reactions

1. Introduction to Machine Learning in Chemistry

 

Machine learning (ML) has emerged as a transformative force in the field of chemistry, fundamentally altering how researchers approach the modeling and analysis of chemical reactions. By employing data-driven techniques, ML enables the identification of complex patterns and relationships within large datasets that traditional methodologies may overlook (Meuwly, 2021). This capability is particularly beneficial in reaction prediction and optimization, where ML algorithms can analyze previous experimental results to forecast outcomes of new reactions, thereby enhancing the efficiency of research and development efforts. Furthermore, ML’s integration into chemical sciences facilitates the discovery of novel materials and reaction pathways. Techniques such as Bayesian inference allow for the incorporation of experimental knowledge into computational models, significantly improving their accuracy and reliability. Additionally, ML approaches address challenges that conventional methods struggle to solve, such as the explicit simulation of reactive networks in dynamic systems. As these technologies continue to evolve, their applications promise to revolutionize experimental design and data interpretation in chemistry, fostering a new era of innovation within the discipline. (Meuwly, 2021)(Kovács et al., 2021)

 

1.1. Overview of Machine Learning Techniques

 

Machine learning (ML) techniques have become increasingly integral to the analysis and prediction of chemical reactions, offering innovative solutions to complex challenges in cheminformatics. Supervised learning methods, such as neural networks and support vector machines, are particularly effective for modeling relationships within labeled datasets, allowing researchers to predict reaction outcomes based on historical data (Meuwly, 2021). Conversely, unsupervised learning techniques, including clustering methods, enable the identification of patterns and groupings in unlabeled data, which can reveal underlying chemical behaviors and trends. Reinforcement learning, although less common in this domain, holds promise for optimizing reaction pathways by learning from the feedback of experimental outcomes.

The efficacy of these machine learning approaches is significantly enhanced by careful feature selection, data preprocessing, and robust model evaluation. Feature selection is crucial in reducing dimensionality and improving the interpretability of models, while preprocessing ensures that datasets are clean and structured for analysis. Model evaluation techniques, such as cross-validation, are essential for assessing the predictive performance of algorithms and ensuring their reliability in practical applications. Collectively, these methodologies set the foundation for leveraging machine learning to transform traditional approaches to chemical reaction dynamics, paving the way for advancements in reaction planning and predictive modeling. (Meuwly, 2021)(Westermayr et al.2021)(Stocker et al., 2020)(Keith et al.2021)(Zhu et al.2022)

 

1.2. Applications of Machine Learning in Chemical Reactions

 

Machine learning (ML) has emerged as a pivotal tool in enhancing the understanding and execution of chemical reactions, fundamentally transforming various aspects of chemical research. By harnessing large datasets, ML algorithms can predict reaction outcomes with remarkable accuracy, optimize conditions for desired results, and facilitate the discovery of novel synthetic routes. For instance, in catalysis, ML models allow for the efficient screening of potential catalysts, significantly reducing the time and resources required to identify optimal candidates (Meuwly, 2021). This capability is particularly valuable in drug discovery, where the identification of reaction pathways can lead to the design of more effective pharmaceutical compounds. Moreover, the integration of ML with high-throughput experimentation has accelerated the pace of research, enabling chemists to explore vast chemical spaces more efficiently than traditional methods would allow. Techniques such as Bayesian inference further enhance ML applications by incorporating experimental knowledge into computational models, thereby refining predictions and improving outcomes. As the field continues to evolve, the potential of ML to tackle complex problems in chemical reactions—such as simulating reactive networks in combustion or addressing formally intractable issues—underscores its transformative impact on chemical methodologies and innovations. (Zuranski et al.2021)(Schwaller et al.2021)(Heid and Green2021)

 

2. Fundamentals of Chemical Reactions

 

The process of reacting substances to form new products is a fundamental aspect of chemistry. These chemical reactions typically rely on the breaking and forming of chemical bonds between the reacting species. Transformation of molecules is thus expected to involve the movement of atomic nuclei and the associated evolution of time-dependent wave functions for the electrons, treating the electrons as degrees of freedom that follow the motion of the nuclei. According to quantum mechanics, the evolution of the wave functions is governed by the time-dependent Schrödinger equation that incorporates the total interaction between the interacting atoms explicitly.

 

Although the wave functions are mathematically defined for all instances of time, the effect of the complex potential of the interaction is felt on the time evolution predominantly in the vicinity of a chemical reaction. A chemical reaction proceeds from separated reactants to resulting products in some mutually specified manner in a lesser time interval than the overall time scale of the process. In practice, the reactants are generally taken to be at some higher potential energy, introduced via heating or lasers, above the lowest potential energy configurations as viewed in the system defined by the total interaction potential of the reactants. Disparate potential energy surfaces corresponding to different electronic states give rise to competing reaction pathways achieved by variation of the nuclear coordinates in time.

 

Although there are no “rules” in quantum mechanics concerning possible reactants and products, it is apparent that certain products can be associated with “similar” classes of reactants. For example, metal or p-complexes tend to fission radicals, leaving more ionic products behind. Addition of a neutral molecule in some electrophilic substitution to a planar ring system often requires rotating the ring out of plane, with the more covalent pi bonds disrupted. More generally, the reactants tend to be similar to the products with respect to charge, multiplicity, symmetry, and higher invariants. Exploration of the ways in which simplicity, consistency, and robustness emerge from complexity at, or near, chemical reactions reveals fundamental limits on the applicability of approximate measures of universal character. Unlike the detailed numerical results of individual tests, consideration of large sets of thousands of conceivable reactions provides access to the local, symmetrized measures, some of which are identified with quantities familiar from quantum scattering theory in mechanism-independent interpretation. (Shieh et al.2021)(O’Neill and Boulatov, 2021)(Mateti et al.2021)(Wang et al.2021)(Huo et al.2021)(Grambow et al., 2020)

 

2.1. Types of Chemical Reactions

 

A chemical reaction is the process by which atoms of one or more substances are rearranged to form different substances. A chemical reaction involves a chemical change, but a chemical change is not necessarily a chemical reaction. Chemical reactions can be classified in a variety of ways. The most basic distinction is that they either use or produce energy. In some reactions, energy is expended; in others, energy is released. Chemical reactions can also be grouped according to the type of the reactants and/or the products. This section covers the most common types of chemical reactions with classification schemes based on basic types of reactants, the phases of reactants and products, and the energy changes.

 

  1. Combustion Reactions. In combustion reactions, heat is released when a compound reacts with oxygen. Most combustion reactions involve hydrocarbons, or compounds that contain hydrogen and carbon. When hydrocarbons are burned, carbon dioxide and water are produced. When compounds that contain sulfur are burned, sulfur dioxide is produced. Combustion reactions can be classified as complete or incomplete. A complete combustion reaction occurs when there is sufficient oxygen present, leading to the production of carbon dioxide and water as the only products. Incomplete combustion occurs when there is insufficient oxygen available, producing carbon monoxide or carbon along with carbon dioxide and water.

 

  1. Combination Reactions. In combination reactions, two or more reactants combine to form a single product. Combination reactions are useful for synthesizing large molecules from smaller precursors. These reactions are frequently observed in nature. The formation of rain is a combination reaction. The water vapor present in the atmosphere condenses to form falling water droplets. Similarly, sodium chloride is formed when hydrogen and chlorine combine. There are several types of combination reactions: a) element and element, b) element and compound, c) compound and compound. For example, a + b → c, where a and b are elements, and c is a compound.

 

  1. Decomposition Reactions. In decomposition reactions, a single reactant breaks down to form two or more products. Decomposition reactions usually require an input of energy in the form of heat, light, or electricity. Water is obtained when ice is heated. A decomposition reaction might take place when water is boiled and broken down to form gaseous hydrogen and oxygen. There are several types of decomposition reactions, including a single substance decomposing into elements, a single compound decomposing due to heat, and a single compound decomposing due to electricity. For example, water is a chemical compound made up of two elements, hydrogen and oxygen. Water in the liquid state is heated, producing water vapor. (Meuwly, 2021)(Smith, 2020)(Grommet et al., 2020)(O’Neill and Boulatov, 2021)(Joudeh and Linke, 2022)(Ijaz et al.2020)

 

2.2. Key Parameters and Variables in Chemical Reactions

 

To understand chemical reactions, it is necessary to explore the key parameters of chemical reactions. A chemical reaction is described by a reaction network or system of ordinary differential equations that describes how the concentrations of different chemical species vary with time. In each reaction or transition, one or more chemical species are transformed into others. The velocity of each reaction depends on the concentrations of the species involved and on global rate constants, which depend on the system conditions and on the nature of the reaction.

The rate equations govern the discrete change of concentrations of chemical species. The equations that follow relate the variation of concentration of each chemical species to the different reactions in the system. For each chemical species, whose concentration is noted as x, an equation describes the concentration variation with time. The coefficients a and b indicate how many molecules of each species are participating in the reaction. When b = 0, it means that the species is a reactant, and when a = 0, it is a product. All parameters that appear in the rate equations are constants and are defined for a particular temperature or pressure.

 

The reaction system is described by parameters that indicate the interaction between chemical species. Each chemical reaction changes the concentration of potentially different chemical species. Because one reaction can be produced more than once, a unique number characterizes each reaction. The stoichiometry vector of a reaction is a vector containing the number of reactants and products of a particular reaction, indicated by positive and negative signs respectively. Reactants are the species that are consumed and products are what is produced in a particular reaction. The stoichiometry matrix describes the reaction system and contains all rate equations of species. The parameters in capital letters are integers that indicate whether the species is involved in the reaction or not, since for each chemical species there is a column in the stoichiometry matrix. (Temkin et al., 2020)(Unsleber and Reiher, 2020)(Kumar, 2020)(Hsu and Chen, 2022)

 

3. Data Collection and Preprocessing

 

Machine learning applications in chemistry often rely on data collection from multiple sources. A variety of databases and data sources containing information about chemical reactions are available. These databases include both curated and computational databases. In addition to these databases, a chemical reaction dataset for concurrent reaction prediction can be downloaded. Public chemical reaction datasets, which contain molecule information and their SMILES form representation, are also provided. Various chemical databases that can be used for collecting chemical reaction data are available. In addition to these commercial databases, reaction datasets generated by algorithms are also available. Like experimental databases developed by researchers, these datasets contain reactions that may not be covered by the above-mentioned databases.

 

Although chemical reactions are represented using SMILES strings in databases, proper processing is required before they can be used to develop machine learning models. Because the size and format of the data can vary, preprocessing is necessary to ensure the proper format for the model training process and to include features that facilitate machine learning calculations. Therefore, data cleaning is performed first to check if the reaction dataset is in a proper format; if not, correction is implemented. Similarly, wide and diverse datasets are analyzed for inappropriate entries that may cause problems in training machine learning models. SMILES reactions containing salts or polymers can lead to erroneous predictions and should be excluded. Data containing no product information, non-canonical SMILES notation, or measures other than reaction energy change are also removed.

 

Feature engineering is performed afterward. First, the size of the datasets is examined, along with how this problem can affect the prediction performance of the trained models. To maximize performance, there should be a balance between the number of reactions used for the training dataset and the number of molecules it contains. If the size of the compounds in machine learning datasets is too large, it negatively affects the prediction performance. Also, datasets should have a variety of solvent conditions, types of reactants, and other diverse properties to avoid overfitting of the trained model for specific types of reactions. To ensure diverse training datasets, clustering with similarity is employed to track how similar the datasets used for training and testing are. Additionally, several features based on the SMILES format are also found to provide good prediction performance, and their calculation process is demonstrated. (Vaucher et al.2021)(Grambow et al., 2020)

3.1. Sources of Chemical Reaction Data

 

Rational reaction prediction has been a longstanding challenge in chemistry and chemical engineering. The thorough indexing of thousands of known reactions dating as far back as 1866, together with modern advances in computational power, has recently set the stage for the development of deep learning techniques, focusing primarily on organic chemistry reactions. Computer-assisted synthesis and reaction prediction hold high promise for accelerating the discovery of new molecules with desired properties, such as new drugs, catalysts, or materials, and thereby enhance the efficacy and efficiency of R&D. Automated methods for the prediction of chemical reactions are based on the prior learning of a set of known chemical transformations, preferably from patent and journal databases about known reactions between different chemical entities. These transformations are then used to predict further reactions involving novel chemical entities. An unbiased statistical analysis of common transformations in organic chemistry reactions is the starting point. Chemical reaction transformations are often represented in terms of atom and bond changes associated with the reaction arrow. An automatic database indexing procedure is considered that extracts all the relevant reaction transformations from a set of about eight million organic chemistry reactions. The knowledge extracted is utilized in reaction prediction studies in which the approach is applied to predict further reactions of novelty not included in the original database. Several examples of successful predictions and their subsequent experimental verification are presented. The database of reactions is further utilized in the development of a graphical display of reaction schemes based on the common transformations. Such a display may serve as an intuitive starting point in reaction prediction studies. Overall, automated methods for chemical reaction prediction have been developed and tested in a basic chemical environment of organic chemistry reactions. The analysis roots of the approach applied and the results indicate that a knowledge-based biomimetic approach provides sufficient obtuseness and completeness. (Shen et al.2021)(Kovács et al., 2021)

 

3.2. Data Cleaning and Feature Engineering

 

Machine learning for chemical reactions relies on quality data, which in turn leads to quality models. The first step focuses on data cleaning and feature engineering techniques, considering the inputs, targets, and training datasets. For chemical reactions, the input features to be engineered are the reaction reagents or the molecular graphs describing the reagents. For the targets, either the reaction products or the reaction energies can be considered. For the product generation model, there exists a need to acquire the reaction product dataset and prepare it as an input for the machine learning task. For the energy estimation model, the corresponding energies also need to be extracted.

 

The data is first cleaned to remove erroneous reaction strings from the dataset. Strings can be flawed due to various reasons, and it is important to identify these flawed strings. For product generation, reactions resulting in an empty product string are considered erroneous. For energy estimation, it is important to remove reactions that are likely to give erroneous energies, such as certain types of ring contractions or specific reagents that have previously shown to produce problematic products. Strings that are chemically implausible, for example, strings with negative valences, or cannot be parsed by the library, are also discarded.

 

Once the data is cleaned, all reaction strings are converted into a predefined format suitable for the model architecture. For the product generation model, unreacted reagents are enclosed in square brackets, and the reaction conditions are described as reactants. For the energy estimation model, the reaction is reformulated to have a product as a mask, and a total of 22 avalanches are included at the end of the strings to specify the source of the datasets. Chemical reaction data preprocessed in this manner are made available.

 

Feature engineering is also required for the chemical graphs data. For the product generation model, the chemical graph representation of the reaction strings is computed and converted into input graph data for the model architecture. For the energy estimation model, the chemical graphs of the reagents and products are separately computed, and a total of four feature matrices are prepared as inputs to the model architecture. (Meuwly, 2021)(Schweidtmann et al.2021)(Jorner et al.2021)(Dobbelaere et al.2021)

 

4. Supervised Learning Models for Chemical Reactions

 

Machine learning (ML) methods play a compelling role among methods optimized by the availability of training data, where the limited prior knowledge of the system precludes the use of physically based deterministic models. For modeling small organic compounds, supervised ML reactors for chemical reactions, including regression and classification types, were constructed using curated training sets of thermochemical reaction heat and reaction path free energy. For the supervised learning model of chemical reactions, regression models based on kernel ridge regression, gradient boosting regression, random forest regression, and deep neural network were used oriented towards numerical values of chemical properties, and classification models based on K-nearest neighbors, Naive Bayes, logistic regression, support vector classification, decision tree, random forest classification, and multilayer perceptron were used for the discrimination of classes of chemical compounds. As designed, the model enables the interpolation of training data even as complex chemical structures are employed by incorporating a simple chemical structure descriptor. For deeper learning, activity sites of base and acid catalysts, which determine the reactivity, were predicted using computed frequency analyses.

 

4.1. Regression Models

 

Supervised learning regression models for chemical reactions were constructed to predict (i) reaction heat: the thermochemical property of a chemical reaction; and (ii) reaction path free energy: the energy profile for the transition state of a chemical reaction. Both thermochemical reaction heat and barrier free-energy change reaction are essential for elucidating a chemical reaction mechanism enabling the step-by-step observation from the initial state to the final state, which is the rationality of nearly all catalytic reactions. For small chemicals, high-throughput computations of thermochemical reaction heat and transition path high-saddle barrier free energy change of more than 4,000 12-member training data were selected from a computation dataset of over 80,000 heterogeneous catalytic reactions by curating supported metal dataset. In the homogeneous oxidation reaction dataset training set, alkanes and aromatics were selected as organic compounds, and reactions were limited to the metal oxidation of adsorbed species. Fully connected neural networks, K-nearest neighbors, and kernel ridge regression were used as regression models.

 

4.2. Classification Models

 

A regression model for chemical reaction type discrimination was constructed to identify (i.e., classify) a chemical reaction from the representation drawn simply from a surface image. The images of ceria quadrate and ceria rod-based in 2D and 3D exposed surfaces were fabricated as a training dataset with added reaction inputs of CO oxidation and NOx-CO oxidation. Supervised machine learning methods were discussed to cluster a chemical reaction from extensive types even with surface-image-based visualization. In detail, unsupervised learning clustering models would leave some of the clusters composing parts of the training reaction information undetected, readily misclassifying unseen chemical reactions. To classify vast types of chemical reactions without losing a specific type, supervised learning models would better utilize known training information. Here, nine supervised classification models with different algorithms were constructed: K-nearest neighbors, Naive Bayes, support vector classifier, decision tree, multilayer perceptron, logistic regression, random forest classifier, gradient boosting classification, and AdaBoost classification.

 

4.1. Regression Models

 

There are three broad categories of supervised learning tasks: regression, multi-task regression, and classification. In the case of regression, the model is trained to predict a continuous quantity. In the context of chemical reactions, a large number of chemical properties of molecules can be cast as regression-like tasks. Here, the aim is to predict thermochemistry, barrier heights, or free energy differences. In contrast, classification tasks return a discrete label and are applicable to other types of chemical reactions, such as the prediction of reaction products, where the model predicts distinct structures.

 

Regression Models. The goal of regression models is to predict a continuous quantity. Given a set of independent variables X and dependent variable y, the model aims to learn a function f(X) = y, providing estimates ŷ as close to y as possible. The quality of estimates can be measured according to a loss function L(y, ŷ), for example, the mean squared error MSE, which corresponds to the error of estimates taken over a dataset D of n data points.

Within the supervised learning framework, to find the optimal parameters w*, the model minimizes the loss over the training dataset. Framing the regression problem within a machine learning framework brings many modeling opportunities. A popular and widely applied model for the task of regression is the kernel ridge regression model. Given training energies E of n data points, the data is mapped into a feature space X within a similarity measure defined by a kernel function. Here, a Gaussian kernel is chosen:

 

The set of feature values X is collected into the design matrix K, and the model can predict the energy of new points F as:

 

When learning from training data, the prior knowledge on the parameter is given, and its posterior knowledge is updated according to new experimental data. The quality of prediction is therefore expressed in terms of uncertainties that are computationally expensive because, generally, it requires the inversion of a matrix with a size of n × n. It is possible to decide whether to trust an estimate based on the prediction uncertainty by screening reaction estimates according to their prediction accuracy. Points far from the training data are uncertain; thus, the model trusts the local points closer to the training data, allowing for a reduction in the chance of false acceptance of new candidates far from known training points.

 

4.2. Classification Models

 

Introduction to Classification Models in the Context of Chemical Reactions

 

Classification models provide a means to group similar data points together based on their shared characteristics. In a classification model, these characteristics, or features, of a data set are defined first, and then a mathematical equation is created to represent the model. The best-fit model will then be passed to a prediction algorithm that will classify new data points into one of the groups based on the shared characteristics. Generally, the features need to be numerically encoded for the machine learning algorithm to utilize them, which can range from simple integer numbers assigned to each classification to several columns of quantitative information represented by numbers.

 

In the context of chemical reactions, given a particular starting compound, a reaction is classified according to its product structure. In an example classification model implementation, starting compounds are represented by five main structural characteristics that were calculated using cheminformatics software, including footprint cell parameters, hydrogen bond donors and acceptors, and tetravalent centers.

 

Once features are calculated, these are passed to the machine learning model. The model is then created from a collection of reactions, each associated with commonly deposited experimental conditions in an electronic database. Two classifiers are implemented: Support Vector Machines and Random Forest. Results from both classifiers are compared, with the Random Forest showing better precision and recall rates of 66% and 79%, respectively. To reduce the risk of cholesterol synthesis being missed, a combined model implementation is used, which consists of a union of the output from the Support Vector Machines and the Random Forest models. This combined approach increases sensitivity by another 10% to 89%.

 

Once the initial model is established, several means to refine the classifiers are applied that could be utilized on top of the existing approach. First, model generalizability is severely hampered in low data availability regions due to overfitting, with Random Forest failing to find any correctly classified reactions under these conditions. A means to address this by falsifying a portion of data available for classification was applied, which suggests a need to curate another group of cholesterol conditions not so stringently enforced during the database compilation.

 

In addition to improving cholesterol classifier performance, this would also be beneficial for other classifiers against more common products. Finally, hierarchical and multi-class classifiers are investigated. (Wen et al.2022)(Liu et al.2021)

 

5. Unsupervised Learning Techniques

 

In unsupervised learning, datasets without labels are utilized, with the goal of dividing these datasets into subgroups or inferring relationships between data items. The most frequent problem in machine learning applied to chemical reactions is to find data patterns. Such techniques can be classified as dimensionality reduction or clustering approaches. Following the dimensionality reduction, a reduction of more than an order of magnitude in the data size is generally obtained, and the data is transferred to a new coordinate system in which a limited number of new coordinates describes the majority of the variance and information of the data. Typically, after this processing step of data handling, clustering techniques are employed to recover and classify groups in the reduced dataset. Even though both dimensionality reduction and clustering techniques can be computed separately, some approaches lower both data size and project to a new space where clusters tend to be better defined simultaneously.

 

In order to narrow the search, only clustering algorithms leading to partitioning of predefined numbers of clusters in the reduced datasets are described. Implementations on simple scripts with no additional external requirements are provided. In the context of chemical reactions, the normal practice after obtaining a clustering result is to visualize the data’s representation in reduced space and then analyze the composition of the resulting clusters in the original variables. Therefore, both the datasets and clustering results are understood within a machine learning approach applied to chemical space. The implementation is of an unsupervised approach to data classification; that is, the models used for standard analysis and results interpretation do not use any knowledge of the sample composition or nature of the chemical reaction, but only information from the spectra variability.

 

As an unsupervised approach to chemical data clustering, the use of the k-means algorithm, applied to UV-Vis spectral dataset, is proposed. The input dataset consists of curves from a simple photochemical reaction, an experimental dataset from a chemical space. In order to facilitate the understanding of the clustering process, the dimensionality reduction by PCA is first performed on the dataset. Considering that PCA is the most widely used technique from the chemometrics area, an illustrative example of a chemical space clustering of UV-Vis spectra from experiments is initially presented. Following this simple application, a more complex clustering of synthetic near-infrared spectral data is described, consisting of a case in which co-addition of chemicals led to reclassifying problems. In both applications, the clusters were first obtained on the reduced datasets, and then cluster composition analysis on original variables is performed.

 

5.1. Clustering Algorithms

 

Clustering refers to the grouping of a set of objects into clusters in such a manner that objects in the same cluster are more similar than those in other clusters. The goal is to choose a grouping, or clustering, that is meaningful and useful. Contemplating the applications of clustering, there can be significant contributions in the fields of biology, medicine, industry, and earth sciences, as well as improvements in chemical reactions, detection of diseases, and other helpful adaptations. In each of these cases, the task is to group related objects.

 

The need for clustering algorithms arises due to the variability and complexity of the data. As the data and information available exponentially increase, the necessity for effective clustering techniques increases. The search for a good clustering algorithm is fueled by the need to find previous and hidden patterns and relationships in the available data. An ideal clustering technique should enhance the overall clustering quality, find the perfect number of clusters to be generated, run in a reasonable time, and work effectively for widely different datasets.

 

To achieve an intense understanding of the requirements of clustering techniques, three representative clustering algorithms are introduced. K-means is the most commonly used clustering method. However, it requires prior knowledge of the number of clusters and awareness of their density, shape, and distribution within the dataset. Finding the global optimal clustering configuration is also NP-hard. K-medoids is similar to K-means. Unlike K-means, K-medoids can be run on datasets having an arbitrary distance function. A probabilistic approach, the expectation-maximization algorithm, has also been proposed to find good local-optimal configurations. However, like K-means, the EM algorithm requires prior knowledge of initial clusters. Generally, K-means is recognized as a quick clustering technique with satisfactory results. (Date et al.2020)(Dobbelaere et al.2021)

 

6. Deep Learning in Chemical Reactions

 

Deep learning, a rapidly developing paradigm of machine learning, has achieved exceptional performance in many fields such as image recognition, speech recognition, and drug discovery. Deep learning technologies have recently taken off in computational chemistry and materials science, filling the gap between the rapid advancement of high-throughput screening experiments and the lack of direct modeling with atomic-level accuracy. Deep neural networks (DNNs) have been widely applied to molecular property prediction tasks with reasonably good accuracy, including the prediction of molecular solubility, toxicity, and formation energy, comparable to DFT. Massive molecular datasets accumulated in the knowledge base can be excavated to facilitate the training of deep neural networks in a many-to-many manner. In particular, DNNs have also been applied to chemical reaction prediction tasks. The research-level application of DNNs in chemical reaction predictions is reviewed, and representative datasets, neural network architectures, and approaches towards various chemical reaction predictions for the design of desired synthetic routes or products are introduced.

 

Chemical reactions are significant to chemistry and materials science, and the information about reactions has been buried in various literature and data sources, which are nowadays available in a large volume. To utilize the pre-existing knowledge about chemical reactions, many databases have been built to archive information on different types of chemical reactions, which consist of reactants, products, experimental observations, mid-products, and mechanisms. Various computer-aided platforms have been proposed for fresh experimental chemists, as assistant ‘virtual chemists,’ to facilitate exploring chemical knowledge bases. The essential point is to efficiently recover the buried reaction information or prediction of desired reactions or products. This requires in-depth investigation of the knowledge bases and thorough comprehension of what kind of reaction knowledge is contained therein, and then model the chemical reactions from a quantitative perspective.

 

Toward this target, there have been some rule-based approaches utilizing computer-aided reaction knowledge bases. However, they all have inherent difficulties in generalization, being either too overgeneralized or too specific, thus might miss many valid reactions. This situation prompts the development of data-driven approaches, which are expected to explore the complex relationship among reactants, products, and other reaction conditions end-to-end, bypassing the arduous process of reasoning and designing reaction rules. To correlate the chemical substructures of the species with the tendency of performing certain reactions, the newly developed DNNs can recognize the structural information of the species adaptable to the chemical reaction pattern in a domain-free manner. Architectures of DNN models to directly relate the structural representations of molecular graphs to reaction depiction are proposed. In addition to the qualitative reaction prediction, post-analyses on the predicted reactions to obtain detailed descriptions, including the mechanisms or energy barriers, can be fulfilled by certain DNN architectures, which consider reactions in a sequential manner with the help of attention mechanisms.

 

6.1. Neural Networks for Reaction Prediction

 

This review of neural networks for chemical reaction prediction has two major sections: methods and applications. The methods section covers the design choices that have been explored in the proposed architectures, including reaction encodings, network architectures, and training strategies. The applications section reviews the diverse areas in which these methods are being explored, including the prediction of simple organic reactions, catalyzed transformations, and abiotic transformations.

 

A summary of the reaction classes and data sets explored by the approaches covered in this review can be found. Data-driven approaches have been proposed for predicting the product of a great diversity of organic transformations, ranging from nucleophilic substitutions to cycloadditions and rearrangements. Machine learning models have further been trained to reproduce substrate and product distributions observed in experiments. Public data sets have been made available for many of these transformations, which has fostered research in this area. A distinct but related endeavor is the development of computational platforms to predict product distributions in multistep organic reactions. Chemical feasibility under a broader set of criteria, such as reagent cost or product toxicity, will likely constitute a future challenge to the community.

 

While more complex transformations are being explored, most of the public benchmarks focus on organic chemistry. Models trained only on organic substrates can struggle with more complex transformations, as recent studies have shown on abiotic reactions. Further, many existing data sets are quite small and focus on a single reaction class. This is in contrast to the abundance of experimental knowledge and publications on mechanisms of unimolecular and bimolecular transformations in catalyzed reactions, coordination reactions, condensation reactions, and others. It will therefore be exciting to see these reaction classes and, more importantly, chemistry outside of the organic paradigm explored using data-driven approaches.

 

With regard to the reaction encodings, this review shows that in some cases, it is possible to effectively combine operation encodings and interaction fingerprints. While these fingerprints represent distinct but complementary aspects of the reaction, using them jointly increases the computational footprint of the model. A direction to further explore would be to benchmark a hybrid encoding of this type against a pure interaction fingerprint model. Such a comparison could help illustrate whether all six dimensions of the corresponding interaction fingerprint are essential in widening the applicability of machine learning methods.

 

6.2. Reinforcement Learning in Reaction Optimization

 

The performance of virtual screening can be improved through the use of reinforcement learning in conjunction with generative neural networks. This endeavor involves formulating an RL problem using a Markov decision process, defining a reward function based on the predicted activity of compounds towards a target receptor computed by other predictive models, and describing the implementation of an algorithm based on policy gradients. Lastly, it focuses on a neural network architecture for generative modeling through training an adversarial network to generate compounds and an additional discriminator that ranks compounds based on target activity predicted by the predictive model. The utility of the generative system is demonstrated through its ability to successfully explore target spaces that are unrecognized by the initial generative model.

 

In reinforcement learning, a computer agent interacts with an environment in discrete time steps, producing a sequence of actions that affect the state of the environment and, consequently, reward signals that provide feedback on the performance of the actions. The objective is to estimate a policy, a mapping from states to actions, that maximizes the total reward. RL falls into the family of latent variable models, as states are not directly observable from the actions, making it inherently a partially observable setting. RL problems can be formulated as Markov decision processes using the notion of a state space, action space, and a reward function defined for each state-action pair. A policy can be deterministic or stochastic, referring to a unique action for each state or a distribution over actions for a state, respectively.

 

Policy entities figure in proposals of two distinct strategies for exploring the action space: deterministic policies act in a predefined manner, irrespective of the capability of the model, and stochastic policies, referred to as exploration policies, draw actions from an explicit probability distribution. Stochastic policies later converge to deterministic policies as the certainty of their probability distribution is increased. Since the degree of freedom over the action space is higher in the first strategy, it can be inferred that deterministic policies can explore more challenging actions. However, the existence of local optima can prevent these policies from finding satisfactory solutions. (Walters and Barzilay, 2020)(Wang et al.2022)

7. Performance Evaluation and Model Selection

 

The reliability of any machine learning model hinges on its validation through an unbiased dataset that was not utilized during the training phase. This imperative can be fulfilled by randomly splitting the dataset into two segments: a training set, comprising 70-90% of the data for model fitting and hyperparameter tuning, and a testing set containing the remaining 10-30% of the data for final validation. Given a dataset with less than 10,000 samples, resorting to a fixed training/testing split is generally advisable. However, for larger datasets, different methods such as k-fold cross-validation, leave-one-out cross-validation, or any other type of stratified sampling cross-validation can be employed. Cross-validation ensures that each sample in the dataset serves as both an element of the training set and as a test set, thereby yielding a thorough evaluation. In simple k-fold cross-validation, the dataset is divided into k = 5 or 10 folds, where each fold is successively used as a test set while the remaining k-1 folds are utilized for the training set.

 

Various metrics that can provide a detailed view of how well a model has learned the mapping from input to output space exist. For regression problems, three widely utilized metrics are considered: Mean Absolute Error, Mean Squared Error, and R² score. When the model needs to classify targets into different classes, a confusion matrix can be constructed that counts the number of correctly and incorrectly classified targets per class. From the confusion matrix, efficiency metrics such as precision, recall, and F1-score can be calculated to assess model efficacy. A detailed description of all the metrics mentioned is provided in the subsequent sections.

 

Mean Absolute Error provides the mean error of the predicted values to the true values, allowing an assessment of the model’s effectiveness. A lower Mean Absolute Error is indicative of a better model. The formula for Mean Absolute Error is defined as: MAE = 1/N ∑(i=1 to N) |y_i – y_i’| Similarly, Mean Squared Error provides the mean squared error from every point to the predicted regression line, thereby allowing a detailed global analysis of the model. A lower Mean Squared Error is also indicative of a better model. The formula for Mean Squared Error is defined as: MSE = 1/N ∑(i=1 to N) (y_i – y_i’)² Additionally, the R² score provides the proportion of the variance in the dependent variable that can be explained by the independent variables. A greater R² score is indicative of a better model. The formula for R² score is defined as: R² = 1 – SS_res/SS_tot where SS_res = ∑(i=1 to N)(y_i – y_i’)² and SS_tot = ∑(i=1 to N)(y_i – y_mean)² In the context of classification problems, the efficiency of the model can be assessed through a confusion matrix. For instance, by utilizing a 4-fold cross-validation, a 4×4 confusion matrix counting the number of correctly and incorrectly classified targets per class can be obtained. From the confusion matrix, efficiency metrics such as precision, recall, and F1-score can be calculated. Precision is given by: Precision = TP / (TP + FP) where TP is the number of targets correctly classified as class ‘i’, and FP is the number of targets incorrectly classified as class ‘i’. Recall is given by: Recall = TP / (TP + FN) where FN is the number of targets of class ‘i’ classified into another class. The F1-score is the harmonic mean of precision and recall and is given by: F1-score = 2 × (Precision × Recall) / (Precision + Recall)

 

7.1. Metrics for Assessing Model Performance

 

The evaluation of models is a crucial aspect of the modeling process, allowing for the objective assessment of model suitability for intended uses. In a world with an increasing number of machine learning tools, choice modeling, and model deployment, experimentalists should be appropriately equipped with the tools to effectively assess the suitability of models. In many cases of machine learning models, proper assessment is left to automated tuning and test set predictions. This comes at the risk of deploying drastically wrong models, ultimately leading to unachievable research goals, wasted resources, and missed opportunities.

 

Formally, model performance evaluation boils down to finding one or more statistics that quantify the agreement between the ‘truth’ and the corresponding model’s prediction. Each statistic needs to be clearly calculated and interpreted to be meaningful. Models can be compared based on one or more figures of merit, potentially sorting them based on fit quality. It is important to note that a quantitative assessment of model performance does not guarantee the applicability of a model. A model with near-perfect agreement to the data may yield nonphysical results outside the training set.

 

An important aspect when selecting figures of merit is recognizing the type of numerical data to be compared. Only identical numerical data can be used in a performance measure calculation. Using figures of merit is not an absolute guarantee of model correctness or applicability. Moreover, figures of merit can only give insight into the fitness of a model relative to the provided data. Therefore, model selection should ultimately be based on both model performance and physical interpretability and plausibility. The evaluation metrics discussed here are intended to provide a reasonable first alternative, guiding the experimentalist towards models that yield accurate results for use-case situations.

 

7.2. Cross-Validation Techniques

 

Cross-validation techniques are widely utilized to determine how the outcomes of a statistical analysis will generalize to an independent data set. These techniques involve partitioning a sample into complementary subsets, estimating models on one subset and validating them on the other. The initial sample is then re-partitioned and the procedure is repeated, allowing for all observations to be used in both training and validation. The results are then averaged across all partitions to produce a single estimation. Machine learning researchers aim to predict the property of a chemical system based on a collection of chemical systems with known properties, where such systems are described using non-structural data. To be able to halve the number of descriptors, e.g., experimentally known descriptors such as the presence of some functional groups, computationally inexpensive descriptors like the pKa of an ionizable atom could be used.

 

A crude approach for cross-validation is to consider all the molecules for which the property is calculated using the same underlying computational method. Unfortunately, this cannot be applied when a collection of molecules for which the property is known is given. An obvious trick for avoiding overfitting is to divide the set into two sets: a training set and a test set. Consider fitting models on the training set while calculating the mean and variance of the actual property in the test set. A good model is expected to yield a mean that is also similar to the mean of the actual property and a small variance, close to the variance in the training set. This approach defines a crude model selection procedure that will be explored herein.

 

However, there are three important considerations that render such a procedure impractical. First, for a typical data set, the fraction of molecules for which the property is known is small and hence a large fraction of the accumulated data would be wasted as a test set. Second, there are 14 descriptors available for each compound, but attention is limited to only one property, the pKa. The input matrix of descriptors is hence ill-conditioned and a proper learning algorithm ought to perform some form of regularization. Third, any model selection procedure was patented back in 1995 and it seems to be widely rejected. A different approach is to randomly select some percentage of the compounds for which such a property is calculated generally using the same computational method. All molecules in the resulting test set cannot be used in the construction of the model. The procedure is then repeated k times, each time partitioning the data set randomly into two different training and testing sets. The performance of the models is then averaged to measure universal error estimates, taking into consideration data sets of totally different chemicals. This approach is denoted as cross-validation. (Bates et al.2024)(Wainer and Cawley, 2021)

 

8. Challenges and Limitations

 

Despite the recent successes of machine learning models for chemical reactions, there remain significant challenges and limitations that must be addressed. One of the primary challenges is data sparsity and quality. Many machine learning methods rely heavily on high-quality datasets for training models, yet datasets suitable for training accurate models of chemical reactivity are sparse. To address this issue, methods must be developed to exploit small datasets rather than relying on large datasets that are often not available. Train/test splits must also be developed that take into account the specific type of reaction being predicted, as the most extreme cases often differ significantly in character from intermediate cases.

 

Another major challenge is the interpretability of models. In many branches of machine learning, including economic forecasting and image recognition, black-box models have been introduced that dramatically outperform the previous state of the art. However, the ability to obtain insight into the mechanistic basis of model predictions is lost when black-box models are used. In a number of chemical subfields, including organometallic catalysis, understanding the mechanistic basis of a predicted reactivity profile is as important as accurately predicting that profile. Models must be generated that balance interpretability with predictive accuracy.

 

There are also other possible directions for improvement, such as the increased incorporation of physical insight into the design of machine learning approaches. The incorporation of domain-specific knowledge structures, or even basic laws of physics, into the design of machine learning approaches has recently been proposed in other fields. Similar ideas would be beneficial for approaches to the prediction of chemical reactivity.

 

8.1. Data Sparsity and Quality

 

In practice, there are several limitations or challenges with the use of machine learning in chemical reaction databases. Some of these include:

  1. Data sparsity: Ideally, one would have a large library of chemical reactions to train machine learning models, but in real life, this is often limited. In chemical databases, where indirect computer searching of reactions is performed, there are often thousands of possibilities, such as searching three terms within a series of conditions, such as solvents, temperature, and reaction time. Manually interpreting all of these is impractical, but machine learning could do this more easily.

 

  1. Quality of data: Outside of indirect searching, and where chemical data is pooled into large databases, the information might be of highly differing quality. If data is pooled from many sources, one database could have many useful reactions, all of which might be one of dozens of closely related processes with slightly different conditions. Thus, the subsequent data might all be in similar chemical spaces and not necessarily representative of the overall reaction space. In this way, data from one database could lead to poor predictions in another database.

 

  1. Interpretation of models: Chemical space is very hard to visualize, and the mapping by machine learning techniques is similarly very abstract. There is thus a great need for visualization tools that allow for an interpretable and rational understanding of these models.

 

In light of these challenges and using the above series of criteria, it becomes clear that in many industries or walks of life, only a small subset of approaches in the broader machine learning field are possible, highlighting the need to standardize approaches and methodologies across the chemical and machine learning community.

 

8.2. Interpretability of Models

 

Model interpretation is generally more challenging with ML methods than with a traditional approach. We would like to analyze which features are most important for predicting a certain product of the reaction with an ML model. This is relatively simple with traditional models because each feature has a specific coefficient that interprets its role in the prediction. However, for more complex models, it is less than obvious how to interpret the contribution of each feature in the prediction.

 

Some methods have been proposed to interpret black-box ML models. For instance, one can train a surrogate explainable model on the predictions of the black box model and use it to analyze its interpretation. Other methods study how the prediction changes with small perturbations in the features and, thus, analyze how much the prediction varies as a function of each feature. Several open-source software libraries exist that help implement these methods. Most of them measure feature importance, but some can create overall model visualizations to interpret how models use each feature.

 

Model interpretability is especially important in scientific domains, where the practitioner aims not only at predicting a certain property but also at understanding the reasons behind it. (Jorner et al.2021)(Shim et al.2022)

 

9. Future Directions in the Field

 

Machine learning has shown its promise for chemistry and many subdisciplines of chemistry and biochemistry. On the one hand, the progress of computational power and various mathematical functionalities for this research area have democratized access for researchers. On the other hand, these successes have raised challenges. Currently, only a few methods can be regarded as best practices, such as graph neural networks for chemical representation, and yet there are many candidates vying for the dubious title of best practice in other areas, such as the research fields of data uncertainty quantification and ML experiment design.

 

Looking forward in the near term, it is likely that current best practices in ML will be implemented in packages and integrated into existing computational packages for quantum mechanics calculations, classical molecular dynamics, force field parameterization, etc., for ease of access by chemists. ML methods will be further developed to address currently unmet needs in chemistry, such as the representation and estimation of polyoxometalate energy surfaces. As ML applications are increasingly published and widely used, proper data stewardship practices will become necessary to curate databases from the relevant public literature with metadata amenable to ML applications.

 

In the mid-term horizon, there is hope that ML extensions of quantum mechanics, such as ML quantum chemistry, will be robust and accurate enough to be widely usable. There will also be augmentation of deep generative models of graph construction by ground state heuristics and machine reasoning models to augment synthesis planning. Such planning models are likely to search chemical databases for past success literature, and then those paths will be analyzed, mathematically transformed, and ideally distilled into relevant chemical knowledge to be preserved in knowledge bases in the same spirit as IUPAC gold book rules. The combination of these two modeling methods should help build real-time automated high-throughput ML planning integration with robotic synthesis experimentation.

 

9.1. Integration of Quantum Mechanics with Machine Learning

 

Modeling chemical reactions by predicting how the atomic connectivity changes while maintaining the molecular energy is extremely computationally demanding, mainly due to the so-called “curse of dimensionality.” The Hamiltonian is an n-electron operator that depends on the 3N coordinate in a rather complex way. Classical chemical reaction trajectories are high-dimensional (N > 100 for organic molecules), and even when limited to a small number of variables, quantum dynamics is 6N-dimensional, or 12N, in the case of wave packet approaches. Therefore, although somewhat successful, quantum wave packet reaction dynamics readily enters a bottleneck where N > 10. Such sophisticated methods are very much needed in the context of the next decade’s challenges, with molecular energy planning technologies to mitigate climate change, such as artificial photosynthesis and solar fuels, CO2 reduction to hydrocarbons, or hydrogen coupled to chemical processes involving reduction with other elements to ammonia.

 

One type of solution to the curse of dimensionality in quantum dynamics is to go from the expansions in coordinate representation to auxiliary variable ones where the many-body Green functions become like classical path integrals. This greatly reduces the number of terms needed for an equal level of approximation but still applies only to small numbers of degrees of freedom. However, the extension of the modeling paradigm to machine learning does not depend on the many-body nature of the wavefunctions and distributions.

 

Conceptually, the idea is rather simple and similar to the classical case. Using machine learning, one can obtain machine learning complex potentials that reproduce the “skeleton” of the energetics, or the ensemble quantum likelihood, or the quantum field, that preserves the rest of the description. To ensure accuracy and stability, the machine learning models must be trained within an interval large enough that they can dynamically sample both the energetically stable classical regions of the moduli space as well as the machine learning models’ inherent limitations. Some of them can know the functions’ energy zone width and particle number like interpolation methods, with applications to reduced density operators modeling and quantum computation simulations with applications to molecular chemistry.

 

9.2. Advancements in Automated Synthesis Planning

 

In recent years, the large design space of organic chemistry has been an attractive target for the application of machine learning methods, as exploratory synthesis could enable high-impact discoveries in fields like drug development, organic materials, and agrochemistry. As a subfield of organic synthesis, synthesis planning has been studied for decades, and advances have been made using both heuristic and AI-based strategies. However, focused on only a fraction of possible reactions, those methods cannot keep up with the ever-increasing pace of laboratory discovery that is aided by novel high-throughput experimental techniques.

 

As an alternative to retrosynthesis, the task of predicting the product of a reaction given its reagents has been actively studied in the past years, with neural network-based approaches yielding significant improvements. Since retrospective models do not require careful definition of reaction templates and are therefore easier to train, they have garnered more interest in the machine learning community. Models trained on large collections of reaction databases can read a reaction in SMILES notation and reliably propose the corresponding product structure.

 

Taking the task of reactive preprocessing, a GAN-based retrosynthesis model is proposed. Molecular structures are represented as graphs, where nodes and edges correspond to atoms and bonds, respectively. User queries can be either SMILES notations of a molecule or a list of substructures to exclude. Then, reaction databases containing biconditional transformations of the underlying structures can be mined. Using a modification of the neural network structure, reactor networks are retrospectively modeled, and their efficiency for reactions of the input molecules is predicted by machine learning. Among other approaches, this one highlights the use of user-defined constraints.

 

Experimental assays are used to probe the predicted reaction conditions and to reveal the final candidates. Initially, approximately 118,000 possible reactions are predicted based on a database of approximately 60,000 transformations. Through several filtering stages, this number is narrowed down to selected reactions that entail the least manipulation of the structure and fall within the scope of a set of chemicals and a three-dimensional manifold defined by the physicochemical properties of the reactants. Intra-network communication allows for selection to focus on specific reaction categories, such as coupling, hydroxylation, oxidation, and ring-opening. Further discriminator networks ensure diversity and novelty of the selected candidates.

 

Several GAN-inspired modeling approaches for retrosynthetic reaction modeling further progress the state of the art in the field of AI-driven reaction prediction. Despite their algorithmic advancements, these approaches remain dependent on the available foothold reaction databases. However, given the sporadic and selective nature of database collections, this constrains their reactivity and understudies chemistry patterns, affording blind and uncertain transferability and inflating the learned priors with jungle patterns. The lack of any non-supervised layer hinders pre-training on uncurated data. Although this is compensated by the automatic SMILES screening, possible spurious strings inevitably remain. Thus, exploratory modeling overall aims for database completeness and high-covering reaction sampling within the realms of parsed chemistry knowledge. (Oliveira et al.2022)(Badowski et al.2020)(Strieth-Kalthoff et al.2020)

References:

Meuwly, M. “Transformative Applications of Machine Learning for Chemical Reactions.” (2021). [PDF]

Meuwly, M. “Machine learning for chemical reactions.” Chemical Reviews (2021). google.com

Kovács, D. P., McCorkindale, W., and Lee, A. A. “Quantitative interpretation explains machine learning models for chemical reaction prediction and uncovers bias.” Nature communications (2021). nature.com

Westermayr, Julia, Michael Gastegger, Kristof T. Schütt, and Reinhard J. Maurer. “Perspective on integrating machine learning into computational chemistry and materials science.” The Journal of Chemical Physics 154, no. 23 (2021). aip.org

Stocker, S., Csanyi, G., Reuter, K., and Margraf, J. T. “Machine learning in chemical reaction space.” Nature communications (2020). nature.com

Keith, John A., Valentin Vassilev-Galindo, Bingqing Cheng, Stefan Chmiela, Michael Gastegger, Klaus-Robert Muller, and Alexandre Tkatchenko. “Combining machine learning and computational chemistry for predictive insights into chemical systems.” Chemical reviews 121, no. 16 (2021): 9816-9872. acs.org

Zhu, Li-Tao, Xi-Zhong Chen, Bo Ouyang, Wei-Cheng Yan, He Lei, Zhe Chen, and Zheng-Hong Luo. “Review of machine learning for hydrodynamics, transport, and reactions in multiphase flows and reactors.” Industrial & Engineering Chemistry Research 61, no. 28 (2022): 9901-9949. whiterose.ac.uk

Zuranski, Andrzej M., Jesus I. Martinez Alvarado, Benjamin J. Shields, and Abigail G. Doyle. “Predicting reaction yields via supervised learning.” Accounts of chemical research 54, no. 8 (2021): 1856-1865. nsf.gov

Schwaller, Philippe, Alain C. Vaucher, Teodoro Laino, and Jean-Louis Reymond. “Prediction of chemical reaction yields using deep learning.” Machine learning: science and technology 2, no. 1 (2021): 015016. iop.org

Heid, Esther, and William H. Green. “Machine learning of reaction properties via learned representations of the condensed graph of reaction.” Journal of Chemical Information and Modeling 62, no. 9 (2021): 2101-2110. acs.org

Shieh, Peyton, Megan R. Hill, Wenxu Zhang, Samantha L. Kristufek, and Jeremiah A. Johnson. “Clip chemistry: diverse (bio)(macro) molecular and material function through breaking covalent bonds.” Chemical Reviews 121, no. 12 (2021): 7059-7121. [HTML]

O’Neill, R. T. and Boulatov, R. “The many flavours of mechanochemistry and its plausible conceptual underpinnings.” Nature Reviews Chemistry (2021). [HTML]

Mateti, Srikanth, Motilal Mathesh, Zhen Liu, Tao Tao, Thrinathreddy Ramireddy, Alexey M. Glushenkov, Wenrong Yang, and Ying Ian Chen. “Mechanochemistry: A force in disguise and conditional effects towards chemical reactions.” Chemical Communications 57, no. 9 (2021): 1080-1092. [HTML]

Wang, Zi, Xujun Zheng, Tetsu Ouchi, Tatiana B. Kouznetsova, Haley K. Beech, Sarah Av-Ron, Takahiro Matsuda et al. “Toughening hydrogels through force-triggered chemical reactions that lengthen polymer strands.” Science 374, no. 6564 (2021): 193-196. mit.edu

Huo, Shuaidong, Pengkun Zhao, Zhiyuan Shi, Miancheng Zou, Xintong Yang, Eliza Warszawik, Mark Loznik, Robert Göstl, and Andreas Herrmann. “Mechanochemical bond scission for the activation of drugs.” Nature chemistry 13, no. 2 (2021): 131-139. rug.nl

Grambow, C. A., Pattanaik, L., and Green, W. H. “Reactants, products, and transition states of elementary chemical reactions based on quantum chemistry.” Scientific data (2020). nature.com

Smith, M. B. “March’s advanced organic chemistry: reactions, mechanisms, and structure.” (2020). ethz.ch

Grommet, A. B., Feller, M., and Klajn, R. “Chemical reactivity under nanoconfinement.” Nature nanotechnology (2020). researchgate.net

Joudeh, N. and Linke, D. “Nanoparticle classification, physicochemical properties, characterization, and applications: a comprehensive review for biologists.” Journal of Nanobiotechnology (2022). springer.com

Ijaz, Irfan, Ezaz Gilani, Ammara Nazir, and Aysha Bukhari. “Detail review on chemical, physical and green synthesis, classification, characterizations and applications of nanoparticles.” Green chemistry letters and reviews 13, no. 3 (2020): 223-245. tandfonline.com

Temkin, O. N., Zeigarnik, A. V., and Bonchev, D. G. “Chemical reaction networks: a graph-theoretical approach.” (2020). [HTML]

 

Unsleber, J. P. and Reiher, M. “The exploration of chemical reaction networks.” Annual review of physical chemistry (2020). [PDF]

Kumar, A. “Control of nonlinear differential algebraic equation systems with applications to chemical processes.” (2020). [HTML]

Hsu, S. B. and Chen, K. C. “Ordinary differential equations with applications.” (2022). worldscientific.com

Vaucher, Alain C., Philippe Schwaller, Joppe Geluykens, Vishnu H. Nair, Anna Iuliano, and Teodoro Laino. “Inferring experimental procedures from text-based representations of chemical reactions.” Nature communications 12, no. 1 (2021): 2573. nature.com

Shen, Yuning, Julia E. Borowski, Melissa A. Hardy, Richmond Sarpong, Abigail G. Doyle, and Tim Cernak. “Automation and computer-assisted planning for chemical synthesis.” Nature Reviews Methods Primers 1, no. 1 (2021): 1-23. nsf.gov

Schweidtmann, Artur M., Erik Esche, Asja Fischer, Marius Kloft, Jens‐Uwe Repke, Sebastian Sager, and Alexander Mitsos. “Machine learning in chemical engineering: A perspective.” Chemie Ingenieur Technik 93, no. 12 (2021): 2029-2039. wiley.com

Jorner, Kjell, Anna Tomberg, Christoph Bauer, Christian Sköld, and Per-Ola Norrby. “Organic reactivity from mechanism to machine learning.” Nature Reviews Chemistry 5, no. 4 (2021): 240-255. [HTML]

Dobbelaere, Maarten R., Pieter P. Plehiers, Ruben Van de Vijver, Christian V. Stevens, and Kevin M. Van Geem. “Machine learning in chemical engineering: strengths, weaknesses, opportunities, and threats.” Engineering 7, no. 9 (2021): 1201-1211. sciencedirect.com

Wen, Mingjian, Samuel M. Blau, Xiaowei Xie, Shyam Dwaraknath, and Kristin A. Persson. “Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining.” Chemical science 13, no. 5 (2022): 1446-1458. rsc.org

Liu, Yiyang, Jinze Liu, Hongzhen He, Shanru Yang, Yixiao Wang, Jin Hu, Huan Jin, Tianxiang Cui, Gang Yang, and Yong Sun. “A review of enhancement of biohydrogen productions by chemical addition using a supervised machine learning method.” Energies 14, no. 18 (2021): 5916. mdpi.com

Date, Mihir S., Devin O’Brien, Danielle J. Botelho, Terry W. Schultz, Daniel C. Liebler, Trevor M. Penning, and Daniel T. Salvito. “Clustering a chemical inventory for safety assessment of fragrance ingredients: identifying read-across analogs to address data gaps.” Chemical research in toxicology 33, no. 7 (2020): 1709-1718. acs.org

Walters, W. P. and Barzilay, R. “Applications of deep learning in molecule generation and molecular property prediction.” Accounts of chemical research (2020). [HTML]

Wang, Zhengyang, Meng Liu, Youzhi Luo, Zhao Xu, Yaochen Xie, Limei Wang, Lei Cai et al. “Advanced graph and sequence neural networks for molecular property prediction and drug discovery.” Bioinformatics 38, no. 9 (2022): 2579-2586. oup.com

Bates, Stephen, Trevor Hastie, and Robert Tibshirani. “Cross-validation: what does it estimate and how well does it do it?.” Journal of the American Statistical Association 119, no. 546 (2024): 1434-1445. [PDF]

Wainer, J. and Cawley, G. “Nested cross-validation when selecting classifiers is overzealous for most practical applications.” Expert Systems with Applications (2021). [PDF]

Shim, Eunjae, Joshua A. Kammeraad, Ziping Xu, Ambuj Tewari, Tim Cernak, and Paul M. Zimmerman. “Predicting reaction conditions from limited data through active transfer learning.” Chemical science 13, no. 22 (2022): 6655-6668. rsc.org

Oliveira, João CA, Johanna Frey, Shuo-Qing Zhang, Li-Cheng Xu, Xin Li, Shu-Wen Li, Xin Hong, and Lutz Ackermann. “When machine learning meets molecular synthesis.” Trends in Chemistry 4, no. 10 (2022): 863-885. [HTML]

Badowski, Tomasz, Ewa P. Gajewska, Karol Molga, and Bartosz A. Grzybowski. “Synergy between expert and machine‐learning approaches allows for improved retrosynthetic planning.” Angewandte Chemie International Edition 59, no. 2 (2020): 725-730. wiley.com

Strieth-Kalthoff, Felix, Frederik Sandfort, Marwin HS Segler, and Frank Glorius. “Machine learning the ropes: principles, applications and directions in synthetic chemistry.” Chemical Society Reviews 49, no. 17 (2020): 6154-6168. [HTML]

Scroll to Top