Machine Learning for Organic Small Molecule Modelling

1. Introduction to Organic Small Molecule Modelling

 

Predicting the molecular properties of small organic molecules made from the elements carbon, hydrogen, and oxygen is fundamental to many applications in chemistry and biochemistry. However, detailed knowledge of these properties may not be easily accessible, for example, by either experiment or theoretical methods. Hence, the ability to predict these properties is very appealing. During recent years, many machine learning models have been developed to do just this. With the increase in machine learning model development, we can expect more models that are ready to be used for the analysis of large organic chemical data sets. In this chapter, we introduce the reader to the use of machine learning models to predict chemical reactivity properties, something that is very important in organic chemistry for both the pharmaceutical and agrochemical industries.

 

1.1. Overview of Organic Small Molecules

 

The term “organic” is used for any molecule containing carbon, with a few exceptions such as metal cyanides or oxide functional groups. Organic small molecules are versatile building blocks for many structurally different naturally occurring and man-made chemicals, with great potential for interacting with biological targets in specific manners. The potential of small molecule interactions has been exploited for drugs, agrochemicals, organic electronic molecules, and other compounds of intrinsic interest. Organic small molecules form the overwhelming majority of known drugs and other bioactive molecules appearing in historically focused bioactive compound collections. The space of such compounds available to be assessed for potential activity towards biological targets of interest has grown to be large. Historically, this has been investigated through biological screening interrogating multiple target-small molecule combinations.

 

Initiation of small molecule design towards multiple diverse targets can be fundamentally enabled, however, through approaches beyond resource-limited traditional lead compound optimization for a specific known target. Identification and generation of chemical diversity are now relatively routine through synthetic organic chemistry and enable potential concentration into subsets associated with defined biological targets developed during the period covered by the data set. There are earlier data collections containing several known “classical” compounds associated with a limited number of specific biological targets, such as the group of known nucleoside kinase inhibitors. These later models are not often included in the discussion, since they are not from the “beginning” of all organic informatics drug discovery at the time of the data set publication. Texts understandably have to limit their analysis to a manageable subset of what is a very large data domain. Rest assured that the models owing their origins to a prominent research institute would be acknowledged if the time scale was a decade later.

 

1.2. Importance in Drug Discovery and Material Science

 

Small molecules, typically composed of molecular weights less than 900 Daltons, possess high diversity, chemical affinity, and selectivity. In biological drug discovery, small molecules allow interface proteins with desirable functions to accelerate, decelerate, or neutralize specific disease targets. Common oral medicines, such as aspirin, metformin, and azithromycin, greatly improve the quality of life as a result of the use of small molecules. With the development of synthesis technology and machine learning/computational prediction models, the annual number of chemical compounds continues to grow rapidly. Scientists combine machine learning tools with big data to develop and optimize novel materials. They generally focus on catalysts for organic and polymer synthesis, bio-building blocks, bioprocess performance enhancers, skincare and cosmetology ingredients, flavor and fragrance ingredients, agrochemicals, polymers, etc. Increasing demand continues to drive the development and optimization of organic small molecule modeling tools.

 

In recent years, the computational methods, including high-performance computing, quantum mechanics, molecular mechanics, empirical approaches, ab initio, and machine-learning-based methods, have reduced the gap between experiment and computational design for searching and optimizing organic materials. However, fast and reliable descriptions of large datasets by machine learning are mostly based on force field parameters derived from quantum mechanics calculations. Using quantum mechanics on a large number of molecules is often unaffordable due to high computational expense. Advisory guidance on force field development and new formulations needs to reduce the number of expensive quantum mechanics iterations. Another important task for accelerating knowledge transfer from quantum mechanics results is selecting small and proportional subsets from the full set of molecular descriptors, such as atomic number and positions, atomic charges, polarizability, atomic spin moment, bond order, and so on. Considering a good feature extraction method, quantum mechanics calculations can focus on the elements with a strong influence on the desired chemical applications. (Walters and Barzilay, 2020)(Zhao et al., 2020)(Chen et al.2020)(Dou et al.2023)(Mahmood et al.2023)

 

2. Fundamentals of Machine Learning

 

In this section, we provide a brief overview of some of the key concepts from machine learning introduced in the earlier text. One way to think of supervised learning is as the process of approximating an unknown function f: X → Y that maps from input space to output space, given a sequence of n i.i.d. training samples (x1, y1), (x2, y2), …, (xn, yn) from D = X × Y, where Y = R corresponds to regression tasks while Y = {-1, 1} is typical for classification tasks. The performance of the hypothesis h learned from these training data is typically measured by calculating a risk that penalizes the difference between h(x) and f(x) over the input space X, such as mean-squared error for regression or misclassification rate for classification. The infeasibility of evaluating this risk over the potentially infinite input space X motivates using performance measures on the same D, such as simple validation and test errors, leading to empirical risk minimization. Popular techniques for approximating the unknown function include algorithms like kernel regression and kernel support vector machines, each associated with a different set of trade-offs in terms of complexity, dimensionality, and scalability.

 

In unsupervised learning tasks, there is no labeled training data available to discriminate the quality of the different candidate models. A canonical problem in unsupervised learning is clustering, which aims to partition the training data into a fixed number of k arbitrary-shaped clusters so that points in the same cluster are close with respect to a pre-specified distance measure within the input space X. For example, k-means clustering algorithms attempt to minimize the average squared Euclidean distances from the data points to their nearest cluster centers, where the cluster centers are only updated at the end of each iteration, resulting in a local minimum of the chosen objective. Another unsupervised learning problem is dimensionality reduction, which is the process of modeling f with patterns that have a lower dimension than the input space, often through techniques like principal component analysis or independent component analysis. (Cohn and Holm, 2021)(Glielmo et al.2021)(Caron et al.2020)(Zhan et al.2020)

2.1. Supervised Learning

 

In a supervised learning approach, one would aim to develop a model that captures the relationship between multiple input explanatory variables and a scalar or categorical response. For organic chemistry, the main example of this approach is QSPR, where the input features are some representation of the molecular structure and the scalar output variable represents some experimentally observed or novel physical property. There are numerous examples of how this may be framed in organic chemistry research, including the development of models for organic small molecule routes, the prediction of logP, and the prediction of relative energy differences for reaction intermediates. In this field, the input features involve a large but manageable number of interactions, provided the input dimensions are selected appropriately. Recent research in this space and beyond omits a major concern that is often associated with supervised learning techniques: the availability of labeled data, which fundamentally limits inference.

 

After discussing the QM/MM example, and assuming the availability of labeled data, it is worth taking a moment to appreciate a common approach for OSM property prediction: feature generation. This crucial OSM approach is intuitive in nature, as it converts each molecule into a relevant and interpretable array of features. When the input array is sufficiently informative for a specific property, predictive modeling can thrive. Ensuring that the relevance assumption is met is non-trivial; as such, one should aim to standardize the input feature representation across property prediction questions to facilitate more rapid development.

 

2.2. Unsupervised Learning

 

The central task of unsupervised learning goes under the umbrella of density estimation. In simple words, the task of density estimation is to learn an approximation, once given a data set, to this underlying probability distribution of these data. Kernel density estimation is one of the most commonly used techniques to solve the problem described above. In kernel density estimation, for every data point in the training set, a probability distribution is computed; typically, they are Gaussians emerging from each data point. The final density used for subsequent tasks is the sum of all these probability distributions. However, the kernel density estimator cannot calculate the density function of a given unknown data point that was not part of the training subset.

 

Generative models play the role of an additional crucial task in unsupervised learning. The main aim is to model the process behind the data – which is usually generating samples – in a learning algorithm. There is a small variety of techniques that can do that, but one of the most widely employed is Variational Autoencoders. These models are generative, and what this means is that rather than modeling the data points themselves, these models focus on better describing the underlying probability distribution of the data. In this case, Variational Autoencoders are used to describe the a posteriori probability by optimizing the computational algorithm.

 

2.3. Reinforcement Learning

 

In reinforcement learning, an agent interacts with an environment by performing actions and observing the resulting state and a reward. The goal is to find an optimal policy that maps state to action, which maximizes the expected cumulative reward or return. It is often formulated as a Markov decision process or partially observable Markov decision process. In the Markov decision process setting, given a starting distribution of the states, it is possible to calculate the action-value function, where the policy is represented. For the partially observable Markov decision process setting, value functions can also be defined, and existing algorithms can calculate them. Typically, a policy can be represented as a neural network, and the parameters can be learned by using policy gradient or Q-learning.

 

Consider the framework of molecular design, where at each time step, consider picking a substituent for a designed position on a molecule where the designer has access. The goal is to change the molecule to satisfy a set of design targets, such as log P_oct. This can be formulated as a partially observable Markov decision process where the states are encoded states of molecules, the actions are the atoms to be added, and the rewards are the design targets. One can train a policy to pick the atom that is predicted to lead to the smallest value of the predicted design target. A critical issue with reinforcement learning in this context is the temporal dependency between actions and the complexity of the policy. As the search space is large, backpropagation is used to guide the search, but instability arises due to the correlation between the next state and the policy error, and usually REINFORCE is much harder to use for molecular design compared to structure-based docking. Future work may open up more possibilities for molecular design. (Boczar and Michalska, 2024)(Akrom et al.2023)(Levatić et al.2020)(Pham and Le, 2024)

 

3. Data Preprocessing for Small Molecule Modelling

 

3.1 Structure of Small Molecules

 

Small molecule models create additional complexity when considering how the data should be presented to the machine learning model. Major types of structural information for small molecules are bonding patterns and element types. Different models and molecular representations have been proposed for these. Two common approaches to represent small molecules are SMILES and graph theory. The Simplified Molecular Input Line Entry System (SMILES) is a de facto standard method for compactly encoding chemical structures as strings and is easily produced from 2D or 3D molecular graphics software. In this representation, chemical elements are represented by one or two capital letters, and chemical bonding information is represented by bond symbols (‘-‘, ‘=’, ‘#’, ‘/’, and ‘\’). The representation starts with the most terminal atom by convention. However, the string length changes with the size of a molecule, indicating a variable length of SMILES representation, which is due to its loss of chemical information on the periphery of the molecule.

 

3.2 TDT Table

 

Similar to RDKit, PandasTools introduced earlier, RDKit has another wrapper to prepare and plot 2D small molecules. PandasTools allows scientists to preprocess small molecule files from CSV or SD file formats. RDKit has a wrapper of Pandas data frame, which can be formed after reading CSV files that are created manually or obtained from several databases that store compound structures. Since it is a Pandas DataFrame, any column(s) in the file can be graphed whenever required. RDKit is invoked to apply functions on the molecule in order to obtain SMILES, physical, thermodynamic, and pharmacokinetic properties. Moreover, operations to depict structures, import SD or PDB format files, and change the structure of molecules can be performed in Jupyter Notebook. The RDKit.PandasTools module can process data that could be a mixture of drug-like, fragment-like, peptide-like, sugar-like, nucleotide-like, and atom-like.

3.3 UMAP Embedding of Small Molecules

 

After the SMILES, properties, and/or fingerprints of molecules are recorded in the TDT table, data will be ready for handling. In some cases where molecules might not obey the drug-likeness computation rules or are low-quality data, the first step is to detect any rows that contain ‘NaN’ in columns defining atoms and the count of molecules. Once the broken rows are removed, data can be utilized. Uniform Manifold Approximation and Projection (UMAP) is a technique for dimension reduction and manifold learning that is well suited to visualizing large sets of molecular data. In our work, the contact UMAP function is created to project a 3D-steered protein shell onto its treatment and will return a trajectory with its positions for all non-protein atoms on the trajectory (segmented according to defined distance) as well as a dictionary with their indices.

 

In conclusion, we describe the preprocessing of bioactivity and chemical stability datasets in SMILES or TDT format so they can be immediately used for machine learning at the cost of a few lines of code. We also provide a quick demonstration of studying small molecules using RDKit. It has been demonstrated that a diverse set of molecules represents various chemical families, resulting in a collection of up to 17 factors. For the diversity set, it can be represented with only a few bytes by a molecule fingerprint that is independent of molecular size and contains only basic chemical descriptors, such as a unique hash of a molecule based on atom composition with counts or iteratively walking the full set of molecular labeling. RDKit and Pymatgen will serve as our primary packages to solve molecular preprocessing problems.

 

3.1. Feature Selection

 

A model has a better chance of making good predictions if it is based on more relevant information. The parity of small molecules provides a robust methodology for conducting feature selection. The more accurate corresponding dataset provides 120 structural, topological, electrostatic, quantum mechanical, and vibration descriptors that can be used to predict several physical properties of organic molecules. These features can be sorted by their importance for the corresponding property for which a model is being constructed. In reality, time and cost are limiting factors, and it is not feasible to calculate 120 features for millions of molecules. Therefore, it becomes desirable to have methods for computing structural, topological, or electrostatic features that only need to be arranged at the molecule’s layout level. Other chemical properties can be practically estimated using only connectivity-based descriptors and empirically derived weights for each descriptor. Joining a connectivity table containing information that specifies bond orders, atomic symbols, and element numbers with a file of atomic descriptors means these weights can be used to estimate each structural feature at a given molecule’s layout. More than a thousand atomic descriptors are used in the dataset, and the feature’s utility is predetermined before practical valuation. Two strategies seem to adopt the most promising feature selection and categorization approaches: predicting physical properties with the fewest features and predicting the greatest number of physical properties with the fewest features. These considerations present a model underfitting danger when developing machine learning tools to design organic small molecules.

 

3.2. Normalization and Standardization

 

Perhaps the simplest way to preprocess the input and output is to normalize the properties between 0 and 1. Mathematically, this is performed using P’ = (P − P_min) / (P_max − P_min) where P is the original property, and P’ is the normalized property. The minimum and maximum values of the property should be determined from the dataset by examining the value of each property in turn of all the molecules. The normalization requires data range and arguably introduces some embedded assumption about the dataset which may not be entirely accurate. For instance, small molecular weight penalizing the property might be counterintuitive when the property is logS. To address this issue, we can standardize the output value.

 

The goal is to transform a property into a Gaussian with a zero mean and variance of 1. This is done using the formula P’ = (P − μ) / σ where P is the original property, and P’ is the standardized value in deviation scores. We need to calculate the mean (μ) and standard deviation (σ) from the dataset. However, most molecular properties are not normally distributed, so standardizing them is not generally recommended. Furthermore, standardization transformation has an issue with predicting beyond training ranges, and many regression algorithms assume properties are normally distributed. Hence, while normalization and standardization are possible, these might not be generally required. However, if we have an output that is statistically not correct, i.e., has physical or chemical restrictions, transformation should be considered. (Li and Fourches, 2021)(Rajan et al., 2021)(Krenn et al.2020)(Jo et al., 2020)

 

4. Machine Learning Models for Small Molecule Modelling

 

In comparison to classical force fields, machine learning potentials offer a more accurate representation of molecular properties at a similar or reduced computational cost. By explicitly taking into account the chemical environment of an atom, machine learning potentials have been shown to be transferable across different organic molecules while precisely representing different chemical behaviors. In turn, these properties have allowed for a wide range of applications, including the use of machine learning models in the search for reaction mechanisms, modeling of reaction pathways, as a catalyst descriptor, or in the characterization of both thermodynamics and kinetics of reactions. In principle, these methods should also improve ligand docking in drug-like compound libraries, with potential medical applications.

 

Through atomistic simulations, it is possible to access an extensive sampling of the chemical environment that an atom is exposed to, catalyzing the identification of the laws that govern the system’s behavior. However, while data-driven methods can take advantage of this, it is often difficult to understand the underlying basis of these techniques. Small molecule data is currently largely restricted to small transition element center molecules, and as of yet, have not been as extensively utilized as more complex data sets. Furthermore, the prediction of binding affinities to protein targets is hindered by our inability to estimate, in a cost-effective manner, the free energy of binding of small drug compounds to proteins. Despite the challenges, considerations for the use of machine learning models for organic small molecules have been discussed, and potentials have been applied for protein-ligand interactions and to the modeling of the electronic structure of organic molecules. This results in a relatively large amount of publicly available domain knowledge, which should serve as a guide.

 

4.1. Linear Regression

 

Simple regression techniques, like linear regression, are a good first step for predicting some continuous target characteristics of an organic small molecule. Some well-known indices and properties for which they can be of use include the octanol/water partition coefficient, polar surface area, molar refractivity, etc. The technique works by taking a set of features related to the molecule and modeling the target property as an affine function of those features. This linear model, which captures the average or expected relationship between the features and the target, can be trained from labeled examples using simple linear algebra techniques. In the context of a feature matrix, where each row corresponds to a different example and each column corresponds to a different feature, the general idea is to solve a set of linear equations. However, the matrix is not always of full column rank; the linear system doesn’t solve as it should.

 

Simple linear regression can be quite useful for quick approximations or for learning about the relationship between a single feature and the target. It also serves a useful pedagogical purpose because the separability it implies can give insights as to why models perform as they do. But, as the number of features increases, appropriately capturing their joint relationships to the target can become increasingly complex. Moreover, the realities of noise mean that most datasets will inevitably violate the strict linear assumption made by simple regression models. While some regularization can serve to alleviate these concerns, linear regression models can also suffer in certain regimes where the range of the predicted values is so limited that a modification improvement for any given prediction is tiny. These models may be only rarely used to predict large or expensive molecule properties, even prior to training specialized models for these tasks.

 

4.2. Decision Trees

 

The nodes represent the features that are being split; the edges represent the decision on which side of the threshold the value must fall, and the leaves represent the target the data point is associated with. Importantly, each split should result in leaf nodes that are more pure with respect to the target than the original; that is, it often splits the leaf nodes into groups that predominantly fall into one class of the target. Here, purity simply means the fraction of the group that is of a single target class; one that is very skewed, like 0.1 purity, which is a set of points that are mostly of a single target. The quality of the splits in the decision tree can be evaluated in terms of the decrease in Gini impurity or in terms of the entropy.

 

When a decision tree is constructed, the feature that provides the most information per choice is used to create the next split, after which that information is removed from the data. The information in the data is quantified in terms of the Gini impurity or in terms of the entropy. Information theory defines the entropy in terms of the uncertainty about the value that a discrete random variable will take, where the items in the data act as the random variable. Entropy is a measure of the uncertainty; that is, what is the amount of surprise one has on learning the value the variable might take. The Gini impurity is similar to the entropy and has a range from 0 to 0.5. A Gini impurity and an entropy of 0 correspond to the purest state. The most important non-leaf node in the tree is referred to as the root.

 

4.3. Random Forests

 

There are two similar classifiers in the family of Random Forest algorithms. As the name suggests, both methods are ensemble classification methods, combining many individual classifiers in some way. The first method uses a collection of decision trees and generates an estimate based on the votes of the individual trees. The second method uses a different approach to determine the nodes of the decision trees and uses them to compute an estimate. Both methods are able to determine the importance of the variables and can provide insight into the size of the chemical features. In order for the Random Forest algorithm to function, we need to collect a large set of descriptors for a given large chemical space. Random Forest then uses this data to generate a model for predicting the activity or property of interest. In contrast to other methods, a number of useful molecular descriptors can be obtained that are highly informative and thus play a major role in the portrayal of chemical data. Therefore, the model generated describes the chemical under scrutiny. This method also has high recall and precision.

 

The method is also able to examine the impact of chemical features on endpoint behavior; in particular, it permits examination of both feature importance and contribution. However, this is a disadvantage and is bypassed by using hypothesis guessing in silico methods. Data has to be pre-processed so that good results are obtained, such as centering and standard deviation normalization. Bootstrapping is usually used in Random Forest and can be computationally intensive. However, there are ways to reduce the number of decision trees. Small exploratory data for decision trees are usually found by the method, thus making better decision trees. Technically, the model produced by the Random Forest method has unprecedented recall and precision; instances with missing values in their descriptors; insights into influentialness; simple representation by variable importance; and simple representation by model trees.

 

4.4. Support Vector Machines

 

SVMs are a specific type of classification or regression algorithm that finds the hyperplane in the input space that maximally separates the training data into two classes in the case of classification or which is as close as possible to as many data points as possible in the case of regression. This hyperplane is selected in a way to maximize the margin from the closest training points of each class, and this makes it more robust against noise in the data or specific information of the individual training set. The position of the data points with respect to the margin defines the functional form of the SVMs’ loss function, which is built to solve optimization problems without running the risk of overfitting the model to the training data. In support vector classification, the hyperplane with the largest margin is selected, using specific techniques to generalize the algorithm to datasets that are not separable using only one hyperplane.

 

The kernel functions of the SVMs, for separable classes, can make the models non-linear, effectively mapping the input space into a higher-dimensional space, but most kernel functions make the models quite computationally intense for large datasets and often degrade the SVMs’ advantage of natural generalization. They are considered an advantage in comparison to simpler algorithms in the field of chemical properties prediction given their resistance to overfitting, in some cases originating models that are more robust and generalizable when dealing with new datasets, including data from other geographical locations or laboratory conditions. (Pinheiro et al., 2021)(Mueller et al., 2020)(Schran et al.2021)(Pinheiro et al.2020)

 

5. Evaluation Metrics for Model Performance

 

This section focuses on the evaluation and comparison of the developed models in our study. The training and validation performance of all developed models monitored during model optimization suggests that more complex ensemble techniques mainly provide better generalization ability, showing improved overall performances. It is observed that the fast, simple, and very little effort required descriptors do not provide a satisfactory model. When organic functional group classifications are used as an alternative target, the resulting model performance was worse compared to compounds for the same descriptor set. Therefore, several compound-based models’ validation results of the experiment reveal that ensemble models generally offer higher sensitivity and consistency levels. Cross-validated model performances and the confusion/exclusion matrices of the selected models are discussed in detail.

 

Model performances of the successful models are evaluated employing the confusion matrix. The particular confusion matrices associated with the test dataset for the selected models and the metrics they represent can assist in the validation of classification models. We describe the most common binary classification confusion matrix format, in which each row is represented by the actual class, while each column is represented by the predicted class. The true positive, true negative, false positive, and false negative columns provide a real-world representation of the classification performance. Additionally, three performance evaluation metrics, such as the class sensitivity, the class specificity rate, and the class positive prediction rate, are each typically chosen as a binary classification representation for different sample class sizes.

5.1. Mean Squared Error

 

Given an input space X and a target space Y, mean squared error (MSE) loss measures the deviation of a predicted value function f from its target value h = f. It is important to note that h may also be seen as a second function h(x) = f(x) for x ∈ X. In machine learning, n is a predefined batch size, n being the number of samples per minibatch. Setting n = 1 and considering f to be the function mapping a and b to the elementwise sum produces back the usual scalar definition of the loss. The error between the two values every function outputs over the input space (i.e., f and h) may be precisely defined using the given expression. Applied to machine learning, the formula turns into a multivariate quadratic function. The definition is flexible since it is independent of the form of input space X and target space Y. It is one of the simplest loss functions for regression problems. In the context of ML, chosen values for X and Y may be found by substituting implicit values.

 

5.2. Root Mean Squared Error

 

Now, we evaluate the predictive accuracy of the models for the eight regression tasks in terms of the RMSE between predicted and exact values. As expected, regression tasks one to three sit in the middle, with the former two showing smaller RMSE. The three solubility prediction tasks show particularly good performance, with the lowest RMSE. The task predicting the melting point of benzene, which is the smallest molecule among all organic solvents, can be challenging for the machine learning model. A best RMSE of 29.7 has a corresponding deviation of 3.61 kJ/mol from the melting point. In comparison to its range of melting points, other cores show better performance.

 

The four toxicity prediction tasks have RMSE less than 60, including two much smaller RMSE than other tasks. Another axis contributes more valuable tasks to the users. Among them, Zellweger syndrome and orphan nuclear receptor ligand activities have RMSE below 30, and the predictor provides the most valuable tasks to users. Our second models have worse RMSE than our first models in all regression tasks, showing the possible overfitting when we design new molecular structures with models.

 

5.3. R-Squared

 

R-Squared (R²) For ML regression, the correlation between predicted and real values and the regression that is done when building the QSAR model is of utmost importance. This can be analyzed by using R or R² in statistics. R-Squared (R²) is a value that tells you how well data fits the model (R-Squared is between 0 and 1, with 1 being the best data fit possible). It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s). Another way to express the concept is that R² is the square of the correlation between the observed values and the predicted values of the dependent variable. Ideally, aim for an R² value of more than 0.7. Note that R² will not change (for the better) when new descriptors are added.

 

Formula It is calculated using the following formula: R² = 1 – (SS_Res / SS_Tot) Where:

 

SS_Res = Sum of squared residuals, which is the difference between the predicted value and the mean of the actual values. SS_Tot = Sum of squares total, where the mean of y is calculated. Significance of R² An R-squared that is close to 1 indicates that the model is a good fit because it accounts for most of the variability in the data. However, a basic assumption in regression analysis is that the model should fit the data already in hand. Large values of R² might result from having overly complex models. Also, R² in general is less helpful when trying to choose one model among many candidates. In some cases, researchers instead report the adjusted R-squared, which is adjusted for the number of independent variables used in the model. (Ganaie et al.2022)(Di et al.2020)(Ampomah et al., 2020)(Alqahtani et al., 2022)

 

6. Advanced Machine Learning Techniques for Small Molecule Modelling

 

The previous section has demonstrated that basic concepts of machine learning are clearly applicable to rich-structure, multi-modal small molecule problems that connect to a wider research field. Furthermore, with the correct level of data, it is easy to train programs to provide highly accurate models for a given property. Here, we consider further improved models that also maintain sparsity and handle rare data points, and then discuss improvement strategies such as the importance of using more informative training data, utilizing various types of additional data, extending models to include multi-task learning, implementation through Bayesian-based techniques, and finally outline the important issue of using model or ensemble selection.

 

As we have already seen, low data points for a particular task can lead to overfitting; the occurrence of many high-variance models for a particular variable. Conversely, for the small molecule design of drug candidates, most likely the considered chemically feasible molecules will have poor properties with respect to this objective; thus, the non-interesting objective will dominate these sparse but accurately measured task data. Therefore, using sparsity will be essential in order to formulate multi-modal models of interesting chemical compounds.

 

6.1. Deep Learning

 

A variety of techniques exist for the application of deep learning to organic chemistry. Methods span from generative models, which take cheminformatics data and generate new compounds, to predictive models that can predict molecular properties such as response to disease or chemical properties such as partition coefficient. Generative models commonly use recurrent or convolutional neurons to derive molecular structure from mass spectral data or format strings, with such techniques showing good performance in synthesizing practical compounds. Predictive models for molecular properties may also utilize similar architectures, while the use of hybrid models is also common, incorporating both visualization and predictive components. Techniques that are typified in machine learning, such as reinforcement learning, have been modified to optimize the experimental synthesis of target compounds, with online access to reaction conditions and products then achievable, which added a new dimension to the virtual synthesis predictions. Care is needed when evaluating the performance of such deep learning models, but seemingly excellent performance can be the result of overfitting, reflecting poor model generalization or the risk of selecting the wrong model. There are also technical challenges associated with the training of deep learning models, with unbalanced data issues and tasks such as localization, detection, and segmentation not always being easily transferable from other domains.

 

6.2. Ensemble Learning

 

In classification, regression, and clustering problems, one commonly encounters the situation where a single model does not perform well. The most commonly applied method to overcome this is to build multiple models. In this chapter, we introduce multiple concepts related to aspects of multiple models. Ensemble learning, also known as classifier or model combination, model fusion, metalearning, and other similar terms, can refer to any method that uses multiple models to obtain better predictive performance than that of any single model.

Traditionally, given an ensemble of N homogeneous models, the task of distinguishing and combining the predictions accordingly is achieved by assigning a weight, i.e., a confidence in the prediction, to each model. These weights are then used to linearly combine the output. Other than the frequent denoising effect, these weights help the ensemble to distinguish and select the experts from the ‘bots,’ hence the term ‘Subset Diffusion.’ Some important questions remain to be answered, such as the optimal choice of the diffusion matrix(-ies), and if the test example has been assigned a new confidence vector that is significantly different from any of the confidence vectors collided with. (Walters and Barzilay, 2020)(Baskin, 2020)(Cáceres et al., 2020)

 

7. Challenges and Limitations in Small Molecule Modelling with Machine Learning

 

Currently, a significant barrier to the widespread application of machine learning models for optimization in small molecule modeling is the large-scale aggregation or curation of high-quality experimental data. Existing databases contain all known structures of organic small molecules, while the amount of property data is only around 10–15%. In general, the traditional rule of thumb has eased the identification of desirable new organic synthesis routes, but chemical scientists sometimes rely on empirical experience for certain types of targets, which can reduce the reliability and novelty of a designed route.

 

For machine learning model training, the number of datasets with multiple properties simultaneously measured by different experimental methods is limited. Furthermore, curation of structured data for small molecule modeling with controlled value ranges, labels with bubbles and thresholds, or predefined corrections is difficult. Although databases with related but different properties vary, these databases cannot be conveniently used for unspecific multi-property machine learning model training owing to incompatible normalized scales. Furthermore, for some specific types of multi-property machine learning models, the lack of three-variable piezoelectric labels is a significant limitation of small molecule modeling with machine learning.

 

7.1. Data Quality and Quantity

 

The quantity of X-ray crystallography data for organic small molecules is very large, with an estimated cumulative total of approximately 1 million data sets. The number of generated data sets is strongly correlated with publication rates, indicating a consistent level of data production. Further, there can be many data sets available for a single molecular structure, as single X-ray diffraction experiments can yield multiple data sets at different temperatures. Despite the currently available number of data sets, a further one million data sets is added approximately every 7.5 years.

 

Estimates of the number of small organic molecules that are synthesizable have found that there are, in theory, around 10^60 molecular structures up to a mass of 1000 Da. In terms of the number of purchasable compounds, compounds up to a mass of 350 Da are readily commercially available, and it is estimated that for compounds up to a mass of 500 Da, there are 10^14 purchasable compounds. With sufficient time and effort, the full available molecules could be experimentally tested using X-ray crystallography. These numbers suggest that the accuracy of a high-throughput trained model would be limited by the observational uncertainty of the training data, rather than insufficient data coverage of molecular structural space. However, because of data linkage challenges, only a maximum of approximately 90,000 high-resolution X-ray crystallography datasets are presently available.

 

7.2. Interpretability and Explainability

 

The successes and predictive power of machine learning are fundamentally based on learning from the complexity of often high-dimensional data. In organic small molecule discovery and design, such high-dimensional data might have originated from explicit experimental measurement of some biophysical or physicochemical property. A key practical question, however, once a machine learning model has been built, is understanding how the intrinsic complexities that have been learned by the model can be usefully exploited both to make informed rational design decisions based on model output, to guide optimal and bespoke model training strategies, and to be able to capture and understand machine learning model predictions which mimic, provide insights and creative alleviations, or generalize over the complexity of molecular design itself; model interpretability makes ML a “learned friend” for small molecule generation. The need for machine learning model interpretability and biological relevance for organic small molecule model predictions cannot be overemphasized in translational medicinal chemistry and molecular design from big heterogeneous data. High-throughput biophysical screening and even low information content data sets such as biochemical ligand binding confirmations store hidden nuggets of invaluable biological information that is deeply connected to the proteogenesis or reappraisal of a novel small molecule. (Simpkin, 2020)(Carbery, 2023)(Nass et al.2021)(Thomas, 2021)

 

8. Applications of Machine Learning in Small Molecule Modelling

 

In this chapter, we described recent applications of machine learning in solving a diverse set of problems, highlighting the best performing models across different areas. However, the scope of machine learning research and applications in organic small molecular modeling goes well beyond what we could have discussed in this brief chapter. Moreover, we believe that the adoption of machine learning tools is poised to grow further in the coming years. Magnetic modeling, molecular data fusion, and transfer learning are a few areas where machine learning methods can greatly benefit and be leveraged more. The development of domain-specific models, domain adaptation, model interpretability, and good model practices are additional areas we believe need greater emphasis in the research community.

 

In the rest of this section, our attempt to review the work has been to discuss a variety of problems, focusing primarily on the current state-of-the-art methods and performance, but not exclusive to the old work, across the areas of solvent effects in organic synthesis planning, organic molecule property prediction, non-covalent interaction energy prediction, physical and bioactivity prediction, as done by the use of cheminformatics methods using molecular fingerprints and using graph representation and graph kernels where possible.

 

8.1. Drug Discovery

 

Drug discovery is an expensive, challenging, and time-consuming process. Many drugs are not effective or have serious side effects that were not detected in preclinical trials. The power of a methodology that could predict drug effectiveness and side effects from the structure of a compound would be enormous. Since the beginning of drug discovery, computational models have been used to assess chemical compounds, and not all of them required a physical understanding of the compound’s behavior. Thus, a rapid and efficient correlation of chemical structure with biological activity is essential, allowing the modeling of complex biological processes using relatively simple calculations. Early drug discovery was mainly based on molecular modeling to study the structure of molecules and help in drug design. These classical methods integrate sparse basic chemical information about target proteins and find connections with related protein sequences that have such information, using an iterative, cautious sequence of databases, alignments, and visual inspections. In QSAR, the structure of a compound is used to predict that compound’s activity. In ligand docking, they look for a compound’s optimal binding position. These methods have had some real successes but have remained too slow, complex, and not predictive enough. Hundreds of sequence and structure-based approaches only get to offshoot targets in some drug gene families with complex biological functions. With the increase of genomic knowledge, biotechnology tools have generated an increasingly large amount of data. However, translating all the information generated in molecular solutions with this rapid expansion of molecular biology and related fields remains a challenge. The huge increase in sequence and structural data, combined with the slow evolution and poor selection of classical methodologies, has fostered the development of new approaches. In this context, computational methods capable of predicting toxicity occur as the hope of reducing the time and cost of obtaining new drugs. Machine learning can be an important part of this family of approaches.

 

8.2. Material Design

 

This work is a collaboration between two labs. It is an investigation of the potential for using machine learning to accelerate data collection and learn important features of organic small molecules. We compare these approaches to the use of molecular dynamics-based protocols. Using either a combination of molecular dynamics and the quality structures learned by our neural networks, we demonstrate that we are able to predict the accuracy of ligand binding in key modern therapeutic targets. These results open the door to using machine learning to guide the direction of organic synthesis by quickly learning information about the molecules of interest. In this section, we demonstrate the concept by first validating the capacity of the networks to learn basic constraints in small molecule modeling and to generate novel organic small molecules that are chemically sound. We then demonstrate the capabilities of surrogate networks to interact with parameters of organic synthesis. Lastly, we experiment with the concept of neural network design of small organic molecules and discuss interactions with the schedules and policies used by future combinatorial chemistry factories. Our results are promising and motivate future efforts in this general theme. (Dou et al.2023)(Chen et al.2020)(Li et al.2021)(Ye and Ouyang, 2021)

 

9. Ethical Considerations in Machine Learning for Small Molecule Modelling

 

While machine learning methods, used responsibly, can widen the pool of researchers who benefit from the power of large and complex models, researchers must recognize the risks that they cannot bias models, as well as ethical concerns about intellectual property, access to models, and accountability. Changing the way model development is done and increasing the diversity of data organizers, methods, applications, and user groups will counteract these concerns. Meeting these goals will require different ways of training a new generation of chemical research scientists, and care from established researchers when changing approaches to the way they do model development. The development and utilization of machine learning in the molecular sciences can raise ethical issues. However, ethics are not separate from the other practical considerations that are relevant to machine learning development, such as data bias or the selection of relevant model inputs. To best address these issues, although the script of chemical research must be changed, it is more important for researchers to internalize these principles rather than to follow an ever-lengthening list of rules. (Adlung et al., 2021)(Ahmed et al., 2020)(Patel and Shah, 2022)(Greener et al.2022)

 

10. Future Perspectives and Emerging Trends

 

In this review, we have discussed the different subfields of small molecular design problems and reviewed the current trends and the state-of-the-art machine learning methods proposed to address these problems. In addition, we have discussed the available benchmark datasets, essential preprocessing and featurization steps, and an evaluation and comparison of various methods. The key findings from our observations are summarized as follows: i) methods such as deep neural networks excel in modeling quantum chemistry properties and some subfields. However, because chemical space is highly sparse, often too much information about target tasks is needed to train predictive models. As a result, sparsely sampled data lead to reduced accuracy, also referred to as Moore’s law for small molecules. Since machine learning problems encountered in molecular design are interlinked and demanding, next-generation machine learning-driven explorations will be interconnected; such interconnected explorations will include many players, like pharmaceutical chemists, machine learning researchers, and people from many other diverse fields with different expertise. Contributions from every voice, including early-, mid-, and late-career researchers, as well as people from many institutions, will be collectively encouraged. In summary, following these future perspectives, well-motivated methods should lead to better-designed organic small molecules. Given that the target changes constantly, the model choice is influenced greatly by the target aims, hardware, financial resources, and the trial-and-error that is interchangeable among these aims, hardware, financial resources, and the trial-and-error that is learned iteratively. Four models should exist: i) often update the article numbers listed, because they change so quickly, with experts in the field acting as consultants to people who are working on the machine learning side; ii) a translator can check necessary tasks in the field, making divide-and-conquer projects clear and easy to reproduce with exact execution; iii) many expert machine learners bring their methods to build large numbers of models as well, letting experts iteratively grasp fast-developing data; put a completion time limit of three years to avoid laziness, with the methods and models at every level used to decide whether predictions are feasible at that time; and iv) one final model distills as much information as possible into an abbreviated, diverse method that is fast and reproducible. (Balytskyi, 2023)(Selvaratnam, 2020)(Lin and Mo, 2023)(, 2020)

References:

Walters, W. P. and Barzilay, R. “Applications of deep learning in molecule generation and molecular property prediction.” Accounts of chemical research (2020). [HTML]

Zhao, Z. W., del Cueto, M., Geng, Y., and Troisi, A. “Effect of increasing the descriptor set on machine learning prediction of small molecule-based organic solar cells.” Chemistry of Materials (2020). [HTML]

Chen, Guang, Zhiqiang Shen, Akshay Iyer, Umar Farooq Ghumman, Shan Tang, Jinbo Bi, Wei Chen, and Ying Li. “Machine-learning-assisted de novo design of organic molecules and polymers: opportunities and challenges.” Polymers 12, no. 1 (2020): 163. mdpi.com

Dou, Bozheng, Zailiang Zhu, Ekaterina Merkurjev, Lu Ke, Long Chen, Jian Jiang, Yueying Zhu, Jie Liu, Bengong Zhang, and Guo-Wei Wei. “Machine learning methods for small data challenges in molecular science.” Chemical Reviews 123, no. 13 (2023): 8736-8780. nih.gov

Mahmood, Asif, Yahya Sandali, and Jin-Liang Wang. “Easy and fast prediction of green solvents for small molecule donor-based organic solar cells through machine learning.” Physical Chemistry Chemical Physics 25, no. 15 (2023): 10417-10426. archive.org

Cohn, R. and Holm, E. “Unsupervised Machine Learning Via Transfer Learning and k-Means Clustering to Classify Materials Image Data.” Integrating Materials and Manufacturing Innovation (2021). [PDF]

Glielmo, Aldo, Brooke E. Husic, Alex Rodriguez, Cecilia Clementi, Frank Noé, and Alessandro Laio. “Unsupervised learning methods for molecular simulation data.” Chemical Reviews 121, no. 16 (2021): 9722-9758. acs.org

Caron, Mathilde, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. “Unsupervised learning of visual features by contrasting cluster assignments.” Advances in neural information processing systems 33 (2020): 9912-9924. neurips.cc

Zhan, Xiaohang, Jiahao Xie, Ziwei Liu, Yew-Soon Ong, and Chen Change Loy. “Online deep clustering for unsupervised representation learning.” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6688-6697. 2020. thecvf.com

Boczar, D. and Michalska, K. “A review of machine learning and QSAR/QSPR Predictions for complexes of organic molecules with cyclodextrins.” Molecules (2024). nih.gov

Akrom, Muhamad, Totok Sutojo, Ayu Pertiwi, Supriadi Rustad, and Hermawan Kresno Dipojono. “Investigation of Best QSPR-Based Machine Learning Model to Predict Corrosion Inhibition Performance of Pyridine-Quinoline Compounds.” In Journal of Physics: Conference Series, vol. 2673, no. 1, p. 012014. IOP Publishing, 2023. iop.org

Levatić, Jurica, Michelangelo Ceci, Tomaž Stepišnik, Sašo Džeroski, and Dragi Kocev. “Semi-supervised regression trees with application to QSAR modelling.” Expert Systems with Applications 158 (2020): 113569. [HTML]

Pham, T. H. and Le, P. K. “A data-driven QSPR model for screening organic corrosion inhibitors for carbon steel using machine learning techniques.” RSC advances (2024). rsc.org

Li, X. and Fourches, D. “SMILES pair encoding: a data-driven substructure tokenization algorithm for deep learning.” Journal of chemical information and modeling (2021). chemrxiv.org

Rajan, K., Zielesny, A., and Steinbeck, C. “STOUT: SMILES to IUPAC names using neural machine translation.” Journal of Cheminformatics (2021). springer.com

Krenn, Mario, Florian Häse, AkshatKumar Nigam, Pascal Friederich, and Alan Aspuru-Guzik. “Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation.” Machine Learning: Science and Technology 1, no. 4 (2020): 045024. iop.org

Jo, J., Kwak, B., Choi, H. S., and Yoon, S. “The message passing neural networks for chemical property prediction on SMILES.” Methods (2020). sciencedirect.com

Pinheiro, M., Ge, F., Ferré, N., Dral, P. O., and Barbatti, M. “Choosing the right molecular machine learning potential.” Chemical Science (2021). rsc.org

Mueller, T., Hernandez, A., and Wang, C. “Machine learning for interatomic potential models.” The Journal of chemical physics (2020). aip.org

Schran, Christoph, Fabian L. Thiemann, Patrick Rowe, Erich A. Müller, Ondrej Marsalek, and Angelos Michaelides. “Machine learning potentials for complex aqueous systems made simple.” Proceedings of the National Academy of Sciences 118, no. 38 (2021): e2110077118. pnas.org

Pinheiro, Gabriel A., Johnatan Mucelini, Marinalva D. Soares, Ronaldo C. Prati, Juarez LF Da Silva, and Marcos G. Quiles. “Machine learning prediction of nine molecular properties based on the SMILES representation of the QM9 quantum-chemistry dataset.” The Journal of Physical Chemistry A 124, no. 47 (2020): 9854-9866. researchgate.net

Ganaie, Mudasir A., Minghui Hu, Ashwani Kumar Malik, Muhammad Tanveer, and Ponnuthurai N. Suganthan. “Ensemble deep learning: A review.” Engineering Applications of Artificial Intelligence 115 (2022): 105151. [PDF]

Di Napoli, Mariano, Francesco Carotenuto, Andrea Cevasco, Pierluigi Confuorto, Diego Di Martire, Marco Firpo, Giacomo Pepe, Emanuele Raso, and Domenico Calcaterra. “Machine learning ensemble modelling as a tool to improve landslide susceptibility mapping reliability.” Landslides 17, no. 8 (2020): 1897-1914. [HTML]

Ampomah, E. K., Qin, Z., and Nyame, G. “Evaluation of tree-based ensemble machine learning models in predicting stock price direction of movement.” Information (2020). mdpi.com

Alqahtani, A., Shah, M. I., Aldrees, A., and Javed, M. F. “Comparative assessment of individual and ensemble machine learning models for efficient analysis of river water quality.” Sustainability (2022). mdpi.com

Baskin, I. I. “The power of deep learning to ligand-based novel drug discovery.” Expert opinion on drug discovery (2020). [HTML]

Cáceres, E. L., Tudor, M., and Cheng, A. C. “Deep learning approaches in predicting ADMET properties.” Future Medicinal Chemistry (2020). tandfonline.com

Simpkin, A. J. “Exploring Unconventional Approaches to Molecular Replacement in X-ray Crystallography with SIMBAD.” (2020). [HTML]

Carbery, A. “Designing more efficient compound libraries for crystallographic fragment screening.” (2023). ox.ac.uk

Nass, Karol, Camila Bacellar, Claudio Cirelli, Florian Dworkowski, Yaroslav Gevorkov, Daniel James, Philip JM Johnson et al. “Pink-beam serial femtosecond crystallography for accurate structure-factor determination at an X-ray free-electron laser.” IUCrJ 8, no. 6 (2021): 905-920. iucr.org

Thomas, W. C. “Convergent Allostery in Bacillus subtilis Ribonucleotide Reductase Studied by SAXS, Crystallography, and Cryo-EM.” (2021). [HTML]

Li, Xiaobo, Phillip M. Maffettone, Yu Che, Tao Liu, Linjiang Chen, and Andrew I. Cooper. “Combining machine learning and high-throughput experimentation to discover photocatalytically active organic molecules.” Chemical Science 12, no. 32 (2021): 10742-10754. rsc.org

Ye, Z. and Ouyang, D. “Prediction of small-molecule compound solubility in organic solvents by machine learning algorithms.” Journal of cheminformatics (2021). springer.com

Adlung, L., Cohen, Y., Mor, U., and Elinav, E. “Machine learning in clinical decision making.” Med (2021). cell.com

Ahmed, Z., Mohamed, K., Zeeshan, S., and Dong, X. Q. “Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine.” Database (2020). oup.com

Patel, V. and Shah, M. “Artificial intelligence and machine learning in drug discovery and development.” Intelligent Medicine (2022). sciencedirect.com

Greener, Joe G., Shaun M. Kandathil, Lewis Moffat, and David T. Jones. “A guide to machine learning for biologists.” Nature reviews Molecular cell biology 23, no. 1 (2022): 40-55. ucl.ac.uk

Balytskyi, Y. “Interdisciplinary Applications of Physics and Machine Learning.” (2023). [HTML]

Selvaratnam, B. “Applications of Machine Learning in Materials Chemistry.” (2020). [HTML]

Lin, J. and Mo, F. “Empowering Research in Chemistry and Materials Science through Intelligent Algorithms.” Artificial Intelligence Chemistry (2023). sciencedirect.com

Scroll to Top