Integrating Machine Learning into Computational Chemistry and Materials Science

1. Introduction to Computational Chemistry and Materials Science

Computational chemistry is the science of predicting chemical properties and outcomes of chemical transformations using computational methods and is a tool used to understand complex molecular behavior. Computational materials science is concerned with using such methods to design new materials and to design and understand their properties. These fields have been growing rapidly over the past decade and are likely to be central to understanding molecular behavior and thereby central to innovation in materials design in the coming years. However, current computational methods suffer from significant handicaps: they are slow, often only producing results to a sufficient level to make a prediction about a material after one has already been made based on a partial understanding; they are based on a very large and complex formalism that can be difficult to adapt from one class of problems to another; they often rely on approximations that may fail for certain systems; and, in materials science, there are a dauntingly large number of possible experiments to hold materials constant in while in a screening study one varies only a few variable properties.

Machine learning shows the potential to revolutionize these important areas of scientific inquiry by reducing the significance of the limitations of first-principles methods and by making it possible to study systems too complex to be accessible to fully predictive theory absent a revolution in the development of algorithms. Indeed, machine learning methods have already entered computational chemistry, where low-coordinate clusters of atoms have been predicted without expensive calculations, where chemical reaction energies are being calculated with design models, and where chemical compound space has been predicted and plotted to direct high-throughput screening of new zeolite structures. Here, we present a review of machine learning methods in these rapidly developing fields.

2. Fundamentals of Machine Learning

Machine learning is a broad and intensive research field in the 21st century. The essence of it is to enable machines to “learn” without being explicitly programmed. After conceptual modeling or learning right representation from input via known or unforeseen output or reward, the intelligence directly assembles automation behavior and an entire procedure based on data-driven reasoning. There has been a long tradition in machine learning, and three classical forms of it include supervised learning, unsupervised learning, and reinforcement learning. The former is widely applicable in regression or classification and is informative or inferential on predictions with known/visible target/output in training. The challenge is that one should transform possibly massive-feature data into an overview, leading to a loss in interpretable, predefined property direct representation, or the prediction is difficult to obtain. As an alternative solution, clustering or association in unsupervised learning is keen on inferring intrinsic structure from feature and no-preset target, while the repressor of reinforcement learning is not intuitive and learning samples are costly or time-consuming in hard environments or the domain.

Like many other nature-driven bio-mimicking systems for learning and reasoning, there is a classical simple multilayered functional (fully connected) and auto-encoders deeply stacked neural network to model low-level data computation in higher-level representation spaces, which can result in an increasingly more abstract and compressed (not by principle) potential structure or “force field” and learn latent factors. This work should be able to provide beneficial and broad strategies to audiences involving computer scientists, computational scientists, material simulators, and quantum chemists to comprehend the practical physics, application, and characteristics of typical chemical methods and quantum and machine-learning algorithms. In this section, a reference-focused overview of machine learning will be provided. Since our targets are non-machine learning audiences, we will avoid the extensive principles and algorithms of machine learning topics and mainly focus on references that discuss research about machine learning application, principles, or algorithms as a reference for readers participating in electronic structure, and natural/disordered polymers and ionic fluids/solids, etc. (Pugliese et al., 2021)(Cioffi et al.2020)(Xu et al.2021)

3. Applications of Machine Learning in Computational Chemistry

This section of the essay will explore the benefits and the impact of machine learning in applications for computational chemistry. We first note an application of machine learning in quantum chemistry simulation and in virtual high-throughput screening of organic compounds for electrical and optoelectronic material properties. Especially in molecular modeling, a range of complex problems are attracting effort worldwide to overcome electronic structure problems, dynamics, spectra, electronic transport, non-equilibrium processing, and nanoscale confinement. Machine learning can also offer a route to quantum understanding of bulk materials, nanoparticles, and active surfaces, followed by their low-dimensional reduction into new chemical, electronic, and photonic materials. For our final example, we show how machine learning can be used for the additional chemical task of reaction discovery, automatically suggesting synthetic conditions that enable a desired, targeted transformation to occur.

The breakthrough of machine learning and improvements in capabilities have transformed the field of computational chemistry and materials science in a number of ways. Improved prediction accuracies and the ability to make reasonably accurate approximate predictions very rapidly has enabled the generation of vast datasets and given researchers a broader range of tools to direct research and development. Algorithms such as the Open Quantum Materials Database (OQMD) have leveraged the development of machine learning to perform efficient high-throughput materials calculations, significantly scaling up and refining search areas for a variety of problems. As a result, machine learning tools have been effective in, for example, screening millions of organic molecules to identify the most promising conductors for electrical transport or beyond for widespread use in organic photovoltaics (OPVs) and organic light-emitting diodes (OLEDs). (Westermayr et al.2021)(Morgan and Jacobs, 2020)(Hong et al.2020)

4. Applications of Machine Learning in Materials Science

In several completely separate sections, we discuss relevant agreed-upon background material (e.g., theory, algorithms, relationships to other methods) as well as contexts for discussion in other parts of the paper. Rather than describing this point more than once, we have chosen to leave the topic sentence relatively abstract and then immediately launch into specific details. Similar to the first section, we first discuss machine learning in the scientific context. We investigate its numerous applications in the fields of computational chemistry and materials science such as screening materials for one application or another, design of novel molecules and materials, property optimization, predicting and interpolating potential energy surfaces, molecular dynamics speed-up (coarse graining, force learning) and building accurate force fields, data-based physical properties estimation (equation of state, phase diagram, transport properties etc.), analysis of molecular and materials structure, properties, and synthesis, etc.

In particular, in the case of materials science, machine learning can be applied for a property of interest (screen novel materials), for a pair of material and property (one can also use the property screening for a wider list of available properties), for property learning and prediction (N2, Ar, and water adsorption on Cu-BTC MOFs, CO-FePt), the target being to select novel materials with a particular property. The bottom-up approach in artificial intelligence (AI) is indeed witnessing a major comeback, as witnessed also by the significant—but not yet fully revealed—interest of the private sector. Efficient integration of disciplines, such as computational chemistry, with Flash AI (based on several aspects of machine learning) is witness to the aforementioned trend and is also proving to lead to research outcomes that are interdisciplinary, the most salient example perhaps being Pande’s project simulating proteins and their evolution on folding@home. (Westermayr et al.2021)(Morgan and Jacobs, 2020)(Rodrigues et al.2021)(Hong et al.2020)

5. Challenges and Limitations of Integrating Machine Learning in Computational Chemistry and Materials Science

1. Data quality: Due to scarcity, difficulty of computation, or lack of parameters for experimental control, the amount of data used to learn molecular science is still much less than that proposed by machine learning. Molecular systems are still oriented towards complex chemistry problems, such as complex reaction networks and understanding electronic properties. The problems are quite difficult. In addition, data quality (e.g., precision and calibration of computation or error of molecular systems in converting experimentally derived scattering patterns or x-ray crystal structures into properties of interest, the strength and perturbation associated with the probe themselves, etc.) remains difficult.

2. Interpretability: Explainability is an open problem in designing transparent models in the molecular sciences. For example, decisions made on the basis of transparency between perceptions and transparency weaken predictions, whereas increasing accuracy in machine learning science applications can reduce interpretability. Data-driven molecular science is an important issue for decision-making applications, such as the design of chemical states and the identification of potential candidate experimental targets. As an open science problem, working with machine learning models is difficult to interpret. However, even for feature sets where transparent and sparse models are used, it becomes difficult to remain transparent if a large shape vector and high heat map are used for descriptor vectors. It simplifies but does not solve ineffectiveness and interpretability of neural networks in machine learning. Suboptimal predictive interpretation (i.e., glass boxes) can still result from incorrect variable inputs based on the weakness of the available assumptions, relationship, and/or final model between dim_target and dim_feature, as well as other unknown limitations. Furthermore, although interpretation is the most critical form of receiving variables, existing noisy vector data, coupled with highly correlated features for biological data, are stored when large and perhaps related to the rap score, which is reduced to a certain level relative to analysis. Related and additional control procedures during model development are prone to overlearning. Reliability of these interpretations is also important.

Model robustness: The quality of train-splitting methodologies can be low if not low, depending on the power of the dudoon. Inaccurate predictions using train-in-full and known false predictions using omitted variables can produce erratic, non-actual force-from-print predictions (out-of-print predictions) in the machine learning molecular sciences. A corresponding overfit power model with low-quality training data that requires line-breaking design can lead to high cross-validation performance based on the parameters of the analysis specification, and its validity may be vulnerable to external criteria. This is a scattered predictive model (<60 ear squared value). When applying a smooth train pattern to classify different bid classes, liabilities to a different design set compared to the use of multiple models are also shown. Furthermore, one of the skin’s skin care shows the ability to predict chemical toxicity at the correct relative prediction or the potential ability to reduce the running power prediction. Most of the machine learning models are irrelevant in molecular trades. The secession/application domain and successful secondary studies are usually also lost between all replication perceptions. (Wigh et al.2022)(Greener et al.2022)(Van et al.2022)(Walters and Barzilay, 2020)

6. Data Preprocessing and Feature Engineering for Machine Learning in Chemistry and Materials Science

Modern research in computational chemistry and materials science produces increasing amounts of experimental and theoretical data. However, large nonlinearities, strong dependence on often obscure interaction patterns, as well as explicitly quantum mechanical behavior hinder the possibility of physical models with unchallengeable predictive power. This landscape is fertile ground for machine learning: querying complex functions from abundant data is the specialty of models such as artificial neural networks, and incredibly predictive models have emerged in recent years. Nevertheless, simulating matter, from its chemistry up to larger phases and properties, involves complex preprocessing of large amounts of data. What steps do researchers take to prepare numerical inputs for ML? This work provides an overview of current practice.

Raw data usually includes numerical quantities detected/calculated by means of experimental and computational methods, with characteristics and units often specific to those methodologies. A common denominator unites these raw inputs: physical, chemical, or structural positions have to be represented by vectors, tensors, or other regular grid-like quantities in 2D (or 3D). The simplest preprocessing converts text input into numerical vectors of fixed length; properties and units, including derivatives and other auxiliary quantities, resemble categorical data when judged as ML inputs. To adapt them for classical ML tools, most researchers (and corresponding software libraries) proceed with the simplest form of scaling: normalizing all training inputs to have zero mean and unit variance. Standard and min/max scalers are in common use, although robust scalers as well as scaling robust to the media in which data are presented also appear. The specific choices of ML algorithm and follow-up analysis may dictate the most adequate scaling. (Shimakawa et al., 2024)(Song et al.2022)(Pinheiro et al.2020)(Hrivnak et al.2020)

7. Supervised Learning Algorithms for Predictive Modeling in Chemistry and Materials Science

The most common machine learning algorithms used in these fields are what is known as “supervised” learning algorithms. In supervised learning, the model is given a set of input data along with the expected output, and attempts to learn the necessary processing step between them. Once trained, the model can be used to predict new properties or behaviors in materials and chemical systems. The models that are most used in predictive modeling are regression algorithms (for predicting a continuous value), classification algorithms (for predicting discrete values), and ensemble methods (which use a combination of other models in predictive modeling tasks).

Regression methods are used when the target property is continuous, or in other words, can take any value. Common regression algorithms in predictive modeling include the force-field like artificial neural network methods, random forest regression, gradient boosting regression, and support vector regression. Classification methods differ from regression in that the model is instructed to place predicted values into predefined categories. For example, one may train a model to predict whether a molecule is likely or unlikely to be toxic based on its chemical structure and other properties. Some common classification methods include k-nearest-neighbors, decision trees, random forest, and support vector machines. Ensembles, or meta-learners, are algorithms that work by combining several learning models (either from the same algorithm type or across types) to improve performance and generalizability. Let us now look at some best practices and specific considerations when leveraging supervised learning for predictive modeling in chemistry and materials science.

One of the challenges in predictive modeling is that there is often a tradeoff to be made between model performance (i.e. the errors in our predictions on the training data) and overfitting. Overfitting occurs when our model becomes too complex and learns the noise in the training data, which reduces its ability to generalize to new, unseen data. Understanding model complexity and overfitting is particularly important in predictive modeling where we need to consider the physical nature of our data (e.g. is my molecule predicted to have physical properties that make sense?). Because of this, careful selection of molecular and materials “features” (e.g. chemical descriptors for molecules) as inputs to our predictive models, as well as approaches for dealing with missing data, can have a significant impact on the validity of our model predictions. Thus, careful feature selection is of utmost importance when constructing a model that will predict material properties from molecular structures. Here, we give a brief overview of general trends that should be considered in the selection of input features for a machine learning algorithm. (Huang et al.2020)(Boateng et al.2020)(Fan et al., 2021)(Lee et al., 2020)

8. Unsupervised Learning Algorithms for Pattern Recognition in Chemistry and Materials Science

Unsupervised learning is used for pattern recognition and clustering to uncover hidden structures and relationships among the data. It helps us to identify molecules with similar chemical structures and properties to better understand the activities, absorption, distribution, metabolism, and toxicity (ADMET) of drug-like molecules. Specific unsupervised learning algorithms could include clustering, dimensionality reduction, and density estimation.

Dimensionality reduction is particularly important for chemical and materials applications because there are often hundreds of molecular descriptors in the field of prescribed drugs. With dimensionality reduction, choosing a few major descriptors is the essence of chemical structure-activity relationships. Density estimation is typically used for sampling and data generation, including compound diversity and rule-based combinatorial approaches in the chemistry field and molecule generation in an adversarial setting. Clustering can be used to cluster the compounds with similar activities, which is of particular interest in target identification. Moreover, clustering can also be used to identify and cluster Caco-2 active (passive) compounds according to the kinase activity for ADMET research into effective drugs. Although clustering, dimensionality reduction, and density estimation are further subdivided, their applications in living compounds and materials are wide-ranging.

Unsupervised learning can offer new approaches for discovering hidden patterns and relationships from available representations, i.e., uncovering “chemical semantics.” Compounds with similar features or properties (like structural information, molecular or material specifications) can be potentially clustered and evaluated together. At the same time, some densely overlapping reactions with similar chemical structural patterns in a dataset can be classified by clustering. Mimetics might share similar features useful as clustering in bioactivity predictions, especially in closely similar protein kinases. In ADMET, compounds can be clustered together via fingerprint features to discover similarity in CYP3A4 inhibition properties or predict their activities. Furthermore, unsupervised classifiers can also be used for density estimation and novelty search. Random samples can estimate a new compound similarly following the desired ADME specs and select possible candidate products before practical organic synthesis. In QM/MD simulation, the structure and its stable energy can be sampled by the clustering algorithm, in the early intermediate transition state and the final stable product. Although clustering captivates growing interest, it varies, and some tuples of compound points into the same cluster occupy notional categories. The validity and applicability of cluster results heavily depend on training substructure features, similarity measurements, and density distributions. Let us note that representing and evaluating methods that measure data similarity is closely linked to clustering performance because such effectiveness may depend on similarity representation. (Glielmo et al.2021)(Oliveri et al., 2021)(Westermayr et al.2021)(Karthikeyan and Priyakumar, 2022)

9. Reinforcement Learning in Chemistry and Materials Science

Reinforcement learning for decision-making and optimization has been the subject of many recent experiments across various tasks in materials and chemistry. When used on trajectory optimization problems, prior works have shown that reinforcement learning can accomplish adaptive strategies that outperform prevailing methods. In addition, affinity-based approaches have shown some success in applying reinforcement learning to molecular design, such as retrosynthesis, as well as chemical property prediction.

Reinforcement learning aims to learn a decision policy for tasks that involve a sequence of choices in order to maximize a long-term objective. In recent years, reinforcement learning methods have been applied to a large number of challenging decision-making tasks with significant success. Many such works elucidate how reinforcement learning could support decision-making in chemistry, biotechnology, and materials, by learning strategies related to materials discovery and synthesis. A macroscopic view of the policy can illuminate an adaptive strategy, while observing and analyzing the evolution of such strategies through training may reveal a trajectory optimization. In this section, we will survey several published applications of reinforcement learning methods in materials and chemistry, using many different reinforcement learning approaches for tasks including molecular design, retrosynthesis, catalyst design, crystal-structure prediction, polymer design, materials synthesis planning, and optimization of autonomous synthesizers. (Walters and Barzilay, 2020)(Horwood and Noutahi, 2020)(Simm et al.2020)(Shen et al.2021)

10. Deep Learning Techniques for Chemistry and Materials Science

Techniques that have been most influential when combined with machine learning include deep learning, neural networks of many layers, and modifications thereof (e.g., convolutional neural networks). Here, we discuss the applications of deep learning for materials science and chemistry. Deep learned models are now widely used to predict properties as a function of molecular and material structure, i.e., to represent structure-property relationships. Besides whetting interest, deep learning has achieved a number of successes in these fields, including many state-of-the-art algorithms for image- and graph-based tasks in chemistry, physics, and materials science. Notably, deep learning extended the reach of machine learning techniques in these fields from regression (predicting real-valued atomization energies or bulk moduli given known compositions) to, for instance, predicting the crystal structure of a hypothetical material given its composition.

Deep learning is also used for molecular generation and optimization. For example, variational autoencoders can be trained on molecules to learn how to generate new (previously unseen) molecules that are similar to the molecules in the training set. Different models are trained for different datasets of molecules or materials, e.g., organic molecules, polymers, and crystalline materials. See, for instance, many DNNs, especially those known as convolutional neural networks (CNNs), employ a hierarchical feature representation that allows us to capture complex, spatially varying features in images or spatially ordered data such as molecular and crystal structures, return a “graph” as output. (Choudhary et al.2022)(Westermayr et al.2021)(Oviedo et al.2022)(Morgan and Jacobs, 2020)

11. Transfer Learning in Chemistry and Materials Science

A particular challenge that has become of prominent importance in chemistry and materials science in recent years is the lack of large, labeled data. This lack of available data introduces a problem when using machine learning models in the context of peptides in one of the following ways: the model developed overfits the small dataset, leading to poor generalization properties when faced with new instances that are more representative of the underlying data distribution; or the model is physically limited to the chemical descriptor it has been trained upon and cannot be generalized to accommodate the slight variations found in related chemical tasks. To address this issue, transfer learning—where knowledge acquired from one task is leveraged to improve learning in another—has become increasingly attractive. We therefore provide an overview of transfer learning methodologies that are tailored to the specificities of both chemical and materials data.

More specifically, a great deal of research effort has been dedicated to the problem of data scarcity in transfer learning applications within the frameworks of both quantum mechanics-based (i.e., computational chemistry) and data-driven approaches. Transfer learning has gained popularity through its application in adapting state-of-the-art neural networks that have been pre-trained on large-scale tasks of generalization, such as image classification. Deep learning is a key area where exploration of transfer learning for knowledge transfer has found prominence, particularly in the domains of natural language processing, reinforcement learning, and the development of featurization from raw data, such as IUPAC chemical names, to a molecular representation suitable for learning. (Jablonka et al., 2020)(Choudhary et al.2022)(Reiser et al.2022)(Olivetti et al.2020)

12. Interpretable Machine Learning Models for Chemistry and Materials Science

Interpretable machine learning models have recently garnered significant attention in chemistry and materials science. The major motivation for using interpretable models is to validate the model’s predictions as the quality of a model is greatly influenced by how well its prediction can conform to known chemistry. In addition, interpretable machine learning models can help chemists and materials scientists develop relevant chemical insights from predictive machine learning models. There exist a substantial number of works in literature describing the development and application of interpretable model architectures as well as descriptors or features importance techniques.

The leading interpretable model types where feature importance can be trivially obtained are tree-based models, including the bagging and boosting model techniques that allow users to determine which features contribute the most to the decision function. When using non-tree-based models, which usually provide the best predictive performance, a more detailed approach must be employed to calculate descriptor importance. Within computational chemistry and materials science, it is absolutely crucial to have transparency and explainability of the employed machine learning models in order to have trust within these fields. This article provides a compact, up-to-date, and comprehensive review of the current state-of-the-art methodologies to obtain chemically interpretable machine learning models in computational chemistry and materials science. (Oviedo et al.2022)(Omidvar et al.2021)(Allen and Tkatchenko, 2022)(Jia et al., 2023)

13. Quantum Machine Learning for Chemical and Materials Systems

Machine learning (ML) is a promising tool to solve complex problems for chemical and materials systems. In recent years, it has made strides in predicting energies, intermolecular interactions, and excited state properties of quantum systems. As the foundational theory of many ML methods shares an intimate connection with the field of quantum mechanics, new opportunities arise when ML approaches are combined with quantum computing, a field that is emerging as a powerful way to solve quantum mechanical problems.

The molecular simulation, drug and materials discovery, and chemical engineering industries all profit significantly from regard to solutions that machine learning can provide. More importantly, the outcomes of quantum property prediction, quantum-based reactive force field development, and materials and molecules discovery will benefit, thanks to higher accuracy and computational performance gains. Machine learning intersects with practically every field of engineering and science, verifying its great versatility. When applied with quantum computing, it has the potential to cause a technological revolution. There’s no denying that quantum machine learning has the potential to bypass the classically computed results, which could have far-reaching implications in scientific analysis and technological innovation. (Jamalpur et al.2020)(Brunton and Kutz, 2022)(Pan et al., 2022)

14. High-Throughput Screening and Accelerated Material Discovery using Machine Learning

14. High-throughput screening and accelerated material discovery using machine learning

High-throughput screening for the accelerated discovery of new materials has been one of the most successful domains where machine learning has greatly impacted scientific research. During the new materials discovery phase, high-throughput screening involves the generation of a massive amount of experimental results or theoretical simulations that are then studied to determine key points of interest for further experimentation or designed synthesis. Moreover, using predictive modeling to connect the structure of a composition space with the response property allows for guiding efficient materials discovery. Thus, using automated computational and experimental methods in conjunction with machine learning is a powerful approach for accelerating the discovery of functional materials and devices.

Materials discovery is a complex and multistage process, which consists of a few key parts such as high-throughput screening, materials data analysis, and materials synthesis, with many recent studies exploring machines in the discovery process. In this section, we detail a workflow for linking the synthesis of combinatorial libraries to property measurements to high-throughput computational material science studies.

With the advent of automated workflows and more recent scientific computing technologies, change is afoot in three critical areas. First, our ability to automate sample preparation and measurement enables the generation of vast amounts of data. Second, advances in computational methods and experimental techniques make it possible to design libraries of complex parametric studies in silico. Finally, the development of improved statistical fitting techniques means we can now accurately model the property-composition space of real or computational materials to predict the unexplored spaces in search of new functional materials. In this section, we will review the advances over the past couple of years in the area of combining high-throughput screening with the power of computation to design and discover new functional materials by automatic high-throughput enumeration of the compositional space. While at first the discovery processes of new materials involved searching a structured database of potential known amounts of the different metal elements and combinations, recent advances in the software allow us to search the complete compositional space. (Duan et al.2022)(Huang et al.2022)(Chen et al.2024)(Cai et al., 2020)(Rodríguez-Martínez et al.2021)

15. Ethical Considerations in Machine Learning Applications for Chemistry and Materials Science

Ethical concerns rise to the forefront when it comes to the potential and widespread application of machine learning in the interdisciplinary research domains of computational chemistry and materials science. Three of the most primary ethical challenges include those of research transparency, experimental bias resulting from the datasets being used, and data privacy.

First, to ensure that a research article is reproducible, the authors should provide sufficient details concerning their use of machine learning to the point that others may independently obtain the same results. Next, experimental bias may occur in computational chemistry in the form of high-throughput, data mining calculations inadvertently including only “good” samples, such as a dataset of molecules that only uses molecules collected from reactions that went to completion or concentration-mass-luminosity relations that treat brown dwarfs as stars. Third, especially when dealing with emotional illness or vulnerable populations, researchers must not invade an individual’s privacy and should ensure that all protocols comply with federal guidelines prior to the analysis and publication of one’s results. While granting ethical approval for all studies should ultimately be left in the hands of individual institutions, the application of machine learning within the fields of chemistry and material science must be responsible and transparent, with the ultimate goal of advancing the state-of-the-art and enhancing decentralized decision-making and control for chemists and materials scientists. (Rodrigues et al.2021)(Hippalgaonkar et al.2023)(Ninduwezuor-Ehiobu et al.2023)(Mobarak et al.2023)

16. Future Directions and Emerging Trends in Integrating Machine Learning in Chemistry and Materials Science

16. Future Directions and Emerging Trends

Reflection on developments in the field as a whole, and particularly throughout this article, gives insight into emerging trends and topics that are going to be of interest in the coming years. In this outlook, we will shed some light on what is likely to become increasingly important in the future.

Predictive modeling is going to see further developments in the coming years, especially in relation to ab initio quantum chemistry. Much investment is currently being allocated to finding solutions to the curses of dimensionality and the cost-intensive process to train these models. The ultimate goal is, and will remain, to develop nonlinear models, an approach that has been particularly effective in the area of QSPR models.

The ultimate aim of much of the work in this area is, like computational chemistry itself, to contribute to the discovery of new materials. The expansion of the exploratory space through data-driven discovery has been a popular topic already and is likely to continue to be so. The types of properties that can be used as an input to these systems will, however, broaden in the future. Traditionally, the chemical integrity of a material has been an important determinant of its desirability, but this has been changing for a number of years. The increasing popularity of fields, such as nanotechnology and structural biology, has drawn attention to properties that may not have been important when the model was first developed.

A final and somewhat fundamental changing trend is in relation to interdisciplinary working. In the early stages of integration of chemistry and machine learning, and particularly computational chemistry and machine learning, the collaboration was essentially a one-way street with chemists working with machine learners. In recent years, however, a number of machine learners have worked in chemical and materials companies, primarily as a means of developing substantial proof of concept systems. This trend of moving into the application areas is likely to increase in the coming years. (Huang and Von Lilienfeld, 2021)(Hermann et al.2023)(Lam et al.2020)(Orupattur et al., 2020)

Reference:

Pugliese, R., Regondi, S., and Marini, R.
“Machine learning-based approach: Global trends, research directions, and
regulatory standpoints.” Data Science and Management, 2021. sciencedirect.com

Cioffi, Raffaele, et al. “Artificial
intelligence and machine learning applications in smart production: Progress,
trends, and directions.” Sustainability 12.2 (2020): 492. mdpi.com

Xu, Yongjun, et al. “Artificial
intelligence: A powerful paradigm for scientific research.” The Innovation
2.4 (2021). cell.com

Westermayr, Julia, et al.
“Perspective on integrating machine learning into computational chemistry
and materials science.” The Journal of Chemical Physics 154.23 (2021). aip.org

Morgan, D. and Jacobs, R.
“Opportunities and challenges for machine learning in materials
science.” Annual Review of Materials Research, 2020. annualreviews.org

Hong, Yang, et al. “Machine learning
and artificial neural network accelerated computational discoveries in
materials science.” Wiley Interdisciplinary Reviews: Computational
Molecular Science 10.3 (2020): e1450. cardiff.ac.uk

Rodrigues, Jose F., et al. “Big data
and machine learning for materials science.” Discover Materials 1 (2021):
1-27. springer.com

Wigh, Daniel S., Jonathan M. Goodman, and
Alexei A. Lapkin. “A review of molecular representation in the age of
machine learning.” Wiley Interdisciplinary Reviews: Computational
Molecular Science 12.5 (2022): e1603. wiley.com

Greener, Joe G., et al. “A guide to
machine learning for biologists.” Nature reviews Molecular cell biology
23.1 (2022): 40-55. ucl.ac.uk

Van Tilborg, Derek, Alisa Alenicheva, and
Francesca Grisoni. “Exposing the limitations of molecular machine learning
with activity cliffs.” Journal of chemical information and modeling 62.23
(2022): 5938-5951. acs.org

Walters, W. P. and Barzilay, R.
“Applications of deep learning in molecule generation and molecular
property prediction.” Accounts of chemical research, 2020. [HTML]

Shimakawa, H., Kumada, A., and Sato, M.
“Extrapolative prediction of small-data molecular property using quantum
mechanics-assisted machine learning.” npj Computational Materials, 2024. nature.com

Song, Zilin, et al. “Mechanistic
insights into enzyme catalysis from explaining machine-learned quantum
mechanical and molecular mechanical minimum energy pathways.” ACS Physical
Chemistry Au 2.4 (2022): 316-330. acs.org

Pinheiro, Gabriel A., et al. “Machine
learning prediction of nine molecular properties based on the SMILES
representation of the QM9 quantum-chemistry dataset.” The Journal of
Physical Chemistry A 124.47 (2020): 9854-9866. researchgate.net

Hrivnak, Tomas, et al. “Accurate
nonlinear optical properties of solvated para-nitroaniline predicted by an
electrostatic discrete local field approach.” The Journal of Physical
Chemistry B 124.45 (2020): 10195-10209. [HTML]

Huang, Jiun-Chi, et al. “Predictive
modeling of blood pressure during hemodialysis: A comparison of linear model,
random forest, support vector regression, XGBoost, LASSO regression and
ensemble method.” Computer methods and programs in biomedicine 195 (2020):
105536. [HTML]

Boateng, Ernest Yeboah, Joseph Otoo, and
Daniel A. Abaye. “Basic tenets of classification algorithms
K-nearest-neighbor, support vector machine, random forest and neural network: A
review.” Journal of Data Analysis and Information Processing 8.4 (2020):
341-357. scirp.org

Fan, G. F., Yu, M., Dong, S. Q., Yeh, Y.
H., and Hong, W. C. “Forecasting short-term electricity load using hybrid
support vector regression with grey catastrophe and random forest
modeling.” Utilities Policy, 2021. [HTML]

Lee, H., Wang, J., and Leblon, B.
“Using linear regression, random forests, and support vector machine with
unmanned aerial vehicle multispectral images to predict canopy nitrogen weight
in corn.” Remote Sensing, 2020. mdpi.com

Glielmo, Aldo, et al. “Unsupervised
learning methods for molecular simulation data.” Chemical Reviews 121.16
(2021): 9722-9758. acs.org

Oliveri, P., Malegori, C., Mustorgi, E.,
and Casale, M. “Qualitative pattern recognition in chemistry: Theoretical
background and practical guidelines.” Microchemical Journal, 2021. [HTML]

Karthikeyan, A. and Priyakumar, U. D.
“Artificial intelligence: machine learning for chemical sciences.”
Journal of Chemical Sciences, 2022. springer.com

Horwood, J. and Noutahi, E.
“Molecular design in synthetically accessible chemical space via deep
reinforcement learning.” ACS omega, 2020. acs.org

Simm, Gregor, Robert Pinsler, and José
Miguel Hernández-Lobato. “Reinforcement learning for molecular design
guided by quantum mechanics.” International Conference on Machine
Learning. PMLR, 2020. mlr.press

Shen, Cynthia, et al. “Deep molecular
dreaming: inverse machine learning for de-novo molecular design and
interpretability with surjective representations.” Machine Learning:
Science and Technology 2.3 (2021): 03LT02. iop.org

Choudhary, Kamal, et al. “Recent
advances and applications of deep learning methods in materials science.”
npj Computational Materials 8.1 (2022): 59. nature.com

Oviedo, Felipe, et al. “Interpretable
and explainable machine learning for materials science and chemistry.”
Accounts of Materials Research 3.6 (2022): 597-607. acs.org

Jablonka, K. M., Ongari, D., Moosavi, S.
M., and Smit, B. “Big-data science in porous materials: materials genomics
and machine learning.” Chemical reviews, 2020. acs.org

Reiser, Patrick, et al. “Graph neural
networks for materials science and chemistry.” Communications Materials
3.1 (2022): 93. nature.com

Olivetti, Elsa A., et al.
“Data-driven materials research enabled by natural language processing and
information extraction.” Applied Physics Reviews 7.4 (2020). aip.org

Omidvar, Noushin, et al.
“Interpretable machine learning of chemical bonding at solid
surfaces.” The Journal of Physical Chemistry Letters 12.46 (2021):
11476-11487. vt.edu

Allen, A. E. A. and Tkatchenko, A.
“Machine learning of material properties: Predictive and interpretable
multilinear models.” Science advances, 2022. science.org

Jia, X., Wang, T., and Zhu, H.
“Advancing computational toxicology by interpretable machine
learning.” Environmental Science & Technology, 2023. acs.org

Jamalpur, Bhavana, et al. “Machine
learning intersections and challenges in deep learning.” IOP Conference
Series: Materials Science and Engineering. Vol. 981. No. 2. IOP Publishing,
2020. iop.org

Brunton, S. L. and Kutz, J. N.
“Data-driven science and engineering: Machine learning, dynamical systems,
and control.” 2022. researchgate.net

Pan, I., Mason, L. R., and Matar, O. K.
“Data-centric Engineering: integrating simulation, machine learning and
statistics. Challenges and opportunities.” Chemical Engineering Science,
2022. [PDF]

Duan, Chenru, Aditya Nandy, and Heather J.
Kulik. “Machine learning for the discovery, design, and engineering of
materials.” Annual Review of Chemical and Biomolecular Engineering 13.1
(2022): 405-429. nsf.gov

Huang, E-Wen, et al.
“Machine-learning and high-throughput studies for high-entropy
materials.” Materials Science and Engineering: R: Reports 147 (2022):
100645. sciencedirect.com

Chen, Chi, et al. “Accelerating
Computational Materials Discovery with Machine Learning and Cloud
High-Performance Computing: from Large-Scale Screening to Experimental
Validation.” Journal of the American Chemical Society 146.29 (2024):
20009-20018. [HTML]

Cai, J., Chu, X., Xu, K., Li, H., and Wei,
J. “Machine learning-driven new material discovery.” Nanoscale
Advances, 2020. rsc.org

Rodríguez-Martínez, Xabier, Enrique
Pascual-San-José, and Mariano Campoy-Quiles. “Accelerating organic solar
cell material’s discovery: high-throughput screening and big data.” Energy
& Environmental Science 14.6 (2021): 3301-3322. rsc.org

Hippalgaonkar, Kedar, et al.
“Knowledge-integrated machine learning for materials: lessons from
gameplaying and robotics.” Nature Reviews Materials 8.4 (2023): 241-260. [HTML]

Ninduwezuor-Ehiobu, Nwakamma, et al.
“Tracing the evolution of ai and machine learning applications in
advancing materials discovery and production processes.” Engineering
Science & Technology Journal 4.3 (2023): 66-83. fepbl.com

Mobarak, Md Hosne, et al. “Scope of
machine learning in materials research—A review.” Applied Surface Science
Advances 18 (2023): 100523. sciencedirect.com

Huang, B. and Von Lilienfeld, O. A.
“Ab initio machine learning in chemical compound space.” Chemical
reviews, 2021. acs.org

Hermann, Jan, et al. “Ab initio
quantum chemistry with neural-network wavefunctions.” Nature Reviews
Chemistry 7.10 (2023): 692-709. [PDF]

Lam, Yu-hong, et al. “Applications of
quantum chemistry in pharmaceutical process development: Current state and
opportunities.” Organic Process Research & Development 24.8 (2020):
1496-1507. [HTML]

Orupattur, N. V., Mushrif, S. H., and
Prasad, V. “Catalytic materials and chemistry development using a
synergistic combination of machine learning and ab initio methods.”
Computational Materials Science, 2020. [HTML]