This is a repost of an article that I have published recently on LinkedIn Pulse.

Recently, I have published an article on Journal of Chemical Physics, entitled *Tree based machine learning framework for predicting ground state energies of molecules* (link to article and preprint). The article discusses in detail, the application of machine learning algorithms to predict ground state energies of molecules.

Current standard of computationally efficient electronic structure simulations is unarguably based on Density Functional Theory (DFT). DFT offers efficient calculations of electronic structure of many electron systems (molecules and solids) through the use of electronic density, instead of wavefunctions. The idea is based on the theorems by Hohenberg and Kohn, stating that the ground state energy of a many electron system is a functional of electron density. While exact in its formulation, the form of this universal functional is unknown and approximations are used in practical computations. Nevertheless, decades of research has improved the accuracy of these approximate functionals, and DFT has become a workhorse in modern electronic structure theory.

Although DFT provides a computationally efficient method for electronic structure predictions, computational design and discovery of new materials require a vast number of simulations to be performed in order screen the chemical compound space. Given the limited amount of resources, such studies are very time consuming, and usually prohibitive. It is therefore crucial to find methods to predict electronic structure of materials, without performing costly simulations. Machine learning is a perfect match for such a task, since learning algorithms can be trained from a given database of electronic structure calculations and predictions on new materials can be performed at negligible cost. With the availability of a vast amount of molecular and materials databases, such a task has become feasible.

In the article, I have used the PubChem database to construct a dataset of electronic structure calculations of 16,242 molecules, made up of C, H, N, O, P and S (CHNOPS). To construct features (a.k.a descriptors) for each molecule, I followed the steps of Rupp et al. and used Coulomb matrices. For each molecule, the Coulomb matrix is defined as

where Z’s are atomic numbers and **R**‘s are atomic positions. I have set the maximum number of atoms in the analysis to be 50, so molecules with less then 50 atoms have their Coulomb matrices appended by rows and columns of 0’s to complete them to be 50×50. These matrices (more precisely, the upper triangular part unrolled into a vector) or their eigenvalues, have been used to construct a design matrix to train machine learning algorithms and predict the atomization energies (calculated using DFT with the Quantum Espresso package) of the molecules in the dataset. The atomization energy is a complex nonlinear function of these features, and the training of the learning algorithm provides an accurate fit to this function. For example, the figure below depicts the dependence of the atomization energies in the dataset on the first two principal components of the design matrix

This peculiar nonlinear dependence is impossible to model with simple linear models, thus learning algorithms such as neural networks and boosted regression trees are a perfect match for such a task. In the article, I argue that the boosted regression tree algorithm (as implemented in the XGBoost package) is preferred due to its computational efficiency and high accuracy.

The Python and R scripts, as well as the data used for the article is available from the RoboBohr Github repository. A demonstration of data visualization and model training is also available from this link. I am actively working on this project and new extensions and more data analyses will be available in the future, so stay tuned!

To conclude, I believe that with the ability to predict electronic properties without performing new simulations for each molecule/material, machine learning techniques open up exciting pathways for rational design of new compounds. Combined with numerous efforts to catalog and standardize datasets (such as Materials Project and Aflow), these methods will be invaluable for many scientific and technological applications.