Machine learning for alchemy

It is no news to anyone that applications of machine learning span a vast range of fields, from artificial intelligence to social sciences. An application that I have been excited about is the possibility of discovering and designing new compounds. The list of unprecedented consequences is very long, including development of better methods in drug discovery, and computational design of compounds (digital alchemy!).

In a previous post, I have discussed my own work in this area and shared my Github repository which contains the code and workflow for predicting ground state energies of molecules. Just recently, I have published a data set on Kaggle containing molecular structures of tens of thousands of compounds. The data is collected from the PubChem repository and currently contains a subset of all the available compounds. More data will be added in time as I collect and process them.

The new data set contains files in a format that allows data scientists to engineer their own features, rather than stick to pre-engineered ones. To make this possible, I have compiled the data in JSON format. For example, a data entry for a given molecule looks like

{'En': 37.801,

'atoms': [ {'type': 'O', 'xyz': [0.3387, 0.9262, 0.46]},

{'type': 'O', 'xyz': [3.4786, -1.7069, -0.3119]},

{'type': 'O', 'xyz': [1.8428, -1.4073, 1.2523]},

{'type': 'O', 'xyz': [0.4166, 2.5213, -1.2091]},

{'type': 'N', 'xyz': [-2.2359, -0.7251, 0.027]},

{'type': 'C', 'xyz': [-0.7783, -1.1579, 0.0914]},

{'type': 'C', 'xyz': [0.1368, -0.0961, -0.5161]},

{'type': 'C', 'xyz': [-3.1119, -1.7972, 0.659]},

{'type': 'C', 'xyz': [-2.4103, 0.5837, 0.784]},

{'type': 'C', 'xyz': [-2.6433, -0.5289, -1.426]},

{'type': 'C', 'xyz': [1.4879, -0.6438, -0.9795]},

{'type': 'C', 'xyz': [2.3478, -1.3163, 0.1002]},

{'type': 'C', 'xyz': [0.4627, 2.1935, -0.0312]},

{'type': 'C', 'xyz': [0.6678, 3.1549, 1.1001]},

{'type': 'H', 'xyz': [-0.7073, -2.1051, -0.4563]},

{'type': 'H', 'xyz': [-0.5669, -1.3392, 1.1503]},

{'type': 'H', 'xyz': [-0.3089, 0.3239, -1.4193]},

{'type': 'H', 'xyz': [-2.9705, -2.7295, 0.1044]},

{'type': 'H', 'xyz': [-2.8083, -1.921, 1.7028]},

{'type': 'H', 'xyz': [-4.1563, -1.4762, 0.6031]},

{'type': 'H', 'xyz': [-2.0398, 1.417, 0.1863]},

{'type': 'H', 'xyz': [-3.4837, 0.7378, 0.9384]},

{'type': 'H', 'xyz': [-1.9129, 0.5071, 1.7551]},

{'type': 'H', 'xyz': [-2.245, 0.4089, -1.819]},

{'type': 'H', 'xyz': [-2.3, -1.3879, -2.01]},

{'type': 'H', 'xyz': [-3.7365, -0.4723, -1.463]},

{'type': 'H', 'xyz': [1.3299, -1.3744, -1.7823]},

{'type': 'H', 'xyz': [2.09, 0.1756, -1.3923]},

{'type': 'H', 'xyz': [-0.1953, 3.128, 1.7699]},

{'type': 'H', 'xyz': [0.7681, 4.1684, 0.7012]},

{'type': 'H', 'xyz': [1.5832, 2.901, 1.6404]} ],

'id': 1,

'shapeM': [259.66, 4.28, 3.04, 1.21, 1.75, 2.55,

0.16, -3.13, -0.22, -2.18, -0.56, 0.21, 0.17, 0.09] }

The description of each field is as follows:

  1. En: This field is the molecular energy calculated using a force-field method. This is the target variable which is being predicted.
  2. atoms: This field contains the name of the element and the position (x,y,z coordinates) and needs to be used for feature engineering.
  3. id : This field is the PubChem Id
  4. shapeM : This field contains the so called “shape multipoles” (a feature that describes the shape of the molecule) and can be used for feature engineering.

The data contains molecules with different number and types of atoms, therefore it is challenging to come up with features that can describe every molecule in a unique way. I have included a simple kernel on Kaggle as a starter, which constructs basic features to be fed into a machine learning algorithm. However, keep in mind that a much better set of features can be engineered.

I invite data scientists and computational physicists/chemists to check this data set out and think about engineering features to predict energies of these molecules as accurately as possible!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s