RoboBohr – A machine learning tool for molecular discovery

Synopsis

RoboBohr is a machine learning tool for predicting electronic structure of molecules. It operates on data collected from the PubChem database and constructs feature vectors to describe each molecule. The feature vectors can then be fed into machine learning algorithms for predicting atomization energies.

Operation

RoboBohr currently has 4 modes of operation:

  • query: Reads input sdf files and creates list of objects that contain types of atoms and coordinates for each entry in the input. The list of these objects are then used to create input files for the pwscf code of the Quantum Espresso package.
  • createFeatures: From the list of objects obtained in the query step, generates a design matrix and saves on file.
  • cluster: Creates job submission files for running the pwscf input files in a high performance computing (HPC) environment. Torque and Slurm scheduling systems are supported.
  • outcomes: Analyzes the output files generated from pwscf runs and stores relevant outcome quantities (e.g. ground state energies) and creates a log file.

Utilities are also provided to construct raw data in JSON format which allows feature engineering capabilities beyond what is provided by default in RoboBohr.

Articles

  • Preprint of the published paper
  • A minimally technical blog post about RoboBohr
  • A notebook (in R) that illustrates the use of the data generated by RoboBohr to train models
  • Two Kaggle datasets:
    • Data used in the original article with pre-engineered features
    • Data in JSON format containing more molecules and allowing users to engineer their own features

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s