{ "cells": [ { "cell_type": "markdown", "id": "6370a49e", "metadata": {}, "source": [ "[](https://colab.research.google.com/github/ddmms/camml-tutorials/blob/main/notebooks/01-intro/tutorial.ipynb)" ] }, { "cell_type": "markdown", "id": "ca147b2d", "metadata": {}, "source": [ "# Tutorial on Basics of Machine Learning in Computational Molecular Science" ] }, { "cell_type": "markdown", "id": "2de94b30", "metadata": {}, "source": [ "## Prelude\n", "\n", "\n", "This tutorial is a basic introduction into machine learning (ML) workflows in the context of molecular simulation and computational materials research. The notebook introduces workflows and concepts related to\n", "\n", "- Data preparation and analysis\n", "- Generating chemical representations and features\n", "- Dimensionality Reduction\n", "- Clustering\n", "- Kernel-based model fitting\n", "- Hyperparameter Optimization\n", "- Uncertainty Quantification\n", "\n", "The required dependencies include\n", "- [Atomic Simulation Environment (ase)](https://wiki.fysik.dtu.dk/ase/): We will use this to store molecular structures and properties\n", "- [scikit-learn](https://scikit-learn.org/stable/): machine learning library\n", "- [dscribe](https://singroup.github.io/dscribe/stable/): library to generate molecular representations (descriptors)\n", "- [openTSNE](https://opentsne.readthedocs.io/en/stable/)\n", "\n", "As you go through the notebook and work through the tasks, look through the documentation pages of those packages if you get stuck.\n", "\n", "We will be working on two datasets of molecules, one that explores composition space (dataset of different molecules), and one that explores configurational space (data from molecular dynamics):\n", "1. Dataset of five short molecular dynamics trajectories of cyclohexane, Courtesy of \"Quantum Chemistry in the Age of Machine Learning\", edited by Pavlo Dral (2022)\n", "2. The QM7 dataset of organic molecules with up to 7 non-hydrogen atoms, Courtesy of Rupp et al., Phys. Rev. Lett. 108, 058301 (2012) and qmlcode.org\n", "\n", "Both datasets feature structures and energies.\n", "\n", "\n", "Reinhard Maurer, University of Warwick (2025)" ] }, { "cell_type": "code", "execution_count": null, "id": "78e3c6c3", "metadata": {}, "outputs": [], "source": [ "# only if you run in google colab\n", "\n", "# import locale\n", "# locale.getpreferredencoding = lambda: \"UTF-8\"\n", "\n", "# ! pip install ase scikit-learn dcribe opentsne data-tutorials\n", "# get_ipython().kernel.do_shutdown(restart=True)\n" ] }, { "cell_type": "code", "execution_count": null, "id": "7c2e781b", "metadata": {}, "outputs": [], "source": [ "# get the data \n", "\n" ] }, { "cell_type": "code", "execution_count": null, "id": "276949ab", "metadata": {}, "outputs": [], "source": [ "#basic stuff\n", "import os\n", "import sys\n", "from functools import partial\n", "import numpy as np\n", "import ase\n", "import random\n", "from ase.io import read, write\n", "from ase.visualize import view\n", "from matplotlib import pyplot as plt\n", "import matplotlib as mpl\n", "import pandas\n", "import seaborn as sns\n", "from ase.build import molecule\n", "from weas_widget import WeasWidget\n", "\n", "#ML stuff\n", "import dscribe\n", "import sklearn\n", "from sklearn.metrics.pairwise import pairwise_kernels\n", "from sklearn.model_selection import train_test_split\n", "from tqdm.auto import tqdm # progress bars for loops\n", "\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "id": "c0d9bd4d", "metadata": {}, "source": [ "## Part 1: Data Preparation and Analysis" ] }, { "cell_type": "markdown", "id": "13f16b8b", "metadata": {}, "source": [ "Let's start with a dataset of five short MD simulations of different conformers of cyclohexane.\n", "\n", "We start five independent simulations\n", "initialized within each of the known cyclohexane conformers shown in the below figure. The MD simulations were run for 10,000 time\n", "steps each. The data contains the atom positions, velocities, energies and forces. These types of MD simulations will explore the energy landscape and settle within\n", "local and/or global minima.\n", "\n", "" ] }, { "cell_type": "markdown", "id": "14247a58", "metadata": {}, "source": [ "### Read and analyse Molecular Dynamics Data" ] }, { "cell_type": "code", "execution_count": null, "id": "6e10ccd5", "metadata": {}, "outputs": [], "source": [ "# read in the frames from each MD simulation\n", "traj = []\n", "names = ['chair', 'twist-boat', 'boat', 'half-chair', 'planar']\n", "rgb_colors = [(0.13333333333333333, 0.47058823529411764, 0.7098039215686275),\n", " (0.4588235294117647, 0.7568627450980392, 0.34901960784313724),\n", " (0.803921568627451, 0.6078431372549019, 0.16862745098039217),\n", " (0.803921568627451, 0.13725490196078433, 0.15294117647058825),\n", " (0.4392156862745098, 0.2784313725490196, 0.611764705882353),]\n", "\n", "ranges = np.zeros((len(names), 2), dtype=int)\n", "conf_idx = np.zeros(len(names), dtype=int)\n", "\n", "for i, n in enumerate(names):\n", " frames = read(f'./cyclohexane_data/MD/{n}.xyz', '::')\n", "\n", " for frame in frames:\n", " # wrap each frame in its box\n", " frame.wrap(eps=1E-10)\n", "\n", " # mask each frame so that descriptors are only centered on carbon (#6) atoms\n", " mask = np.zeros(len(frame))\n", " mask[np.where(frame.numbers == 6)[0]] = 1\n", " frame.arrays['center_atoms_mask'] = mask\n", "\n", " ranges[i] = (len(traj), len(traj) + len(frames)) #list of data ranges\n", " conf_idx[i] = len(traj) # handy list to indicate the index of the first frame for each trajectory\n", " traj = [*traj, *frames] # full list of frames, 50000 entries" ] }, { "cell_type": "markdown", "id": "3450f70f", "metadata": {}, "source": [ "After this call, all the MD frames of the 5 different runs sit in the object `traj`. \n", "`traj` is a list of ASE `Atoms` objects, which contain the structure definition, positions, velocities, energies etc.\n", "\n", "If you are unfamiliar with ASE, please explore the online documentation and play around a bit with it in the following cells.\n", "\n", "You can see, we have 50,000 configurations, each labeled with an energy in electronvolt (eV)." ] }, { "cell_type": "markdown", "id": "fa829141", "metadata": {}, "source": [ "**Task for you**\n", "