MACE in Practice I

MACE in Practice I#

### only if you are in google colab

# import locale
# locale.getpreferredencoding = lambda: "UTF-8"

# ! pip uninstall torch torchaudio torchvision numpy -y
# ! uv pip install mace-torch data-tutorials torch==2.5.1 --system
# get_ipython().kernel.do_shutdown(restart=True)

# !pip install tblite  rdkit
# !pip install git+https://github.com/imagdau/aseMolec@main

from data_tutorials.data import get_data

get_data(
    url="https://gitlab.com/cam-ml/tutorials/-/raw/main/notebooks/day-4/data/",
    filename=["solvent_configs.xyz",  "solvent_liquid.xyz",  "solvent_molecs.xyz", "solvent_xtb.xyz"],
    folder="data",
)

In this tutorial, you will learn how to fit and test a MACE model (Message Passing Neural Network), which is a highly accurate and efficient MLIP (Machine Learnning Interatomic Potential). The training/testing techniques we show here, however, are broadly applicable to all MLIPs. You can independently learn about MACE by studying the original method paper. MACE was developed by unifying the Atomic Cluster Expansion (ACE) approach with the Neural Equivariant Interatomic Potentials (NequIP). The mathematical formalism which unifies these methods is explained in the accompaning paper. Another useful reference showcases the method’s performance on published benchmark datasets. The code implementation is publically available and here you can find the documentation.

Learning Objectives for today:#

Understanding the data: diverse configs, reference labels
Understanding MACE parameters: architecture and training
Fitting and testing MACE models
Ultimate goal: stable and accurate Molecular Dynamics

1. Understanding the data#

1.1 Diverse Molecular Conformations#

Understanding the data is a crucial part of fitting an MLIP. Most models will underperform at first, often because of insufficiently representative data. In this application, we will develop an MLIP for molecular liquids of carbonates. The data presented here is a subset from this work and comprises a mixture of 6 different types of molecules: cyclic carbonates (Vinylene carbonate VC, Ethylene carbonate EC, Propylene carbonate PC) and linear carbonates (Dimethyl carbonate DMC, Ethyl Methyl Carbonate EMC, Diethyl carbonate DEC). Mixtures of these molecules in various formulations are used as solvents in Li-ion battery electrolytes.

from rdkit import Chem
from rdkit.Chem import Draw

# choose your device here.
#device = 'cpu'
device =  'cuda' 

# SMILES strings for each molecule
sm_dict = {
    'VC': 'c1coc(=O)o1',
    'EC': 'C1COC(=O)O1',
    'PC': 'CC1COC(=O)O1',
    'DMC': 'COC(=O)OC',
    'EMC': 'CCOC(=O)OC',
    'DEC': 'CCOC(=O)OCC'
}

Draw.MolsToGridImage([Chem.MolFromSmiles(sm_dict[mol]) for mol in sm_dict], legends=list(sm_dict.keys()))

For this tutorial, we prepared in advance a collection of atomic configurations (small subset from this paper). Let’s understand the data! We start by loading the raw configurations with no labels (energy, forces). The atomic configurationsare stored in the extxyz format and can be accessed using ASE as shown below:

from ase.io import read, write
from weas_widget import WeasWidget

import numpy as np

db = read('data/solvent_configs.xyz', ':') #read in list of configs

print("Number of configs in database: ", len(db))
print("Number of atoms in each config: ", np.array([len(at) for at in db]))
print("Number of atoms in the smallest config: ", np.min([len(at) for at in db])) #test if database contains isolated atoms
print("Information stored in config.info: \n", db[10].info) #check info
print("Information stored in config.arrays: \n", db[10].arrays)

We can visualize the databse using the WEAS pyhon library, or locally we can use a standalone software like ovito to view the *.xyz files directly. You can launch the Desktop application for ovito from the Launcher and navigate to your working directory in the terminal, then simply run ovito <filename>.xyz.

import ipywidgets as widgets
from weas_widget import WeasWidget
from IPython.display import display, HTML


# Function to display a molecule
def display_molecule(atoms):
    viewer = WeasWidget()
    viewer.from_ase(atoms)
    display(viewer)

viewers = []
for atoms in db[:10]: #display the first 10 configurations
    viewer = WeasWidget()
    viewer.from_ase(atoms)
    viewer.avr.model_style = 0
    viewer.avr.show_hydrogen_bonds = False
    viewers.append(viewer)

left_column = widgets.VBox(viewers[::2])  # Even-indexed viewers
right_column = widgets.VBox(viewers[1::2])  # Odd-indexed viewers

display(widgets.HBox([left_column, right_column]))

At this point, each configuration is a collection of atoms: atomic number (Z) and positions (R), with no additional information. Let’s identify the molecules and label molecular clusters. This will make it easier to inspect the data set and, later, test the accuracy of the potential on describing inter-molecular interactions. Molecule identification is achieved using the wrap_molecs function from the aseMolec package, shown here for the first 100 frames db[:100].

from aseMolec import anaAtoms as aa

aa.wrap_molecs(db[:100], prog=False) #identify molecules and label molecular clusters, showcase: first 100 frames
# write('data/solvent_molecs.xyz', db) #save full result
print("Information stored in config.info: \n", db[10].info)
print("Information stored in config.arrays: \n", db[10].arrays)

Note the additional information for each atomic config: number of molecules Nmols, molecular composition Comp (e.g DEC(1):EC(1) means the config comprises a dimer with 1 DEC molecule and 1 EC molecule) and molecular ID molID. Running the code for the full 5000 configurations can be slow, let’s just load the final result (data/solvent_molecs.xyz) and inspect the distribution of configs by number of molecules:

from matplotlib import pyplot as plt

db = read('data/solvent_molecs.xyz', ':')
Nmols = np.array([at.info['Nmols'] for at in db]) #collect Nmols information across all data
plt.hist(Nmols, align='left', bins=[1,2,3,4,5,6,7], rwidth=0.8);
plt.xlabel('# Molecs');
plt.ylabel('# Configs comprising that # Molecs');

There are just under 1000 configs comprising of single molecules and more than 2000 dimers. The largest configs contain clusters of six molecules.

We can check the distribution of molecular compositions for each cluster size:

from aseMolec import extAtoms as ea
from collections import Counter

comp_dict = {} #create a dictionary of compositions for each cluster size
for Nmol in range(1,7):
    comp_dict[Nmol] = dict(Counter([at.info['Comp'] for at in ea.sel_by_info_val(db, 'Nmols', Nmol)]))

Nmol = 6 #show distribution of compositions for cluster size 6
plt.pie(comp_dict[Nmol].values(),
        labels=comp_dict[Nmol].keys(),
        explode=10/(25+np.array(list(comp_dict[Nmol].values()))),
        rotatelabels =True);

Task for you

The training set is quite diverse and it contains a good mix of compositions. Check the distribution for other cluster sizes: Nmol = 1, 2, 3, 4, 5. Find out if all isolated molecules are present and well sampled. We have six molecules, so there should be 6x7/2 dimers present, are all dimers sampled?

1.2 Labeling Data with XTB Values#

We convinced ourselves the training set is quite diverse, it samples many compositions and molecular cluster sizes. It is time to prepare the reference data (energies, forces) to train the model on. We will do this using the Semiempirical Tight Binding level of theory with XTB. This may be less accurate than other methods specialized for these systems, but it is fast and it will later allow us to test MLIP errors on-the-fly.

Notice the data set contains isolated molecules but no isolated atoms. MACE (and other MLIPs) fit to atomization energies (eV) which is total energy minus the energy of each atom in vacuum $(E^{0})$: $$ E^{\rm atm} = E^{\rm tot}-\sum_i^{N} E^{0} $$

In our specific example, all molecules comprise three chemical elements and we will need to compute $(E^{0})$ for each of them:

\[ E^{\rm atm} = E^{\rm tot}-\sum_i^{N_H} E^{H}_i-\sum_i^{N_C} E^{C}_i-\sum_i^{N_O} E^{O}_i \]

Let us add three frames containing Hydrogen H, Carbon C and Oxygen O to the dataset and label them as config_type=IsolatedAtom

from ase import Atoms

db = read('data/solvent_molecs.xyz', ':')
db = [Atoms('H'), Atoms('C'), Atoms('O')]+db #add isolated atoms to the database

for at in db[:3]:
    at.info['config_type'] = 'IsolatedAtom'

print("Number of configs in database: ", len(db))

We are now ready to compute the energy and forces with XTB:

from tqdm import tqdm
from tblite.ase import TBLite
xtb_calc = TBLite(method="GFN2-xTB",verbosity=0)

for at in tqdm(db[:15]): #showcase: first 15 frames
    at.calc = xtb_calc
    at.info['energy_xtb'] = at.get_potential_energy()
    at.arrays['forces_xtb'] = at.get_forces()
# write('data/solvent_xtb.xyz', db) #save full result

print("Information stored in config.info: \n", db[13].info) #check info
print("Information stored in config.arrays: \n", db[13].arrays)

The updated data contains one energy value for each config energy_xtb and the forces_xtb on each atom. Latest version of ASE does not support simple names such as energy and forces so we append _xtb. The entire computation takes about 25 mins for the 5003 configs. We have precomputed the data, so we can simply load the final result. Let’s check the $E^0$ values and atomization energies:

db = read('data/solvent_xtb.xyz', ':15')

print("E0s: \n", ea.get_E0(db, tag='_xtb'))
print("Total energy per config: \n", ea.get_prop(db, 'info', 'energy_xtb', peratom=False)[13])
print("Toal energy per atom: \n", ea.get_prop(db, 'info', 'energy_xtb', peratom=True)[13])
print("Atomization energy per config: \n", ea.get_prop(db, 'bind', prop='_xtb', peratom=False)[13])
print("Atomization energy per atom: \n", ea.get_prop(db, 'bind', prop='_xtb', peratom=True)[13])

Good! We get about -6 $\rm eV/atom$ which is largely dominated by the energy of the covalent bonds. Remember, the largest contribution to the total energy comes from $E^0$ and then from covalent bonds. The noncovalent interactions contribute a very small amount to the total energy, yet they are crucial for molecular dynamics.

2. Understanding MACE parameters#

2.1 Model parameters#

We’ll give a high-level explanation of the important parameters in MACE. We will discuss these in detail during the third tutorial and associated lectures. Consult the documentation for additional parameters.

–num_interactions: message-passing layers

Controls the number of message-passing layers in the model.

–hidden_irreps: number of messages and their symmetry

Determines the size of the model and its symmetry. For example: hidden_irreps='128x0e' means the model has 128 channels or paths, the output is invariant under rotation ($L_{\rm max}=0$) and even under inversion ('e'). For most applications, these settings will do well. hidden_irreps='64x0e + 64x1o' means the model has 64 channels and is equivariant under rotation ($L_{\rm max}=0$).

Alternatively, the model size can be adjusted using a pair of more user-friendly arguments:

–num_channels=32

–max_L=2

which, taken together achieve the same as –hidden_irreps=’32x0e + 32x1o + 32x2e’

In general, the accuracy of the model can be improved by using more layers, more channels or higher equivariances. This will result in more parameters and slower models.

–correlation: the order of the many-body expansion

\[ E_{i} = E^{(0)}_{i} + \sum_{j} E_{ij}^{(1)} + \sum_{jk} E_{ijk}^{(2)} + ... \]

The energy-expansion order that MACE induces at each layer. Choosing --correlation=3 will create basis functions of up to 4-body (ijkl) indices, for each layer. If the model has multiple layers, the effective correlation order is higher. For example, a two-layer MACE with --correlation=3 has an effective body order of 13.

–r_max: the cutoff radius

The cut-off applied to local environment in each layer. r_max=3.0 means atoms separated by a distance of more than 3.0 A do not directly communicate. When the model has multiple message-passing layers, atoms further than 3.0 A can still communicate through later messages if intermediate proxy atoms exist. The effective receptive field of the model is num_interactions x r_max.

–max_ell: angular resolution

The angular resolution describes how well the model can describe angles. This is controlled by l_max of the spherical harmonics basis (not to be confused with L_max). Larger values will result in more accurate but slower models. The default is l_max=3, appropriate in most cases.

Let’s train our first model:

import warnings
warnings.filterwarnings("ignore")

import os
os.makedirs("config/", exist_ok=True)

%%writefile config/config-01.yml

model: MACE
num_interactions: 2
num_channels: 32
max_L: 0
r_max: 4.0
max_ell: 2

!mace_run_train --config config/config-01.yml --seed 42

This won’t work, of course, we still need to supply the training data and a model name, at a minimum. In fact, let’s take a look at the remaining parameters which control the file management and training protocol.

2.2 Training and data management parameters#

Let’s start by splitting the data into a train and test set.

from ase.io import read, write

db = read('data/solvent_xtb.xyz', ':')
write('data/solvent_xtb_train_200.xyz', db[:203]) #first 200 configs plus the 3 E0s
write('data/solvent_xtb_test.xyz', db[-1000:]) #last 1000 configs

–name: the name of the model

This name will be used to form file names (model, log, checkpoints, results), so choose a distinct name for each experiment

–model_dir, –log_dir, –checkpoints_dir, –results_dir: directory paths

These are the directories where each type of file is saved. For simplicity, we will all files in the same directory.

–train_file: name of training data

These data configs are used to compute gradients and update model parameters.

–valid_file: name of validation data

An alternative way to choose the validation set is by using the --valid_fraction keyword. These data configs are used to estimate the model accuracy during training, but not for parameter optimization. The validation set also controls the stopping of the training. At each --eval_interval the model is tested on the validation set. The evaluation of these configs takes place in batches, which can be controlled by --valid_batch_size. If the accuracy of the model stops improving on the validation set for --patience number of epochs, the model will undergo early stopping.

–test_file: name of testing data

This set is entirely independent and only gets evaluated at the end of the training process to estimate the model accuracy on an independent set.

–E0s: isolated atom energies

Controls how E0s should be determined. The strongly recommended approach is to add these values to the training set with config_type=IsolatedAtom in atoms.info and set E0s="isolated". If these values are not available, MACE can estimate them by least square regression over the available data E0s="average" which can lead to unintended consequences depending on how representative the data is.

–energy_key, –forces_key the key where these values are stores

This key must coincide with the ase.Atoms.info[key]/ase.Atoms.arrays[key] where the energies and forces are stored in the ase.Atoms object.

–device computing device to use

Can be CPU (cpu), GPU (cuda) or Apple Silicon (mps). Here we will use cuda since the GPU will be significantly faster than the CPU.

–batch_size number of configs evaluated in one batch

Number of configs used to compute the gradients for each full update of the network parameters. This training strategy is called stochastic gradient descent because only a subset of the data (batch_size) is used to change the parameters at each update.

–max_num_epochs number of passes through the data

An epoch is completed when the entire training data has been used once in updating the weights batch by batch. A new epoch begins, and the process repeats.

–swa protocol for loss weights

During training you will notice energy errors are at first much higher than force errors, MACE implements a special protocol that increases the weight on the energy in the loss function (--swa_energy_weight) once the forces are sufficiently accurate. The starting epoch for this special protocol can be controlled by changing --start_swa.

–seed random number generator seed

Useful for preparing committee of models.

Now we are ready to fit our first MACE model:

3. Fitting and Testing MACE models#

3.1 Fitting the model#

%%writefile config/config-02.yml
    
model: "MACE" 
num_interactions: 2 
num_channels: 32 
max_L: 0 
correlation: 2 
r_max: 4.0 
max_ell: 2 
name: "mace01" 
model_dir: "MACE_models" 
log_dir: "MACE_models" 
checkpoints_dir: "MACE_models" 
results_dir: "MACE_models" 
train_file: "data/solvent_xtb_train_200.xyz" 
valid_fraction: 0.10 
test_file: "data/solvent_xtb_test.xyz" 
E0s: "average" 
energy_key: "energy_xtb" 
forces_key: "forces_xtb" 
batch_size: 10 
max_num_epochs: 50 
swa: True 
seed: 123

dev = f'device: {device}'
%store dev >>"config/config-02.yml"

!mace_run_train --config config/config-02.yml
# if you run on cpu pass --device=cpu to the command above

#remove checkpoints since they may cause errors on retraining a model with the same name but a different architecture
import glob
import os
for file in glob.glob("MACE_models/*_run-*.model"):
    os.remove(file)
for file in glob.glob("MACE_models/*.pt"):
    os.remove(file)

MACE will automatically plot the loss function, energy and force errors before and after the SWA. Let’s investigate below. The correlation plots correspond to the model indicated by a solid vertical line. Begining of the SWA is indicated by the dashed line.

from IPython.display import Image, display
display(Image("MACE_models/mace01_run-123_train_default_stage_one.png"))

from IPython.display import Image, display
display(Image("MACE_models/mace01_run-123_train_default_stage_two.png"))

Tasks for you

Monitor the memory and usage of your GPU using nvitop in the terminal. What changes when increasing the batch_size? How about when changing the model size (num_interactions, num_channels, max_L, correlation)? Notice the accuracy vs speed trade-off.
Here we trained on a very small subset of the data, repeat training for 400, 1000, 2000, 4000 data points, remember to change the name of the model accordingly. How does the learning curve (test error vs size of training test) look like?

We will now use the mace_eval_configs script to evaluate the trained model on both the train and test datasets. The script takes the arguments: --configs which specifies the file to evaluate, the path to the model in --model and the path to the output in --output.

3.2 Testing the model: simple RMSEs#

import warnings
warnings.filterwarnings("ignore")

os.makedirs("tests/mace01/", exist_ok=True)

#evaluate the train set
!mace_eval_configs \
    --configs="data/solvent_xtb_train_200.xyz" \
    --model="MACE_models/mace01_stagetwo_compiled.model" \
    --output="tests/mace01/solvent_train.xyz"

#evaluate the test set
!mace_eval_configs \
    --configs="data/solvent_xtb_test.xyz" \
    --model="MACE_models/mace01_stagetwo_compiled.model" \
    --output="tests/mace01/solvent_test.xyz"

We can compare MACE vs XTB accuracy on the train and test sets and for this we will use the aseMolec which implements some handy utilities for manipulating ase.Atoms and testing potentials, especially for molecular systems.

from aseMolec import pltProps as pp
from ase.io import read
import matplotlib.pyplot as plt
from aseMolec import extAtoms as ea
import numpy as np

def plot_RMSEs(db, labs):
    ea.rename_prop_tag(db, 'MACE_energy', 'energy_mace') #Backward compatibility
    ea.rename_prop_tag(db, 'MACE_forces', 'forces_mace') #Backward compatibility
    
    plt.figure(figsize=(9,6), dpi=100)
    plt.subplot(1,3,1)
    pp.plot_prop(ea.get_prop(db, 'bind', '_xtb', True).flatten(), \
                 ea.get_prop(db, 'bind', '_mace', True).flatten(), \
                 title=r'Energy $(\rm eV/atom)$ ', labs=labs, rel=False)
    plt.subplot(1,3,2)
    pp.plot_prop(ea.get_prop(db, 'info', 'energy_xtb', True).flatten(), \
                 ea.get_prop(db, 'info', 'energy_mace', True).flatten(), \
                 title=r'Energy $(\rm eV/atom)$ ', labs=labs, rel=False)
    plt.subplot(1,3,3)
    pp.plot_prop(np.concatenate(ea.get_prop(db, 'arrays', 'forces_xtb')).flatten(), \
                 np.concatenate(ea.get_prop(db, 'arrays', 'forces_mace')).flatten(), \
                 title=r'Forces $\rm (eV/\AA)$ ', labs=labs, rel=False)
    plt.tight_layout()
    return

train_data = read('tests/mace01/solvent_train.xyz', ':')
test_data = train_data[:3]+read('tests/mace01/solvent_test.xyz', ':') #append the E0s for computing atomization energy errors

plot_RMSEs(train_data, labs=['XTB', 'MACE'])
plot_RMSEs(test_data, labs=['XTB', 'MACE'])

These figures show correlation plots between XTB values and MACE predicted values for atomization energy per atom, the total energy per atom and forces. Do the RMSE values match the number printed at the end of the model training? These errors don’t look too bad, and this MACE is a small model with few parameters.

Tasks for you

Significantly better accuracies can be achieved when training on larger models with more data. How does your model trained on 4000 configs compare?

3.3 Testing on the Intra/Inter decomposition:#

As shown in this paper one of the challenges associated with modelling molecular systems has to do with the inter-molecular interactions. Molecular dynamics is primarily driven by these inter-molecular interactions, however they are relatively small in comparison to covalent interactions and prove difficult to capture with MLIPs. The paper introduces this protocol to decompose the force errors into [intra- and inter-] molecular RMSEs to gauge the quality of the potential separately on the two interaction scales. This approach can be summarized as follows:

Identify molecules (labeled j).
Within each molecule j sum over all atomic forces (labeled k) to obtain the translational component: $$F^{\rm trans}_j = \sum_{k \in j} f_{k}$$
Redistribute the molecular translational force onto individual atoms (labeled i) to obtain the atomic translational contributions: $$f^{\rm trans}_i = \frac{m_i}{M_j} F^{\rm trans}_j$$
Similarly, compute the torque on the entire molecule: $$T_j = \sum_{k \in j} f_{k} \times r_{k}$$
Compute the atomic rotational force contributions that give rise to the given molecular torque: $$f^{\rm rot}_i = m_i r_i \times (I_j^{\alpha \beta})^{-1} T_j$$
Finally compute the vibrational contribution as the difference: $$f^{\rm vib}_i = f_i - f^{\rm trans}_i - f^{\rm rot}_i$$
In this approach, the Inter is the sum of trans and rot, while the Intra is the vib component.

This force decomposition can be automatically obtained using the aseMolec package as shown below:

from aseMolec import pltProps as pp
from aseMolec import anaAtoms as aa

db1 = read('tests/mace01/solvent_test.xyz', ':')
ea.rename_prop_tag(db1, 'energy_xtb', 'energy') #Backward compatibility
ea.rename_prop_tag(db1, 'forces_xtb', 'forces') #Backward compatibility

db2 = read('tests/mace01/solvent_test.xyz', ':')
ea.rename_prop_tag(db2, 'MACE_energy', 'energy') #Backward compatibility
ea.rename_prop_tag(db2, 'MACE_forces', 'forces') #Backward compatibility

aa.extract_molecs(db1, intra_inter=True)
aa.extract_molecs(db2, intra_inter=True)

pp.plot_trans_rot_vib(db1, db2, labs=['XTB', 'MACE'])

Indeed, the translation and rotational part of the forces (related to inter-molecular interactions) is significantly harder to capture as evidenced by the larger errors. While the absolute RMSEs are smaller, the relative RMSEs are signifincatly larger for the inter-molecular components. Nevertheless, MACE errors ar significantly lower than other models on these tests.

Tasks for you

Once again, try the same test with your best model trained on 4000 configs, how does it comapare?

4. Molecular Dynamics with MACE#

4.1 Is the dynamics stable?#

Accuracy on fixed test sets is great, but molecular dynamics (MD) is the ultimate test, check out this paper for a more detail discussion. First, we care about stability, then accuracy: let’s check if MACE gives stable dynamics. We will start by implementing a simple function to run Langevin dynamics. We initialize the temperature at 300 K and remove all translations and rotations.

from ase.io import read, write
from ase import units
from ase.md.langevin import Langevin
from ase.md.velocitydistribution import Stationary, ZeroRotation, MaxwellBoltzmannDistribution

import random
import os
import time
import numpy as np
import pylab as pl
from IPython import display

def simpleMD(init_conf, temp, calc, fname, s, T):
    init_conf.set_calculator(calc)

    #initialize the temperature
    random.seed(701) #just making sure the MD failure is reproducible
    MaxwellBoltzmannDistribution(init_conf, temperature_K=300) #initialize temperature at 300
    Stationary(init_conf)
    ZeroRotation(init_conf)

    dyn = Langevin(init_conf, 1.0*units.fs, temperature_K=temp, friction=0.1) #drive system to desired temperature

    %matplotlib inline

    time_fs = []
    temperature = []
    energies = []

    #remove previously stored trajectory with the same name
    os.system('rm -rfv '+fname)

    fig, ax = pl.subplots(2, 1, figsize=(6,6), sharex='all', gridspec_kw={'hspace': 0, 'wspace': 0})

    def write_frame():
            dyn.atoms.write(fname, append=True)
            time_fs.append(dyn.get_time()/units.fs)
            temperature.append(dyn.atoms.get_temperature())
            energies.append(dyn.atoms.get_potential_energy()/len(dyn.atoms))

            ax[0].plot(np.array(time_fs), np.array(energies), color="b")
            ax[0].set_ylabel('E (eV/atom)')

            # plot the temperature of the system as subplots
            ax[1].plot(np.array(time_fs), temperature, color="r")
            ax[1].set_ylabel('T (K)')
            ax[1].set_xlabel('Time (fs)')

            display.clear_output(wait=True)
            display.display(pl.gcf())
            time.sleep(0.01)

    dyn.attach(write_frame, interval=s)
    t0 = time.time()
    dyn.run(T)
    t1 = time.time()
    print("MD finished in {0:.2f} minutes!".format((t1-t0)/60))

Now we can run MD with MACE and compare it to the XTB dynamics. Let’s try 2 picoseoncds at 1200 K, starting from a single molecule config:

 #let us start with a single molecule
init_conf = ea.sel_by_info_val(read('data/solvent_molecs.xyz',':'), 'Nmols', 1)[0].copy()
os.makedirs("moldyn/", exist_ok=True)

#we can use MACE as a calculator in ASE!
from mace.calculators import MACECalculator
mace_calc = MACECalculator(model_paths=['MACE_models/mace01_stagetwo_compiled.model'], device=device, default_dtype="float32")
# change to device='cpu' if you run on cpu
simpleMD(init_conf, temp=1200, calc=mace_calc, fname='moldyn/mace01_md.xyz', s=10, T=2000)

Let’s visualize the trajectory using nglviewer. Alternatively you can use ovito through the Launcher.

import nglview as nv
traj = read('moldyn/mace01_md.xyz', ':')
nv.show_asetraj(traj)

For reference, we can also run XTB dynamics from the same starting configuration.

# reinitialize the original config
init_conf = ea.sel_by_info_val(read('data/solvent_molecs.xyz',':'), 'Nmols', 1)[0].copy()

from tblite.ase import TBLite
xtb_calc = TBLite(method="GFN2-xTB",verbosity=0)

### running this can take a long time, try for yourself. we will provide the pregenerated trajectory
simpleMD(init_conf, temp=1200, calc=xtb_calc, fname='moldyn/xtb_md.xyz', s=10, T=2000)

MACE dynamics finished in 2 minutes, vs XTB in 14. This is the essence of MLIP: speeding up calculations that would otherwise take a long time to run. In this case, the speed-up is just one order of magnitude, but depending on the cost of the reference calculation, it can be many orders of magnitude for expensive Quantum Chemistry methods and large systems. Remember, the cost of MLIP is independent of the accuracy of the potential energy surface!

Let’s visualize the trajectory:

from ase.io import read, write
import nglview as nv

traj = read('moldyn/xtb_md.xyz', ':')
nv.show_asetraj(traj)

Remember, you can visualize the trajectory files .xyz with ovito in the Desktop opened through the Launcher.

Obtaining stable dynamics with so little training is a great result. Up until recently, most MLIPs would require a lot of training before MD was stable. MACE combines the lessons learned over 10-15 years in MLIP development, to achieve a smooth and regular potential energy surface, which minimizes the risk of unstable MD.

4.2 Is the dynamics accurate?#

Are the different dynamics sampling the correct distributions? Let us check the radial distribution functions (RDF). The aseMolec package provides functionality to do that:

from matplotlib import pyplot as plt
from aseMolec import anaAtoms as aa

tag = 'HO_intra' #choose one of 'HH_intra', 'HC_intra', 'HO_intra', 'CC_intra', 'CO_intra', 'OO_intra'

for f in ['xtb_md', 'mace01_md']:
    traj = read('moldyn/'+f+'.xyz', '50:') #ignore first 50 frames
    for at in traj:
        at.pbc = True #create a fake box for rdf compatibility
        at.cell = [100,100,100]
    rdf = aa.compute_rdfs_traj_avg(traj, rmax=5, nbins=50) #aseMolec provides functionality to compute RDFs
    plt.plot(rdf[1], rdf[0][tag], '.-', label=f, alpha=0.7, linewidth=3)

plt.legend();
plt.yticks([]);
plt.xlabel(r'R ($\rm \AA$)');
plt.ylabel('RDF '+tag);

The trajectories here are stable and also quite accurate. This is still a relatively simple task: we chose a single molecule at relatively small temperatures (for a molecule) and only ran for 2 picoseconds. In practice, given enough time and high enough temperature the initial models will fail.

Task for you

Try it yourself! Inspect other RDF pairs (C-O, H-C), how well are they reproduced?
Experiment with the starting configs, temperatures, simulation length, see if you can find problems with the potentials!

4.3 MD of a molecular liquid?#

The MLIP was trained on clusters, can we simulate the liquid molecular environment?

init_conf = read('data/solvent_liquid.xyz') #read a liquid config with periodic boundary conditions
init_conf.center()
simpleMD(init_conf, temp=500, calc=mace_calc, fname='moldyn/mace01_md_liquid.xyz', s=10, T=2000)

This XTB calculator is non-periodic, so this dynamics would not be possible without an MLIP! Check for yourself, by replacing the calculator with xtb. The system is much larger than the example before (12 molecules vs just one). Let’s view the trajectory:

import nglview as nv
traj = read('moldyn/mace01_md_liquid.xyz', ':')
view = nv.show_asetraj(traj)
view.add_representation('unitcell')
view

Transferability from clusters to the condensed phase environment is still an open research question. If this works, it implies that we might be able to learn highly accurate Quantum Chemistry PES on molecular clusters and make predictions (density, diffusivity) for the condensed phase! This is new science!

MACE in Practice I

Contents

MACE in Practice I#

Learning Objectives for today:#

1. Understanding the data#

1.1 Diverse Molecular Conformations#

1.2 Labeling Data with XTB Values#

2. Understanding MACE parameters#

2.1 Model parameters#

–num_interactions: message-passing layers

–hidden_irreps: number of messages and their symmetry

–correlation: the order of the many-body expansion

–r_max: the cutoff radius

–max_ell: angular resolution

2.2 Training and data management parameters#

–name: the name of the model

–model_dir, –log_dir, –checkpoints_dir, –results_dir: directory paths

–train_file: name of training data

–valid_file: name of validation data

–test_file: name of testing data

–E0s: isolated atom energies

–energy_key, –forces_key the key where these values are stores

–device computing device to use

–batch_size number of configs evaluated in one batch

–max_num_epochs number of passes through the data

–swa protocol for loss weights

–seed random number generator seed

3. Fitting and Testing MACE models#

3.1 Fitting the model#

3.2 Testing the model: simple RMSEs#

3.3 Testing on the Intra/Inter decomposition:#

4. Molecular Dynamics with MACE#

4.1 Is the dynamics stable?#

4.2 Is the dynamics accurate?#

4.3 MD of a molecular liquid?#