Autoregressive LLMs for Crystal Structure Generation

Autoregressive LLMs for Crystal Structure Generation#

Welcome to this tutorial on using autoregressive large language models (LLMs) in materials chemistry.

In this notebook, we’ll explore how to generate inorganic crystal structures using CrystaLLM, an autoregressive model trained on tens of thousands of known materials. By the end of this tutorial, you’ll be able to:

Understand the inputs and outputs of CrystaLLM
Run the model to generate new hypothetical crystal structures
Interpret and analyze generated outputs

This builds on your knowledge of:

Neural networks and generative models
Transformer architectures and language modelling
Basic inorganic crystal chemistry

What is CrystaLLM?#

CrystaLLM is a large language model trained to generate inorganic crystal structures in an autoregressive manner. It operates on a tokenized representation of crystal structures, learning the statistical patterns of known materials from databases such as the Materials Project and OQMD.

Key features:

Based on the transformer architecture
Learns from linearly encoded crystal structure sequences
Generates structures one token at a time, similar to how text is generated in traditional LLMs
Outputs can be decoded into CIF-like representations for further analysis

For more details, see our recent paper: CrystaLLM: Generative modeling of inorganic crystal structures with autoregressive large language models

import locale
locale.getpreferredencoding = lambda: "UTF-8"

! pip install janus-core[all] data-tutorials ase
! pip install git+https://github.com/lantunes/CrystaLLM.git
get_ipython().kernel.do_shutdown(restart=True)

from data_tutorials.data import get_data

get_data(
    url="https://gitlab.com/cam-ml/tutorials/-/raw/main/notebooks/notebooks/05-generative/bin/",
    filename=["download.py", "make_prompt_file.py","sample.py"],
    folder="bin",
)

Obtain the pretrained model#

The pretrained CrystaLLM as published in XXXX is available to download from Zenodo. There is a helpful bin/download.py script to help you with this.

We download the small model (~25M parameters). But you can also access the larger model using !tar xvf crystallm_v1_large.tar.gz. In addition there are other models fror download, which are trained on different datasets, for the full list see the config directory of the repo.

!python bin/download.py crystallm_v1_small.tar.gz
!tar xvf crystallm_v1_small.tar.gz

Generate a prompt#

CrystaLLM needs a prompt to start generating a file. This prompt is the opening text of the file. At its simplest we can just give a chemical formula. We put the prompt into a .txt file that will be read by the run script later on. We could also add a spacegroup using the --spacegroup option:

python bin/make_prompt_file.py Na2Cl2 my_sg_prompt.txt --spacegroup P4/nmm

!python bin/make_prompt_file.py LiMnO2 my_prompt.txt

Run `CrystaLLM`#

To randomly sample from a trained model, and generate CIF files, use the bin/sample.py script. The sampling script expects the path to the folder containing the trained model checkpoint, as well as the prompt, and other configuration options.

Click for supported configuration options and their default values

out_dir: str = "out"  # the path to the directory containing the trained model
start: str = "\n"  # the prompt; can also specify a file, use as: "FILE:prompt.txt"
num_samples: int = 2  # number of samples to draw
max_new_tokens: int = 3000  # number of tokens generated in each sample
temperature: float = 0.8  # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions
top_k: int = 10  # retain only the top_k most likely tokens, clamp others to have 0 probability
seed: int = 1337
device: str = "cuda"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.
dtype: str = "bfloat16"  # 'float32' or 'bfloat16' or 'float16'
compile: bool = False  # use PyTorch 2.0 to compile the model to be faster
target: str = "console"  # where the generated content will be sent; can also be 'file'

For example:

python bin/sample.py \
out_dir=out/my_model \
start=FILE:my_prompt.txt \
num_samples=2 \
top_k=10 \
max_new_tokens=3000 \
device=cuda

In the above example, the trained model checkpoint file exists in the out/my_model directory. The prompt is in a file located at my_prompt.txt. Alternatively, we could also have placed the configuration options in a .yaml file, as we did for training, and specified its path using the --config command line option.

Instead of specifying a file containing the prompt, we could also have specified the prompt directly:

python bin/sample.py \
out_dir=out/my_model \
start=$'data_Na2Cl2\n' \
num_samples=2 \
top_k=10 \
max_new_tokens=3000 \
device=cuda

Assuming we’re in a bash environment, we use the $'string' syntax for the start argument, since we’d like to specify the \n (new line) character at the end of the prompt.

The generated CIF files are sent to the console by default. Include the target=file argument to save the generated CIF files locally. (Each file will be named sample_1.cif, sample_2.cif, etc.)

! python bin/sample.py \
out_dir=crystallm_v1_small/ \
start=FILE:my_prompt.txt \
num_samples=3 \
top_k=10 \
max_new_tokens=3000 \
device=cuda \
target=file

Visualise the results#

Use ASE to see what our outputs looked like

import ase.io
from ase.visualize import view
from ase.build import make_supercell
import numpy as np

structure = ase.io.read('sample_3.cif')
supercell = 3
print('Formula: ', structure.symbols)
print('Unit cell: ', structure.cell)
view(make_supercell(structure, supercell * np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])), viewer='x3d')

Exercises#

Try generating other structures
Try doing sample generation with spacegroup as well as composition
Load one of the other models - e.g. crystallm_carbon_24_small this has been trained only on allotropes of carbon. How good is this at generating a perovskite structure?
Try out the large model - do the results look different to the small model?
You can try to do generation using Monte Carlo Tree Search to choose conditioned next tokens, see the documentation here - in principle this should lead to lower energy genearated structures. See how it affects your generations - you can use the MACE models from previous tutorials to calculate the energy of the generated structures.