{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0_3Lyj01-U_5"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ddmms/camml-tutorials/blob/main/notebooks/05-generative/crystallm.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ISybdfJV-U_7"
      },
      "source": [
        "# Autoregressive LLMs for Crystal Structure Generation\n",
        "Welcome to this tutorial on using autoregressive large language models (LLMs) in materials chemistry.\n",
        "\n",
        "In this notebook, we'll explore how to generate inorganic crystal structures using **CrystaLLM**, an autoregressive model trained on tens of thousands of known materials. By the end of this tutorial, you'll be able to:\n",
        "- Understand the inputs and outputs of CrystaLLM\n",
        "- Run the model to generate new hypothetical crystal structures\n",
        "- Interpret and analyze generated outputs\n",
        "\n",
        "This builds on your knowledge of:\n",
        "- Neural networks and generative models\n",
        "- Transformer architectures and language modelling\n",
        "- Basic inorganic crystal chemistry\n",
        "\n",
        "## What is CrystaLLM?\n",
        "\n",
        "CrystaLLM is a large language model trained to generate inorganic crystal structures in an autoregressive manner. It operates on a **tokenized representation** of crystal structures, learning the statistical patterns of known materials from databases such as the Materials Project and OQMD.\n",
        "\n",
        "Key features:\n",
        "- Based on the transformer architecture\n",
        "- Learns from linearly encoded crystal structure sequences\n",
        "- Generates structures one token at a time, similar to how text is generated in traditional LLMs\n",
        "- Outputs can be decoded into CIF-like representations for further analysis\n",
        "\n",
        "For more details, see our recent paper: [CrystaLLM: Generative modeling of inorganic crystal structures with autoregressive large language models](https://www.nature.com/articles/s41467-024-54639-7)"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import locale\n",
        "locale.getpreferredencoding = lambda: \"UTF-8\"\n",
        "\n",
        "! pip install janus-core[all] data-tutorials ase\n",
        "! pip install git+https://github.com/lantunes/CrystaLLM.git\n",
        "get_ipython().kernel.do_shutdown(restart=True)"
      ],
      "metadata": {
        "id": "5B7ujidU_AFa"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "from data_tutorials.data import get_data\n",
        "\n",
        "get_data(\n",
        "    url=\"https://gitlab.com/cam-ml/tutorials/-/raw/main/notebooks/notebooks/05-generative/bin/\",\n",
        "    filename=[\"download.py\", \"make_prompt_file.py\",\"sample.py\"],\n",
        "    folder=\"bin\",\n",
        ")"
      ],
      "metadata": {
        "id": "dYFDi2wlAOsH"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ynirg4i5rqCP"
      },
      "source": [
        "## Obtain the pretrained model\n",
        "\n",
        "The pretrained `CrystaLLM` as published in XXXX is available to download from Zenodo. There is a helpful `bin/download.py` script to help you with this.\n",
        "\n",
        "We download the small model (~25M parameters). But you can also access the larger model using `!tar xvf crystallm_v1_large.tar.gz`. In addition there are other models fror download, which are trained on different datasets, for the full list see the [config directory](https://github.com/lantunes/CrystaLLM/tree/main/config) of the repo."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Kp0Tz03YoYH0"
      },
      "outputs": [],
      "source": [
        "!python bin/download.py crystallm_v1_small.tar.gz\n",
        "!tar xvf crystallm_v1_small.tar.gz"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "elxmxmFws-P4"
      },
      "source": [
        "## Generate a prompt\n",
        "\n",
        "`CrystaLLM` needs a prompt to start generating a file. This prompt is the opening text of the file. At its simplest we can just give a chemical formula. We put the prompt into a `.txt` file that will be read by the run script later on. We could also add a spacegroup using the `--spacegroup` option:\n",
        "\n",
        "```\n",
        "python bin/make_prompt_file.py Na2Cl2 my_sg_prompt.txt --spacegroup P4/nmm\n",
        "```"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "Mlhpz87SnoS7"
      },
      "outputs": [],
      "source": [
        "!python bin/make_prompt_file.py LiMnO2 my_prompt.txt"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NGxnhp9xt-zi"
      },
      "source": [
        "## Run `CrystaLLM`\n",
        "\n",
        "\n",
        "To randomly sample from a trained model, and generate CIF files, use the `bin/sample.py` script. The sampling script\n",
        "expects the path to the folder containing the trained model checkpoint, as well as the prompt, and other configuration\n",
        "options.\n",
        "\n",
        "<details>\n",
        "  <summary>Click for supported configuration options and their default values</summary>\n",
        "\n",
        "  ```python\n",
        "  out_dir: str = \"out\"  # the path to the directory containing the trained model\n",
        "  start: str = \"\\n\"  # the prompt; can also specify a file, use as: \"FILE:prompt.txt\"\n",
        "  num_samples: int = 2  # number of samples to draw\n",
        "  max_new_tokens: int = 3000  # number of tokens generated in each sample\n",
        "  temperature: float = 0.8  # 1.0 = no change, < 1.0 = less random, > 1.0 = more random, in predictions\n",
        "  top_k: int = 10  # retain only the top_k most likely tokens, clamp others to have 0 probability\n",
        "  seed: int = 1337\n",
        "  device: str = \"cuda\"  # examples: 'cpu', 'cuda', 'cuda:0', 'cuda:1', etc.\n",
        "  dtype: str = \"bfloat16\"  # 'float32' or 'bfloat16' or 'float16'\n",
        "  compile: bool = False  # use PyTorch 2.0 to compile the model to be faster\n",
        "  target: str = \"console\"  # where the generated content will be sent; can also be 'file'\n",
        "  ```\n",
        "\n",
        "</details>\n",
        "\n",
        "For example:\n",
        "```shell\n",
        "python bin/sample.py \\\n",
        "out_dir=out/my_model \\\n",
        "start=FILE:my_prompt.txt \\\n",
        "num_samples=2 \\\n",
        "top_k=10 \\\n",
        "max_new_tokens=3000 \\\n",
        "device=cuda\n",
        "```\n",
        "In the above example, the trained model checkpoint file exists in the `out/my_model` directory. The prompt is\n",
        "in a file located at `my_prompt.txt`. Alternatively, we could also have placed the configuration options in a .yaml\n",
        "file, as we did for training, and specified its path using the `--config` command line option.\n",
        "\n",
        "Instead of specifying a file containing the prompt, we could also have specified the prompt directly:\n",
        "```shell\n",
        "python bin/sample.py \\\n",
        "out_dir=out/my_model \\\n",
        "start=$'data_Na2Cl2\\n' \\\n",
        "num_samples=2 \\\n",
        "top_k=10 \\\n",
        "max_new_tokens=3000 \\\n",
        "device=cuda\n",
        "```\n",
        "Assuming we're in a bash environment, we use the `$'string'` syntax for the `start` argument, since we'd like to\n",
        "specify the `\\n` (new line) character at the end of the prompt.\n",
        "\n",
        "The generated CIF files are sent to the console by default. Include the `target=file` argument to save the generated CIF\n",
        "files locally. (Each file will be named `sample_1.cif`, `sample_2.cif`, etc.)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "CqznF8QAn0ed"
      },
      "outputs": [],
      "source": [
        "! python bin/sample.py \\\n",
        "out_dir=crystallm_v1_small/ \\\n",
        "start=FILE:my_prompt.txt \\\n",
        "num_samples=3 \\\n",
        "top_k=10 \\\n",
        "max_new_tokens=3000 \\\n",
        "device=cuda \\\n",
        "target=file"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M9W2NJWItxXD"
      },
      "source": [
        "## Visualise the results\n",
        "\n",
        "Use ASE to see what our outputs looked like"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "LwjUfuuioM7-"
      },
      "outputs": [],
      "source": [
        "import ase.io\n",
        "from ase.visualize import view\n",
        "from ase.build import make_supercell\n",
        "import numpy as np\n",
        "\n",
        "structure = ase.io.read('sample_3.cif')\n",
        "supercell = 3\n",
        "print('Formula: ', structure.symbols)\n",
        "print('Unit cell: ', structure.cell)\n",
        "view(make_supercell(structure, supercell * np.array([[1, 0, 0], [0, 1, 0], [0, 0, 1]])), viewer='x3d')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2BIjvbUGuulE"
      },
      "source": [
        "## Exercises\n",
        "\n",
        "* Try generating other structures\n",
        "* Try doing sample generation with spacegroup as well as composition\n",
        "* Load one of the other models - e.g. `crystallm_carbon_24_small` this has been trained only on allotropes of carbon. How good is this at generating a perovskite structure?\n",
        "* Try out the large model - do the results look different to the small model?\n",
        "* You can try to do generation using Monte Carlo Tree Search to choose conditioned next tokens, see [the documentation here](https://github.com/lantunes/CrystaLLM?tab=readme-ov-file#monte-carlo-tree-search-decoding) - in principle this should lead to lower energy genearated structures. See how it affects your generations - you can use the MACE models from previous tutorials to calculate the energy of the generated structures."
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}