{ "cells": [ { "cell_type": "markdown", "metadata": { "id": "pKtv5vIXBIM2" }, "source": [ "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ddmms/camml-tutorials/blob/main/notebooks/05-generative/diffusion-model.ipynb)" ] }, { "cell_type": "markdown", "metadata": { "id": "d_gVXHPmhJr6" }, "source": [ "# Crystal Generation using Diffusion Models\n", "\n", "In this tutorial, we'll explore how to generate inorganic crystal structures using **Chemeleon**, a diffusion-based generative model for materials discovery.\n", "\n", "Whereas autoregressive models like CrystaLLM generate structures one token at a time, diffusion models work by **iteratively denoising** a noisy latent representation to arrive at a valid crystal structure.\n", "\n", "We will:\n", "- Understand the intuition behind diffusion-based generation\n", "- Use a pretrained Chemeleon model to generate new materials\n", "- Visualize and interpret the generated crystal structures\n", "\n", "Reference: [Chemeleon GitHub](https://github.com/hspark1212/chemeleon)\n", "\n", "## Diffusion Models: A Brief Overview\n", "\n", "Diffusion models are a class of generative models inspired by physical processes of noise and denoising.\n", "\n", "**Forward process:** Gradually add noise to a structure until it becomes pure noise.\n", "\n", "**Reverse process (learned):** Train a neural network to remove noise step-by-step and reconstruct the original data.\n", "\n", "Advantages:\n", "- High-quality, diverse generations\n", "- Better suited for continuous and spatially structured data (like crystal coordinates)\n", "\n", "In Chemeleon, the diffusion model operates on a latent representation of crystal structures derived from atom types and fractional coordinates.\n", "\n", "## What is Chemeleon?\n", "\n", "Chemeleon is a **graph-based diffusion model** for crystal structure generation. It uses:\n", "- A crystal graph representation with node features (elements) and position embeddings\n", "- A denoising network trained on the Materials Project dataset\n", "- A *score-based diffusion process* in latent space to generate valid structures\n", "\n", "Key components:\n", "- Graph Neural Networks (GNNs) for encoding\n", "- Denoising Score Matching for training\n", "- Support for conditional generation (e.g. given composition)\n", "\n", "Paper: [ChemRxiv](https://chemrxiv.org/engage/chemrxiv/article-details/6728e27cf9980725cf118177)" ] }, { "cell_type": "markdown", "source": [ "execute next cell only if you run on google colab" ], "metadata": { "id": "lJkDkdjNJSlc" } }, { "cell_type": "code", "source": [ "import locale\n", "locale.getpreferredencoding = lambda: \"UTF-8\"\n", "\n", "! pip install janus-core[all] data-tutorials chemeleon ase\n", "get_ipython().kernel.do_shutdown(restart=True)" ], "metadata": { "id": "B2hb6RDqBLE-" }, "execution_count": null, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "GUq3A2ryjOif" }, "source": [ "## Load up the required functions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "UNETLWFSfjyJ" }, "outputs": [], "source": [ "from chemeleon import Chemeleon\n", "from chemeleon.visualize import Visualizer\n", "from ase.io import write\n", "import os" ] }, { "cell_type": "markdown", "metadata": { "id": "1N0sV37GjUOJ" }, "source": [ "## Load the model\n", "\n", "We load up a pre-trained model. We start with the composition mode. This model only takes compositions as a prompt. Note that the download and loading takes some time (2 - 5 minutes)." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "_2DO72t-f-ZR" }, "outputs": [], "source": [ "%%time\n", "composition_model = Chemeleon.load_composition_model()" ] }, { "cell_type": "markdown", "metadata": { "id": "Ht7onr3vlDE-" }, "source": [ "## Generate Structures\n", "\n", "Here we generate just one sample, to run quickly. But you can increase this later. Since we are using the composition only model, it can only take elements as a prompt." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Pb3ifSUrf_zi" }, "outputs": [], "source": [ "# Set parameters\n", "n_samples = 2\n", "n_atoms = 8\n", "prompt = \"Li P S\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "MeJ28Gq0gJof" }, "outputs": [], "source": [ "%%time\n", "# Generate crystal structures\n", "atoms_list = composition_model.sample(prompt, n_atoms, n_samples)" ] }, { "cell_type": "markdown", "metadata": { "id": "nEzGGtBHBIM7" }, "source": [ "## Output Interpretation\n", "\n", "The output of Chemeleon is a valid 3D crystal structure, represented as a `pymatgen` Structure object.\n", "\n", "Each structure includes:\n", "- Lattice parameters\n", "- Atomic coordinates\n", "- Element types\n", "\n", "You can export to CIF using:\n", "```python\n", "structure.to(filename=\"generated_structure.cif\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "Th5BYZhAgyoo" }, "outputs": [], "source": [ "# Visualise\n", "visualizer = Visualizer(atoms_list)\n", "visualizer.view(index=0)\n", "\n", "from ase.io import write\n", "write(\"generated_structures.extxyz\",atoms_list)\n", "write(\"generated_structure_0.cif\",atoms_list[0])\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Y_uVlQbpBIM7" }, "source": [ "## Diffusion vs. Autoregressive Generation\n", "\n", "| Feature | Diffusion (Chemeleon) | Autoregressive (CrystaLLM) |\n", "|---------------------------|------------------------------|------------------------------|\n", "| Generation style | Denoising from noise | Token-by-token sampling |\n", "| Data representation | Continuous 3D + elements | Discrete sequence |\n", "| Conditioning support | Via latent / graph inputs | Via token prompts |\n", "| Output format | pymatgen Structure | CIF / token sequence |\n", "| Typical advantages | Spatial realism, diversity | Interpretability, simplicity |\n", "\n", "**Question for students**:\n", "- Which approach seems more flexible for structure-property targeting?\n", "- Can you imagine combining both approaches in a hybrid model?" ] }, { "cell_type": "markdown", "metadata": { "id": "TTu6sOBWnTjO" }, "source": [ "## Try generation with more conditions\n", "\n", "We can use the `general_text_model` which allows us to use natural language to impose more conditions." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "8aH-9Hmvnvqp" }, "outputs": [], "source": [ "# Load default model checkpoint (general text types)\n", "chemeleon = Chemeleon.load_general_text_model()" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "W8_1TZB3lmwn" }, "outputs": [], "source": [ "# Set parameters\n", "n_samples = 3\n", "n_atoms = 56\n", "text_inputs = \"A crystal structure of LiMn2O4 with cubic symmetry\"" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "CeEHhNxApC4c" }, "outputs": [], "source": [ "# Generate crystal structure\n", "atoms_list = chemeleon.sample(text_inputs, n_atoms, n_samples)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "B65QVYzCpDgp" }, "outputs": [], "source": [ "# Visualize the generated crystal structure\n", "visualizer = Visualizer(atoms_list)\n", "visualizer.view(index=0)" ] }, { "cell_type": "markdown", "metadata": { "id": "ObHOE2DGptFA" }, "source": [ "## View the generation trajectory\n", "\n", "We can visualise the diffusion Langevin dynamcis that is used to generate the final structure from the initial sampling. To achive this we return the trajectory from the genertation process.\n", "\n", "In the visulaisation note how the composition, the positions and the lattice are updated at the same time." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "OX0iPJm6phW_" }, "outputs": [], "source": [ "n_samples = 1\n", "n_atoms = 56\n", "text_inputs = \"A crystal structure of LiMn2O4 with cubic symmetry\"\n", "\n", "# Generate crystal structure with trajectory\n", "trajectory = chemeleon.sample(text_inputs, n_atoms, n_samples, return_trajectory=True)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "id": "JoEt8QQDp_N_" }, "outputs": [], "source": [ "# Visualize the trajectory\n", "idx = 0\n", "traj_0 = [t[idx] for t in trajectory][::10] + [trajectory[-1][idx]]\n", "visualizer = Visualizer(traj_0, resolution=15)\n", "visualizer.view_trajectory(duration=0.1)" ] }, { "cell_type": "markdown", "metadata": { "id": "9VAXnfFrbruB" }, "source": [ "## Exercise\n", "\n", "Use `Chemeleon` to generate some of the materials that you generated previously with `CrystaLLM`. Then export these structures and compare the formation energies of the two different generative models." ] } ], "metadata": { "accelerator": "GPU", "colab": { "gpuType": "T4", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "name": "python3" }, "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 0 }