CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM


1 Department of Computer Science, Princeton University, Princeton, NJ, USA
2 Center for Computational Biology, Flatiron Institute, New York, NY, USA
3 Center for Computational Mathematics, Flatiron Institute, New York, NY, USA
4 Department of Computer Science, University of British Columbia, Vancouver, BC, Canada

NeurIPS 2024 (Spotlight)
Dataset and Benchmark Track

Abstract

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. As this technique can capture dynamic biomolecular complexes, 3D reconstruction methods are increasingly being developed to resolve this intrinsic structural heterogeneity. However, the absence of standardized benchmarks with ground truth structures and validation metrics limits the advancement of the field. Here, we propose CryoBench, a suite of datasets, metrics, and performance benchmarks for heterogeneous reconstruction in cryo-EM. We propose five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from simple motions and random configurations of antibody complexes and from tens of thousands of structures sampled from a molecular dynamics simulation. We also design datasets containing compositional heterogeneity from mixtures of ribosome assembly states and 100 common complexes present in cells. We then perform a comprehensive analysis of state-of-the-art heterogeneous reconstruction tools including neural and non-neural methods and their sensitivity to noise, and propose new metrics for quantitative comparison of methods. We hope that this benchmark will be a foundational resource for analyzing existing methods and new algorithmic development in both the cryo-EM and machine learning communities.

Overview

We present:
1️⃣ 5 datasets representing different sources of heterogeneity & difficulty levels.
2️⃣ Analysis of state-of-the-art reconstruction tools (neural & non-neural) & their noise sensitivity.
3️⃣ New metrics for method comparison.

Project overview (Figure 1 in the paper).

Datasets

CryoBench datasets represent both simple motions that are easy to interpret for diagnostic purposes as well as challenging datasets to motivate new methods development.


Overview of datasets.

IgG - 1D

IgG-1D is produced by rotating one of the domains of the IgG antibody complex 360 degrees, simulating a simple one-dimensional continuous circular motion.


Latent space and reconstructions of the benchmarked methods on the IgG-1D dataset.

IgG - RL

For IgG-RL, we generate random conformations for a disordered peptide linker connecting the Fab to the rest of the IgG complex. IgG-RL is a more challenging, complex motion that hopefully represents a realistic case of conf-het in cryo-EM.


Latent space and reconstructions of the benchmarked methods on the IgG-RL dataset.

Spike - MD

In Spike-MD, we use a long timescale molecular dynamics simulation to produce over 46k ground truth structures in this dataset. We hope to motivate methods development connecting MD simulations with cryo-EM.


Latent space and reconstructions of the benchmarked methods on the Spike-MD dataset.

Ribosembly

Ribosembly provides a simple example of compositional heterogeneity (comp-het) using 16 ribosome assembly states as ground truth structures. These structures contain a common core that grows through the addition of proteins and ribosomal RNA.


Latent space and reconstructions of the benchmarked methods on the Ribosembly dataset.

Tomotwin-100

Finally, Tomotwin-100 is a challenging dataset for modeling compositional heterogeneity. It contains a mixture of 100 complexes commonly found inside cells (h/t the TomoTwin paper for curating these structures).


Latent space and reconstructions of the benchmarked methods on the Tomotwin-100 dataset.

Noisy Dataset

Cryo-EM images are characteristically extremely noisy! To test the robustness of different methods to noise, we also created the IgG-1D-noisier and IgG-1D-noisiest dataset.


Depiction of various noise levels for the IgG-1D datasets.

For each of these datasets, we benchmarked 10 state of the art heterogeneous reconstruction algorithms. Check out the SI for a detailed analysis, and please let us know if you have any feedback.

Dataset Access & Usage

The datasets are available for download on Zenodo:

  1. Conf-het: https://zenodo.org/records/11629428
  2. Comp-het: https://zenodo.org/records/12528292
  3. Spike-MD: https://zenodo.org/records/12528784


1) Conformational Heterogeneity Dataset (IgG-1D / IgG-RL)

You can download and unzip IgG-1D.zip and/or IgG-RL.zip. As an example, we provide a brief outline of the directory structure of IgG-1D here. The pdbs directory includes the 100 ground truth atomic models (PDBs). The images directory contains one directory for each noise level (SNR 0.01, 0.005, 0.001), where every directory contains one .mrcs file for each conformation, and every .mrcs file contains 1,000 images. The init_mask directory has the back projected volumes for all noise levels and the mask. The vols directory contains 100 ground truth volumes obtained from the atomic models. Finally, the files combined_ctfs.pkl and combined_poses.pkl have the CTFs and poses for the 100,000 IgG-1D images.


2) Compositional Heterogeneity Dataset (Ribosembly / Tomotwin-100)

You can download and unzip Ribosembly.zip and/or Tomotwin-100.zip. All datasets are organized by states for the Ribosembly dataset and by size for the Tomotwin-100 dataset.


3) Spike-MD Dataset

You can download and unzip Spike-MD.zip. You can also download the atomic models stitched into a MD trajectory, sampled_pdbs.xtc.

BibTeX

@article{jeon2024cryobench,
        title={CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM},
        author={Jeon, Minkyu and Raghu, Rishwanth and Astore, Miro and Woollard, Geoffrey and Feathers, Ryan and Kaz, Alkin and Hanson, Sonya M and Cossio, Pilar and Zhong, Ellen D},
        journal={arXiv preprint arXiv:2408.05526},
        year={2024}
}