Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. As this technique can capture dynamic biomolecular complexes, 3D reconstruction methods are increasingly being developed to resolve this intrinsic structural heterogeneity. However, the absence of standardized benchmarks with ground truth structures and validation metrics limits the advancement of the field. Here, we propose CryoBench, a suite of datasets, metrics, and performance benchmarks for heterogeneous reconstruction in cryo-EM. We propose five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from simple motions and random configurations of antibody complexes and from tens of thousands of structures sampled from a molecular dynamics simulation. We also design datasets containing compositional heterogeneity from mixtures of ribosome assembly states and 100 common complexes present in cells. We then perform a comprehensive analysis of state-of-the-art heterogeneous reconstruction tools including neural and non-neural methods and their sensitivity to noise, and propose new metrics for quantitative comparison of methods. We hope that this benchmark will be a foundational resource for analyzing existing methods and new algorithmic development in both the cryo-EM and machine learning communities.
We present:
1️⃣ 5 datasets representing different sources of heterogeneity &
difficulty levels.
2️⃣ Analysis of state-of-the-art reconstruction tools (neural &
non-neural) & their noise sensitivity.
3️⃣ New metrics for method comparison.
CryoBench datasets represent both simple motions that are easy to interpret for diagnostic purposes as well as challenging datasets to motivate new methods development.
IgG-1D is produced by rotating one of the domains of the IgG antibody complex 360 degrees, simulating a simple one-dimensional continuous circular motion.
For IgG-RL, we generate random conformations for a disordered peptide linker connecting the Fab to the rest of the IgG complex. IgG-RL is a more challenging, complex motion that hopefully represents a realistic case of conf-het in cryo-EM.
In Spike-MD, we use a long timescale molecular dynamics simulation to produce over 46k ground truth structures in this dataset. We hope to motivate methods development connecting MD simulations with cryo-EM.
Ribosembly provides a simple example of compositional heterogeneity (comp-het) using 16 ribosome assembly states as ground truth structures. These structures contain a common core that grows through the addition of proteins and ribosomal RNA.
Finally, Tomotwin-100 is a challenging dataset for modeling compositional heterogeneity. It contains a mixture of 100 complexes commonly found inside cells (h/t the TomoTwin paper for curating these structures).
Cryo-EM images are characteristically extremely noisy! To test the robustness of different methods to noise, we also created the IgG-1D-noisier and IgG-1D-noisiest dataset.
For each of these datasets, we benchmarked 10 state of the art heterogeneous reconstruction algorithms. Check out the SI for a detailed analysis, and please let us know if you have any feedback.
IgG-1D
here. The pdbs
directory
includes the 100 ground truth atomic models (PDBs). The
images
directory contains one directory for each noise
level (SNR 0.01, 0.005, 0.001), where every directory contains one
.mrcs file for each conformation, and every .mrcs file contains 1,000
images. The init_mask
directory has the back projected
volumes for all noise levels and the mask. The
vols
directory contains 100 ground truth volumes obtained
from the atomic models. Finally, the files
combined_ctfs.pkl
and
combined_poses.pkl
have the CTFs and poses for the
100,000 IgG-1D images.
Ribosembly
dataset and by size for the
Tomotwin-100
dataset.
@article{jeon2024cryobench,
title={CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM},
author={Jeon, Minkyu and Raghu, Rishwanth and Astore, Miro and Woollard, Geoffrey and Feathers, Ryan and Kaz, Alkin and Hanson, Sonya M and Cossio, Pilar and Zhong, Ellen D},
journal={arXiv preprint arXiv:2408.05526},
year={2024}
}