CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM


1 Department of Computer Science, Princeton University, Princeton, NJ, USA
2 Center for Computational Biology, Flatiron Institute, New York, NY, USA
3 Center for Computational Mathematics, Flatiron Institute, New York, NY, USA
4 Department of Computer Science, University of British Columbia, Vancouver, BC, Canada

Website Under Construction.

Abstract

Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. As this technique can capture dynamic biomolecular complexes, 3D reconstruction methods are increasingly being developed to resolve this intrinsic structural heterogeneity. However, the absence of standardized benchmarks with ground truth structures and validation metrics limits the advancement of the field. Here, we propose CryoBench, a suite of datasets, metrics, and performance benchmarks for heterogeneous reconstruction in cryo-EM. We propose five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from simple motions and random configurations of antibody complexes and from tens of thousands of structures sampled from a molecular dynamics simulation. We also design datasets containing compositional heterogeneity from mixtures of ribosome assembly states and 100 common complexes present in cells. We then perform a comprehensive analysis of state-of-the-art heterogeneous reconstruction tools including neural and non-neural methods and their sensitivity to noise, and propose new metrics for quantitative comparison of methods. We hope that this benchmark will be a foundational resource for analyzing existing methods and new algorithmic development in both the cryo-EM and machine learning communities.

Overview

Project overview (Figure 1 in the paper).

Figure 1: Overview of CryoBench.

(a) In cryo-EM, each image Xi captures a molecule Vi projected at an unknown pose ϕi . A latent variable zi models the conformational space V that describes the heterogeneity between molecules {Vi}. (b) CryoBench includes 5 synthetic datasets of varying difficulty, characterized by heterogeneity arising from conformational (i.e. shape) or compositional (i.e. identity) changes. (c) Methods can be grouped into using either a continuous latent variable z or discrete latent variable π for heterogeneity. Hidden variables assumed to be known are shown in gray. Volumes are represented as a neural field (NF), voxel array (VA), neural volume (NV), or tetrahedral mesh (TM). Generative models are colored blue for nonlinear neural methods; orange for linear generative models, pink for mixture models; and green for density-preserving motion models. (d) A variety of metrics are used to assess both latent inference and volume reconstruction quality.

Dataset Access & Usage

The datasets are available for download on Zenodo:

  1. Conf-het: https://zenodo.org/records/11629428
  2. Comp-het: https://zenodo.org/records/12528292
  3. Spike-MD: https://zenodo.org/records/12528784


1) Conformational Heterogeneity Dataset (IgG-1D / IgG-RL)

You can download and unzip IgG-1D.zip and/or IgG-RL.zip. As an example, we provide a brief outline of the directory structure of IgG-1D here. The pdbs directory includes the 100 ground truth atomic models (PDBs). The images directory contains one directory for each noise level (SNR 0.01, 0.005, 0.001), where every directory contains one .mrcs file for each conformation, and every .mrcs file contains 1,000 images. The init_mask directory has the back projected volumes for all noise levels and the mask. The vols directory contains 100 ground truth volumes obtained from the atomic models. Finally, the files combined_ctfs.pkl and combined_poses.pkl have the CTFs and poses for the 100,000 IgG-1D images.


2) Compositional Heterogeneity Dataset (Ribosembly / Tomotwin-100)

You can download and unzip Ribosembly.zip and/or Tomotwin-100.zip. All datasets are organized by states for the Ribosembly dataset and by size for the Tomotwin-100 dataset.


3) Spike-MD Dataset

You can download and unzip Spike-MD.zip. You can also download the atomic models stitched into a MD trajectory, sampled_pdbs.xtc.

BibTeX

@article{jeon2024cryobench,
        title={CryoBench: Diverse and challenging datasets for the heterogeneity problem in cryo-EM},
        author={Jeon, Minkyu and Raghu, Rishwanth and Astore, Miro and Woollard, Geoffrey and Feathers, Ryan and Kaz, Alkin and Hanson, Sonya M and Cossio, Pilar and Zhong, Ellen D},
        journal={arXiv preprint arXiv:2408.05526},
        year={2024}
}