Bootstrapping Dense 3D Segmentations from Sparse 2D Annotations

Arlo Sheridan
Salk Institute
Kristen M. Harris
UT Austin
Uri Manor
UCSD / Salk Institute

TL;DR

  • Dense 3D segmentations from sparse 2D annotations, no 3D labels needed.
  • 10 minutes of sparse 2D painting bootstraps dense 3D training data.
  • Validated across diverse imaging modalities and biological targets.
  • Bootstrapped 3D models offer 1000-fold reduction in manual annotation effort.
  • Open-source CLI and napari1 plugin, runs on consumer hardware. Scalable to large volumes.

Motivation

Consider a fresh 3D microscopy volume with no annotations. The goal is a dense instance segmentation of some specific structure of interest: cells, organelles, neurites, or other objects. The natural first step is to try existing deep learning models.

Sometimes they work reasonably well, especially when the data resembles what the model was trained on. That is a good sign: it means the model can likely be fine-tuned. But often the data differs enough in imaging modality, tissue type, resolution, or staining that off-the-shelf models produce unusable results.

Segmentation foundation models (SAM2, microSAM3, CellSAM4, Cellpose5, etc.) can help in 2D, but they produce sparse segmentations, lack 3D consistency, and frequently miss the specific structures of interest. The same features that make a dataset scientifically interesting are exactly what generalist models tend to get wrong.

This leads to the same bottleneck every time: manual annotation of objects of interest to generate training data. Dense 3D annotation of even a small volume can take hundreds to thousands of expert hours. For example, dense annotation of hippocampal neuropil in a 180 μm³ EM volume required approximately 2,000 hours of expert effort 6.

We asked: how sparse can the annotations be?

We developed a 2D→3D method that generates dense 3D instance segmentations from sparse 2D annotations. Ten minutes of non-expert painting on a few 2D sections is sufficient to bootstrap 3D segmentations approaching the quality of models trained on dense expert ground-truth on matched evaluation volumes, a 1,000-fold reduction in annotation time. The approach works across diverse imaging modalities, biological targets, and segmentation tasks, and requires no domain-specific pretraining.

Results

Dense 3D segmentations from sparse 2D annotations

We validated the 2D→3D method across six publicly available datasets: HARRIS-156, FIB-257, CREMI-A/B/C8, and EPI9. Each dataset contains two dense ground-truth volumes. We systematically varied annotation density from single objects to dense labels.

Across all datasets, ten minutes of non-expert sparse annotation on a single section was sufficient to generate a dense 3D segmentation comparable in quality to one based on expert annotation requiring 1,000× more time.

Example sparse, non-expert annotations created in 30 minutes for HARRIS-15, partitioned into three 10-minute subsets.
Fig. 1. Example sparse, non-expert annotations created in 30 minutes for HARRIS-15. Annotations were partitioned into three approximately 10-minute subsets to assess the effect of annotation time on segmentation quality.
Orthogonal views comparing 2D to 3D result with 10 minutes of sparse annotation, 3D Baseline result with all dense annotations, and ground-truth.
Fig. 2. Orthogonal views comparing the 2D→3D result with 10 minutes of sparse annotation, the 3D Baseline result with all dense annotations, and ground-truth. Total time to segmentation includes human annotation time and machine time (training + inference + post-processing).

On HARRIS-15, segmentation errors against the dense baseline drop with increasing amounts of annotation, with diminishing returns at each additional order of magnitude of annotation time.

Cost vs. quality on HARRIS-15. Δ-NVI sum relative to the 3D baseline plotted against estimated annotation time.
Fig. 3. Cost vs. quality on HARRIS-15. Δ-NVI sum (Normalized VOI sum relative to the 3D baseline) plotted against estimated annotation time. All objects is 33,333× slower than 10 minutes 2D for 0.07 NVI sum improvement.
Improvement per additional order of magnitude of annotation time on HARRIS-15.
Fig. 4. Improvement per additional order of magnitude (OOM) of annotation time on HARRIS-15 relative to 10 min 2D. All objects requires 4.5 OOM of extra runtime over 10 minutes 2D for only 0.07 NVI sum improvement.

In practical terms, the 2D→3D bootstrap path approaches the quality of direct dense annotation while saving 1-6 months of calendar time and $25k-$125k of expert annotation labor.

2D to 3D bootstrapping shortens the path to high-quality training data on HARRIS-15.
Fig. 5. 2D→3D bootstrapping shortens the path to high-quality training data on HARRIS-15. Three paths are compared. The Direct (upper bound) path annotates dense ground truth over 6 calendar months and trains a 3D model on it. The Direct (lower bound) path does the same over 1 calendar month. The Bootstrap path annotates sparsely (10 min 2D), trains a 2D→3D model, and generates pseudo-GT, then trains the same 3D model on that pseudo-GT. Bootstrapping saves 1–6 months of calendar time and $25k–$125k of human-annotator labor at $25/h.
Deviation of normalized VOI sum across six datasets relative to the 3D baseline, plotted against annotation budget.
Fig. 6. Deviation of normalized Variation of Information (VOI) sum across six datasets (total ground-truth path lengths in parentheses) relative to the 3D Baseline trained on all dense annotations. Approximate manual annotation times are marked for the HARRIS-15 dataset. Across all amounts and datasets, the VOI scores consistently approached those of the 3D baselines.

Generalization across imaging modalities and segmentation tasks

We tested the 2D→3D method on five additional publicly available datasets spanning distinct imaging modalities, biological structures, and segmentation tasks. Across all five, the pipeline produced visually coherent 3D segmentations from minimal annotation effort.

  • LICONN10: Expansion-microscopy volumes of mouse somatosensory cortex (effective voxel size ~10 × 10 × 20 nm, x × y × z). 10 minutes of SAM2-assisted sparse annotation on a few sections for boundary segmentation.
  • PRISM11: 18-channel light microscopy of mouse hippocampus (voxel size 35 × 35 × 80 nm, x × y × z). Ten sections of manual ground-truth annotation for boundary segmentation.
  • CREMI-C8: Serial-section TEM of Drosophila neuropil through an axon tract (voxel size 4 × 4 × 40 nm, x × y × z). Two sections of manual annotation for synaptic cleft segmentation.
  • MitoEM-H12: Multi-beam SEM of human cortex (voxel size 8 × 8 × 30 nm, x × y × z). Two sections of manual annotation for mitochondria instance segmentation.
  • Fluo-C2DL-Huh713: Live cell laser confocal imaging of Huh7 hepatocarcinoma cells (pixel size 0.65 × 0.65 μm, x × y). Two frames of manual annotation for 2D+t cell segmentation.
The 2D to 3D method generalizes across imaging modalities and segmentation tasks. Five datasets with representative 2D images, published ground-truth, published 3D meshes, 2D to 3D segmentation results, and 2D to 3D meshes.
Fig. 7. The 2D→3D method generalizes across imaging modalities and segmentation tasks. Five datasets spanning distinct imaging modalities, biological targets, and annotation strategies. Each row displays: a representative 2D image (col 1), the published proofread, manual, or ground-truth segmentation (col 2), the published 3D mesh (col 3), our 2D→3D segmentation result (col 4), and our 2D→3D mesh (col 5). Datasets: a LICONN, b PRISM, c CREMI-C, d MitoEM-H, e Fluo-C2DL-Huh7.

Comparison with existing tools

To contextualize the method against existing general-purpose segmentation tools, we compared it with Cellpose + uSegment3D5,14 on the EPI dataset. Cellpose was applied to all 540 images of the EPI test volume, and uSegment3D was used to merge the 2D segmentations into a 3D consensus segmentation. For the 2D→3D method, we evaluated three levels of supervision: dense ground-truth labels, a single densely labelled section, and sparse SAM-generated labels on 3 images in 5 minutes of human time. The 2D→3D method with sparse SAM labels achieved a usable segmentation from 5 minutes of human effort, approximately two orders of magnitude less annotation for a modest decrease in segmentation accuracy.

Comparison of 2D to 3D segmentation with Cellpose + uSegment3D on the EPI dataset. Representative XY sections and 3D mesh renderings, plus quantitative comparison of labelled voxels/objects/images, annotation time, and normalized VOI.
Fig. 8. Comparison of 2D→3D segmentation with Cellpose + uSegment3D on the EPI dataset. a, Representative XY sections (top) and 3D mesh renderings (bottom) of the EPI test volume, comparing ground-truth labels with segmentations produced by different methods at varying levels of supervision. b, Quantitative comparison reporting the number of labelled voxels, labelled objects, labelled images, estimated human annotation time, and normalized VOI (sum) for each method.

Bootstrapping reduces total reconstruction cost

Unproofread 2D→3D segmentations serve as pseudo ground-truth to bootstrap dedicated 3D segmentation models. We evaluated whether segmentations from different amounts of sparse training data could bootstrap subsequent 3D models without any manual proofreading. Across all six datasets, bootstrapped 3D models trained on 2D→3D pseudo ground-truth approached the quality of dense 3D baselines at all sparsity levels.

Deviation of the total number of split and merge errors to fix per skeleton path length across six datasets relative to the non-bootstrapped 3D baseline.
Fig. 9. Deviation of the total number of split and merge errors to fix per skeleton path length (μm−1) across six datasets (total ground-truth path lengths in parentheses) relative to the non-bootstrapped 3D baseline. All values fall above zero, except for HARRIS-15 when there were 50 or more bootstrapped objects, where the 2D→3D method outperformed the non-bootstrapped baseline. This occurs because the manually annotated ground-truth covers only a cylindrical subregion in Volume 2, whereas the pseudo ground-truth segments the entire volume, providing more training labels.

Pseudo ground-truth from 10 minutes of non-expert annotation on a single section yielded bootstrapped segmentations requiring only approximately 2-3 additional edits per micron (in HARRIS-15) compared to those from dense expert annotations of entire volumes (1,000× more annotation time). Sparser approaches increase the subsequent proofreading burden, but the reduction in upstream annotation yields an order-of-magnitude improvement in total reconstruction time.

Cost-benefit analysis

We estimated the total reconstruction time as the sum of manual annotation time, machine computation time, and estimated proofreading time for HARRIS-15. Proofreading time was estimated at 0.1 minutes per split correction and 1, 3, or 10 minutes per merge correction, spanning the range of reported proofreading throughputs. All times are in Person Workdays (PWD) at 5 hours per day.

Annotation amount All objects 10 objects 10 min 2D
Total merges to fix false splits 3,052 3,589 4,850
Total splits to fix false merges 1,821 2,026 2,158
Total edits 4,964 5,723 7,156
Time (PWD)
Manual annotation 1,000 25 0.03
Machine computation 0.3 0.3 0.3
Proofreading splits (0.1 min/edit) 0.61 0.67 1.62
Proofreading merges (1 min/edit) 7.1 7.9 8.8
Total (1 min/edit) 1,008 33.9 10.7
Proofreading merges (3 min/edit) 19.2 21.5 23.2
Total (3 min/edit) 1,020 47.5 25.1
Proofreading merges (10 min/edit) 61.7 68.7 73.6
Total (10 min/edit) 1,062.6 94.7 75.5
Table 1. Bootstrapping cost-benefit analysis for HARRIS-15. Efficiency analysis comparing manual annotation time, machine time, estimated proofreading times, and total reconstruction times at different proofreading rates. Proofreading time per false split is set to 0.1 min/false split. All times are in Person Workdays (PWD) at 5 hours per day. Even at the most pessimistic proofreading rate (10 min/edit), 10 minutes of sparse annotation yields a 14× reduction in total reconstruction time compared to dense annotation.

How It Works

The pipeline operates in two stages. Both use standard U-Net15 architectures and are lightweight enough to run on consumer hardware (<3 GB GPU memory).

Stage 1: Sparse 2D → Dense 2D

A user creates sparse 2D annotations on one or a few sections, either manually or using a foundation model (e.g., SAM2). A 2D U-Net is trained on these sparse labels to predict dense local shape descriptors (LSDs)16 and affinities17,18. A masked loss restricts supervision to labeled regions, preventing the network from learning background features from unannotated instances. At inference, the 2D network is applied section-by-section to produce a stack of dense 2D predictions.

Sparse 2D to dense 2D training. 2D images with sparse labels train a 2D U-Net to learn dense 2D LSDs. Background regions of the target 2D LSDs are masked out during loss computation.
Fig. 10. Sparse 2D to dense 2D training. 2D images with sparse labels are used to train a 2D U-Net to learn dense 2D LSDs. Background regions of the target 2D LSDs, denoted by diagonal gray stripes, are masked out during loss computation. All networks compute a weighted mean-squared-error (MSE) loss between predictions and targets during training.

Synthetic 3D Label Generation

Synthetic 3D training data is generated using three distinct strategies to simulate diverse biological structures: morphological dilation from random seeds, section-wise dilation of speckled binary arrays, and watershed on Gaussian-filtered random peaks. A fourth ensemble strategy combines all three equally. These synthetic labels serve as the source of both inputs and targets for the Stage 2 network below and are generated on-the-fly during training.

Generation of 3D synthetic data through a series of morphological operations and transformations applied to a random array of noise or foreground voxels.
Fig. 11. Generation of 3D synthetic data through a series of morphological operations and transformations applied to a random array of noise or foreground voxels. The Gunpowder implementation is available at create_labels.py.

Stage 2: Stacked 2D → 3D

A lightweight 3D U-Net is pre-trained on the synthetic 3D labels described above. It learns to map noisy stacked 2D LSDs to clean 3D affinities, with both inputs and targets simulated from the synthetic labels. Because the 3D network only ever sees stacked 2D predictions as input, it acquires z-axis connectivity priors entirely from the synthetic data, without requiring any 3D annotation from the target dataset.

Stacked dense 2D to dense 3D training. A 3D U-Net is trained to map stacked 2D LSDs to 3D affinities. Both the inputs (stacked 2D LSDs) and the targets (3D affinities) are simulated from synthetic 3D labels.
Fig. 12. Stacked dense 2D to dense 3D training. A 3D U-Net is trained to map stacked 2D LSDs to 3D affinities. Both the inputs (stacked 2D LSDs) and the targets (3D affinities) are simulated on-the-fly from the synthetic 3D labels described, training uses a weighted mean-squared-error (MSE) loss between predictions and targets.

Inference

At inference, the two trained U-Nets run sequentially on a target volume. The 2D U-Net is applied section-by-section to produce stacked 2D LSDs, which the 3D U-Net converts to 3D affinities. Seeded watershed on the Stage 2 affinities produces the final 3D instance segmentation.

2D to 3D inference pipeline. Sections from a 3D image volume are input slice-by-slice to the trained 2D U-Net to generate 2D LSD slices, from which the trained 3D U-Net infers 3D affinities.
Fig. 13. 2D→3D inference pipeline. Five example sections shown for simplicity. Sections from a 3D image volume are input slice-by-slice to the trained 2D U-Net to generate 2D LSD slices, from which the trained 3D U-Net infers 3D affinities. LSDs and affinities are RGB encoded for visualization. Seeded watershed generates 3D segmentation from the affinities.

Get Started

Napari Plugin

The napari-bootstrapper plugin provides an interactive graphical interface for applying the 2D→3D method to small volumes. Create sparse annotations directly in napari, generate dense volumetric segmentations, and perform basic proofreading and post-processing. Works seamlessly with foundation models and the napari plugin ecosystem.

pip install napari-bootstrapper
Napari bootstrapper plugin demonstration showing annotation, training, inference, and 3D segmentation output.
Napari-bootstrapper plugin: annotate, train, infer, and export dense 3D segmentations within a single GUI.

Bootstrapper CLI

The complete framework is open-source. It includes modules, scripts, and a command line interface for all components of the workflow: data preparation, 2D→3D models, LSD models, training pipelines, blockwise parallel inference and post-processing, evaluation, and error identification. Designed to scale to large volumes using lightweight distributed algorithms that do not require high-performance computing clusters.

pip install bootstrapper
# or
git clone https://github.com/ucsdmanorlab/bootstrapper.git
cd bootstrapper && pip install -e .

Links

Takeaways

  • We present a general-purpose method for generating dense 3D segmentations from sparse 2D annotations. It works across diverse imaging modalities, biological targets, and segmentation tasks. No domain-specific pretraining or high-performance computing is required.
  • Specialist 3D models trained on dense ground-truth achieve the highest accuracy within their domain. This method does not replace them. It addresses the bottleneck that precedes them: generating the dense training data needed for both model development and evaluation.
  • In our experiments, bootstrapped 3D models trained on unproofread pseudo ground-truth approached the quality of models trained on dense expert annotations, with orders-of-magnitude less human effort.

Complementary to foundation models

The pipeline is complementary to segmentation foundation models. SAM, microSAM, or CellSAM outputs serve directly as sparse input labels for the 2D→3D framework. As 2D foundation models improve, so does every segmentation bootstrapped through them, with no changes to the pipeline itself.

Toward autonomous bootstrapping

The bootstrapper CLI is fully scriptable. AI agents can independently orchestrate the entire bootstrapping loop: selecting annotations, training models, running inference, monitoring LSD and segmentation inconsistencies, and iterating. This paradigm, where agents explore multiple analysis paths in parallel and retain those that meet quality thresholds, could dramatically accelerate ground-truth generation across many volumes simultaneously.

Dense voxel ground-truth remains the gold standard for 3D microscopy segmentation. This method accelerates convergence to that standard by reducing the total cost of generating training data. For laboratories with limited resources, it transforms an intractable annotation problem into a manageable one.

Acknowledgements

U.M. is supported by NIA P30AG068635 (Nathan Shock Center), the David F. and Margaret T. Grohne Family Foundation, Core Grant application NCI CCSG (CA014195), NIDCD R01DC021075-01, NSF NeuroNex Award (2014862), the L.I.F.E. Foundation, and the CZI Imaging Scientist Award from the Chan Zuckerberg Initiative DAF. K.M.H. and V.V.T. are supported by NIH R01MH095980, NSF NeuroNex Technology Hub Award (1707356), NSF NeuroNex Award (2014862), and NSF NCS Award (2219864).

Special thanks to Patrick H. Parker for expert curation and proofreading of the HARRIS-15 dataset, and to all manual annotators. Computing resources provided by the Texas Advanced Computing Center (TACC) at UT Austin.

References

  1. napari contributors. napari: a multi-dimensional image viewer for Python. Zenodo (2019). doi:10.5281/zenodo.3555620.
  2. Kirillov, A. et al. Segment Anything. ICCV (2023).
  3. Archit, A. et al. Segment Anything for Microscopy. Nat. Methods (2025).
  4. Marks, M., Israel, U., Dilip, R. et al. CellSAM: a foundation model for cell segmentation. Nat. Methods 22, 2585–2593 (2025).
  5. Stringer, C. et al. Cellpose: a generalist algorithm for cellular segmentation. Nat. Methods 18, 100–106 (2021).
  6. Harris, K. M. et al. A resource from 3D electron microscopy of hippocampal neuropil for user training and tool development. Sci. Data 2, 150046 (2015).
  7. Takemura, S. et al. Synaptic circuits and their variations within different columns in the visual system of Drosophila. PNAS 112, 13711–13716 (2015).
  8. CREMI. cremi.org.
  9. Wolny, A. et al. Accurate and versatile 3D segmentation of plant tissues at cellular resolution. eLife 9, e57613 (2020).
  10. Tavakoli, M. R. et al. Light-microscopy-based connectomic reconstruction of mammalian brain tissue. Nature 642, 398–410 (2025).
  11. Park, S. Y. et al. Combinatorial protein barcodes enable self-correcting neuron tracing. bioRxiv (2025).
  12. Franco-Barranco, D. et al. Current Progress and Challenges in Large-Scale 3D Mitochondria Instance Segmentation. IEEE Trans. Med. Imaging 42, 3956–3971 (2023).
  13. Ruggieri, A. et al. Dynamic Oscillation of Translation and Stress Granule Formation Mark the Cellular Response to Virus Infection. Cell Host Microbe 12, 71–85 (2012).
  14. Zhou, F. Y. et al. Universal consensus 3D segmentation of cells from 2D segmented stacks. Nat. Methods (2025).
  15. Ronneberger, O., Fischer, P. & Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. MICCAI (2015).
  16. Sheridan, A. et al. Local shape descriptors for neuron segmentation. Nat. Methods 20, 295–303 (2023).
  17. Turaga, S. C. et al. Convolutional Networks Can Learn to Generate Affinity Graphs for Image Segmentation. Neural Computation 22, 511–538 (2010).
  18. Lee, K. et al. Superhuman Accuracy on the SNEMI3D Connectomics Challenge. arXiv:1706.00120 (2017).

Full reference list available in the paper.