// Medical Computer Vision

Trained a Gleason grading + tumor-segmentation model on full-resolution prostate WSI

We trained the Gleason grading + tumor-segmentation model - gland-level Gleason pattern classification plus pixel-level cancer segmentation on gigapixel H&E prostate biopsies - for a seed-stage medical imaging startup. The GPUs on their 8×A100 box were idling at <30% under an OpenSlide dataloader and CPU tile augmentation, so we rebuilt the data path GPU-native (Zarr-backed pyramidal tiles, cuCIM preprocessing and stain normalization on-device, Kornia augmentation in the same memory). Single-epoch wallclock collapsed from ~9h to ~50m, GPU utilization went from sub-30% to >90%, and the iteration loop finally caught up with the pathologists' labeling cadence.

offices

Europe

size

Seed-stage

industry

Medical imaging AI - anonymized

revenue

-

// Outcomes

The numbers that matter

  • 9h → ~50m

    single-epoch wallclock

  • >90%

    GPU utilization (was <30%)

  • On-prem

    no PHI egress, hospital VLAN

01 · GPUs idle at sub-30% while the data loader does all the work

The Challenge

The customer is an early-stage medical imaging startup. The model they needed to ship was a single-head Gleason grading + segmentation network on full-resolution prostate biopsy WSI - gland-level Gleason pattern classification (3, 4, 5, plus benign tissue and high-grade PIN) and pixel-level tumor segmentation of the same regions, so a downstream pathologist gets both a primary/secondary pattern call and an annotated mask they can review in the digital pathology viewer. The clinical promise is real: prostate core-needle biopsies are high-volume, Gleason scoring drives the treatment decision, and inter-observer variability between pathologists on patterns 3 vs. 4 is well documented - a second-reader AI that surfaces equivocal regions with a localized mask has a credible path to clinical adoption. The problem was that the team's training loop wasn't keeping up with their own labeling throughput.

A single annotated WSI in their cohort is a pyramidal H&E slide at 40× scan magnification - roughly 100,000×100,000 pixels at the base level, 5–20 GB on disk per slide, with multiple cores per case and stain/scanner variation across the cohort. Their training pipeline pulled tiles through OpenSlide on every epoch, decoded JPEG-compressed tiles on the CPU, ran tile augmentations (random crops, rotations, color jitter, elastic deformations, stain perturbations) on the CPU, and then pushed the result to the GPU. The result on their 8×A100 box was textbook starvation: GPU utilization hovering around 25%, single-epoch wallclock around 9 hours on the curated training set, and an iteration loop where a single hyperparameter sweep took most of a week.

Two non-negotiables shaped the rebuild. PHI couldn't leave the hospital VLAN - WSI ingest, training data, and model artifacts all had to live inside the customer-controlled environment, with no cloud step in the loop. And the rebuild had to land on the same 8×A100 box the customer already owned; raising more capital to buy more GPUs because the data loader was inefficient wasn't a plan their seed round could afford.

02 · Move the entire data path to the GPU. Stop re-decoding tiles from disk.

Approach

Step 1: Zarr-backed pyramidal tile storage instead of OpenSlide re-reads.

We converted the curated cohort once, end-to-end, into a Zarr store with 2D chunks sized to the tile geometry the model trains on (512×512 at the working magnification by default, configurable per experiment), preserving the WSI pyramid so the model can sample at 5×/10×/20× context as needed. Compression with Blosc/Zstd kept the on-disk footprint comparable to the source SVS/pyramidal-TIFF while making random tile reads constant-time and parallel-safe. The Zarr layout sits on the customer's NVMe-backed shared storage; the dataloader maps chunks directly into pinned host memory and DMAs them into GPU memory without re-decoding JPEG-compressed tiles through OpenSlide on every epoch.

Step 2: cuCIM for GPU-resident preprocessing and stain normalization.

cuCIM (NVIDIA's GPU-accelerated digital-pathology library) takes over the work that used to run on the CPU: tissue-mask computation from the thumbnail, foreground tile filtering, Macenko stain normalization against a reference H&E target, and per-scanner color standardization - all on the device, in CuPy/Torch-compatible memory. Tiles that come off Zarr go straight into cuCIM kernels and stay on the GPU. No CPU bounce, no PIL re-encode, no tensor copy back across PCIe.

Step 3: Kornia for tile augmentation in the same GPU memory.

Augmentation is where most CPU pipelines collapse: stain perturbations, elastic deformations, random affine transforms, and color jitter at WSI tile rates are expensive enough that even a 32-core CPU can't feed 8 A100s. Kornia runs the same family of augmentations natively on the GPU - affine, elastic, color, HED jitter, noise, and the random tile sampling itself - operating directly on the cuCIM output tensor. Augmentation is part of the training step's compute graph, not a separate process boundary the GPUs wait on.

Step 4: Eval-gated, deterministic, reproducible.

The augmentation pipeline is seeded per-batch and the Zarr chunking is content-addressed, so any training run is exactly reproducible from the run config plus the dataset hash. Validation runs without augmentation through the same code path - same cuCIM stain normalization, same tile geometry - so the eval numbers measure the model, not a second preprocessing pipeline drifting away from training. Training metrics, sweeps, and validation per-cohort breakdowns all land in a Weights & Biases workspace the pathologists and ML team share.

03 · GPU-bound training. Iteration loop in hours, not days.

Result

  • Single-epoch wallclock dropped from ~9 hours to ~50 minutes on the same 8×A100 box - about a 10× training-throughput lift, no new hardware.
  • GPU utilization moved from <30% (CPU-bound dataloader) to >90% sustained (GPU-bound on the actual model).
  • Random tile reads are constant-time off the Zarr store regardless of cohort size - the pipeline scales as the customer adds annotated slides, instead of slowing down.
  • Stain normalization and augmentation run in the same GPU memory as preprocessing and the model - zero PCIe round-trips per batch, no CPU augmentation worker pool to tune.
  • The whole pipeline runs inside the hospital VLAN - WSI ingest → Zarr build → training → eval, no PHI egress, no cloud dependency.
  • Training pipeline ships to the customer's repo with the Zarr build script, the cuCIM/Kornia transform graph, the W&B project, and a runbook the team can re-run as new annotated slides come in.

The win wasn't a new model architecture or a clever loss - those weren't the bottleneck. The bottleneck was a CPU-bound dataloader and a redundant disk path masquerading as a training problem. Once the data path lives on the GPU, the team's iteration cadence catches up to the rate the pathologists are producing labels, and the actual model work - the part the medical results depend on - finally gets the compute it needed all along.

// Expert insight

Digital pathology teams burn enormous amounts of GPU time waiting on an OpenSlide dataloader and a tile-augmentation pipeline that never made it off the CPU. Zarr for storage, cuCIM for stain normalization and preprocessing, Kornia for augmentation - once the whole data path lives on the GPU, the iteration cadence finally matches the rate the pathologists are producing labels. That's the engagement.
Michał Pogoda-Rosikoń

Michał Pogoda-Rosikoń

Co-founder @ bards.ai

// Ready to ship?

Let's build something that delivers numbers like these.

Book a meeting