Multi-stage AI pipelines for medical image analysis can produce structurally implausible outputs (disconnected segmentation masks, lesions placed entirely outside the organ) that propagate to downstream analysis and scoring. We evaluate whether simple, deterministic quality control checks at multiple pipeline stages can catch such failures in a model-agnostic, interpretable, and low-cost manner.
We implemented deterministic quality gates, pure functions requiring no GPU or learned parameters, at two stages of a prostate cancer detection pipeline: organ segmentation (8 gates) and lesion detection (3 gates). Thresholds were calibrated on 50 PROMISE12 expert segmentations and applied without modification to 1,500 PI-CAI cases across four segmentation models of widely different quality.
Cross-domain validation on liver tumor segmentation (CT, N=131) and kidney tumor segmentation (CT, N=489) replicated the cross-stage complementarity pattern with 0-2% overlap in all three domains.
Gates calibrated on 50 expert segmentations transferred without modification across four models on 1,500 cases. Rejection rates scaled with model weakness: 4.8% to 93%.
Gates at different pipeline stages caught nearly disjoint failures. Only 2 out of 143 rejections overlapped between stages for the strongest model.
The same pattern held across prostate (MRI, N=1,500), liver (CT, N=131), and kidney (CT, N=489) with 0-2% cross-stage overlap in all three domains.
Gate filtering improved downstream lesion containment more than random case removal for two of three models (bootstrap test, 10,000 iterations).