Four of the Most-Cited Medical Imaging Datasets Fail Basic Provenance Checks. What That Means for Clinical AI Developers.
Four of the Most-Cited Medical Imaging Datasets Fail Basic Provenance Checks. What That Means for Clinical AI Developers.
A new open standard called VIDS (Verified Imaging Dataset Standard) was published in April 2026 by Princeton Medical Systems. As part of the paper, the authors benchmarked four of the most heavily cited public medical imaging datasets — LIDC-IDRI, BraTS, CheXpert, and the Medical Segmentation Decathlon — against 22 dataset-compliance dimensions.
The headline numbers are uncomfortable.
- **CheXpert: 20%**
- LIDC-IDRI: 27%
- Medical Segmentation Decathlon: 30%
- BraTS: 39%
- Average: 29%
On annotation provenance specifically, the average drops to 8%.
This is one of the more consequential medical-AI data infrastructure papers we have seen so far in 2026, and the implications cut directly into how clinical AI developers, sponsors, and regulatory affairs teams should be thinking about training data going into 2027.
Disclosure: VIDS was developed by Princeton Medical Systems. ClinStacks is covering it because we believe the benchmark raises a broader data-governance issue for clinical AI developers, not because of any affiliation with the authors.
This article walks through what VIDS is, what the benchmark actually measured, why provenance is the largest gap, and what we recommend you do about it depending on whether you're acquiring datasets, publishing them, or submitting AI products to FDA, EMA, or notified bodies.
What VIDS Is — and What It Is Not
VIDS is an open specification, published April 19, 2026, that defines:
- A folder layout for multi-modality medical imaging datasets
- A file-naming convention (extends BIDS)
- A structured annotation provenance schema
- Mandatory quality documentation
- ML-readiness requirements (train/val/test splits with leakage prevention)
- 21 machine-enforceable validation rules packaged as a reference validator on PyPI
This is the part that matters: VIDS makes dataset quality assessable, not guaranteed. A dataset with thin provenance — say, an annotator ID but no credentials, dates, or QC review — can still pass validation. But the gap is now structured, machine-readable, and immediately visible to any reviewer examining the metadata.
That distinction is important. VIDS is not claiming to fix dataset quality. It is making dataset quality inspectable. Two very different things, and we'll come back to why that distinction matters for regulatory submissions.
VIDS sits in a layer that existing standards do not occupy:
- **DICOM** standardizes image acquisition and transmission at the individual study level. It does not address how curated datasets should be organized for AI development.
- BIDS organizes neuroimaging research datasets with consistent naming. It does not enforce annotation provenance or quality documentation, and it is specific to neuroimaging.
- COCO and Pascal VOC define annotation schemas for natural images. They were never designed for volumetric medical imaging or for provenance.
- Datasheets for Datasets, Data Cards, and RSNA ATLAS document datasets at the metadata level. They do not enforce file structure or validate compliance automatically.
What the Benchmark Actually Measured
The 22 compliance dimensions span six categories. The breakdown is:
| Category | Dimensions | What it captures |
|---|---|---|
| Structure | 6 | Dataset marker, description, participant registry, README, subject and session hierarchy |
| Imaging | 3 | Standardized format, per-image metadata sidecars, consistent file naming |
| Annotation | 4 | Annotation directory layout, segmentation masks, per-annotation metadata sidecars, machine-readable label maps |
| Provenance | 5 | Annotator identity, annotator credentials, tool used, annotation date, QC review documentation |
| Quality | 2 | Inter-annotator agreement, quality summary |
| ML Readiness | 2 | Documented splits, split rationale |
Each dimension is scored as satisfied (1.0), partial (0.5), or absent (0.0). Partial credit was given when the information existed somewhere — for example, in a companion paper — but was not in machine-readable form within the dataset itself.
Here is the full per-category breakdown across the four datasets:
| Category | LIDC-IDRI | BraTS | CheXpert | MSD |
|---|---|---|---|---|
| Structure (6) | 1.5 | 2.0 | 1.5 | 1.5 |
| Imaging (3) | 1.0 | 2.0 | 1.0 | 2.0 |
| Annotation (4) | 1.5 | 2.0 | 1.0 | 2.0 |
| Provenance (5) | 1.0 | 0.5 | 0.0 | 0.0 |
| Quality (2) | 1.0 | 1.0 | 0.0 | 0.0 |
| ML Readiness (2) | 0.0 | 1.0 | 1.0 | 1.0 |
| Total (22) | 6.0 (27%) | 8.5 (39%) | 4.5 (20%) | 6.5 (30%) |
Two things to notice.
First, no dataset scores well on provenance. The highest is LIDC-IDRI at 1.0/5 (20%), and that is largely because LIDC published an extensive companion paper documenting its annotation protocol — which still counts only as partial credit, because the information is not in machine-readable form within the dataset files. CheXpert and MSD score 0/5 on provenance. Zero.
Second, the structure and imaging categories are partially covered everywhere. All four datasets have some directory organization and some imaging standardization. The gaps are concentrated in the categories that regulators care most about: provenance, quality documentation, and to a lesser extent ML readiness.
The 8% Provenance Number Is the One That Should Worry You
Annotation provenance is the chain of custody for the labels your AI model learned from. It is who annotated what, when, with what tool, with what credentials, and reviewed by whom under what process.
If your AI product was trained on a public dataset and your sponsor or a regulator asks "where did these labels come from?" — annotation provenance is the answer. And on the four most-cited public datasets, the average provenance score is 0.4/5, or 8%.
Let's be specific about what is missing.
Annotator identity. LIDC-IDRI documents that four radiologists annotated each scan, but anonymizes individual reader identities. BraTS, CheXpert, and MSD provide no per-annotation reader identity at all. This means that for any model trained on these datasets, there is no way to construct a per-annotation chain of custody from the dataset files themselves.
Annotator credentials. None of the four datasets document board certification, years of experience, or specialty for individual annotators in machine-readable form. Where this information exists at all, it is in companion publications.
Annotation tool. None of the four datasets record, per annotation, which software and version produced the segmentation or label. This matters because tool version affects results — a known issue in segmentation reproducibility — and tool changes during a dataset's lifetime create silent drift in label characteristics.
Annotation date. Per-annotation timestamps are absent in all four datasets in machine-readable form. This makes it impossible to reconstruct, after the fact, whether a particular annotation was made early in a project (when consensus protocols may have been looser) or late (after refinement).
QC review. Per-annotation review status — who reviewed it, when, with what outcome — is not present in the dataset files of any of the four datasets in structured form. LIDC-IDRI describes its two-phase blinded review process in the companion publication, but the per-annotation review record is not extractable from the dataset.
These five elements are not regulatory exotica. They are the basic chain-of-custody facts that any clinical AI submission package will be asked to support. The fact that the most-cited public training datasets in the field provide structured access to roughly 8% of this information is the operating constraint every clinical AI developer is working under, whether or not they realize it.
The 25% Quality Documentation Gap
The second-largest gap is quality documentation: inter-annotator agreement scores and a structured quality summary. The average score across the four datasets is 50%, or 1/2 dimensions on average — but this average hides a sharp divide.
LIDC-IDRI and BraTS score 1.0/2 each. Both have inter-annotator data that exists somewhere — either computable from the raw annotations (LIDC) or published in challenge papers (BraTS) — but neither provides it as a structured file within the dataset itself.
CheXpert and MSD score 0/2. No quality documentation at all in either dataset.
For a sponsor or QA team trying to assess whether a public dataset is fit-for-purpose for a particular regulatory submission, this is the information they need first: how much do the annotators agree, and on what kinds of cases do they disagree? When that information isn't structured, the assessment becomes an unfunded research project rather than a checkable data point.
What This Means If You're Acquiring Datasets
If you are a sponsor, CRO, or pharma team acquiring imaging datasets — whether from public sources, internal repositories, or third-party vendors — the operating reality changes in three ways.
You now have a checkable acceptance criterion. VIDS validation produces a binary PASS/FAIL and a structured JSON report. This is exportable to procurement workflows. "Datasets delivered to us must pass VIDS Full profile" is a contractual clause that is enforceable in seconds, not weeks. We would not be surprised to see this language appear in dataset procurement agreements over the next 12-18 months, particularly for datasets supporting regulatory submissions.
You can now audit the datasets you have. Most internal imaging dataset repositories have grown organically over years, with documentation conventions that drift across projects. Running VIDS validation across these repositories produces a structured map of where the gaps are. This is exactly the kind of audit that regulatory affairs teams have wanted to do but have not had a standard to execute against. The validator is open source, runs in seconds on 100-subject datasets, and produces machine-readable output.
You can now require provenance from vendors. "We will not accept datasets without VIDS Full profile compliance" is now a line you can hold. Before VIDS, the equivalent demand was a paragraph describing what you wanted, which left interpretation open. Now there is a specification and a validator. Compliance is no longer a matter of opinion.
We expect the biggest near-term operational shift from VIDS to be on this side: dataset acquirers using validation PASS as a precondition rather than a nice-to-have.
What This Means If You're Publishing Datasets
If you are publishing imaging datasets — whether as an academic group, a consortium, a hospital health system releasing internal data, or a vendor delivering to a sponsor — the calculus changes too.
The bar moves. As VIDS compliance becomes a procurement criterion, datasets that do not pass validation will become harder to monetize, harder to publish in venues that expect dataset standards, and harder to use as the basis for regulatory submissions. This will not happen overnight. But once standards become embedded in procurement or submission workflows, they are difficult to unwind.
The migration path is bounded. The VIDS authors released a reference dataset called LIDC-Hybrid-100 — a 100-subject CT subset converted from LIDC-IDRI into a fully VIDS-compliant format with consensus segmentation masks from four-radiologist annotations. The conversion pipeline is documented in the paper. For a dataset of similar size, the conversion work is real but bounded — and the result is a 21/21 PASS on the Full profile. The key implication is that this is an engineering project, not an infinite undertaking. A well-resourced team can convert legacy datasets into VIDS-compliant form in weeks, not years.
There are two profiles for a reason. VIDS defines a POC profile (15 rules) for prototypes, internal research, and pilot deliveries, and a Full profile (21 rules) for production work, publications, and regulatory submissions. This is deliberate. The POC profile lowers the adoption bar for groups that cannot yet commit to inter-annotator agreement studies or formal split documentation. Most dataset publishers should aim for POC first, then graduate to Full as their internal processes mature. The two-profile structure is the kind of incremental adoption ramp that lets a standard actually get adopted.
What This Means If You're Submitting to Regulators
This is where the implications get sharpest.
EU AI Act, Article 10 requires data governance, quality, representativeness, and bias control for high-risk AI systems. Medical imaging AI products targeting EU markets fall squarely in scope. The VIDS authors explicitly map the 21 validation rules to these requirements: dataset_description.json provides provenance and traceability; participants.json enables demographic bias analysis; annotation sidecars document the annotation chain of custody; quality documentation provides inter-annotator agreement evidence; and the splits file prevents data leakage.
IMDRF Good Machine Learning Practice (GMLP) — authored by ten international regulators including FDA, Health Canada, and the MHRA — requires data quality, provenance, and annotation protocols. Same mapping applies.
FDA AI/ML SaMD Action Plan emphasizes traceability and lifecycle documentation. VIDS structured artifacts attach cleanly as documentation of data governance practices.
What VIDS does not do — and the authors are explicit about this — is certify regulatory compliance. A VIDS PASS does not equal an FDA clearance. It produces structured, machine-readable evidence that can be attached to a regulatory submission. That is a substantial step forward from the current default, which is unstructured prose in a 510(k) document referencing a dataset that no one outside the original group has audited.
The practical translation: if you are preparing an AI/ML imaging submission in the next 12-24 months, building VIDS-compliant artifacts now will reduce friction during the regulatory review cycle. It will not eliminate review questions. It will reduce the number of questions about data governance that arrive as preventable surprises.
We expect notified bodies to become increasingly receptive to VIDS-like structured dataset evidence, even if they do not endorse VIDS by name. We do not yet have public confirmation of this from any specific notified body, but the direction of regulatory expectations is clear.
What VIDS Does Not Do
VIDS is a delivery standard, not a quality guarantee. Anyone evaluating it for their own program should be clear about its limits.
- **It does not cover 4D temporal sequences.** Cardiac cine MRI, dynamic imaging, time-series studies — these are out of scope for v1.0. v2.0 may address this.
- It does not cover non-imaging clinical data. EHR data, genomics, lab values — these are out of scope. Linkage to clinical metadata is on the roadmap but not in v1.0.
- It does not certify quality. A dataset can pass VIDS validation with thin provenance, low inter-annotator agreement, and demographic gaps. These will be visible in the structured metadata, but they will not block a PASS.
- It does not yet have notified-body endorsement. No regulator or notified body has formally endorsed VIDS as of May 2026. Adoption is expected to be community-driven first, regulatory acknowledgment second.
- DICOM as a native storage format is not yet supported. NIfTI is the canonical working format in v1.0. For teams whose imaging pipelines are DICOM-native end-to-end, this is a real engineering consideration. DICOM SEG-native support is under consideration for v2.0.
A Practical Path Forward
If we were standing up a new clinical AI program today, our recommended sequence would be:
Week 1 — Run the validator on what you already have. pip install vids-validator, then point it at every imaging dataset your team is using. Get the structured JSON report. This is your audit baseline. You will likely discover that your internal datasets fall in the same 20-39% range as the public datasets, and the gaps will cluster in the same places.
Week 2-4 — Decide where the gaps actually matter. Not every dataset needs Full profile compliance. Internal exploration data may be fine at POC. Regulatory-submission-bound data should target Full. Make this decision deliberately rather than uniformly.
Month 2-3 — Build VIDS-compliant ingest for new annotations. This is where the highest leverage lies. New annotations going forward should be captured in VIDS-compliant sidecars from day one. The marginal cost of doing this at the moment of annotation is small. The cost of retrofitting it later is large.
Month 3-6 — Plan retrospective conversions selectively. For datasets that will be used in a submission or licensed externally, plan a conversion to VIDS Full profile. Budget for inter-annotator agreement studies on datasets where this has not been done. Treat this as a finite, plannable engineering project, not an open-ended commitment.
Ongoing — Treat VIDS validation as a procurement gate. Datasets delivered by vendors should pass validation as a contractual precondition. Internal datasets going into model training should pass validation as an internal QA gate.
This is not a transformation. It is a discipline. The standard exists, the validator runs in seconds, and the benchmark has surfaced where the systematic gaps are. The work is in adopting it.
What We Think
A few opinions to leave you with.
The 29% number is not an indictment of any specific dataset. LIDC-IDRI, BraTS, CheXpert, and MSD were assembled at a time when no standard required structured provenance. The authors of those datasets did remarkable work under the conventions of the era. The point of the benchmark is not to score them — it is to make visible the systematic gap that has been hiding in plain sight across the entire field for a decade.
The biggest practical impact will be in dataset procurement, not in dataset publishing. The fastest-moving constituency will be sponsors and CROs adding VIDS PASS as a contractual requirement, because they have the most concentrated incentive — they pay for the data, they take the regulatory risk on the downstream submission, and they have legal teams who can write the clause. Dataset publishers will follow.
Regulatory adoption will be slower than industry adoption. Notified bodies and FDA will not endorse VIDS by name in the next 12 months. But they will increasingly ask, in pre-submission meetings, "how do you know your training data is governed?" — and VIDS-compliant artifacts will be the cleanest answer available.
Provenance is the advantage. The category where the gap is widest — 8% across four major datasets — is exactly the category that is hardest to retrofit and most expensive to skip. Teams that build structured provenance into their pipelines from now forward will have a material advantage over teams that try to reconstruct it during a submission cycle three years from now.
Standards adoption is one-way. Once VIDS validation becomes a procurement criterion at one major sponsor, it becomes a market expectation. Once it becomes a market expectation, it becomes a default. Once it becomes a default, datasets that do not pass it become legacy assets. The window in which "we'll handle provenance later" is a viable strategy is closing.
The VIDS specification, validator, and reference dataset are all freely available:
- Specification: [vidsstandard.org](https://vidsstandard.org)
- Validator:
pip install vids-validator - Source: github.com/vids-standard/vids-standard
- Reference dataset (LIDC-Hybrid-100): doi.org/10.5281/zenodo.19582717
- Paper: arxiv.org/abs/2604.17525
This article is part of the ClinStacks AI Compliance series. For our companion piece on how the FDA evaluates AI/ML credibility in regulatory submissions, see The FDA's 7-Step AI Credibility Framework.