ClinStacks
Opinion

AI Endpoints Are the Weakest Link in Modern Clinical Trials

6 min read

This is an opinion piece. We think it's a defensible one.

AI-powered endpoints are entering clinical trial protocols at an accelerating rate. Tumor volumetry from automated segmentation. Cardiac function measured by AI-assisted echocardiography. Retinal grading from deep learning models. Digital biomarkers from wearable sensors processed by machine learning pipelines.

Each of these sounds like progress. In many cases, it is. But the infrastructure supporting these endpoints — the training data, the validation evidence, the regulatory framework — has not kept pace with adoption. And when the endpoint is the thing your entire trial is powered around, that gap isn't theoretical. It's a protocol risk.

The training data problem is worse than you think

We recently analyzed Project Imaging-X, a survey of 1,000+ open-access medical imaging datasets. The findings are sobering for anyone relying on AI imaging endpoints.

The data landscape is radically skewed. Brain, lung, and retinal imaging dominate. Cardiac, musculoskeletal, and gastrointestinal imaging are significantly underrepresented. Classification and segmentation tasks have abundant training data. Detection, tracking, and temporal reasoning tasks do not.

What this means in practice: an AI model measuring tumor response in a lung CT trial is operating with a reasonable training data foundation. An AI model measuring cardiac function in an echocardiography trial is operating on comparatively thin ice. Both might perform well on curated benchmarks. Only one has the training data depth to generalize reliably across the diversity of sites, scanners, and patient populations in a multicenter trial.

Protocol teams rarely ask about training data composition when selecting AI endpoints. They should.

The validation gap is structural

We've observed that a small fraction of FDA-cleared AI medical devices underwent prospective testing, and fewer still reported the demographics of their validation cohorts. This isn't a quality issue with individual manufacturers — it reflects a structural gap in how AI medical devices reach market.

The typical path: train on retrospective data, validate on a curated test set, clear through FDA, then deploy into clinical workflows that look nothing like the validation environment. Different scanners. Different patient populations. Different acquisition protocols. Different disease prevalence.

When that same AI device becomes the measurement tool for a clinical trial endpoint, every limitation of that validation path becomes an endpoint reliability question. Is the AI measuring what you think it's measuring, in the population you're studying, at the sites you're using?

Most protocols don't address this explicitly. The statistical analysis plan specifies the endpoint. It rarely specifies how the AI behind that endpoint was validated for the specific trial context.

Regulatory frameworks are catching up — slowly

The EMA's Reflection Paper on AI and the FDA's credibility framework both acknowledge that AI in clinical trials requires specific documentation and validation. The EMA is explicit: if an AI method hasn't been previously qualified, sponsors must provide full model architecture, training data documentation, and validation records.

But these frameworks are guidance, not requirements. And they arrived after hundreds of trials had already incorporated AI endpoints without this level of scrutiny. The result is a gap between what regulators expect going forward and what's already baked into running protocols.

Teams designing new protocols have an opportunity — and arguably an obligation — to get ahead of this. Building AI endpoint validation into the protocol design process, rather than treating it as an afterthought, is the difference between a defensible submission and a regulatory question you can't answer at review.

The compounding risk nobody models

Here's what concerns us most: these risks compound.

Narrow training data leads to models that underperform in underrepresented populations. Weak validation means that underperformance isn't caught before deployment. Regulatory ambiguity means nobody requires the checks that would catch it. And protocol teams, optimizing for speed and innovation, adopt AI endpoints without systematically assessing any of these layers.

Each layer in isolation seems manageable. Together, they create a systemic risk to endpoint reliability that most protocol teams aren't modeling.

This doesn't mean AI endpoints are bad. They represent genuine advances in measurement precision, objectivity, and efficiency. But adopting them without addressing the infrastructure gaps is like building a house on a foundation you haven't inspected.

What we think protocol teams should do

Interrogate training data composition before selecting an AI endpoint. Ask the vendor or model developer: what modalities, anatomies, and patient populations were represented in training? If your trial population or imaging protocol differs meaningfully from the training distribution, plan for additional site-specific validation.

Include AI endpoint validation in the protocol, not just the statistical analysis plan. The imaging charter or endpoint specification should address how the AI model's performance will be verified in the specific trial context — not just reference the FDA clearance as sufficient.

Benchmark against conventional measurement. For endpoints where conventional methods exist (manual tumor measurement, expert echocardiographic reading), consider running AI and conventional in parallel for at least a subset of patients. This creates a safety net and generates evidence for regulatory submissions.

Track the regulatory landscape actively. EMA and FDA expectations are evolving. Protocols designed today will be reviewed in 2-3 years. Building to where regulators are heading — not where they were when the protocol was drafted — reduces amendment risk downstream.

The uncomfortable bottom line

AI endpoints are entering clinical trials because they offer real advantages: consistency, scalability, potentially higher sensitivity. We're not arguing against adoption.

We're arguing that adoption without infrastructure is risk without awareness. And in clinical trials, unrecognized risk doesn't stay theoretical — it surfaces as unexplainable variance, regulatory holds, and protocol amendments.

The teams that will succeed with AI endpoints are the ones who treat the AI not as a black box that produces a number, but as a measurement system that requires the same scrutiny as any other component of their trial design.

The ones who don't will learn the same lesson the hard way — at the cost of time, money, and patient access to therapies that might actually work.


This is a ClinStacks opinion piece. We welcome disagreement — especially from teams who have successfully navigated AI endpoint validation in multicenter trials. Contact us or respond on LinkedIn.

If you're designing a protocol with AI-powered endpoints, Protocol Risk Scan can help you validate your design choices against historical precedent before finalization.