ClinStacks
AI ComplianceGuide #06

Where FDA and EMA Met: Reading the Ten Joint AI Principles as a Convergence Signal

25 min read

On January 14, 2026, the FDA and the EMA jointly published Guiding Principles of Good AI Practice in Drug Development, a set of ten principles spanning the full medicines lifecycle. The document is six pages long, contains no requirements, and binds no one. It would be easy to file it under "regulatory throat-clearing" and move on.

We think that read misses what actually happened. For two years, anyone tracking AI in regulated drug development has been watching two parallel tracks: the FDA's risk-based credibility framework, which we covered in Guide 01, and the EMA's reflection paper on AI across the medicinal product lifecycle, which we covered in Guide 02. Those tracks used overlapping but distinct vocabulary, sat in different legal systems, and gave sponsors running programs on both sides of the Atlantic a real problem: build to one, and you might still have to re-document for the other.

The January 2026 principles are the moment those two tracks converged into a shared floor. That is the story. Not the ten bullet points — the convergence underneath them. This guide reads the principles as a forward indicator of where binding guidance hardens next, and argues that a sponsor who built honestly to the FDA credibility framework is already most of the way to satisfying what both agencies have now signalled they care about.

What was actually published, stated precisely

Precision matters here, because the most common error we have seen in the trade coverage is conflation. So, plainly:

The document is titled Guiding Principles of Good AI Practice in Drug Development. It was published jointly by the EMA and the FDA on January 14, 2026. It identifies ten principles. It covers AI used to generate or analyse evidence across the drug product lifecycle — nonclinical, clinical, post-marketing, and manufacturing. It is directed at medicine developers, marketing authorisation applicants, and authorisation holders. And it is explicitly non-binding: the agencies describe the principles as intended to "lay the foundation" for future good practice and guideline development, and as identifying areas where regulators, standards organisations, and other bodies could collaborate.

Two things this document is not, both worth saying out loud because the misreads are common:

It is not the FDA's January 2025 draft guidance, Considerations for the Use of Artificial Intelligence to Support Regulatory Decision-Making for Drug and Biological Products. That is a separate, US-only document, published a full year earlier, and it is the source of the seven-step credibility framework we dissected in Guide 01. The two are deeply related — we will spend most of this guide on exactly how — but they are different documents with different dates and different legal postures. If you see them treated as one item, treat that source with suspicion.

And it is not enforceable. There is no compliance deadline, no comment docket that produces a binding rule, no inspection criterion. EMA's own framing is that EU guideline development is already underway, building on the 2024 reflection paper, and that the principles will be supplemented over time by guidance reflecting applicable legal frameworks. The principles describe a destination, not a gate you pass through next quarter. (They also sit alongside, not inside, the EU AI Act — a separate body of law that will shape the EU picture independently. That interaction is the question most sponsors are carrying, and we take it up directly near the end.)

That non-binding status is not a reason to ignore the document. It is the reason to read it carefully. Non-binding principles published jointly by two of the world's most consequential drug regulators are the clearest public signal you will get about the shape of the binding requirements that follow. They are a map of the terrain before the roads are paved.

The spine: the credibility framework goes transatlantic

Here is the throughline. Three of the ten principles — taken together — are the FDA's credibility framework, restated in language EMA can also stand behind.

Recall the spine of the FDA credibility framework from Guide 01: you define the question of interest the AI model addresses; you define the context of use (the specific role and scope of the model in answering that question); you assess model risk, which is the combination of how much the model's output influences the decision and how serious the consequence of getting that decision wrong would be; and you then make the rigour of your credibility evidence commensurate with that risk. Low-influence, low-consequence uses get a light touch. High-influence, high-consequence uses get the full validation burden. The entire framework is a machine for proportioning effort to risk, anchored to a precisely defined context of use.

Now read three of the joint principles:

Principle 2, Risk-based approach: the development and use of AI follows a risk-based approach with proportionate validation, risk mitigation, and oversight based on the context of use and determined model risk.

Principle 4, Clear context of use: AI technologies have a well-defined context of use — the role and scope for why the technology is being used.

Principle 8, Risk-based performance assessment: risk-based performance assessments evaluate the complete system, including human-AI interactions, using fit-for-use data and metrics appropriate for the intended context of use.

That is not similar language. That is the same conceptual machinery. "Context of use," "model risk," "proportionate validation," "commensurate with risk" — these are the load-bearing terms of the FDA's 2025 framework, and they are now the load-bearing terms of a document EMA co-signed. The phrase "determined model risk" in Principle 2 is doing exactly the work that the model-influence-times-decision-consequence matrix does in the FDA guidance.

This is the single most useful thing a sponsor can take from the January 2026 publication. If your organization built its AI documentation discipline around the FDA credibility framework — defining context of use up front, grading model risk, and sizing your validation evidence to that grade — you did not build to a US-only standard that you now have to redo for Europe. You built to the conceptual core that both agencies have now publicly endorsed. The convergence works in your favour.

It helps to make this concrete, because "context of use drives risk drives validation" can sound like an abstraction until you walk a real case through it. Take an AI model used to stratify trial participants into low-risk and high-risk groups for a serious adverse reaction, so that low-risk participants can be sent home for outpatient monitoring rather than held for inpatient observation — the clinical example the FDA's own 2025 guidance uses. Walk it through the principles. The question of interest (Principle 4's prerequisite) is precise: which participants can be considered low enough risk that they do not need inpatient monitoring after dosing? The context of use (Principle 4) is the model's specific role and scope: the model's output is the sole determinant of which monitoring arm a participant enters. Now grade the risk (Principle 2): model influence is high, because nothing else moderates the model's call; decision consequence is high, because a misclassified high-risk participant could suffer a life-threatening reaction in a setting without proper treatment. High influence and high consequence yield high model risk — which, under Principle 8's risk-based performance assessment, demands the most stringent validation, the tightest performance acceptance criteria, and explicit evaluation of the human-AI system as a whole rather than the model in isolation.

Now change one fact. Suppose the model is not the sole determinant — suppose independent confirmatory testing runs in parallel and the model only flags cases for review. The context of use has changed, so the risk grade changes with it: model influence drops, because the model's output is no longer the last word, and the same model now sits at a lower risk tier requiring proportionately lighter validation. Same model, same accuracy, different context of use, different validation burden. That is the entire engine of the credibility framework in one comparison — and it is exactly what Principles 2, 4, and 8 encode. A sponsor who can run that walk-through fluently for their own AI uses is already operating in the conceptual world both agencies now share.

It is worth being honest about the direction of travel. The shared vocabulary is recognizably the FDA's. The context-of-use and model-risk constructs trace back through the FDA's 2025 guidance to the ASME V&V40 credibility framework for computational models that the FDA explicitly adapted. EMA's 2024 reflection paper was more discursive, more oriented toward lifecycle and data governance than toward a single crisp risk-grading construct. What the joint principles do is take the FDA's crisp risk spine and set it inside EMA's broader lifecycle-and-governance frame. Both agencies gave something. But the part a sponsor builds to — the part that tells you how much validation is enough — is the FDA's contribution, now made transatlantic.

The lineage is clear enough to put in a table. The FDA credibility framework is built on three load-bearing concepts; each maps onto one or more of the joint principles, and together they account for the risk spine of the whole document.

FDA credibility framework conceptWhere it appears in the joint principles
Question of interest / context of use — the specific role and scope of the modelPrinciple 4 (clear context of use)
Model risk — model influence on the decision × consequence of an errorPrinciple 2 (risk-based approach, "determined model risk")
Credibility commensurate with risk — validation sized to the risk gradePrinciple 2 (proportionate validation) and Principle 8 (risk-based performance assessment)

If you have not read Guide 01, the table is the whole framework in miniature: define what the model is for, grade how much could go wrong if it is wrong, and size your validation to that grade. Every other principle in the document is either an input to that logic or a way of sustaining it over time. Read this way, the joint principles are not a new framework — they are two existing frameworks fused where they already agreed, with the FDA's risk-grading construct supplying the load-bearing structure and EMA's lifecycle-and-governance heritage supplying the rest.

What EMA pulls to the foreground

If the risk spine is the FDA's contribution, the emphasis on provenance, lifecycle, and communication is where EMA's heritage shows. Three principles carry most of that weight, and they are where many sponsors are weakest, precisely because they are less about a one-time validation event and more about sustained discipline.

Principle 6, Data governance and documentation. The principle asks that data source provenance, processing steps, and analytical decisions be documented in a detailed, traceable, and verifiable manner, in line with GxP requirements, with appropriate governance — including privacy and protection for sensitive data — maintained throughout the technology's lifecycle.

Read that slowly, because it is more demanding than it first appears. It is not asking only that your training data be good. It is asking that where the data came from, the processing steps applied to it, and the analytical decisions made along the way all be documented to a standard that is traceable and verifiable. That is a much higher bar than "we validated the model and it performed well." It is the difference between being able to assert that a model works and being able to reconstruct, after the fact, exactly what evidence and what choices produced a given output. In a GxP context, where the governing assumption is that what is not documented did not happen, that reconstructability is what survives an inspection.

Principle 9, Life cycle management. Risk-based quality management systems are implemented throughout the AI technology's lifecycle, with scheduled monitoring and periodic re-evaluation to ensure adequate performance — explicitly naming data drift as the failure mode to guard against.

This is the principle that most directly contradicts a "validate once, file it, move on" mentality. AI models are not static instruments. Their performance can degrade as the data they encounter in deployment diverges from the data they were trained on — data drift — and a model that was credible at lock can quietly stop being credible six months later without anyone touching a line of code. Principle 9 asks for the quality-management scaffolding to catch that: scheduled monitoring, periodic re-evaluation, and a defined process for when degradation triggers a response. For sponsors whose AI governance was built around a project milestone rather than an ongoing operation, this is the gap that will need the most structural work.

The reason this is the structural gap, rather than just one more requirement, is that drift is invisible by default. A traditional analytical method does not change its behavior when the population it is applied to shifts; a model can, and it does so silently. There is no error message when a model trained on one site's imaging characteristics begins underperforming on a new site's scanner, or when a model trained on a pre-pandemic population starts seeing a meaningfully different case mix. The output still looks like a confident prediction. Without monitoring designed specifically to detect degradation, the first signal that a model has drifted out of its validated envelope may be a discrepancy a regulator finds, not one the sponsor caught. That asymmetry — the failure mode produces no native alarm — is exactly why Principle 9 insists on scheduled monitoring and periodic re-evaluation rather than leaving it to be triggered by something going visibly wrong.

There is a deeper point embedded here about the relationship between the risk grade and the lifecycle obligation. The intensity of lifecycle management should itself be risk-based, per Principle 9's own language. A low-risk model in a stable, well-characterized context of use may warrant light monitoring on a relaxed cadence. A high-influence, high-consequence model operating in a context where the input distribution can shift — new sites, new populations, evolving standard of care — warrants tight, frequent monitoring with pre-specified re-evaluation triggers, because the same drift that would be a nuisance in a low-risk setting is a patient-safety exposure in a high-risk one. The lifecycle obligation, in other words, inherits the risk grade from the context-of-use analysis. It is not a separate axis; it is the risk spine extended through time.

Principle 10, Clear, essential information. Plain language is used to present clear, accessible, contextually relevant information to the intended audience — including users and patients — about the AI technology's context of use, performance, limitations, underlying data, updates, and interpretability or explainability.

This one is easy to skim past as boilerplate transparency language, but it has teeth in one specific respect: it names limitations and underlying data as things that must be communicated, not just headline performance. A communication that says "the model achieves 94% accuracy" and stops there does not satisfy the spirit of Principle 10. The principle wants the boundary conditions made legible — where the model is reliable, where it is not, what data it rests on, and what changed at the last update. That is a documentation and communication discipline that most performance reporting does not currently meet.

It is worth being concrete about what satisfying these three principles actually looks like, because the gap between "we have good practices" and "we can demonstrate them on demand" is where programs get caught. For Principle 6, good looks like a data lineage record that lets someone who was not in the room reconstruct, for any AI-derived result in a submission, which source data fed the model, what transformations were applied, and which analytical choices were made — without relying on the memory of the analyst who built it. For Principle 9, good looks like a monitoring plan with defined performance metrics, a stated re-evaluation cadence tied to risk, pre-specified thresholds that trigger investigation, and a documented response pathway for when drift is detected — in place before deployment, not assembled after a regulator asks. For Principle 10, good looks like a model fact sheet that states context of use, performance with confidence intervals, explicit limitations and out-of-scope conditions, the data the model rests on, and a changelog of updates, written so an intended user can actually act on it. None of this is exotic. All of it takes time to stand up, and none of it can be convincingly fabricated after the fact — which is the whole point.

Rounding out the ten are the principles that set the ethical and structural frame. These get less attention than the risk spine and the governance emphasis, and rightly so — they are less likely to be where a program fails. But each carries a practical edge worth naming.

Principle 1, human-centric by design, asks that the development and use of AI align with ethical and human-centric values. The operational reading is that the human remains accountable for the decision the AI informs; the technology supports judgment rather than displacing it. In a high-model-influence context of use, this principle is in productive tension with the risk grading — the more a model determines a decision outright, the more weight Principle 1 places on the human oversight and accountability structures around it.

Principle 3, adherence to standards, asks that AI technologies adhere to relevant legal, ethical, technical, scientific, cybersecurity, and regulatory standards, including Good Practices (GxP). The inclusion of cybersecurity is easy to overlook and increasingly consequential: a model whose integrity can be compromised is a model whose outputs cannot be trusted, and that is a regulatory exposure, not just an IT concern. GxP adherence is the thread that ties AI governance back into the quality systems sponsors already operate, rather than treating AI as a parallel universe with its own rules.

Principle 5, multidisciplinary expertise, asks that expertise covering both the AI technology and its context of use be integrated throughout the lifecycle. The phrase "both" is the point. A team of excellent data scientists who do not understand the clinical or regulatory context will build a technically sound model for the wrong question; a team of domain experts without AI literacy will not know which questions the technology can actually answer or where it will silently fail. Principle 5 is a quiet argument against siloing AI development away from the people who own the context of use.

Principle 7, model design and development practices, asks that development follow best practices in model and system design and software engineering, leverage data that is fit-for-use, and attend to interpretability, explainability, and predictive performance — promoting transparency, reliability, generalisability, and robustness. This is the principle most familiar to anyone who has done serious model development, and the one most likely to be satisfied already by a competent team. Its regulatory significance is that it elevates ordinary engineering discipline — version control, reproducible pipelines, fit-for-use data, guarding against overfitting — from good hygiene to a stated expectation against which a program can be assessed.

Taken together, these four are the connective tissue that makes the risk spine and the governance emphasis sit inside a coherent whole. None is the reason to read the document, but a program that nails context of use and risk grading while neglecting any of these four will still have a defensible gap.

Reading the ten as a system, not a checklist

A subtle point about how the principles relate to each other, because reading them as ten independent boxes to tick misses the design.

The principles are not parallel. They nest. Principle 4 (clear context of use) is logically prior to almost everything else — you cannot do a risk-based performance assessment (Principle 8) without first knowing the context of use you are assessing against, and you cannot grade model risk (Principle 2) without a defined role and scope for the model. Context of use is the keystone; pull it out and the risk-based principles have nothing to attach to.

Similarly, Principle 6 (data governance and documentation) and Principle 9 (lifecycle management) are not separate from the risk principles — they are how the risk principles get sustained over time. A risk-based performance assessment is a snapshot. Lifecycle management is what keeps that snapshot honest as the world changes. Data governance is what makes the snapshot reconstructable when someone asks, months later, how you arrived at it. The risk spine tells you how much rigour to apply; the governance and lifecycle principles tell you how to make that rigour durable and defensible rather than a one-time performance.

Seen this way, the ten principles describe a single posture: define what the AI is for, grade the risk that definition implies, size your evidence to that grade, document the lineage so the evidence is reconstructable, and monitor so the evidence stays true. That posture is coherent, it is recognizably the credibility framework with a governance-and-lifecycle wrapper, and it is the thing worth building toward — regardless of when, or in what binding form, it arrives.

This is also why some principles get more attention in this guide than others, and the imbalance is deliberate rather than an oversight. The principles are not equally load-bearing. A few — context of use, model risk, the two assessment principles — define the logic everything else hangs on. A few more — data governance and lifecycle management — are where that logic is most often broken in practice, because they require sustained operation rather than a one-time deliverable. The remaining principles are real expectations, but they are the ones a competent program is most likely to satisfy already, and least likely to fail an inspection on. Weighting the analysis toward the principles that are either structurally central or operationally fragile is not treating the others as optional; it is spending the reader's attention where the risk actually concentrates.

The honest read for sponsors

So what should a sponsor actually do with a non-binding document? Three things, stated with appropriate humility about what we can and cannot know.

First, treat the convergence as a planning input, not a compliance event. There is nothing to comply with yet. But the probability that future binding guidance in both jurisdictions is organized around context of use, model risk, proportionate validation, traceable data documentation, and lifecycle monitoring is now substantially higher than it was before January 2026 — because two regulators have jointly committed to that frame in public. If you are making architecture decisions about how your organization documents and governs AI-derived evidence, building toward that frame is the low-regret choice. It is where the roads are most likely to be paved.

Second, the convergence reduces, though it does not eliminate, the risk of divergent US and EU documentation demands. We want to be careful here. The principles are aligned; the binding guidance that follows will be written inside two different legal systems, and EMA has been explicit that EU guidance will reflect applicable EU legal requirements, including new pharmaceutical legislation and the broader EU AI regulatory environment. So "build once, file everywhere" is too strong. But "build to a shared conceptual core and expect the jurisdiction-specific guidance to be variations on it rather than separate frameworks" is a reasonable working assumption, and a far better planning posture than preparing for two unrelated regimes.

Third — and this is the part that is genuinely actionable today — the two principles most sponsors are weakest on, data governance (Principle 6) and lifecycle management (Principle 9), are also the two that take the longest to build and cannot be retrofitted quickly. You can write a context-of-use statement in an afternoon. You cannot manufacture the provenance record for a dataset after the fact; either you captured the lineage as you went or you did not. You cannot retroactively monitor a model for drift over a period you were not monitoring it. These are the capabilities that have to be stood up before you need them, which means the rational response to a non-binding signal is to start building the slow capabilities now, while the binding requirements are still being drafted.

That last point is where the document earns its keep. The deadline-driven reading — "nothing is required, so nothing to do" — is exactly backwards for the parts that take the longest to build. The capabilities that require standing infrastructure — traceable data documentation captured as you go, monitoring that detects drift, quality-management scaffolding across the lifecycle — cannot be produced on demand and have to exist before the use that needs them. The per-use deliverables — a context-of-use statement, a risk grading, a model fact sheet — are fast to produce against an established backbone. So a rational program inverts the apparent urgency: it builds the slow, no-deadline capabilities first, precisely because they are slow. Reading the principles for their build-time rather than their binding status is how a sponsor turns a non-binding document into a work plan.

One caution against reading any of this as a counsel of maximalism: the risk-based thread is also permission. Not every AI use is a high-influence model determining patient management. A model supporting an internal efficiency that does not touch the reliability of study results sits well down the risk scale and warrants light treatment. The principles ask for effort proportioned to risk — which means accurate risk grading is also what protects a program from over-investing where it does not matter.

Where this sits, and what is still unsettled

The January 2026 principles are a foundation, and the agencies say so directly. The next phase is guideline development — already underway in the EU, building on the 2024 reflection paper — and the gradual translation of these principles into jurisdiction-specific guidance that does carry weight. The HMA-EMA group focused on AI has signalled continuing work on a shared glossary of AI terminology, which matters more than it sounds: a common vocabulary is the precondition for the two systems' guidance actually staying convergent rather than drifting apart in the details.

What remains genuinely unsettled: how the principles interact with the EU AI Act and the broader EU legal framework for high-risk AI; how the FDA's own draft guidance evolves as it moves toward final form; and whether the convergence visible at the level of principles survives contact with the binding-guidance drafting process in two different legal systems. These are open questions, and we will not pretend to know the answers. What we will say is that the direction is now public, joint, and specific enough to plan against.

The EU AI Act interaction deserves a closer look, because it is where the cleanest source of future US-EU divergence sits. The joint principles are a regulatory-science instrument: they describe how to establish that an AI model's output is credible enough to support a drug regulatory decision. The EU AI Act is horizontal legislation that classifies AI systems by risk category and imposes obligations accordingly, independent of the drug context. A given AI use in drug development could plausibly sit inside both frames at once — assessed for credibility under the medicines framework the joint principles foreshadow, and simultaneously classified and regulated under the AI Act. The FDA has no analogue to the AI Act; its AI governance lives inside the drug regulatory framework. So even with perfectly converged principles, an EU sponsor may carry obligations a US sponsor does not, arising not from divergent drug-AI guidance but from a separate body of EU law the US simply lacks. EMA has been explicit that EU guidance will reflect applicable EU legal requirements, including new pharmaceutical legislation — which is the agency telling you, in advance, that the EU-specific layer is coming. The convergence is real at the level of how you establish credibility. It is the surrounding legal scaffolding, not the credibility logic, where the two jurisdictions will most plausibly part ways.

That distinction is useful for planning. It means the conceptual core worth building toward — context of use, model risk, proportionate validation, traceable documentation, and lifecycle monitoring — is genuinely shared and low-regret. The jurisdiction-specific work, when it arrives, is more likely to be additive obligations layered on top of that shared core than a competing framework that forces you to rebuild it. You are not betting on which of two frameworks wins. You are building the part both agree on and leaving room for jurisdiction-specific additions.

For the practitioner, the takeaway is unglamorous and durable. The conceptual core — context of use, model risk, proportionate validation, traceable provenance, lifecycle monitoring — is the thing both agencies have endorsed and the thing worth building toward. The vocabulary will be refined, the binding guidance will arrive in pieces, and the jurisdictional details will differ. But the spine is stable, it is the spine we have been describing in this series since Guide 01, and it now has two regulators' names on it.

That is what happened in January. Not ten new rules. A public confirmation of where the rules are heading — and a year, more or less, to build for it before they arrive.