ClinStacks
AI ComplianceGuide #04

21 CFR Part 11 in the Age of AI: Electronic Records, Audit Trails, and Model Versioning for Regulated Workflows

30 min read

21 CFR Part 11 in the Age of AI: Electronic Records, Audit Trails, and Model Versioning for Regulated Workflows

Most teams now running AI inside a regulated workflow have not been asked the Part 11 question yet. They will be.

We're seeing it begin to happen in pre-inspection readiness reviews and CRO QA walkthroughs: a sponsor has deployed an LLM-assisted MedDRA coder, or an AI-assisted lesion measurement tool, or a generative drafting layer on top of their CSR pipeline. The system works. People like it. It's quietly producing or modifying records that, under a strict read of 21 CFR Part 11, are GxP electronic records. And then someone asks the obvious question — which version of the model produced this output, and can you reproduce it? — and the room goes quiet.

This is the gap this guide is written to close.

21 CFR Part 11 was finalized in 1997. It was written for a world of CDMS form-based data entry, validated COTS software, and audit trails that logged a human user changing a field. The drafters could not have anticipated systems where the "operator" is a model, the model changes underneath you, the output is not deterministic, and the system boundary extends to a vendor's cloud you don't control. Yet sponsors and CROs are deploying exactly those systems into Part 11 contexts every week, often without a clear story for how the AI fits into the rule.

The good news: Part 11 doesn't actually need rewriting to accommodate AI. The bad news: the practical work of applying it — what to capture, how to version, what to validate — has to be invented at the implementation layer, because regulators have not yet supplied the detail. This guide is the playbook we use when advising teams through that work.

Why traditional Part 11 implementations stopped working when AI showed up

Before we get to AI specifically, a brief refresher — because most teams misremember what Part 11 actually requires.

The rule has three working sections that matter for our purposes. Section 11.10 lays out controls for closed systems: validation (11.10(a)), the ability to generate accurate and complete copies of records (11.10(b)), record protection and retention (11.10(c)), limited access to authorized individuals (11.10(d)), and — the one everyone fights over — "secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records" (11.10(e)). Section 11.30 layers on additional controls for open systems, principally encryption and digital signatures. And 11.3(b) defines what makes a system closed versus open: closed systems are "controlled by persons who are responsible for the content of electronic records that are on the system." Open systems are not.

That last definition is where AI bites hardest, and we'll return to it.

In 2003 FDA issued a Scope and Application guidance that narrowed the rule's reach considerably. The agency said it would exercise enforcement discretion on certain Part 11 requirements (validation, audit trail, copy generation, record retention) for legacy systems and for records not specifically required by predicate rules. This is the language sponsors lean on when they argue Part 11 is "softer than it looks." That's a real argument. It is not, however, a license to ignore Part 11 for AI systems generating regulated records — the predicate rule scope (GMP, GCP, GLP) still anchors the obligation.

FDA announced in January 2023 that it intended to modernize Part 11. Public comments ran through 2023 and 2024. As of this writing, no modernized rule has been issued. What has arrived is a constellation of adjacent guidance that reinterprets how Part 11 principles apply to current technology — the Computer Software Assurance (CSA) guidance finalized in September 2025, ICH E6(R3) finalized January 2025 and posted by FDA in September 2025, the January 2025 FDA draft guidance on AI in regulatory submissions, the August 2025 PCCP final guidance for AI-enabled devices, and the FDA/EMA Guiding Principles of Good AI Practice issued jointly in January 2026. The EU is moving faster in parallel, with the draft revision of GMP Annex 11 expected to finalize in mid-2026 and a new Annex 22 specifically addressing AI in GMP contexts.

The practical result is that Part 11 today reads almost identically to its 1997 form, but the interpretation layer around it has shifted dramatically. AI sits in the gap between the rule and the interpretation, and that gap is where audit findings get written.

The reason traditional Part 11 implementations stopped working when AI arrived isn't that the rule's language is wrong. It's that the implementation patterns built around the rule assume a set of system properties that AI no longer has.

The five Part 11 assumptions that AI breaks

Part 11 is built on five implicit assumptions. Each of them holds for a traditional CDMS or LIMS. Each of them breaks, in different ways, for an AI system.

Assumption one: records are stable artifacts. A row in an EDC database is a thing. It exists. You can point at it, hash it, store it, retrieve it. The output of an AI inference is not a thing in that sense — it's a transient product of a model, an input, configuration parameters, and (often) a random sample. If you don't capture all four of those at the moment of inference, you cannot reconstruct the output. The record exists only because you chose to write it down.

Assumption two: the system behaves deterministically. A traditional electronic record system behaves the same way today as it did yesterday given the same input. This is a precondition for validation: if you tested a function and it passed, you can rely on it. LLMs with temperature greater than zero do not behave this way. The same input can produce different outputs across invocations. Even at temperature zero, hosted model versions can shift underneath you between calls.

Assumption three: the system boundary is controllable. Closed system. You own the hardware, the software, the access list. With cloud AI APIs, you do not own the model. You don't control when the vendor updates the weights. You don't have access to the training data. You can't audit what changed. The system extends out of your control by definition.

Assumption four: configuration is static between releases. When you validate a CDMS, the configuration is fixed until the next change control. AI systems have configuration surfaces that change all the time and don't feel like configuration: a tweaked system prompt, an updated retrieval corpus, a new tool definition for an agent, a different sampling temperature. Each of these is a change that should trigger validation work. Most teams don't treat them as changes.

Assumption five: the actor in the audit trail is a person. 11.10(e) talks about "operator entries and actions." When the operator is a model, who is the actor? The user who initiated the call? The model itself? The vendor who trained it? Each interpretation has different implications for what the audit trail must capture, and Part 11 doesn't tell you which is correct.

These five assumptions don't just break individually. They compound. If your system is non-deterministic and the model changes and the configuration drifts and the actor is ambiguous, then "secure, computer-generated, time-stamped audit trails to independently record the date and time of operator entries and actions that create, modify, or delete electronic records" becomes a much harder design problem than it looks at first read.

Electronic records for AI: what actually counts

The first practical question is: when an AI system produces or modifies a regulated record, what is the record?

For a traditional system, the answer is obvious. The record is the field value, plus its metadata (timestamp, user, source). For an AI system, the record is irreducibly larger. We work with the following minimum capture set:

The input. Everything the model saw. For an LLM, that means the full prompt including any system message, any conversation history, any retrieved context (the actual chunks pulled from a RAG store, not just the references). For an imaging model, the exact DICOM file or pixel array, with hash. For a tabular model, the input frame. If the model saw it and used it to produce the output, it goes into the record.

The output. The exact text, classification, score, or measurement returned by the model. Stored verbatim, not paraphrased. If the output was processed before being stored (rounded, parsed, formatted), both the raw and the processed forms are captured.

The model identity. Not just "GPT-4" or "Claude" — the exact version string. gpt-4-turbo-2024-04-09 is a different system from gpt-4-turbo-2024-08-15. claude-3-5-sonnet-20240620 is a different system from claude-3-5-sonnet-20241022. If you used a fine-tune, the fine-tune ID. If you used a model snapshot, the snapshot ID. If the model is self-hosted, the weights checksum.

The configuration. Temperature, top_p, top_k, max_tokens, response format, stop sequences, seed if applicable. The system prompt (if not already captured in input). For RAG, the embedding model version and the retrieval index identifier with timestamp. For agents, the full tool definition list.

The trace. For systems with intermediate steps — agents that call tools, retrieval pipelines that score and filter, ensemble systems that combine multiple model outputs — the intermediate state. Which tools were called, with what arguments, returning what. Which chunks were retrieved with what similarity scores. The reasoning trace if exposed by the model.

The provenance link. Which downstream regulated record this inference informed. Did this output go into a CRF field? Inform an SAE assessment? Get incorporated into a CSR draft? The link from inference event to downstream record is what makes the inference itself a Part 11 record.

This is more data than most teams are currently storing. We see two failure modes routinely. The first: teams that log only the output ("the AI said 'mild adverse event'") and nothing else. That's a Part 11 violation waiting to be written up — you cannot reconstruct, and you cannot satisfy 11.10(b) on copy generation. The second: teams that log inputs and outputs but not configuration, so when the model version changes silently they can't tell what produced what.

The right rule of thumb is this: if a regulator asked you, six months from now, "reproduce this AI-generated output exactly," could you? Not "approximately." Not "produce a similar output." Exactly. If the answer is no, your record set is incomplete.

A concrete example clarifies. Consider an LLM-assisted MedDRA coding workflow. A user pastes a verbatim adverse event term ("upset stomach for two days after dosing") into an interface. The LLM is prompted to suggest the best MedDRA Preferred Term match. The user accepts, rejects, or overrides. The accepted PT goes into the eCRF.

A complete Part 11 record set for that single coding event captures: the verbatim term as submitted, the full prompt (system message + user message + any few-shot examples), the model ID with snapshot, the temperature and other sampling parameters, the embedding model version if MedDRA hierarchy retrieval was used, the retrieved candidate PTs with their similarity scores, the model's suggested PT with confidence indication, the user's action (accept/reject/override), the final PT entered, the user ID and timestamp for the human decision, and the link to the eCRF record updated.

That is roughly fifteen distinct fields for what feels like a single coding event. Most current implementations capture three or four. The gap is the Part 11 risk.

Audit trails: the model dimension that's missing

11.10(e) requires audit trails to record "operator entries and actions that create, modify, or delete electronic records." For decades the design pattern has been straightforward — log every CRUD operation with user, timestamp, old value, new value. For AI systems, this pattern is necessary but not sufficient.

The reason is the model itself. When an AI generates an output that becomes part of a regulated record, the model is the actor — or at least an actor — in that creation. A Part 11 audit trail that records only the human user who initiated the inference, and not the model that produced the substantive content, is incomplete. It tells you who pressed the button. It does not tell you what was on the other side of the button.

We use a six-field extension to the traditional audit trail for any system involving AI inference. Beyond the standard {user, timestamp, action, record, old value, new value}, an AI-aware audit trail also captures:

Model invocation ID. A unique identifier for the specific inference event. This is the join key that links the audit log entry to the captured inference record (inputs, outputs, configuration, trace as described above). Without this ID, you have a record that says "AI changed this field" with no way to retrieve what AI did.

Model identity. The exact version string. This duplicates information stored in the inference record itself, but having it inline in the audit trail enables filtering and rapid scan — "show me all records produced by claude-3-5-sonnet-20240620 in this study."

Confidence or score indicator. If the model exposed any signal of its own uncertainty — log probability, retrieval similarity score, classifier confidence — capture it. This is what enables the reviewer or auditor to ask sensible questions about model behavior at the edges.

Human review state. Was the model output reviewed by a human before being committed to the record? If yes, who, when, and did they accept, modify, or override? The model-output-to-eCRF-field path with no human review is a different risk profile from the same path with reviewer sign-off, and the audit trail must reflect which one happened.

Reason for change. Part 11 does not explicitly require a reason for change, but FDA expects it as a Good Documentation Practice and EU Annex 11 effectively requires it. For AI-generated changes, the "reason" is a structured combination of model identity, prompt template, and confidence — not a free-text user explanation. The audit trail design has to accommodate that.

System boundary indicator. Was the inference performed by a model under your control (self-hosted, in your VPC) or by an external API (closed system versus open system, in Part 11 terms)? This matters for how you defend the rest of the audit trail to a regulator.

The hardest of these in practice is the trace problem. For a simple LLM call, the inference record can be stored as a JSON document and referenced from the audit trail entry. For an agentic system that calls four tools, retrieves from two indices, and chains three model calls before producing the final output, the inference record can be hundreds of kilobytes of structured trace. Teams either over-engineer (storing every intermediate state forever, blowing up storage cost) or under-engineer (storing only the final output and hoping nobody asks). The right middle is to store the full trace for any inference whose output reaches a regulated record, and a sampled trace for everything else.

Consider an imaging endpoint workflow. An AI tool measures the longest diameter of a target lesion. The radiologist reviews and accepts. The measurement goes into the database, ultimately feeding the RECIST 1.1 response assessment. A Part 11 audit trail for that measurement needs to record: which DICOM was analyzed (instance UID and hash), the model and version, the segmentation mask produced, the measurement computed, the radiologist's review action, any manual adjustment they made, the final value committed. If the radiologist accepted the AI value with no adjustment, that is itself a meaningful audit event — not the absence of one. "Confirmed AI measurement of 24.3 mm without modification" is the entry; "no human action" is what shows up in a traditional audit trail and what gets you a finding.

Model versioning: what triggers a version event

This is where most current implementations are weakest, and it's where the Part 11 risk is most concentrated for AI systems.

The naive view of model versioning treats it as a one-dimensional problem: the vendor changes the model name, you note the new name, that's the version event. This view captures roughly one-tenth of the actual surface area.

The full list of changes that should trigger a new system version (and therefore change control under Part 11) for an AI system includes:

Vendor checkpoint update. Same model name, new underlying weights. OpenAI deprecates gpt-4-turbo-2024-04-09 and points everyone to gpt-4-turbo-2024-08-15. Anthropic ships claude-3-5-sonnet-20241022 after claude-3-5-sonnet-20240620. If you pinned to the older snapshot, you're protected until the deprecation date. If you used the unversioned model name, the change happens silently. Either way, when you adopt the new snapshot, that is a system change.

Fine-tune update. Continued training on your own data produces a new fine-tune ID. The old fine-tune and the new fine-tune are different systems, full stop.

System prompt change. The system prompt is not "configuration" in the casual sense — it is part of the program. A change to the system prompt can radically alter the model's outputs on the same input. This is the version event most often missed. Teams iterate on prompts in production with no change control. Then a reviewer asks why outputs from January look different from outputs from March, and there's no documented prompt history to refer to.

Retrieval corpus change. For any RAG-based system, the retrieval index is part of the model. If documents are added, removed, or re-embedded, the system has changed. The embedding model itself counts too — switching from text-embedding-3-small to text-embedding-3-large makes the same documents retrieve differently for the same query.

Tool definition change for agents. Agentic systems where the model can call tools (search, calculation, database queries) have the tool list as part of their behavior surface. Adding, removing, or modifying a tool definition is a system change.

Sampling parameter change. Going from temperature 0 to temperature 0.7 is a different system. Going from top_p=1 to top_p=0.9 is a different system. These changes feel small. They affect output distribution materially.

Pre- or post-processing logic change. Most production AI systems sit inside a pipeline that does input cleaning before the model and output parsing after. Changes to either are system changes.

Underlying API change. If the vendor changes the API surface — adds a new field to the response, changes how function calls are structured, modifies the error behavior — your system has effectively changed even if you didn't update anything yourself.

Each of these should trigger, at minimum, an entry in a model registry, a validation check appropriate to the risk level of the change, and a Part 11 change control record. In practice, the discipline most teams achieve is to log the first item (vendor checkpoint update) and ignore the rest. The audit risk from this is real and growing — regulators are starting to ask not just "is the model validated?" but "is the model that produced this specific record the same one that was validated?"

The cleanest implementation we've seen treats the AI system as a versioned unit composed of {model snapshot, system prompt, configuration, retrieval index version, tool list, pre/post-processing code}. The hash of that bundle is the system version. Any change to any component produces a new hash and a new version event. The model registry stores all versions ever deployed, with their validation status. The audit trail entries store the version hash, not just the model name. Reconstruction is straightforward because every component is recoverable from the bundle.

The CSA guidance finalized in September 2025 helps here — it endorses a risk-based approach to validation effort, which means low-risk component changes (a prompt edit affecting a non-regulated user-facing message) don't require the same depth of validation as high-risk changes (a model snapshot update affecting MedDRA coding). But "risk-based" is not the same as "ignore." Every change still needs to be captured in the version history.

ALCOA+ when the system is non-deterministic

ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) is the data integrity framework FDA and other regulators apply to electronic records. Each letter has a specific meaning, and AI systems strain most of them in interesting ways.

Attributable. For AI, attribution must extend to the model invocation, not just the human user. "Generated by claude-3-5-sonnet-20241022 at the request of user_id 4471" is the form. Most current implementations stop at the user_id.

Legible. The output must be stored in a form a human can read. This sounds trivial, but for embedding-based pipelines we've seen records stored as vectors with no text reconstruction. If the audit trail says "similarity 0.94 to cluster 17," that is not legible. Store the text.

Contemporaneous. Logged at the moment of inference. Not reconstructed later from prompts and seeds. The inference happens once. The record is created at the same time, or it isn't a record.

Original. The exact output, not a paraphrase. If the LLM said "The patient experienced mild nausea on day 3," that string is the record. Storing "AE: mild nausea, day 3" is a summarization that has destroyed the original. Both can be stored — the original must be one of them.

Accurate. This is where AI breaks ALCOA most violently. "Accurate" assumes a ground truth the record corresponds to. Hallucinations are not accurate by definition. Probabilistic outputs are not "accurate" in the classical sense — they are samples from a distribution.

Our recommended reframing for AI systems is to shift the operative concept from accurate to verifiable. A verifiable record is one where a human, with the full captured record set (input, output, model identity, configuration, trace), can determine for themselves whether the output is acceptable for the intended use. The system does not assert its own accuracy. The record enables verification. This reframing is consistent with the spirit of ALCOA+ and consistent with the FDA AI credibility framework's emphasis on context-of-use validation rather than universal accuracy claims.

Complete. All the data, not just the output. This is the inference record set discussed earlier. Incomplete inference records fail this criterion.

Consistent. Records are stored in a stable format over time. For AI, this means committing to a captured record schema and versioning it explicitly when it changes.

Enduring. Retained for the required period. The challenge for AI is that retaining model weights is harder than retaining database rows. If your model is hosted externally, you cannot retain the weights at all — you can only retain the version string and rely on the vendor's snapshot availability. This is a key reason hosted-only AI deployments are structurally weaker for long-retention regulated records.

Available. Retrievable when asked. The audit trail's job is to make the inference record retrievable. If the audit trail entry points to an inference record that has been garbage-collected, available has failed.

The honest conclusion is that strict ALCOA+ compliance for AI-generated records is achievable, but it requires more record-keeping infrastructure than most teams have built, and it requires the reframing of "accurate" toward "verifiable" to be operationally meaningful.

Hosted models and the open-system trap

This deserves its own section because it is the single most common Part 11 oversight we see in AI deployments.

11.3(b)(4) defines a closed system as "an environment in which system access is controlled by persons who are responsible for the content of electronic records that are on the system." 11.3(b)(9) defines an open system as one where access is not so controlled.

When you make an API call to OpenAI, Anthropic, Google, or any other hosted model provider, the model — a critical component of your system — runs in an environment controlled by a third party who is not responsible for the content of your records. The model could be updated. The infrastructure could change. The vendor could log your prompts. Access to the model is controlled by the vendor's identity and access controls, not yours.

That is the definition of an open system.

This has direct consequences under Part 11. Section 11.30 requires that for open systems, sponsors employ "procedures and controls designed to ensure the authenticity, integrity, and, as appropriate, the confidentiality of electronic records from the point of their creation to the point of their receipt." In practice, this means encryption in transit, encryption at rest, and digital signatures or equivalent integrity controls.

For hosted LLM APIs, the encryption-in-transit part is standard. The encryption-at-rest part depends on the vendor's data handling agreement. The integrity controls part is where most implementations are thin — there is typically no signed proof that the output you received is the output the model produced, nor that the model you invoked is the model the vendor claims it is.

There are three architectural paths to managing the open-system status of hosted AI:

Pinned versions and contractual controls. Pin every API call to a specific snapshot. Maintain a vendor agreement that includes data handling, audit rights, and minimum notice for model deprecation. This is the path most production deployments use today. It does not make the system closed under Part 11. It does make the open-system risk manageable.

Self-hosted models. Run the model on infrastructure you control. The model becomes part of a closed system in the Part 11 sense. The tradeoff is operational — you own model serving, scaling, monitoring, and updates. For most clinical operations use cases this is overkill. For high-risk applications where reconstruction over a long retention period is essential, it may be the right answer.

Hybrid batch capture. Use hosted models for inference but capture the full output and store it in a closed system before it touches any regulated record. The hosted model is "outside" the regulated boundary. The captured record is "inside." This is a useful pattern for workflows where regulated record creation can be decoupled in time from inference.

Whichever path you choose, the Part 11 documentation has to be explicit about it. We've seen system validation reports that describe an AI tool as if it were a self-contained system without mentioning that inference is performed by a hosted vendor service. That documentation gap is the audit finding. The narrative the regulator wants is: this is an open system; here are the open-system controls; here is the vendor agreement; here is the pinned version; here is the change control process when the vendor updates.

The operational playbook

Synthesizing everything above into what to actually do this quarter, in priority order:

1. Build a model registry. Not a spreadsheet. A versioned, queryable record of every model deployment your organization has — including model identity, fine-tune ID, system prompt, configuration, retrieval components, tool definitions, validation status, and effective dates. Every inference event should be linkable to a specific registry entry. The registry is the spine of Part 11 compliance for AI; nothing else works without it.

2. Pin every API call. No deployment should be using unversioned model names in production. If your code says model="gpt-4" instead of model="gpt-4-turbo-2024-08-15", that line is a Part 11 risk. Fix it before anything else.

3. Capture the full record set per inference. Inputs, outputs, model identity, configuration, trace, provenance link. Stored in an immutable log with retention matching the predicate rule (15 years for GCP, longer for some GLP). If you can't reconstruct an inference exactly from the stored record, the record is incomplete.

4. Treat configuration as code. System prompts, sampling parameters, tool definitions, and pre/post-processing logic live in version control with the same discipline as application code. Changes go through code review and change control. The hash of the configuration bundle is part of the system version.

5. Define your system boundary in writing. Document explicitly whether each AI component is operating in a closed system (self-hosted) or an open system (hosted API), what the Part 11 controls are for each side of the boundary, and how records cross from open to closed.

6. Validate the workflow, not the model alone. The CSA guidance from September 2025 reinforces this: validation effort should be proportional to the risk the function poses to product quality and patient safety. For an AI tool, the question is rarely "is the model accurate?" — it is "does the workflow surrounding the model produce reliable regulated records?" Validation evidence should cover the human-AI loop, the audit trail, the reconstruction capability, not just the model's standalone performance metrics. We covered the seven-step credibility approach in detail in our FDA AI Credibility Framework guide; validation effort under Part 11 should follow the same context-of-use logic.

7. Don't put unreviewed AI output behind an electronic signature. This is a separate Part 11 risk most teams miss. 11.50 and 11.70 govern electronic signatures and their link to the records they sign. If a user e-signs a record that contains AI-generated content they did not actually review, the signature integrity is in question — the user has signed something they didn't author. Build the workflow so that AI outputs require explicit human review before any e-signed record incorporates them.

8. Operationalize the audit log aggregation. AI workflows typically span multiple systems — the application, the model API, the vector store, the downstream record system. Each produces logs in a different format. The Part 11 audit trail is whatever your inspector can reconstruct from those logs, which means the aggregation pattern matters as much as the individual log captures. Many teams use workflow automation platforms like Make.com to ETL events from disparate sources into a unified audit repository — that's a reasonable starting architecture for small and mid-size deployments where building a custom aggregation layer is overkill.

9. Maintain a model-deprecation watchlist. Every hosted model snapshot you depend on has a deprecation date, announced or otherwise. Track them. Plan migration to new snapshots as a change control event, not a fire drill. When gpt-4-turbo-2024-04-09 was scheduled for deprecation, teams using it for regulated records had a known window to validate the successor snapshot. Teams using gpt-4-turbo unversioned discovered the switch had already happened.

10. Build the reconstruction drill into your readiness program. Pick a random AI-generated record from six months ago. Try to reproduce its exact output from your captured record set. If you can't, your record set is incomplete. Do this every quarter. The drill surfaces failures long before an inspector does.

What FDA has and hasn't said

To be candid about the regulatory gap: FDA has not issued definitive guidance applying Part 11 specifically to AI/ML systems used in clinical operations. The closest the agency has come is the cluster of adjacent documents we've referenced — the CSA guidance, the AI credibility framework for drug submissions, the PCCP guidance for AI-enabled devices, the joint FDA/EMA Guiding Principles. Each of these addresses a piece of the picture. None of them resolves the core Part 11 questions about audit trails for non-deterministic systems, version control for prompt-based components, or the closed/open system status of hosted APIs.

ICH E6(R3), finalized in January 2025 and adopted by FDA in September 2025, comes closer than anything else. The R3 revision has a dedicated chapter on computerized systems requiring "a comprehensive system inventory detailing each system's purpose, validation status, access controls, security measures, interfaces, and management responsibilities" — language that maps cleanly onto a model registry, though it doesn't say AI explicitly. The R3 framework is risk-based and proportional, which gives teams room to design AI controls that fit the risk profile rather than over-applying legacy CDMS patterns.

Europe is moving faster. The draft revision of GMP Annex 11, expected to finalize in mid-2026, will require audit trails to be "always on and locked" and adds requirements around lifecycle traceability that map directly onto AI versioning concerns. The new Annex 22, specifically addressing AI in GMP-regulated environments, introduces "human oversight and explainability" as explicit regulatory expectations — the first time we've seen those concepts written into a binding GMP document.

The EU AI Act adds another layer, with full enforcement beginning August 2026. The Act's high-risk classification covers AI systems used in critical medical contexts, and the documentation and risk management requirements run in parallel to Part 11 without aligning perfectly. Sponsors operating in both jurisdictions are building a unified control framework that satisfies the strictest applicable rule on each dimension.

The honest summary is that no single regulation today gives you a complete answer for AI under Part 11. The teams that appear best positioned for the next several years of inspections are the ones building infrastructure now under uncertainty — comprehensive inference record capture, model registries with full configuration versioning, workflow-level validation evidence, explicit closed/open system documentation, and reconstruction capability as a tested property of the system. None of these controls are explicitly required by Part 11 as written today. All of them are consistent with the direction adjacent guidance is converging on.

The next eighteen months

Three things are likely to happen between now and the end of 2027.

First, the FDA Part 11 modernization rule will probably issue. It has been promised since January 2023 and the agency continues to signal that it is coming. When it does, expect language that explicitly addresses cloud, mobile, and — at least obliquely — AI/ML systems. The teams that have already built AI-aware Part 11 infrastructure will translate easily. The teams that haven't will face a remediation cycle.

Second, EU GMP Annex 22 will be in force across the EU and adjacent markets, and global sponsors will adopt its AI controls as the de facto standard rather than maintaining different rules in different jurisdictions. The "always on, locked" audit trail language will become baseline expectation for inspection readiness.

Third, AI-specific findings are likely to appear in Part 11 enforcement at some point. The timing and the first named sponsor are not predictable from where we sit, and we are not in the business of predicting them. What is observable today is that the enforcement volume already supports such citations — 327 warning letters in the second half of 2025 alone — and that the failure modes this guide describes (inadequate model versioning, missing inference records, unreconstructable AI-generated regulated records) are exactly the kind of issue an inspector is now equipped to identify if they look for it.

The work of getting Part 11 right for AI is not waiting for any of this. It is happening now, in the implementation choices teams are making every week. The choice to pin or not pin. The choice to log a trace or not. The choice to treat a prompt change as a version event or as a routine edit. Those choices accumulate into either a defensible audit posture or an audit liability, and they're being made before the rules that will eventually govern them are written.

That's the actual job. The rule will catch up. The records that are being created today either will or will not be reconstructable when the rule arrives.


Related reading on ClinStacks:

  • [The FDA 7-Step AI Credibility Framework — A Practitioner's Guide](/compliance/fda-ai-credibility-framework) — Compliance Guide 01
  • AI Compliance landing page — Index of all compliance guides and frameworks (FDA, EMA, GAMP, EU AI Act)
  • FDA AI Medical Device Tracker — Live registry of FDA-authorized AI/ML devices, useful for cross-referencing precedent
  • Clinical Trial AI Stack Guides — Practitioner-tested tooling guides for LLM-assisted clinical workflows
  • ClinStacks Advisory — For teams who want a Part 11 readiness review of their current AI deployments before their next inspection
If you're building or operating AI inside a Part 11-regulated workflow and want a second pair of eyes on your inference record capture or your model registry design, that's the kind of conversation our advisory engagements are built for. The cheapest finding to fix is the one you find before the inspector does.

— The ClinStacks team