FDA Guidance on Artificial Intelligence enabled device software functions

Scope

The scope of this guidance is medical device functions powered by AI. Even if any kind of AI technology can be envisioned, it focuses on AI with Machine Learning (ML). Machine Learning is used in the vast majority of AI algorithms nowadays.

Even though this is not mentioned, we could add that in terms of audience, it targets everybody in FDA's personnel who's in charge of regulating medical devices. Reading the guidance, we can guess it was circulated in many hands / departments of the FDA personnel. Throughout the document, we can find remarks or inserts about specific subjects (Q Submissions, Usability, Cybersecurity ...) showing that every department in charge of medical devices was consulted to write this guidance.
This fact isn't so truly visible with other guidance. Thus it is remarkable. This guidance probably required an amount of man-hours far above the average required to write a "standard" guidance.

As a consequence, mirroring the involvement of FDA staff, the audience of this guidance is also everyone dealing with medical device lifecycle within the manufacturer personnel: product, design, V&V, clinical, post-market, quality, regulatory...
To make it short: everybody involved in AI-powered medical device at any step of the lifecycle should read this guidance.

AI/ML is revolutionary but it is not a revolution

I'm not trying to freeze the enthusiasm of AI aficionados with this subtitle! AI/ML is revolutionary, for it brings new possibilities to design algorithms and to deliver diagnostic / treatment functions that were totally out of reach with classical algorithms.
But, from a regulatory perspective, this is not a revolution. Most of what is described in this guidance was already applicable to medical device software with complex algorithms (called "deterministic" - easy to analyse, opposite to ML algorithms, which internals are difficult to analyse).

What makes the difference between a PACS workstation allowing to diagnose various diseases with some complex classical algorithms, and another one with AI algorithms? What makes the difference between a diabetes management software, predicting the insulin dose to inject, and another one AI-powered?
From a performance or clinical standpoint, the same level of scrutiny is expected by the FDA in the device marketing authorisation file. If we read this guidance, having in mind a device with some complex algorithm, yet classical, most of the guidance content is relevant for this classical device: device description, user-interface and labelling, risk assessment, cybersecurity, and even data management (only for V&V in classical algorithms) .

What's new with AI

What's new with AI is the need to train the ML model with data. A data-driven design, opposite to a software-driven design. The performance of such ML model relies heavily on data used for training, tuning, verification and validation. Thus, we find specific considerations in this guidance about ML model design, development, and above all ML model validation.
These considerations aim to ensure the delivery of an AI-powered device with an appropriate level of safety and performance, by addressing typical pitfalls of AI/ML algorithms:

Lack of transparency (opacity of models),
Introduction of biases,
Difficulty to generalise,
Drift of performance, once released.

The FDA simply wants to address these pitfalls by letting manufacturers document their AI-powered devices by following the recommendations described in this guidance.

Recommendations

As usual in this blog, we're not going to synthesise or paraphrase this guidance (some consultants excel at doing this, and they're being replaced by LLM's). We're going to spot a few striking points, seen when reading the guidance.

Vocabulary

A note on vocabulary is given at the introduction of recommendations. It's quite important to align AI vocabulary and FDA definitions. Especially, how the world "validation" can be misleading. In the medical device world, validation has a strict meaning, and we stick to it. It looks like it's up to data scientists to adapt their understanding to medical device vocabulary!

Device description

In all the elements listed in this section, we see FDA's transparency concern. But, coming back to classical algorithms, most of these elements were already required.
Some bullet-points are new, like an explanation on how AI achieves device intended use. Likewise, some points seen as basic (or not important) for a classical algorithm, become significant for AI. Like the position of the AI-powered device in the clinical workflow. With data-driven design, such considerations about workflow and processing of data in target conditions can become of utmost importance.
A wrong description of the position of the AI-powered device in the clinical workflow can lead to a collapse of performance of the fielded ML model.

User-interface and labelling

Describing a user-interface was already important for a classical MDSW.
However, with ML models, detection of a drift or degradation of performance on the spot can be seen as a mitigation of AI-specific risks. Thus, the emphasis on user-interface in this guidance.

Likewise, labelling is important for any device. Here, specific considerations, mainly to address transparency, are discussed / recommended in the way a medical device incorporating AI is labelled.
Specific elements to document cover the ML model design, development, V&V, and post-market surveillance (ML model performance monitoring). Thus, a very comprehensive ML model description, to ensure transparency.

Risk assessment

Here, the FDA recommends to address specific AI risks by making use of the AAMI CR 34971 guide. This guide can be seen as state-of-the-art, for the FDA doesn't add any other recommendations, besides applying that document for AI risk management.
It's worth noting that the FDA insists on human factors risks specific to AI. A complex or too opaque ML model may lead to misunderstandings or misinterpretations, wrong decisions, and eventually patient harm.

Note: AAMI 34971 should be replaced by ISO/TS 24971-2 in the near future. The publication of this TS is scheduled by August 2026.

Data management

This part is probably the most AI-specific. For classical algorithms, we rely on test datasets. For AI models, we rely on training, tuning, verification, and validation datasets. These datasets are at the center of data-driven design (obviously).
The FDA guidance addresses the risk of wrong data leading to a wrong ML model. Namely: biased model, not generalisable, and data with confounding variables.

One major principle is the separation (FDA uses the word "sequestration") of data: validation data shall never be seen by software development teams and model design teams. This principle was already present in older documents like Good Machine Learning Practice co-authored by FDA, Health Canada and MHRA (itself being updated by the IMDRF Good Machine Learning Practice).

Classical software documentation effort is based on a simple principe: the higher the software development process rigour, the higher the software quality. And the lower the concentration of latent bugs in the released version.
Likewise, for ML models, the higher the data management process rigour, the better the ML model performance, and the better the ML model transparency.

It’s worth noting that the guidance references other guidances on clinical data management (broadly speaking), like the use of real-world data (RWD).
To sum-up this long section, data management is a matter of having the right clinical data, for training, optimisation, verification, and validation. The way to have such relevant data is to have a rigorous data management process. This process shall ensure, to name a few recommendations, the establishment of reference standards (an "Oracle" in ML language), an appropriate data annotation process, and data representativeness with regard to device intended use.

The FDA also insists on representativeness. For that goal, it is not recommended to have RWD coming from a single site. Thus, it is warmly recommended to have data coming from more than one site.

All these recommendations look like the need of clinical data with a process of acquisition similar to a clinical investigation with real patients, or a retrospective study with RWD (coming from real patients as well). Thus, documentation like a data management plan, a study design, a statistical sampling plan, look like compulsory.

Remark: some return on experience on FDA's expectations. Official guidance only remind the importance of having qualified experts and appropriate methods to establish ground truth. But the Agency also warmly recommends to establish ground truth by having at least three clinical experts practising in the US. This recommendation on the umber of experts isn't present in any guidance but is encountered in exchanges with FDA personnel (510(k) or QSub discussions....).

Model description and development

Like data management, model description is a way to ensure transparency of ML models. Items required by the guidance can be simply put in a template document in your design procedure.
It's worth noting that model description isn't enough. The FDA also requires information about model development, training and tuning. No standard exists yet on model development process, for medical devices (IEC 5338 gives some cues on AI lifecycle process but is for MD specific) .

Validation

Considerations about ML model validation are already applicable to classical algorithms: performance validation, human factors validation, clinical validation are known subjects.
For ML models, these considerations are all the more important.

Human factors

For human factors, the FDA stresses out the risk on safety caused by misunderstanding brought by the way ML model output data is presented to the end-user. Once again, these considerations are already important for classical algorithms. Perhaps, return on experience identified the understanding of ML model output as a frequent critical task, and pushed these considerations up on the stack.
Interestingly, the FDA uses the language "human-device team". Manufacturers shall demonstrate that this team is supposed to perform better than human alone.
Probably a demonstration we're going to see more often, as devices gain in degree of autonomy.

Performance validation

Once again, performance validation concerns come from return on experience. Not only on medical devices, but also on ML models used in other industries.
Especially the risk of unexpected performance collapse shall be addressed in validation protocols. Needless to say that data management process also contributes to this risk mitigation. The FDA stresses out the need of "quality, and quantity of data" to test the MD model.
Likewise, the FDA pinpoints the risk of overfitting.

Study protocols and results

That section is almost a clinical investigation plan. Either based on retrospective study, when using historical data, or based on a real investigation, when recruiting patients to collect data.
It usually takes a Clinical Investigation Plan (CIP), a State of the Art (SOTA), an Statistical Analysis Plan (SAP), and a Clinical Investigation Report (CIR), to answer to FDA’s expectations on study protocol and results.

ML Model validation is all about clinical validation, and statistics, as emphasised in appendix C. Be prepared to add documents above to your De Novo or even 510(k) submission. A substantial equivalence table, and pre-clinical tests on datasets known in design, won’t be enough. Evidence of substantial equivalence for a ML Model requires the documents listed above.

This is a bit like what EU MDR requires in Technical Files (but with the least burdensome approach advocated by the FDA, not the most burdensome approach cherished by NB’s - yes, I know, still some MDR bashing on this blog).

Device Performance Monitoring

This section addresses the risk of performance collapse due to RWD drift. Most precisely, undesirable and unexpected data drift.
Performance monitoring recommendations are close to what we see in a PMS / PMCF (even closer to PMCF) plans required by EU MDR.

The comparison ends here. PMS / PMCF plans are constrained by regulatory requirements present in MDR Articles 83 to 86, and annex XIV on PMCF. FDA’s view on performance monitoring is focused on ML model performance. It’s up to the manufacturer to define adequate measurements, or indicators or criteria, required to appropriately monitor device performance.
FDA’s recommendations could also be used on the EU side, to define methods to collect data and rationale on data appropriateness, present in a PMCF plan. There’s no such MDCG guide on this essential characteristic of ML models, up to now.

Cybersecurity

At last, we end with cybersecurity (everybody at the FDA was consulted!).
The guidance quotes the possible safety problems that could arise from cyber threats. The likelihood of cyber risks will depend on the exposure of the medical device, and its ML model, to the outside world.
While cyber threats scenarios can be very specific to ML models (e.g.: data poisoning), the assessment of likelihood is similar to classical algorithms, based on the presence of SW / data assets and the ease to break into the system. But we could imagine a peculiar ML model containing assets of nature and/or quantity unseen up to now, leading to new threats. A bit like LLMs being looted and duplicated by clever people, some ML models with a medical purpose could be subject to similar threats.
The menagerie of ML animals is vast enough to imagine new threats and scenarios.

Public Submission summary

Only one striking point: the FDA recommends (requires) to publish a lot on the ML model!
A transparency concern present at two levels:

Transparency for the FDA on the content of premarket submissions, with all the documentation required in the guidance,
Transparency for users, and other stakeholders in the supply chain, with the model card present in appendixes.

Still on transparency (and no relation to public submission summary), the FDA doesn’t talk about methods to add interpretability to the ML model output itself. A way to contribute to transparency.

A word on QMS

The guidance also contains a section about Quality System Regulation (QSR). In other words, the need to use existing QMS provisions or adapt them to AI model and data lifecycle.
This section is quite vague and won't give you any tip on how to adapt your QMS to AI. It merely recalls existing requirements present in QSR. An additional comment on QMS is present in performance monitoring section. But still not practical enough.

ISO 42001 is supposed to be the standard for such purpose. But it is not present in the recognized standards database. It's probably a symptom of the limitations of that standard. Built on a scheme similar to ISO 27001, it especially doesn't address (doesn't address correctly) software design, verification and validation, as well as data management.

Conclusion

If you want to succeed in your next premarket submission of device incorporating AI:

You shall follow the recommendations presented in this guidance. Take it seriously. Put it in an tabular file and draw an action plan for each paragraph,
You should (shall?) go through a QSub, to be sure that you ML model design and validation strategies meet FDA's expectations.

Note: everything you do on the FDA side may not be apt to be reused on the EU side in terms of:

Clinical validation strategy, EU MDR has its own requirements,
Clinical data. Obviously, there may be differences between US and EU populations.

Software in Medical Devices, by MD101 Consulting