Artificial Intelligence in Medical Devices - Part 4 Risk Management - Software in Medical Devices, by MD101 Consulting

What doesn't change

Let's imagine we have a legacy device with a classical algorithm. We want to place on the market a new generation of this device with AI models.
The risk management process doesn't change:

Risk management steps (risk analysis, evaluation, control, and so on) remain the same,
Your risk management file templates remain the same.

The risk modelling doesn't change:

Software risk stems from software failure, be it classical or AI,
Combination of severity and probability of occurrence of harm, as well as your tables for probability levels / severity levels remain the same,
Hazardous situations (software failure, refined in several possibilities like false positive and so on), and harms (specific to your device intended use), remain the same.

That's not very exciting! Let's continue.

What changes: risks on data

With AI models, we have:

New technology = new failure modes,
New failure modes = new sequence of events.

Examples of risks related to data

Classical algorithm: a value out of bounds may lead to a floating-point overflow or underflow, the software crashes,
AI model: a value out of bounds may lead to an irrelevant result, the software doesn't crash and still gives a result completely out of it.
In both cases, a mitigation action could be to filter input values out of bounds. However, for AI with many input parameters (features), defining "out of bounds" may be challenging.

Data-driven design

As we've seen in the last two articles, AI design is a data-driven design. Thus, lots of risks stem from poor data quality or bias in data. Quoting the content of AAMI 34971, here are some possible sources of risks:

Quality of data:
- Incorrect
- Incomplete
- Subjective
- Inconsistent
- Atypical
- Thus, it is necessary to define data specifications and data quality criteria,
Bias on data:
- Selection bias
- Non-normality
- Proxy variables
- Confusion variables
- Group attribution bias
- Experimental / medical practice bias
- Thus, it is necessary to stratify, analyze, rework data to avoid such biases.

This is exactly what we've seen previously in the data management process, with the following steps (amongst others):

Defining data specifications,
Defining data quality criteria,
Preparing data,
Reviewing data.

Process-level risks

These problems on data quality and biases are primarily mitigated by a proper data management process established into your existing QMS. Practically speaking, this is the daily work of data scientists and people with a clinical or scientific background to manage data to avoid these problems.
We could even assert that such risks are more mitigated by processes, rather than by product requirements.

In classical software development, developers can introduce bugs into software code. This is mitigated by a proper software design process.
In AI model design, data scientists and subject matter experts can introduce problems in data used for AI training, validation and test. This is mitigated by a proper data management process.

Data-level risks

Some risks can be attached to data. Thus, they can be seen as data risks, instead of product risks.

Examples of data risks

In design:

Data Risk: Limited representativeness of data
- Mitigation: collect multi-centric data, mono-centric data is possible with a rationale
Data risk: imbalanced proportion of data with regard to classes
- Mitigation: documenting stratification of data and sample data at random in each stratum to divide cross-validation sets (cross-validation stratification).
Data Risk: Different disease prevalence in datasets collected from multiple centers.
- Mitigation: balancing the prevalence by either deleting true negatives or adding synthetic true positives.
Data Risk: inter-expert variability. For a given dataset, some experts may classify a data item in class A, whereas some others may opt to class B.
- Mitigation: define a parameter to quantify agreement between experts (very poor, poor, good, very good) and set a threshold on the agreement level. When the agreement is below the threshold, assign a new class “undefined” to the data item.
Data Risk: bias in the data representativeness of the patient population, in favour of elderly versus paediatric for a given intended use addressing a large population. But with more severe consequences for paediatric than for adults.
- Mitigation: ensure a minimum of paediatric cases in the datasets, especially the AI test dataset.

In production / post-market:

Data Risk: Data drift when the device is feed with real world data.
- Mitigation: ensure performance monitoring is planned for the MDAI placed on the market.
- Mitigation: plan retraining of MDAI to release a new version every year.

Remark: All these risks and mitigation can be recorded in a new section named "Data" in your risk assessment matrix.

What changes: risks on user interfaces and man-machine interactions

Conceptually, the FDA requires to assess whether human + AI is better than human alone. With a risk-based approach, we also have to determine if we can identify new risks in the case of human + AI+
With AI models, we have:

New sequence of events, with man-machine interactions. Users may react differently when they are confronted to an AI,
Devices with AI may also have a degree of autonomy higher than classical devices. This may lead to cases where users can hardly control the device or even monitor the device behavior.

Examples of risks related to UI

Classical algorithm: the software algorithm is a rules engine integrated in a robot. It always triggers the same action / gesture for data comprised into a predefined range / envelope,
AI model: the AI model is a ML model integrated in a robot. It may trigger similar action / gesture for data comprised into a predefined range / envelope. But not exactly the same, leading to rare loss of accuracy,
Users confronted to the robot with AI may place too much trust in the AI and may overlook this problem of accuracy.

What changes: risks on generative AI

Some MDAI with generative models have recently emerged in the medical device world, like:

Models to generate custom-made implants by reconstruction of anatomy from medical images.
Bots built on LLM, making use of specific bibliographic repository (RAG), to generate medical advice from patient-specific data.

With generative AI models, we have:

New technology = new failure modes, once again,
New failure modes = new sequence of events.

Examples of risks related to generative AI:

AI model: the AI model generates customs-made implants by reconstruction from medical images.

Risk: It may generate an anatomically wrong shape of implant.
- Mitigation: Surgeon shall review and manually adjust the shape.
Risk: It may generate spikes on the shape.
- Mitigation: a classical algorithm “surfaces” the shape to remove spikes.
Risk: It may generate shapes without appropriate mechanical resistance.
- Mitigation: a probe is 3D printed to test the resistance.

Remark: if some classical algorithm exists with the same intended use, we may retrieve these risks. There is no real new sequence of events in these examples above. What's new with generative AI algorithms: possibilities never seen with classical algorithms.

AI model: the LLM-based AI model generates medical advice from patient-specific data:

Risk: the model confabulates some wrong advice, 10% of time.
- Mitigation: Challenge the LLM result. A second LLM, from a different supplier, is asked if this advice is right. If there is a discrepancy of results between the two, the first LLM is rerun with a modified prompt ("are you sure, my little fancy tiny sweet LLM that this is ...?").

Remarks:

This is really a new sequence of events. Classical algorithms, like rules engines (however also named AI, but more traditional) would not be subject to such random failure rate.
In this example, what is the effectiveness of the mitigation?
- Could both LLM confabulate at the same time? Let's say 10% x 10% = 1%.
- Is a wrong result once every hundred advices acceptable? This depends on the intended use:
- Probably "yes" for a daily nutrition assistant,
- Probably "no" for a drug prescription assistant (what a bad idea!).

More classical risks and mitigation actions

Like in classical algorithms, we can have some kind of overarching mitigation action, when the AI model fails. This AI failure is simply a software failure at system level:

Risk: AI system detection algorithm result is irrelevant
- Mitigation 1: the AI system outputs a detection result and a confidence index.
- Mitigation 2: a software supervisor controls the AI system output, and discards it when the result is out of a clinically relevant range.
Risk: AI system failure, embedded in a robot
- Mitigation 1: software, when the AI system outputs irrelevant data, a supervisor fallbacks to a less complex but more reliable classical algorithm.
- Mitigation 2: hardware, circuit breaker in case of safety compromised by both AI and classical software.

Precedence of mitigation actions

ISO 14971 requires to seek for mitigation actions by order of precedence, starting with inherent safety by design. Finding such mitigation for AI model can be challenging:

A proper data quality assurance can be seen as inherent safety by design.
We could object that this is more a remote kind of safety by design. Data quality assurance is supposed to ensure a better trained model. But there is no direct relationship between data quality assurance and the safe behavior of the MDAI. At least with current technologies.
Conversely, filtering AI system outputs with a classical (and deterministic) algorithm is a kind of direct inherent safety by design.
There is a noticeable difference of effectiveness between data quality assurance (supposed effectiveness demonstrated only with statistical AI test data) and the traditional algorithm (effectiveness demonstrated by classical scripted testing methods).

A proper data management plan and a proper MDAI development and V&V plans are necessary to ensure de release of a safe and effective device. This is the only obvious actions for inherent safety by design we have at hand. And this is unfortunately the inherent limitation of AI ML technologies built on statistical models.

For Protective measure, we are more on a ready-made path:

Human oversight is seen as a protective measure (E.g. in AAMI 34971).
This is more realistic to place such a mitigation action as a protective measure. Inherent safety by design shouldn't require a human action.
Another protective measure may be an explanation of the AI output. A LLM quoting its sources is a kind of protective measure. The user has to verify the sources by themselves.

For Information to users, we are also on a ready-made path, but with some peculiarities:

A kind of information to user can be describing in the IFU the boundaries within which a safe use of the AI is possible.
As we've seen previously, AI models may use a lot more parameters than classical algorithms. they can have behaviors difficult to anticipate in case of real-world data being outside the envelope parameters. And there may be no way to systematically check if parameters are in a predefined enveloppe. Explaining how the device can be used and within which boundaries may be the only possible mitigation action.

Information to users also contributes to the requirement of AI model transparency. While we may object that transparency is a concern for any algorithm, it is exacerbated by AI models, validated within a given envelope of parameters.
Thus, information to users can be almost seen has a mandatory mitigation action. This is why in its guidance on AI-enabled MD, the FDA requests manufacturers to document a model card.

Explainable AI can also be seen as a mitigation action. Is it inherent safety by design or a protective measure? Since, the explanation is presented to the user, we are more in the case of a protective measure. A bit like a warning pop-up in a classical software.

Conclusion

Risk management for AI in medical devices is not so far from what we know for classical software. At least in the method. Some new sequences of events are possible, coming from the specific characteristics of AI. The future standard ISO TS 24971-2 will contain a questionnaire on these characteristics, helping manufacturers in their identification of AI-related risks.
This future standard will be a significant update of the current AAMI 34971. We've seen above that this topic is still subject to discussions. State-of-the-art of risk management for MDAI is still being built.

Software in Medical Devices, by MD101 Consulting