Radiology experts develop practical framework for evaluating AI models before purchasing
Radiology experts are detailing how they developed a practical framework for evaluating artificial intelligence models, before deploying the technology.
Real world performance of radiology AI apps frequently diverges from promised results, “creating challenges in anticipating a model’s clinical value and impact.” Researchers with Stanford University and Rad Partners recently developed a structured, pre-deployment method for assessing radiology AI models, detailing their work Wednesday in the American Journal of Roentgenology.
“The presented method provides a practical framework for guiding radiology practices in evidence-based purchasing and deployment decisions when evaluating radiology AI models,” lead author David B. Larson, MD, MBA, a professor of radiology with Stanford, and colleagues wrote March 4.
Rad Partners conducted a pre-deployment evaluation of a single vendor’s portfolio of 13 AI models between 2022 and 2024 (Aidoc was a co-author of the study). A four-radiologist workgroup developed a list of attributes that contribute to the value of clinical AI, assigning weights to each attribute and rating the models based on them. Radiologist and AI performance was assessed for nearly 89,000 exams conducted across different clinical sites. They used both conventional metrics and augmented ones for “enhanced detection cases” (i.e., instances where AI detected something a radiologist missed). The workgroup then combined their task values and pooled AI performance to help predict a model’s value.
They identified three attributes they believe are most likely to contribute to the inherent value of an AI model. These included (1) the tediousness of the task, (2) the likelihood that the radiologist would overlook the finding, and (3) the clinical impact from a miss. Based on their system, five tasks were rated as having a high inherent value, five were rated medium and two were labeled as low value.
Across all tasks, radiologists mostly had a higher positive predictive value, while AI logged better sensitivity. AI models logged widely varying absolute (0.03% to 2.28%) and relative enhanced detection rates (4.5% to 60.5%). Based on the framework, five AI models were predicted to have high value, five medium value and three of low value. Radiologists also were surveyed after implementation. Larson and colleagues found that perceived value categories agreed between survey respondents and the workgroup’s prediction for 10 of the 12 tasks.
“Our reported experience in using the method to evaluate FDA-cleared AI triage models for 12 clinical tasks could be useful for other radiology practices performing similar evaluations,” the authors noted. “The surveyed radiologists strongly agreed with the selected attributes (tediousness of the task, likelihood that the radiologist would miss the finding, potential patient impact if missed) as contributing to this inherent value,” they added later. “Because attributes are inherent to a particular clinical task, they should have broad application to all AI models for that task across clinical settings once they have been defined.”
Read more, including potential study limitations, in AJR.
