New study may provide model for assessing AI performance in real-world settings

A new study published Tuesday in Radiology [1] may provide a model for assessing AI performance in real-world settings.

The Personal Performance in Mammographic Screening, or PERFORMS, scheme was originally devised to assess human readers’ skills at interpreting breast images. But a team of researcher recently put the tool to the test, assessing artificial intelligence’s performance.

They discovered no difference between AI and 552 human readers in detecting cancer on 120 examinations. When using AI score recall thresholds that matched average human reader performance (90% sensitivity and 76% specificity), AI showed no difference, at 91% and 77%, respectively.

“It’s vital that imaging centers have a process in place to provide ongoing monitoring of AI once it becomes part of clinical practice,” Yan Chen, PhD, a professor of digital screening at the University of Nottingham, said in a Sept. 5 announcement from the Radiological Society of North America. “There are no other studies to date that have compared such a large number of human reader performance in routine quality assurance test sets to AI, so this study may provide a model for assessing AI performance in a real-world setting.”

For the retrospective study, Chen et al. used two PERFORMS test data sets, each consisting of 60 challenging cases. Human readers performed their interpretations between 2018 and 2021, while a commercially available algorithm from Lunit did so in 2022. The data sample included 161 normal mammograms, 70 more with malignancies, and nine benign breasts. PERFORMS, the authors noted, is a quality assurance assessment used by the U.K.'s National Health Service Breast Screening Program to evaluate readers.

Providers incorporated in the study included 315 board-certified radiologists (57%), and 237 nonradiologist readers consisting of 206 radiographers and 31 breast clinicians. All told, the sample represented about 68% of readers in the NHS Breast Screening Program, allowing for a “robust performance comparison,” Chen noted.

“The use of external quality assessment schemes like PERFORMS may provide a model for regularly assessing the performance of AI in a way similar to the monitoring of human readers, but further work is needed to ensure this assessment model could work for other AI algorithms, screening populations, and readers,” the study concluded.

Read much more, including potential study limitations and a corresponding editorial [2], at the links below. The analysis was funded by South Korea-based AI vendor Lunit Inc., which was not involved in producing the article.

Marty Stempniak

Marty Stempniak has covered healthcare since 2012, with his byline appearing in the American Hospital Association's member magazine, Modern Healthcare and McKnight's. Prior to that, he wrote about village government and local business for his hometown newspaper in Oak Park, Illinois. He won a Peter Lisagor and Gold EXCEL awards in 2017 for his coverage of the opioid epidemic. 

Trimed Popup
Trimed Popup