Most imaging AI algorithms perform unimpressively in external validation exercises

A Johns Hopkins review of published diagnostic radiology algorithms tested on image data that wasn’t used in their development—“external” data—has found the vast majority fail to match the scores they notched with internal data.

The review’s authors use the finding to extol the value of external testing in evaluating algorithms’ generalizability and, in the process, improving the quality of future research into radiological AI.

Lead author Alice Yu, MD, senior author John Eng, MD, and co-author Bahram Mohajer, MD, MPH analyzed the performance of 86 deep learning (DL) algorithms that were validated, or not, on external datasets. The team discovered that some 81% of the models (70 of 86 in 83 separate studies) diminished at least somewhat in diagnostic accuracy compared with their accuracy on internal datasets.

Around half (42 of 86, 49%) had at least a modest falloff, and nearly a quarter (21 of 86, 24%) dropped by a substantial degree.

Radiology: AI posted the report May 4, ahead of final review.

Citing prior literature reviews showing a relative dearth of external validation steps in imaging-based AI studies, Yu and colleagues suggest the problem largely owes to the difficulty of obtaining “an appropriate external dataset of medical images and lack of awareness of external validation’s importance in establishing clinical value.”

These challenges “may diminish as large public datasets become increasingly available,” the authors note, and as “major journals begin supporting guidelines that highlight the importance of performing external validation.”

The authors also discuss the intuitive expectation that large datasets should have readier generalizability than smaller sets, citing prior research bearing this out.

“In contrast, we did not find size or number of institutions in the development dataset to have a statistically significant impact on external performance, suggesting that other factors may be involved,” Yu et al. write.

Meanwhile several studies in the present review showed DL performing better with external datasets than they did with internal sets.

“Such a result might be naively interpreted as evidence that some algorithms are highly generalizable, but such a conclusion should be questioned,” they write before offering for consideration of two possible causes of “misleadingly high” external performance.

First, the external dataset “might contain only images with heavily weighted features responsible for correct classification and not be representative of a realistic target population.”

Second:

The image data might contain information about the diagnosis that is unrelated to the disease process, such as a radiography marker or ‘burned in’ text in the images. … Interpretability techniques such as image embedding and activation maps can help identify [such] ‘data leakage.’ In the study with the most dramatic external performance increase in our review, the authors found that the external dataset, which was a publicly available breast ultrasound dataset, contained very straightforward examples and possibly only contained heavily weighted features.”

More coverage of external validation:

Most AI research focused on radiology lacks external validation

O-RADS externally validated for differentiating between benign and malignant ovarian lesions

AI interprets imaging data as well as physicians—but there’s a catch

What Has Artificial Intelligence Done for Radiology Lately?

 

Reference:

Alice C. Yu, Bahram Mohajer, John Eng, “External Validation of Deep Learning Algorithms for Radiologic Diagnosis: A Systematic Review.” Radiology: AI, May 4, 2022. DOI: https://doi.org/10.1148/ryai.210064

Dave Pearson

Dave P. has worked in journalism, marketing and public relations for more than 30 years, frequently concentrating on hospitals, healthcare technology and Catholic communications. He has also specialized in fundraising communications, ghostwriting for CEOs of local, national and global charities, nonprofits and foundations.

Around the web

The patient, who was being cared for in the ICU, was not accompanied or monitored by nursing staff during his exam, despite being sedated.

The nuclear imaging isotope shortage of molybdenum-99 may be over now that the sidelined reactor is restarting. ASNC's president says PET and new SPECT technologies helped cardiac imaging labs better weather the storm.

CMS has more than doubled the CCTA payment rate from $175 to $357.13. The move, expected to have a significant impact on the utilization of cardiac CT, received immediate praise from imaging specialists.