AI model could help providers choose more appropriate imaging exams
Researchers have developed an artificial intelligence model they believe will help address imaging overutilization.
Similar to OpenAI’s ChatGPT, AMIR-GPT (Appropriate Medical Imaging Recommendations Generative Pre-trained Transformer) is a generative pre-trained transformer model; experts trained it on guideline-adherent clinical scenarios that would prompt a provider to order medical imaging. Researchers involved in its development are hopeful its use could guide providers through the process of determining whether patients need imaging and, if so, which exam is best suited for the clinical question at hand.
“Overutilization of medical imaging is not just a cost problem. It reflects a gap between the best available evidence and what happens in practice," corresponding author Han Lyu, MD, an associate professor in the Department of Radiology, Beijing Friendship Hospital, and colleagues wrote in Intelligent Medicine. "Our goal was to explore whether a domain-specific AI model could help bridge that gap in a way that supports clinicians, not replaces them."
AMIR-GPT was trained on over 1,000 curated question-and-answer pairs from 26 of the guidelines in the American College of Radiology Appropriateness Criteria. The dataset covered common clinical scenarios and was split into two sets—one for training and another for testing. The model’s performance was compared to GPT-4, GPT-3.5 and Gemini on the test set, with responses scored from 1 to 5 using a weighted Cohen’s kappa.
Out of all three models, AMIR-GPT yielded the best performance. For one-third of the questions, AMIR-GPT scored a perfect 5 out of 5; in comparison, GPT-4 achieved the same score for 16.7%, while GPT-3.5 and Gemini had perfect scores just 6.2% of the time.
The model continued to outshine the others for scores of 4 and 3, though its performance was less dominant and the models were competitive. For the scenarios the model achieved lower scores on, the team observed omissions and deviations from standard ACR recommendations, which raises concern around how it might perform in clinical settings, the group cautioned.
“The comprehensive analysis highlights that the AMIR-GPT model excels in providing accurate, professional, comprehensive, and detailed answers," the authors noted. "For instance, the model accurately addresses questions about specific conditions, such as osteochondritis dissecans, and offers precise information. On the other hand, low-scoring responses reveal shortcomings such as inaccuracies, deviations from standard answers, incompleteness, and erroneous information.”
Overall, the team perceived their findings as positive; their results suggest that general purpose models can be successfully fine-tuned to improve medical decision-making in the future.
“This is a step toward AI as a collaborative tool in medicine, but responsible integration requires broader datasets, stronger evaluation methods, and validation across diverse real-world settings before these systems can be trusted more widely,” Lyu said.
Read more here.
