Are LLMs ready for radiology? Research reveals how chatbots' performance has evolved over time
Although large language models have come a long way since they gained mainstream popularity in 2023, they still require significant oversight in medicine.
When OpenAI’s ChatGPT burst onto the scene in late November of 2022 it quickly caught the attention of stakeholders throughout the healthcare industry. Many saw opportunity to use the LLM, particularly in radiology. Since then, numerous tech giants have come out with their own LLMs, many of which have been tested in radiology settings for tasks like report generation, communicating important report findings, and translating medical jargon into text that is more easily understood by patients and more.
Though LLMs have shown great potential in radiology, their clinical implementation has been hindered by their wavering performance. Now, nearly three years after ChatGPT was released to the public, experts have sought to gauge how performance has improved over time. They shared their findings on how some of today’s most popular LLMs fare when prompted with radiology-related tasks this week in Academic Radiology.
“The rapid advancements in chatbots capabilities suggest a promising trajectory for AI in healthcare, which is reshaping how medical knowledge is shared to patients,” Arash Bedayat, MD, with the department of radiological sciences at the David Geffen School of Medicine, University of California, Los Angeles, and colleagues noted. “As these models continue to evolve, it becomes increasingly important to assess their accuracy and reliability in real-world applications such as delivering trustworthy medical information.”
The team repeated an analysis conducted two years prior using ChatGPT and Bard. This time around their work involved not two, but five state-of-the-art LLMs available to the public—GPT-o3-mini, Gemini, DeepSeek R1, Claude, and Perplexity. Each was prompted with 40 lung cancer-related questions that were developed by radiologists, based on Lung-RADS and Fleischner Society guidelines. Three experts categorized the LLMs’ responses as correct, partially correct, incorrect, or no answer, while those produced by GPT-03-mini (one of the more recent versions of ChatGPT) were compared alongside ChatGPT’s original performance.
GPT-03-mini yielded the best results, producing accurate answers to over 75% of the prompts; Gemini followed closely behind, at 74.17% accuracy. DeepSeek R1 showed the highest agreement among raters, while GPT had the most instances of disagreement. Still, GPT outperformed both DeepSeek and Gemini at identifying intentionally incorrect Lung-RADS questions, further solidifying its place as the model most well-equipped to answer medical queries.
Though GPT and Gemini (referred to as Bard in the original LLM assessment two years ago) both yielded improved performance when compared to their older iterations, the LLMs still showed several signs of inconsistency, the group acknowledged.
“Our study provides a snapshot of the current state of AI-driven medical response generation for patients, but continued research will be necessary to fully appreciate the potential of chatbots in the healthcare settings,” the team wrote, adding that "...there continues to be a real danger of using chatbots as a means of disseminating complicated medical information for patients.”
Read more about the findings here.
