Popular LLMs confidently fail at answering image-based board questions
Multimodal large language models like OpenAI’s ChatGPT have come a long way since they garnered mainstream attention, but new data indicate they still have a ways to go before being fully trusted in medical settings.
Updated LLMs have been trained to process both text- and image-based questions, potentially making them more effective in radiology settings. However, the utility of these new features is only beginning to be evaluated, leaving many unanswered questions regarding their reliability.
“Recent advancements have led to the development of multimodal LLMs that can simultaneously process text and images, more closely reflecting real-world diagnostic workflows and training in radiology. These models have the potential to automate and refine the radiology workflow, extending from report generation to assistance in diagnostics,” Romi Noy Achiron, MD, with the department of radiology at Tel Aviv Sourasky Medical Center in Israel, and colleagues noted. “This application is only beginning to be explored in the context of vision-enabled large language models.”
Authors of a new analysis in Clinical Imaging sought to answer some of the questions pertaining to the performance and limitations of OpenAI’s newest versions of ChatGPT—ChatGPT-4v and ChatGPT-4o. Researchers put the LLMs’ potential to the test using image-based questions from national radiology board examinations. Each model was presented with 222 multiple-choice questions previously featured on 2020 and 2024 national radiology boards.
With scores of 59% (GPT-4o) and 54% (GPT-4v), neither of the models achieved a passing rate. Each performed similarly on subspecialty analyses, though ChatGPT-4o's performance showed greater variability. When provided with clinical information, the performance of both models improved significantly. Of note, the team observed an air of confidence in the models’ responses, noting they often provided plausible-sounding, albeit inaccurate, answers—a sign of model drift.
“Unlike traditional predictive models trained on curated datasets, ChatGPT is a probabilistic model trained on imperfect, unfiltered, diverse data. As a result, they may reproduce biased or inaccurate information with persuasive language, which currently limits the applicability of ChatGPT in medical [settings],” the authors cautioned. “This is a major concern in clinical applications, where misleading responses may go unchallenged.”
Though it can be reasonably assumed the models will continue to improve their medical accuracy as they are trained on additional data, the group cautions that, “this is not yet the case.” As such, they will continue to require significant oversight in clinical settings, the group suggested.
Read more here.
