Latest version of ChatGPT AI passes radiology board exam
There has been a lot of discussion about the artificial intelligence chatbot ChatGPT and how it might be used in healthcare. In radiology, one of the most interesting developments is the latest version ChatGPT (GPT-4) passing the written portion of a radiology-style board exam. The research was published today in the Radiological Society of North America's flagship journal Radiology.[1,2]
Find out more in the article below and the video interview above.
“The use of large language models like ChatGPT is exploding and is only going to increase,” said lead author Rajesh Bhayana, MD, an abdominal radiologist and technology lead at Toronto General Hospital in Toronto, Canada. “Our research provides insight into ChatGPT’s performance in a radiology context, highlighting the incredible potential of large language models, along with the current limitations that make it unreliable.”
He said the research is important for several reasons. First, these large language models are now emerging everywhere and ChatGPT in particular has been looked at for applications in medical imaging and healthcare.
There has been a lot of discussion about the artificial intelligence chatbot ChatGPT and how it might be used in healthcare. In radiology, one of the most interesting developments is the latest version ChatGPT (GPT-4) passing the written portion of a radiology-style board exam. The research was published today in the Radiological Society of North America's flagship journal Radiology.[1,2]
Find out more in the article below and the video interview above.
“The use of large language models like ChatGPT is exploding and is only going to increase,” said lead author Rajesh Bhayana, MD, an abdominal radiologist and technology lead at Toronto General Hospital in Toronto, Canada. “Our research provides insight into ChatGPT’s performance in a radiology context, highlighting the incredible potential of large language models, along with the current limitations that make it unreliable.”
He said the research is important for several reasons. First, these large language models are now emerging everywhere and ChatGPT in particular has been used for applications in medical imaging and healthcare.
"Google released their version recently and it will be incorporated into Google searches. It is also being incorporated into Word, Office and Google Docs. Even EMR vendor Epic is integrating this into its interface, so it is definitely here and will change the way we interact with technology. And with that, radiologists need to know how it performs in our specialty to know how we can use it now. And also because there are so many downstream applications of this, including use in our reports and we need to know if it understands the context of our reports," Bhayana explained. "Knowing its limitations is also important."
He said if the technology can understand the true context of radiology-specific information, it might be able to generate, simplify and summarize reports in layman's terms for patients.
ChatGPT passes a radiology board exam with a score of 81%
"What we found is that it performs pretty well," Bhayana said. "Especially since it was a general model that was not fine-tuned for radiology."
Researchers set up a board-style exam with 150 written questions based on Canadian Royal College and American Board of Radiology exams. They excluded the interpretation portion of the exam because the AI cannot yet interpret images. They originally tested ChatGPT version 3.5, but GPT-4 came out in March 2023 just after they finished the study. So, they wrote a second paper that showed improvement in cognitive ability between the two versions.
The researchers found that ChatGPT-3.5 answered 69% of questions correctly (104 of 150), near the passing grade of 70% used by the Royal College in Canada. The model performed relatively well on questions requiring lower-order thinking (84%, 51 of 61), but struggled with questions involving higher-order thinking (60%, 53 of 89). More specifically, it struggled with higher-order questions involving description of imaging findings (61%, 28 of 46), calculation and classification (25%, 2 of 8), and application of concepts (30%, 3 of 10). Its poor performance on higher-order-thinking questions was not surprising given its lack of radiology-specific pretraining, Bhayana said.
However, as they were just finishing the study, GPT-4 was released, which was supposed to perform much better with higher-order questions. So, the research team ran the experiment again with the new version, and it achieved a passing grade. GPT-4 answered 81% (121 of 150) of the same questions correctly, outperforming GPT-3.5 and exceeding the passing threshold of 70%. GPT-4 performed much better than GPT-3.5 on higher-order thinking questions (81%), more specifically those involving description of imaging findings (85%) and application of concepts (90%).
"GPT-4 passed the board with flying colors. It performed much, much better on the higher-order reasoning questions in a radiology context. And that is important because, the way these models work, it is all about understanding the context of language, and the fact that it can carry out these high-order reasoning tasks in the context of radiology is critical to be able to enable those downstream, more advanced imaging applications. These are the things needed to take the next step," he explained.
GPT-4 showed no improvement on lower-order thinking questions (80% vs. 84%) and answered 12 questions incorrectly that GPT-3.5 answered correctly, raising questions related to its reliability for information gathering.
Bhayana is not sure the technology is ready to take that next step, but it is clear ChatGPT is moving very rapidly in that direction.
The ability to interpret images also is a very important part of being a radiologist, and future versions of ChatGPT will incorporate this type of functionality. There are already more than 100 FDA-cleared AI algorithms that can interpret medical imaging. But they do so in one-off applications, such as AI to differentiate hemorrhagic versus ischemic stroke, the ability to identify the level of breast density, or auto detection of pulmonary embolism or aortic aneurysm. But there is not a single AI algorithm that can do all of these things at once.
Downside of ChatGPT is its overconfidence in the wrong answers it creates
Both studies showed that ChatGPT used confident language consistently, even when it was incorrect. Bhayana said this is particularly dangerous if solely relied on for information, especially for novices who may not recognize confident incorrect responses as inaccurate.
“We were initially surprised by ChatGPT’s accurate and confident answers to some challenging radiology questions, but then equally surprised by some very illogical and inaccurate assertions,” Bhayana said. “Of course, given how these models work, the inaccurate responses should not be particularly surprising.”
ChatGPT’s dangerous tendency to produce inaccurate responses, termed "hallucinations," is less frequent in GPT-4, but still limits usability in medical education and clinical practice.
"This is the biggest pitfall at present, and what is currently limiting the use of these things in real-life practice and education. So even if a response is inaccurate, if it is the best answer the AI has available to it, it will spit it out. And that will result in these hallucinations, where it generates an incorrect or irrelevant response very confidently. That can be very dangerous in this setting," he said.
Bhayana said a novice clinician asking it a question and getting a bad response may be taken as gospel truth, which may have a very negative impact on patient care.
"When I was playing around with ChatGPT, and even using it for this study, there were instances where even with things I knew inside and out, I would ask it a question and it would answer so confidently that I would start to question myself. Even as a specialist, I would have to ask myself if that is really correct, and I had to fact check myself. This will have to be overcome before this can be relied on solely," Bhayana explained.
Another limitation is that ChatGPT does not create a reference section to show where it found the data it is providing. He said this issue appeared to be resolved with a beta version of Google's LLM AI Med-PaLM 2, designed specifically for medicine, which provides a reference section. But, Bhayana said it does not always work correctly and he found it can give a list of references that do not actually exist.
"The AI does not really learn discreet facts; it does not think or know like a human does, but it spits out that text in a way that is almost indistinguishable from thinking or knowing. So it brings up a philosophical question of what it actually means to think or know. But I am not a programmer. So, if it walks like a duck and quacks like a duck, to me, it's a duck, where it is mimicking that knowledge," Bhayana said.
In the end, this type of AI is here to stay and radiologists need to understand how it applies to their speciality because it will impact several areas of their jobs.
"This type of AI will be used to help create Microsoft PowerPoint presentations, to help prepare academic talks, and we are going to be using them in all the products that we are using in our practice. This includes accessing information in Epic in the future. Understanding how it works and its limitations and understanding the applications in the future is going to be important," he explained.