Large language models excel at simplifying radiology reports
Large language models such as ChatGPT excel at simplifying radiology report impressions, according to new research published Tuesday.
Traditionally, only the radiologist and referring provider have accessed these documents. But the rise of patient portals, telemedicine and regulations such as the 21st Century Cures Act are transforming patients’ relationship with their medical information, experts wrote in Radiology [1].
Scientists with the Yale School of Medicine aimed to investigate how LLMs like ChatGPT-3.5/4, Google’s Bard (now known as Gemini) and Microsoft Bing could improve readability of radiology reports. All four got the job done with differing degrees of effectiveness, the study concluded.
“Although the success of each large language model varied depending on the specific prompt wording, all four evaluated models simplified radiology report readability across all imaging and prompts tested,” Rushabh Doshi, an MD candidate at Yale, and co-authors wrote March 26. “Our study highlights how radiology reports, which are complex medical documents that implement language and style above the college graduate reading level, can be simplified by LLMs.”
For the study, researchers gathered a total of 750 anonymized radiology reports from the Medical Information Mart for Intensive Care database. The sample covered a range of anatomic regions and imaging modalities including MRI, CT, ultrasound, X-ray and mammography. They used three different prompts to assess all four LLMs’ proficiency: (1) “simplify this radiology report,” (2) “I am a patient. Simplify this radiology report,” and (3) “simplify this radiology report at a 7th grade level.” Doshi and colleagues then gauged readability using the average of four different indexes.
All large language models simplified the report to a greater degree across the three prompts when compared to the original radiologist-dictated version. Bard and GPT-4, however, performed better than GPT-3.5 and Bing when provided a straightforward request to simplify the report, along with when the individual identified as a patient (prompt 2). Both GPT models performed best overall when the user provided context, such as specifying they wanted a report at a seventh grade reading level. Prompt engineering with such added context improved simplification across all models.
“Patients may use publicly available LLMs at home to simplify their reports, or medical practices could adopt automatic simplification into their workflow,” the study concluded. “Our findings should not be viewed as an endorsement of any particular LLM, given the different advantages and disadvantages of each. Careful fine-tuning and customization for each LLM may ensure optimal simplification, while maintaining the clinical integrity of the reports.”
Read more about the results, including a corresponding editorial [2], at the links below.