Large language models offer potential for helping appeal denied radiology claims
Large language models could help reduce the burden of putting together appeals when imaging exams are denied by insurance, according to a new analysis in Academic Radiology.
LLMs have been touted as a potential solution for numerous duties in radiology workflows, including report generation, improving follow-up adherence and flagging relevant findings for referring providers. Now, experts are using LLMs to target a task known to create significant administrative burdens on staff—writing appeal letters to payers.
“In recent years, generative AI and large language models have been evaluated for multiple tasks in clinical medicine, including providing patient information, assisting with documentation, translating and summarizing, medical research and education, for example,” Colin J. McCarthy, MD, with Beth Israel Deaconess Medical Center, and colleagues noted. “Given the administrative burden associated with appealing denials from insurance companies, this study sought to determine if currently available large language models could generate accurate, valid and clinically meaningful appeals letters for coverage denials.”
Researchers recently tasked four established LLMs—Claude 3.5, Nova Pro, Llama-3.1–70B, and ChatGPT-4o—with generating appeals for simulated clinical scenarios. The team used zero-shot, few-shot, and retrieval-augmented generation techniques to prompt the models to create 12 appeal letters, which were assessed by four board-certified interventional radiologists. Readers, who were blinded to which model and technique was used to create the letter, scored the messages based on content (accuracy, personalization, references), grammar and structure (readability, tone, persuasiveness), and usability; references cited by the models were verified as well.
Combined, the models yielded mean content and grammar scores of 3.9 ± 0.95 and 4.3 ± 0.9 out of a possible 5 points. However, reader agreement on these scores varied widely.
Though the models produced letters that were largely accurate, they also showed signs of hallucinations and cited fabricated references. For example, hallucinations were flagged in 16 out of the 48 letters, with the online model (ChatGPT-4o) being more vulnerable to this compared to the offline models. What’s more, of the 44 references cited, 80% were fabricated in the letters produced by the offline models. In comparison, 29% of ChatGPT-4o's letters contained fabricated references.
Overall, the readers signaled that the LLMs’ letters would serve as helpful templates in 73% of cases. The letters were generally perceived as useful.
Though the group expressed optimism for how LLMs can reduce administrative burden, they cautioned that these models still require a significant amount of oversight.
“Large language models are becoming increasingly accessible to the public, with drag-and-drop interfaces for the addition of external knowledge bases, reducing or eliminating the need for coding,” the group noted. “Generative AI may reduce the administrative burden related to prior authorizations or denials, however, the outputs still require careful human review prior to submission.”
Read more here.
