Three well-known artificial intelligence (AI) systems (also called large language models or LLMs) missed the cut when asked to answer questions like those on rheumatology board certification exams, researchers said.
Responses to 40 such questions were 78% accurate from , 63% from , and 53% from , according to Alí Duarte-García, MD, MS, of the Mayo Clinic in Rochester, Minnesota, and colleagues.
Many of the incorrect answers bordered on -- blatantly false and with unclear source or rationale -- and some could cause "severe harm," the researchers .
"Non-expert users might find it difficult to detect LLM hallucinations," the group wrote. "Therefore, both patients and clinicians should be aware that LLMs can provide highly convincing but potentially harmful answers."
In the study, Duarte-García and colleagues used questions from the 2022 Continuous Assessment and Review Evaluation (CARE) question bank of the American College of Rheumatology. The corresponding correct responses in the bank served as the gold standard for judging the AI models' performance.
One example provided in the report involved a question about a hypothetical 59-year-old man with osteoarthritis and "concerns about osteoporosis," asking at what age he should first have a bone density test. He had no particular risk factors for osteoporosis other than his worries about it. The correct answer from CARE, based on recommendations from the Endocrine Society and the International Society of Clinical Densitometry, is age 70.
ChatGPT-4 and Claude 3 Opus both got it wrong, indicating that the man should have the scan at 65. Their responses looked like hallucinations. In ChatGPT-4's case, it noted that "current guidelines [for men] recommend starting at age 70 unless there are risk factors for osteoporosis," and that the man didn't have any. It nevertheless stated that "the most appropriate age to first measure this patient's bone mineral density would be... 65," without further explanation. Claude 3 Opus followed a similar path, citing published guidelines specifying 70 as the beginning age for men without osteoporosis risk factors and then recommending age 65 for this patient anyway.
Gemini Advanced (a Google product formerly called Bard) also knew about the published guidelines and, unlike the others, followed them in providing the answer of age 70.
The incorrect answers to this question would likely not hurt the patient very much -- the study defined "severe harm" as causing "[b]odily or psychological injury (including pain or disfigurement) that interferes significantly with the functional ability or quality of life." But two of Gemini Advanced's responses, and one each for the other systems, met those criteria. (The report didn't say what they were.)
Moreover, "across all three LLMs, more than 70% of incorrect answers had the potential to cause harm" to some degree, the researchers noted. ChatGPT-4 missed nine of the 40 questions, Claude 3 Opus missed 15, and Gemini Advanced got 19 wrong. The latter failed to provide any answer to 11 questions; ChatGPT-4, in what might be another form of hallucination, couldn't come up with single answers to two questions, and so provided two responses to each.
Limitations to the study included the use of a single question bank, "which might not generalise to other sources and might not fully reflect real-world clinical scenarios," Duarte-García and colleagues wrote. As well, the models were queried up until March 2024; they may be more accurate now with subsequent updates to their algorithms and data sources.
Disclosures
No specific funding was reported for the study.
Duarte-García reported support from the Rheumatology Research Foundation, the Lupus Research Alliance, and the CDC. Authors declared they had no relevant relationships with commercial entities.
Primary Source
The Lancet Rheumatology
Flores-Gouyonnet J, et al "Performance of large language models in rheumatology board-like questions: Accuracy, quality, and safety" Lancet Rheumatol 2025; DOI: 10.1016/S2665-9913(24)00400-4.