When It Comes to Eye Care, AI Couldn't See Straight

— ChatGPT gave bogus and even potentially harmful answers to retinal disease questions, study shows

by Randy Dotinga, Contributing Writer, �鶹��ý August 4, 2023

A close up photo of the ChatGPT website on a computer monitor.

In response to commonly asked patient questions, an artificial intelligence (AI) chatbot gave inappropriate and even potentially harmful medical advice about vitreoretinal disease, according to a cross-sectional study.

Two ophthalmologists determined that the popular chatbot ChatGPT accurately answered only eight of 52 questions about retinal health that were submitted in late January, reported Peter Y. Zhao, MD, of New England Eye Center at Tufts Medical Center in Boston, and colleagues.

Two weeks later, after resubmitting the questions, all 52 responses changed, with 26 responses materially changing: the accuracy materially improved in 30.8%, while the accuracy materially worsened in 19.2%, they wrote in a research letter in .

In recent months, stunning advances in AI have sparked high-level debate about how to best use the technology while preventing it from launching a "Westworld"-style takeover of humanity. On the medical front, teams of clinicians have tested AI chatbots by peppering them with questions about healthcare.

The chatbots have performed fairly well in recent analyses of their responses to questions about , appropriately responding to 21 of 25 questions on the prevention of cardiovascular disease and , correctly answering 26 of 30 questions about "common oncology topics," with none of the wrong answers considered harmful. One chatbot at making complex diagnoses.

However, the chatbot in the cardiac care study did make something up, which is known as "hallucinating" in AI circles, when it responded that the cholesterol-lowering drug inclisiran (Leqvio) is commercially unavailable. In fact, the FDA approved it in 2021 and it's readily available.

Study co-author Benjamin K. Young, MD, of Oregon Health & Science University in Portland, told �鶹��ý that he and his co-authors "hypothesize[d] the inaccuracy rate was very high because retina is a small subspecialty. Therefore, it stands to reason that ChatGPT has fewer online resources to 'learn' from compared to something like heart disease."

The authors also noted that "hallucination generating factually inaccurate responses is a known issue with LLM [large language model]-based platforms but has the potential to cause patient harm in the domain of medical knowledge."

In this study, ChatGPT responded to a question about the treatment options for central serous chorioretinopathy, a condition often exacerbated by corticosteroid use, by advising the user to take corticosteroids.

"Steroids make the condition worse, but the chatbot said you should use steroids to make it better," Zhao told �鶹��ý. "That was a complete 180, a completely wrong type of answer."

The chatbot also incorrectly included injection therapy and laser therapy as treatments for epiretinal membrane, though it correctly mentioned vitrectomy as an option.

Young pointed to another in which researchers asked retinal disease-related questions of a newer version of ChatGPT, and found that most responses were "consistently appropriate." While the methodology of this study was different than Zhao and Young's study, Young said it may be a sign that ChatGPT is getting better.

Zhao and colleagues used Google's "People Also Ask" subsection to make a list of commonly asked questions about vitreoretinal conditions and procedures, including macular degeneration, diabetic retinopathy, retinal vein occlusion, retinal tear or detachment, posterior vitreous detachment, vitreous hemorrhage, epiretinal membrane, macular hole, central serous chorioretinopathy, retina laser, retinal surgery, and intravitreal injection, as well as ocular symptoms that could be explained by vitreoretinal disease using the terms "floaters," "flashes," and "visual curtain."

The questions were initially posed to ChatGPT on Jan. 31, 2023. Since ChatGPT is continually updated, the researchers resubmitted the questions on February 13.

Matthew DeCamp, MD, PhD, of the University of Colorado Anschutz Medical Campus in Aurora, told �鶹��ý that studies like this have important limitations.

"This study required the entire answer to be accurate. But answers could be entirely accurate, or partly accurate, or completely inaccurate, and not all inaccuracies carry the same importance," said DeCamp, who was not involved in the research. "This study would have been stronger had the researchers also found a way to judge answers from real-life clinicians. Better yet, the researchers could have been blinded to whether an answer came from a real-life physician or a chatbot. There is a risk that the researchers' own biases could have influenced their judgment."

It may be impossible to know why chatbots are producing bad information, he noted, since some "are built on inexplicable models -- so-called 'black boxes' -- that do not or cannot cite actual sources of information."

Moving forward, he said chatbot developers "will need to know how clinicians and patients tend to ask questions, and to be attentive to that issue of differential impact -- the possibility that the chatbot answers questions differently to different people."

"The fact that answers may change over even a short time period for no clear reason is a real concern," he added.

As to whether "Dr. Google" is any better at providing accurate health information, DeCamp pointed to that compared how Google and ChatGPT answered questions regarding dementia.

"Whereas Google was more current and transparent, [it] required users to sift through commercial information and advertising that could be hard to interpret," he said. "ChatGPT was more conversational in responses, which could help the user experience and hence understanding, but it did not include sources. Comparing chatbots against each other, against other online sources of information, and against humans are all going to be important."

Randy Dotinga is a freelance medical and science journalist based in San Diego.

Disclosures

Zhao reported no disclosures.

Young reported support from the NIH, the Malcom M. Marquis, MD Endowed Fund for Innovation, and Research to Prevent Blindness.

DeCamp reported NIH grant funding to his institution to examine the use of AI-based prognostic algorithms in palliative care and from the Greenwall Foundation to examine how patients experience patient-facing chatbots in health systems.

Primary Source

JAMA Ophthalmology

Caranfa JT, et al "Accuracy of vitreoretinal disease information from an artificial intelligence chatbot" JAMA Ophthalmol 2023; DOI: 10.1001/jamaophthalmol.2023.3314.