ChatGPT Flubbed Drug Information Questions

— Researchers say the AI tool is not yet accurate enough for consumer or pharmacist questions

by Sophie Putka, Enterprise & Investigative Writer, �鶹��ý December 6, 2023

Last Updated December 7, 2023

ANAHEIM, Calif. -- ChatGPT provided incorrect or incomplete information when asked about drugs, and in some cases invented references to support its answers, two evaluative studies found.

In the first, 39 questions sent to a drug information service for pharmacists were later posed to ChatGPT, which provided no response, an inaccurate response, or an incomplete response to 74% of them, Tina Zerilli,, PharmD, of Long Island University in Brooklyn, New York, and colleagues reported.

For example, in a response to a real query of whether there was a drug interaction between nirmatrelvir/ritonavir (Paxlovid) and verapamil (Verelan), a blood pressure lowering drug, ChatGPT indicated that there were no interactions, though ritonavir can interact with verapamil.

Moreover, in instances where the artificial intelligence (AI) chatbot did provide a response with references, it did so each time citing references that were fabricated, with URLs that led to nonexistent studies, according to findings presented at the American Society of Health-System Pharmacists (ASHP) midyear meeting.

"It's an evolving technology. We should not rely on it right now as the definitive source of information," Zerilli told �鶹��ý. "We need to verify all the information that is generated from it and know that it can spit out inaccurate information, bad information, fabricated results."

In the second study, ChatGPT missed at least half of established side effects for 26 of 30 FDA-approved drugs, Shunsuke Toyoda, PharmD, of Torrance Memorial Medical Center in California, and colleagues reported by checking the generative AI algorithm's performance against a pharmacological database.

"As you can see, [it was] mostly inaccurate, and as pharmacists, we always have to be 100% accurate," Toyoda told �鶹��ý at a poster presentation, also at the ASHP midyear meeting. "So I'm sorry, ChatGPT -- no way they can replace us, not in our lifetime."

ChatGPT is a large language model (LLM) generative AI chatbot whose use has exploded in popularity since launching in 2022. Its role in healthcare, along with other AI tools geared toward medical professionals and industry, is still being debated even as in investing are poured into the space.

"We go through so many trainings, even after school, and I think, if some computer program really just came out one night and also claimed to completely replace us? I find that very insulting," said Toyoda.

Even so, the present studies offer a limited view of what AI in medicine is capable of, commented John Ayers, PhD, MA, of the Qualcomm Institute at the University of California San Diego in La Jolla, who was not involved in the study.

"They use a generic LLM that's not optimized to assess or evaluate healthcare relevant information. It's not trained specifically on that kind of data," he said. Referring to the drug information service study, Ayers said he was "surprised that it did so well, and it shows the potential, with optimization, of what could be achieved."

But at the same time, the accuracy of AI tools being used in medicine are irrelevant if regulatory bodies like the FDA don't create a framework or standards for their quality, he noted. "In a way, there's no money to do these types of evaluations," he said. "The technology companies themselves would not want to pay for this kind of research because they don't want to report if it doesn't work. And the FDA is not going to want to mandate this type of research."

Researchers in the drug information service study randomly assigned 39 questions they had received from January 2022 to April 2023 to one of two investigators, who created responses to the questions based on a literature search. The questions were posed to ChatGPT, followed by the phrase, "Please provide references to support the response."

Investigators evaluated the response as "satisfactory" or "unsatisfactory" based on a tool they designed, and if they disagreed, a third investigator weighed in. "Unsatisfactory" responses could include no direct response, inaccurate information, incomplete information, or extraneous information. "Satisfactory" responses were accurate and complete with no irrelevant information.

For the side effects study, researchers randomly selected 30 FDA-approved drugs and input into ChatGPT, "What are the most common side effects of [each selected drug]?" from April to June this year. ChatGPT responses that matched all common side effects listed in Lexicomp, the drug database service, were classified "accurate," those that matched half were classified "partially accurate," and those that matched less than half were "inaccurate."

Study limitations included a lack of validated tools to measure the accuracy of ChatGPT, and using versions of ChatGPT that may not be the most current.

Correction: Quotes in this story are attributable to Tina Zerilli, PharmD, not co-investigator Sara Grossman, PharmD.

Sophie Putka is an enterprise and investigative writer for �鶹��ý. Her work has appeared in the Wall Street Journal, Discover, Business Insider, Inverse, Cannabis Wire, and more. She joined �鶹��ý in August of 2021.

Disclosures

Zerilli and Toyoda disclosed no conflicts of interest.

Primary Source

American Society of Health-System Pharmacists

Matsuura M, et al "Evaluation of side effect drug information generated by ChatGPT" ASHP 2023; Abstract 8-023.

Secondary Source

American Society of Health-System Pharmacists

Grossman S, et al "ChatGPT: evaluation of its ability to respond to drug information questions" ASHP 2023; Abstract 8-021.