New Study Highlights Dangers of Patients Relying on ChatGPT for Treatment Recommendations

ChatGPT became an internet sensation soon after its November 2022 launch, reaching 100 million monthly active users by January 2023. People worldwide use the large language model (LLM) chatbot to generate answers and fulfill requests relating to every topic under the sun—including medical matters.

While ChatGPT can be a valuable tool for people to educate themselves about medical conditions, it is prone to generating inaccurate responses. For patients exploring treatment options following a cancer diagnosis, this is a troubling prospect.

Danielle S. Bitterman, MD, of the Department of Radiation Oncology at Dana-Farber Brigham Cancer Center is a physician-scientist whose research is focused on natural language processing for cancer applications. She led a team of Brigham researchers who assessed how consistently ChatGPT provides cancer-treatment recommendations that align with National Comprehensive Cancer Network (NCCN) guidelines. Their paper, published in JAMA Oncology in August 2023, reports that ChatGPT provided an inappropriate (“non-concordant”) recommendation in over one-third of cases.

“We never would expect ChatGPT to be at the level of an oncologist in prescribing treatment; it’s just one resource of many that patients may use to begin educating themselves,” Dr. Bitterman says. “But while searching on Google gives you a whole list of results, ChatGPT just gives you one nice, summarized result. And it’s not so obvious that some of the sources that went into that answer may be higher in quality than others.

“Patients need to be aware that the medical advice they get with ChatGPT may be false. It’s not trained—and more importantly, it’s not clinically validated—to make these types of recommendations. In the end, patients should always speak with their doctor to learn more.”

A Focus on Breast, Prostate, and Lung Cancer

In their study, Dr. Bitterman and her colleagues focused on the three most common cancers: breast, prostate, and lung cancer. For each of 26 unique diagnosis descriptions, they created four slightly different prompts asking ChatGPT to recommend a treatment approach. A team of four board-certified oncologists assessed concordance of the chatbot output with NCCN guidelines.

The researchers found that 98% of all responses included at least one treatment approach that aligned with NCCN guidelines. However, 34% of these responses also included at least one non-concordant (i.e., incorrect or only partially correct) treatment recommendation.

The certainty with which ChatGPT replies to inquiries made it difficult to detect incorrect recommendations hidden among correct ones. “There have been some interesting new studies showing that humans are swayed when something is written very convincingly, even if the information is incorrect,” Dr. Bitterman says. “It’s not just patients who can be impacted by this bias—clinicians and other experts may be as well.”

‘Hallucinations’ Found in 12.5% of Cases

In 12.5% of cases, ChatGPT produced “hallucinations,” or treatment recommendations entirely absent from NCCN guidelines. These included recommendations of novel therapies or of curative therapies for non-curative cancers. The authors emphasize that this form of misinformation can incorrectly set patients’ expectations about treatment and even harm the clinician-patient relationship.

One of the more interesting findings of the study, Dr. Bitterman notes, was how dramatically slight differences in the prompts influenced the accuracy of treatment recommendations. This revelation highlights both the promise and challenges of interacting with LLMs.

“There is quite a bit of potential to improve results with better prompt engineering and doing things like grounding the models with the actual NCCN guidelines,” Dr. Bitterman says. “On the other hand, we need to make patients aware that at this point, you can’t completely rely on ChatGPT when you need medical information.”

Dr. Bitterman and her colleagues used GPT-3.5-turbo-0301, one of the largest models available when they conducted the study in February 2023 and the model class currently used in the open-access version of ChatGPT. (A newer version, GPT-4, is only available with the paid subscription.) They also used the 2021 NCCN guidelines, because GPT-3.5-turbo-0301 was developed using data up to September 2021.

“This work is a snapshot of a single model and time,” Dr. Bitterman says. “The difficulties in keeping up with these technologies speak to the need for benchmark datasets that would allow us to quickly evaluate technologies as they come out, especially if they’re going to be used in the clinic.

“Further complicating this is that medicine is constantly evolving. Even after we do these initial upfront evaluations, we need to research how to ensure that LLMs stay up to date with current standards of care. These technologies need to be accompanied with rigorous standards for evaluation and ongoing monitoring.”

Calling for Clinical Validation of LLMs

Dr. Bitterman stresses that LLMs like ChatGPT have the potential to democratize the sharing of medical knowledge among patients and providers. But first, she says, the safety and effectiveness of these models must be validated via clinical trials, just as drugs and medical devices are today.

Moving forward, Dr. Bitterman would like to see translational researchers like herself and clinicians working more closely with developers to ensure LLMs meet a high clinical standard. More immediately, education and transparency are two of her top priorities.

“There’s definitely a need for better training of clinicians on how to interact with LLMs and other forms of artificial intelligence,” she says. “More generally, AI literacy will soon go hand in hand with health literacy. We need to do a better job educating the public on how to use these models and understand whether you can trust them. In addition, people have a right to be aware of how their data is being used to develop and validate LLMs and the ramifications of their data being entered into these models.”