Large language models (LLMs) are a form of artificial intelligence (AI) with the potential to transform medicine. The past two years alone have seen everything from research on language model development and the use of LLMs for clinical applications to pilots of LLMs to support the clinical workflow.
The 2015 Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) initiative established minimum reporting standards for diagnosis and prognostic prediction model studies. TRIPOD+AI, a TRIPOD extension released in 2024, added best practices for predictive and prognostic AI research.
Neither of these guidelines, however, specifically addresses LLMs. To fill this gap, a team of investigators created TRIPOD-LLM, a tool that standardizes reporting guidelines for authors of studies focused on biomedical and healthcare LLMs. A statement announcing TRIPOD-LLM was published in Nature Medicine.
“TRIPOD+AI was being developed as LLMs were coming on the scene. Because of how quickly things were changing in the field, the authors explicitly excluded LLMs and generative AI,” says corresponding author Danielle S. Bitterman, MD, an AI researcher and radiation oncologist at Brigham and Women’s Hospital. “With TRIPOD-LLM, we’re accounting for the unique considerations and complications that LLMs present.”
“The novel complexities introduced by LLMs,” wrote the authors of the Nature Medicine statement, “include concerns regarding hallucinations, omissions, reliability, explainability, reproducibility, privacy, and biases being propagated downstream, which can adversely affect clinical decision-making and patient care.”
In addition, Dr. Bitterman notes that “the open-ended nature of generative AI and LLMs raise new considerations for research reporting.”
Making LLM Research Reporting More Transparent
The rapid growth of biomedical and healthcare LLMs sparked conversations among the team behind the TRIPOD and TRIPOD+AI efforts, national language processing researchers, and clinical researchers. The groups coalesced into a larger group of about two dozen individuals united in their concern over the need for more transparency in the reporting of LLM research.
These individuals became the author team behind TRIPOD-LLM, which aims to “enhance reproducibility and research quality while also ensuring others can understand the results and implications of the research,” according to Dr. Bitterman.
Released in July 2024, TRIPOD-LLM provides a comprehensive checklist that users can complete online or via a downloadable PDF. The tool collects all information necessary for robust reporting of studies that are developing, tuning, prompt engineering, or evaluating an LLM, from title and abstract through results and discussion.
In her experience as a researcher, Dr. Bitterman has found that filling out checklists like TRIPOD-LLM can be arduous. Mindful that TRIPOD-LLM would add some reporting burden, she and her co-workers tried to make using it as easy as possible.
“Our hope is that when they start planning their research, users leverage it to understand what types of information they should be keeping track of, controlling for, and ultimately reporting to have a final research product that is as reproducible and transparent as possible,” she says. “It is also a resource for readers to assess the quality of LLM research.”
Living Document Will Evolve With the Field
Generative AI and LLMs are advancing rapidly. As a result, TRIPOD-LLM was conceived as a living document whose reporting recommendations would evolve as the field evolves.
“As more research comes out about these models, what we need to know in LLM-focused studies will likely change,” Dr. Bitterman says. “We’re going to solicit external comments and revisit the guidelines periodically based on the input we receive.”
LLM researchers and developers, journal editors, and healthcare professionals are among those whom TRIPOD-LLM will help navigate the LLM field. Dr. Bitterman expects researchers will comprise the largest percentage of users. Her long-term wish is that journal editors will ask researchers to report the checklist alongside their research.
“As has been done with other TRIPOD guidelines, I hope that editors and peer reviewers will use it as a guide to ensure the manuscript they’re reviewing has reported all the information needed to accept it for publication,” she adds.
Dr. Bitterman notes that many of the same considerations that apply when evaluating LLMs and other forms of generative AI for research are also relevant when evaluating models before clinical implementation. Innovation, she says, must be balanced with safety.
“The details of that evaluation and monitoring at a larger scale is a research question that my lab is really interested in,” she concludes. “How we support that process will be essential for the safe translation of these models.”