April 24, 2024
Using 100 simulated cancer patient scenarios paired with questions, researchers evaluated the impact of using a LLM, GPT-4, to draft responses to patient questions.
Safety: 58% of LLM-generated responses were safe and usable without edits, although 7.7% posed safety risks if used unedited.Reduced Workload: Clinicians reported subjective efficiency improvements with LLM-assistance, which could reduce spend on patient communications and potentially alleviating burnout.
The content of manual responses was significantly different than the content of LLM draft and LLM-assisted responses. LLM errors tended to arise not from incorrect biomedical factual knowledge, but incorrect clinical gestalt and identification of the urgency of a situation.
We found pre-clinical evidence of anchoring based on LLM recommendations. Raising the question: Is using an LLM to assist with documentation simple decision-support, or will clinicians tend to take on the reasoning of the LLMs? Despite being a simulation study, these early findings provide a safety signal indicated a need to thoroughly evaluate LLMs in their intended clinical contexts, reflecting the precise task and level of human oversight. Moving forward, more transparency from EHR vendors and institutions about prompting methods are urgently needed for evaluations. LLM assistance is a promising avenue to reduce clinician workload but has implications that could have downstream effect on patient outcomes. This situation necessitates treating LLMs with the same rigor in evaluation as any other software as a medical device.
Check out the articles in STAT and Forbes discussing our study!