Review finds potential uses, pitfalls for generative AI in medicine
Although large language models could be used in several areas in medicine, many notable challenges remain unresolved, limiting their implementation, the review authors said.
A new narrative review advises clinicians about the best potential uses of large language models (LLMs) in daily practice and some potential pitfalls.
LLMs are artificial intelligence (AI) models trained on vast quantities of text data to generate humanlike output and are already being used in health care. Researchers at Stanford University provided a comprehensive overview of the training processes behind LLMs, historical and current applications of LLMs in medicine, descriptions of popular LLMs, and important research questions. The review was published Jan. 30 by Annals of Internal Medicine.
Models are only as accurate as the data sets used to train them, and LLMs were trained on large data sets that surpass the ability of human teams to manually check quality, according to the review. This results in models trained on a nebulous data set that may further decrease user trust in these algorithms, they wrote. Because of the inability to check the data set's quality, the training and testing data sets often overlap, resulting in overprediction of model accuracies. The data used in model training can become outdated, and retraining the model on updated information is nontrivial, the review said.
“Of note, ChatGPT and many other LLMs are not trained on curated medical data sets but rather on a broad range of inputs, from news articles to literary works, that allow models to capture linguistic patterns and features,” the review stated. “Moreover, the models do not ‘understand’ the actual content and thus can generate completely fabricated responses. This can result in poor performance in domain-specific questions, including medical applications.”
Models also frequently enhance and reinforce structural biases that are found in the training data sets. Models are promoting practices that have long been scientifically refuted, including methods for estimated glomerular filtration rate; false assertions about race, muscle mass, and creatinine levels; negative sentiments about people with disabilities; and overrepresentation of gun violence, homelessness, and drug addiction among patients with mental illness. “In another scenario, LLMs were asked to provide analgesia choices for chest pain for White and Black patients, resulting in weaker analgesic recommendations for Black patients,” the review stated. To mitigate these risks, the authors suggest checks and balances that include always having a human being in the loop and using AI tools to augment work tasks rather than replace them.
According to the review, there are several areas where LLMs could be used in medicine, such as administrative tasks, augmentation of clinician knowledge, medical education, and medical research. “Despite these opportunities, many notable challenges with LLMs remain unresolved, limiting the implementation of these models in medicine,” the authors wrote. “Issues affecting adoption include underlying biases in data sets, data quality and unpredictability of outputs, patient privacy, and ethical concerns. Physicians and other health care professionals must weigh potential opportunities with these existing limitations as they seek to incorporate LLMs into their practice of medicine.”