ChatGPT in Medicine Part 2: Optimizing Large Language Models for Medicine

In my last blog post, I talked about ChatGPT and its potential uses in medicine. I drew on my experiences as a medical student and artificial intelligence (AI) researcher to discuss the possibilities and to demonstrate an example of how ChatGPT might hypothetically help a medical student learn about the potential causes of illness in a patient and generate a differential diagnosis. Today, I’d like to dive deeper into large language models (LLMs) that have been optimized specifically for medicine.

ChatGPT and other AI systems like it are generally trained on a large body of data — while OpenAI (the creator of ChatGPT) has not released the exact training set it used, we can assume that it is a large, generalizable corpus of text, articles and webpages. In this process, the model “learns” via random learning with human reinforcement, a modification of typical machine learning training procedures. The model receives feedback on proposed answers, and is told which answer a human “prefers,” hence enabling the model to learn not only accurate answers but to sound natural and humanlike. This protocol has also been shown to decrease the amount of time needed to train a model by enabling rapid feedback. This has the benefit of creating a model that is broadly applicable, and able to “chat” about a wide range of topics. However, the performance of any AI is dependent on its context and training. For medical applications, there are advantages to creating LLMs optimized for use in health care.

For example, several LLMs have shown surprisingly good performance on common medical questions and medical licensure exams. Researchers found that an LLM trained on general data did not approach the performance of human clinicians, but when they substantially improved the model by tailoring the training process to favor clinically relevant and scientifically accurate answers, some of the medical LLMs exceeded 90% accuracy. Moreover, using AI to flag low confidence answers and send them for a manual clinician review has shown potential to further improve performance while reducing clinician workload. Accuracy in answering curated questions is an imperfect measure of how an AI could function in a real-world, real-time setting — but admittedly, these question sets are how we measure performance for human clinicians as well — so AI performance in these tasks is an encouraging sign that emphasizes the value of training medical-specific LLMs.

These specialized LLMs are an important step toward incorporating AI into clinical practice. However, like all AI models, they raise important ethical and privacy concerns that are particularly acute in the setting of health care.

One important example is bias — a challenge with any AI system is that biases inherent to the training data can be learned and propagated by the AI. This problem is particularly acute in medicine, where there is plentiful evidence that biases along the lines of race, socioeconomic status, sexual orientation and other characteristics can cause unequal outcomes among patients. For example, Black women in the United States are three times more likely to die from a pregnancy-related cause than white women — a statistic that can be attributed to disparities in access to care, racial biases prevalent within the health care system, and broader societal inequities along racial and socioeconomic lines.

These disparities and biases likely manifest themselves in AI training data; for example, a model later tasked with predicting success for vaginal birth after cesarean (VBAC) could erroneously “learn” that Black mothers are less likely to successfully deliver vaginally after prior cesarean section (important because VBAC can reduce risk of surgical complications and encourage bodily autonomy for mothers when successful, though attempting it carries its own risks and must be carefully managed). In reality, rates of VBAC success are confounded by Black women being less likely to be offered a VBAC trial and disparities in maternal health that negatively impact Black women. While there is growing awareness among health care professionals about these types of biases and resulting disparities, we are a long way from their eradication. If care is not taken in the curation of biomedical training data, LLMs could unintentionally perpetuate inequality.

Another concern with artificial intelligence in medicine is that models cannot “forget” — they can be continuously trained and updated, but unless the model is to be totally deleted and re-trained (or at least reverted to a state before a specific piece of problematic data was introduced), it can’t be made to unlearn information. This is becoming less feasible as AI models become larger and more expensive and time consuming to train. Large language models can take months to years to finish training, at costs upward of several million dollars. This poses a problem in the case of biased data, as described above, but also is a risk to patient privacy. In the case that the patient withdraws consent for their data to be used after it was used to train the model, there is no clear path to removing it from large-scale models like LLMs.

Despite these concerns, LLMs and other AI technologies are here to stay in medicine. Models trained on specialized biomedical data have the potential to improve performance and provide specialized services — all of which could become an integral part of medical practice in the future.

ChatGPT in Medicine Part 2: Optimizing Large Language Models for Medicine

Related Content

Notices & Policies

Language Assistance Available

Notices & Policies

Language Assistance Available