This document outlines the rationale behind Glass Health’s strategy of having physicians create evidence-based, peer-reviewed clinical guidelines that a large language model can use as context to generate clinical plan drafts and accurate and trustworthy answers to clinical reference questions.
What is a large language model?
A large language model (LLM) is a computer program designed to process and generate text. More specifically, it is a deep neural network, a complex machine learning algorithm inspired by the architecture of the human brain, that is trained on text data to generate text outputs when provided with text input.
How is a large language model created?
Large language models are trained by presenting the model with vast amounts of text data that adjusts the model's baseline ability to predict text. At baseline, the model predicts text randomly. During training, the model is presented with snippets of text and tasked with predicting the next word in the sequence. When the model's predictions deviate from the actual words in the training data, it calculates the error and adjusts its parameters—numerical values that dictate how the input data is processed. This iterative training process, involving multiple passes over the dataset, refines the model's ability to predict text.
What happens when a large language model generates an output?
After a large language model (LLM) has been trained, it can be used for inference. Inference refers to the process by which a trained model makes predictions or decisions based on new, previously unseen data. The model passes this input through multiple layers of its neural network. Each layer utilizes parameters, which were adjusted during training, to transform the input and generate predictions for potential word sequences. The model conducts real-time calculations to determine the probabilities of various word sequences, given the user's specific input. The output provided to the user is thus a reflection of the model's foundational training and immediate computations based on the current input.
Does an LLM have knowledge?
The inference capability of an LLM to generate text based on input is especially effective for question-and-answer scenarios. As an illustration, if "What is the capital of the United States?" is provided as input, the LLM utilizes its parameters, which have been fine-tuned through exposure to extensive text data, to guide its generation process and may produce an output like "The capital of the United States is Washington, DC". Importantly, the response isn't fetched from a specific knowledge database or encyclopedia; rather, it's predictively and probabilistically generated based on the model's fine-tuned parameters.
Bringing us back to medicine, when we say that an LLM can pass the USMLE, we are really saying that when presented with a text input that is a USMLE question, a LLM can produce text output that reflects what we humans judge to be the right answer on the USMLE. Again, this output is generated probabilistically. It does not reflect information that the LLM “knows” but instead an emergent ability to generate text that reflects the right answer to the USMLE questions as a result of having been trained on immense amounts of text data.
What are the limitations of an LLM in the clinical environment?
An LLM's capacity to generate text probabilistically is the source of its great utility, enabling it to operate as a versatile tool and excel at many tasks involving natural language processing and text generation. However, the probabilistic nature of its text generation also presents a challenge, particularly in the clinical environment. Specifically, when an LLM is asked a factual question or tasked with generating text that involves factual information, it can produce a response that is factually incorrect and do so without any inherent understanding of the veracity of its responses. An example of this would be if an LLM told to generate a research paper generates fictitious references or citations. Rather than pulling from a database of known sources, the model generates the text of the citation probabilistically based on its training data and can generate an author or name or paper name that has no relevance to the text its citing or does not exist. Additionally and importantly, by training on text datasets that represent much of the publicly available text on the internet, LLMs can mirror and perpetuate biases reflected in that text dataset. In clinical domains, where accuracy and equity are of critical importance, these limitations must be addressed.
How can we overcome the limitations of the LLM in the clinical environment?
There are several techniques available that can begin to overcome an LLM’s limitations in the clinical environment and increase the chance that an LLM’s outputs are accurate, evidence-based, and trustworthy - namely, fine-tuning, reinforcement with human feedback, and retrieval augmented generation.
Fine-tuning is a process in which a pre-trained model, like a large language model, is further trained (usually on a smaller, specialized dataset) to adapt it to a specific task or to refine its outputs. When it comes to addressing the limitations of large language models in generating truthful answers, fine-tuning can be used to adjust the model's behavior based on a curated dataset that emphasizes factual accuracy and domain-specific knowledge. This additional training phase allows the model to better align with the desired accuracy and specificity of the task at hand, thereby enhancing its reliability in producing correct and relevant responses. At Glass Health, fine-tuning might involve providing a model with a large number of sample inputs and sample outputs created by our tune with the goal of adjusting the models parameters so that generates text that better aligns with the sample outputs.
Reinforcement Learning from Human Feedback (RLHF) is a methodology, closely related to fine-tuning, that is used to improve large language models. In this process, the model is asked to generate several responses to the same input. The response options are ranked by human AI trainers and the highly ranked outputs and then used to fine-tune the model’s parameters and improve performance.