Gosta Labs publishes its first scientific results at ASCO 2024

The results represent some of the first insights worldwide into the capabilities of large language models to interpret ESMO and NCCN clinical practice guidelines.

May 30, 2024

At Gosta Labs, we are excited to announce the publication of our first scientific results in conjunction with the 2024 ASCO Annual Meeting. Our research investigated the use and performance of large language models utilizing in-context learning and retrieval-augmented generation for interpreting clinical practice guidelines from the European Society for Medical Oncology (ESMO) and the American Society of Clinical Oncology (ASCO) for non-small-cell lung cancer (NSCLC) and small-cell lung cancer (SCLC).

‍The scientific abstract is now available on the 2024 ASCO Annual Meeting conference website. We are also excited to share more details about our initial research and its implications for the future in this article.

Background: Importance of the study and research questions

Recent advancements in artificial intelligence, particularly in the development of advanced large language models (LLMs), have opened new opportunities for enhancing patient care across multiple domains. These sophisticated AI systems have the potential to significantly improve clinical decision support by providing healthcare professionals with accurate and evidence-based information in real-time, thus enhancing diagnostic precision and optimizing treatment plans. Additionally, LLMs can assist in patient education by answering medical queries, thereby empowering patients with reliable information to manage their health more effectively.

However, despite their potential, the integration of LLMs into routine clinical practice has been hindered by challenges such as hallucinations. In the context of artificial intelligence, hallucinations refer to instances where the model generates information that is either incorrect or inaccurate. This issue has been a significant barrier to the adoption of LLMs in healthcare, as the reliability and accuracy of the information provided by these models are paramount.

Our primary research question was: Is it possible to improve large language model performance and reduce hallucinations for clinical guidelines interpretation when using In-Context Learning (ICL) and retrieval-augmented generation (RAG) for LLMs? In our study, we explored a novel method that could potentially enable improved use of LLMs in conjunction with validated clinical contexts, such as clinical guidelines, using in-context learning and retrieval-augmented generation.

Methods: The developed clinical guidelines RAG application and study design

To address our research questions, we developed an advanced method combining in-context learning and retrieval-augmented generation (RAG) within the Gosta Labs’ health AI platform. First, we curated the guidelines, including all relevant tables and diagrams, and converted them into a structured text format. This text was then split into documents and stored in a vector database designed for efficient retrieval.

Next, we employed OpenAI’s GPT-4 Turbo model (version gpt-4-1106-preview), which has a knowledge cutoff in April 2023 and was generally rated as the most advanced LLM at the time of the study, as the core large language model for our implementations. Finally, we developed an algorithm to find the most relevant documents from the vector database and create tasks for the LLM by combining the task prompts with the relevant documents.

A visualization of the developed RAG application for ESMO and NCCN lung cancer clinical guidelines is outlined below:

Two oncologists were included in the study group to evaluate the performance of the large language models. They did not participate in the development of the RAG application but acted as external evaluators. The oncologists generated relevant questions for the model and subsequently rated the responses. Overall, they created 11 questions related to SCLC and 13 questions concerning NSCLC, focusing on treatment recommendations and definitions. The more detailed questions are outlined in the results section.

We tested three different settings to gauge the LLM performance:

GPT-4 Turbo model with existing knowledge (GPT-4): Utilizing the model’s pre-existing knowledge base without additional context.
ICL with Maximum Context Length (ICL-MC): Providing the model with the entire guidelines or a maximum context length of 128,000 tokens, to incorporate as much comprehensive guideline information as possible.
ICL with RAG (ICL-RAG): Using a heuristic approach to include only the most relevant parts from the vector database, aiming to enhance the efficiency and relevance of the source materials.

For each setting, we generated specific question prompts tailored to the ESMO guidelines, NCCN guidelines, and a combination of both. Two experienced oncologists reviewed a total of 216 responses generated across these different settings. They assessed the alignment of each response with the official ESMO and NCCN guidelines. Each of the question series was blinded so that the oncologists did not know which model responses they were evaluating. An evaluation questionnaire was developed, including the following questions and choices:

Did the system provide an answer to the question? (Yes / No)
How accurately does the system's answer align with the guidelines from [ESMO/NCCN]? (Completely accurate / Partially accurate / Not accurate)
In case of inaccuracies, what best describes the issue with the system's answer? (The answer lacks important information as per the guidelines / The answer contains incorrect or non-factual (hallucinated) information / The answer is entirely incorrect)
Did the system's answer include accurate evidence? (Yes, the evidence were accurate / Partially, some evidence was accurate, but some was not / No, the evidence was not accurate)
If the advice provided by the system were followed, would it lead to an off-label treatment? (Yes / No / Not applicable)

The study design is visualized below:

Results for the small-cell lung cancer (SCLC) related questions

The results of the oncologist evaluations for SCLC-related questions are shown in the table below:

For the ESMO guidelines on SCLC, both the RAG application and the pure ICL (using the whole guidelines as context) outperformed the GPT-4 Turbo model with its existing knowledge. Overall, the ICL-RAG and ICL-MC applications answered 90.9% (10/11) of the SCLC questions completely correctly. There was no consensus among the oncologists for one evaluation in the case of ICL-RAG, and ICL-MC had one partially accurate response. GPT-4 answered 72.7% (8/11) of the questions completely accurately, with partially accurate responses to 18.2% (2/11) of the questions. For one question, the oncologists did not reach a consensus on the model’s response.

For the NCCN guidelines on SCLC, the ICL-RAG application performed the best, answering 90.9% (10/11) of the questions completely accurately, with the oncologists not reaching a consensus on one evaluation. This was followed by GPT-4, which answered 81.8% (9/11) of the questions completely accurately, with no consensus for two evaluations. The ICL-MC application answered 45.5% (5/11) of the questions completely accurately, 9.1% (1/11) partially accurately, and there was no consensus among the oncologists for 45.5% (5/11) of the evaluations, likely indicating ambiguous responses to those questions by the model.

For the ESMO guidelines, the questions where evaluators did not reach a consensus ranged between partially and completely accurate responses. For the NCCN guidelines, the ICL-MC application assessments ranged from not accurate to partially accurate for one question, and otherwise from partially to completely accurate responses. For ICL-RAG and GPT-4, the evaluations ranged between partially and completely accurate responses.

Results for non-small-cell lung cancer (NSCLC) related questions

The results of the oncologist evaluations for NSCLC-related questions are shown in the table below:

For the ESMO guidelines on NSCLC, the ICL-MC setting performed the best, providing completely correct responses to 76.9% (10/13) of the questions, with the oncologists not reaching a consensus on two questions. The ICL-RAG setting followed, with completely correct responses to 69.2% (9/13) of the questions, and no consensus on three questions. Both ICL-MC and ICL-RAG had one partially accurate response. The GPT-4 model provided completely accurate responses to 53.8% (7/13) of the questions, with one partially accurate response. For GPT-4, the oncologists did not reach a consensus on five questions, likely indicating more ambiguous responses to those questions.

For the NCCN guidelines on NSCLC, both ICL-RAG and GPT-4 answered 76.9% (10/13) of the questions completely accurately. For GPT-4, the oncologists did not reach a consensus on three questions. ICL-RAG had two partially accurate responses, and there was no consensus on one question. The ICL-MC setting answered 25% (3/12) of the questions accurately, with partially accurate responses to 33.3% (4/12) of the questions. Oncologists did not reach a consensus on 41.7% (5/12) of the evaluated questions. Additionally, one question had a missing evaluation, thus precluding the assessment of consensus for that response.

For the ESMO guidelines, when oncologists did not reach a consensus in evaluation, the assessments ranged between:

GPT-4 Turbo: In one evaluation between not accurate and completely accurate, otherwise between partially and completely accurate.
ICL-RAG: Between partially and completely accurate.
ICL-MC: Between partially and completely accurate.

For the NCCN guidelines, the differences in evaluation when consensus was not reached were as follows:

GPT-4 Turbo: Assessments ranged between partially and completely accurate.
ICL-RAG: Assessments ranged between partially and completely accurate.
ICL-MC: In two instances, between not accurate and partially accurate; otherwise, between partially and completely accurate.

Insights on non-factual or incorrect information (hallucinations) by the models

For responses containing inaccuracies, oncologists evaluated whether the model’s response included non-factual or incorrect information (hallucinations). The oncologists reached a consensus on one question where the response was hallucinated, which was GPT-4 Turbo’s response to the question regarding immunotherapy as a recommended treatment for extensive-stage SCLC. For ICL-RAG and ICL-MC, evaluators did not reach a consensus on any hallucinations in the model responses.

When assessed individually without requiring consensus, meaning that if either oncologist identified the response as hallucinated, the following results were obtained:

‍GPT-4 Turbo: 14.6% (7/48) of the questions included incorrect or non-factual information.‍
ICL-RAG: 8.3% (4/48) of the questions included incorrect or non-factual information.‍
ICL-MC: 12.5% (6/48) of the questions included incorrect or non-factual information.

Conclusions and future implications of our study

The findings from our study provide new insights into the performance of large language models in interpreting clinical guidelines for small-cell lung cancer (SCLC) and non-small-cell lung cancer (NSCLC). Our evaluation compared the GPT-4 Turbo model, In-Context Learning with Maximum Context (ICL-MC), and In-Context Learning with Retrieval-Augmented Generation (ICL-RAG).

GPT-4 Turbo: Performing well on American (NCCN), but more challenges with European (ESMO) clinical practice guidelines

The performance of the GPT-4 Turbo model in interpreting clinical guidelines for SCLC and NSCLC showed varied results. For the ESMO guidelines on SCLC, GPT-4 Turbo provided completely accurate responses to 72.7% of the questions, with partially accurate responses to 18.2%. For the NCCN guidelines on SCLC, GPT-4 Turbo performed better, answering 81.8% of the questions accurately, although oncologists did not reach a consensus on two questions. For NSCLC, the model provided accurate responses to 53.8% of the ESMO guidelines questions and 76.9% of the NCCN guidelines questions. Oncologists reached a consensus on hallucinations for one question, and when evaluated individually, hallucinations were identified in 14.6% of the responses, indicating the need for further improvements to enhance the model’s reliability in real-world clinical settings.

Interestingly, GPT-4 Turbo performed better on NCCN guidelines compared to ESMO guidelines, possibly due to a bias towards American-based training materials.

ICL-RAG: Consistent and stable results across guidelines

The ICL-RAG application demonstrated good performance across both SCLC and NSCLC guidelines. For the ESMO guidelines on SCLC, ICL-RAG answered 90.9% of the questions completely correctly, with no consensus issues on one question. The performance was equally strong for the NCCN guidelines, with 90.9% accuracy and no consensus issues on one question. For NSCLC, ICL-RAG answered 69.2% of the ESMO guidelines questions accurately and 76.9% of the NCCN guidelines questions accurately.

Oncologists did not reach a consensus on hallucinations with the ICL-RAG application. However, when evaluated individually, hallucinations were identified in 8.3% of the responses, the smallest percentage among the approaches. Oncologists also had the most consensus on their responses when evaluating ICL-RAG outputs, suggesting that the RAG-based approach provides clearer and more interpretable responses compared to other methods.

ICL-MC: Good performance with shorter contexts, challenges with longer contexts

The ICL-MC setting showed promising results, particularly for the ESMO guidelines on SCLC, where it answered 90.9% of the questions accurately. However, the performance was lower for the NCCN guidelines, with only 45.5% accuracy and significant ambiguity indicated by the lack of consensus on 45.5% of the questions. For NSCLC, ICL-MC provided accurate responses to 76.9% of the ESMO guidelines questions but only 25% of the NCCN guidelines questions, with partially accurate responses to 33.3%. Oncologists did not reach a consensus on hallucinations; however, when evaluated individually, hallucinations were identified in 12.5% of the responses.

The ICL-MC approach performed especially well with the ESMO guidelines, which had a smaller token count compared to the NCCN guidelines. This indicates that as the context length increases, the models tend to forget crucial information, aligning with previous research findings on LLM behavior.

Future implications

The study’s findings underscore the potential of using advanced LLMs for interpreting clinical guidelines. However, the results also highlight the need for continuous improvement in reducing hallucinations and ensuring the accuracy of model responses. ICL techniques, particularly when combined with RAG, seem to offer advantages in providing accurate and contextually relevant information.

Future research should focus on refining these models to further reduce the occurrence of hallucinations and enhance their ability to interpret and apply clinical guidelines accurately. As seen in the differences between ESMO and NCCN interpretation, incorporating local and up-to-date guidelines is crucial for ensuring that LLMs can be effectively used across different healthcare systems and regulatory environments. Additionally, using RAG to mitigate the limitations of long context in ICL suggests a promising direction for optimizing model performance and cost-efficiency.

Overall, our results indicate that while current LLMs have strong potential, there is still a need for improvements to fully realize their capabilities in clinical settings. Continued collaboration between AI developers and clinical experts will be essential for achieving these advancements and improving patient care outcomes.