Dear Editor,
The Eastern Cooperative Oncology Group (ECOG) is a commonly used performance status (PS) scale in oncology. It influences cancer treatment decisions and clinical trial recruitment. However, there can be significant inter-rater variability in ECOG-PS scoring, due to subjectivity in human scoring and innate cognitive biases.1,2 We propose that generative pretrained transformers (GPT), a foundational large language model (LLM), can accurately and reliably score ECOG-PS.
We used 16 fictional scripts (Supplementary Table S1, Set 1) from studies by Datta et al. and Azam et al., both of which assessed the inter-rater reliability of ECOG-PS scoring by oncology health professionals.2,3 We used the OpenAI Action Programming Interface to query the GPT-3.5-turbo model with the scripts. A standardised prompt (I would like you to assume the role of an oncologist. You will review the patient case history that I give you and score the ECOG score) was used. No further instructions were provided. We queried GPT-3.5-turbo with 16 scenarios sequentially. Each scenario was reassessed for 50 iterations, and all outputs were captured and documented precisely. For the 12 scenarios from Datta et al., the most appropriate ECOG score for each vignette was decided in the original study through consensus rating among the 3 oncologists who designed the vignettes. Any discrepancy in the ratings was resolved through feedback and discussion. For the scenarios from Azam et al., the original study did not provide a gold standard ECOG score for each of the 4 vignettes. Thus, the most appropriate ECOG score was determined by an author of the current study, RSYCT, a consultant oncologist. We did an initial qualitative appraisal of the interpretability and validity of the responses. We then quantitatively assessed the accuracy (percentage correct ECOG-PS) and consistency (Fleiss’ kappa4) of the responses. GPT’s scores were compared with human scores using the Mann-Whitney U test. The cut-off for statistical significance was set at P<0.001 (after Bonferroni correction).
Since GPT pretraining data used text corpora before 2021, there is a possibility that the original vignettes were included as pretraining data. Thus, we also assessed GPT responses to modified versions of the vignettes. We created 2 new sets of vignettes (Supplementary Table S1, Sets 2–3), with 1 set paraphrasing the script and another modifying numerical values. We ensured that the textual and numerical changes did not change the overall ECOG-PS for both sets.
Qualitatively, GPT yielded clear, comprehensible answers in response to all scripts. A single valid ECOG-PS was assigned for all the scenarios. Although unprompted, GPT further justified the ECOG-PS assigned in all cases. All script responses were reviewed qualitatively and summarised (Supplementary Table S2).
Notably, it demonstrated the ability to distinguish between a patient’s functional ability for activities of daily living and their ability to perform work activities, which is important in differentiating ECOG-PS scores of 1 and 2. Furthermore, some responses displayed a level of inference of how a patient’s symptoms might impact function, as shown in Supplementary Fig. S1, in which ascites and pedal oedema were inferred to contribute to limited mobility. Hallucination was observed in a few cases. This included a few instances of misinterpretation of the prompt. For instance, 1 case in which the patient’s work activity of running a store was not considered, and she was given a score of ECOG-PS 3 (Supplementary Fig. S2).
GPT scored ECOG-PS correctly more often than human raters (Table 1). The difference in performance was statistically significant in 7 scripts (44%). Ten scripts (63%) had ≥90% correct scores. GPT had excellent consistency, with Fleiss’ kappa of 0.785 on Datta et al.’s scripts and 0.739 on Azam et al.’s. Conversely, human raters had less consistency, with kappa of 0.167 on Datta’s scripts (Azam et al. did not report a kappa value). There was no substantial difference between the GPT responses on the original scripts (Set 1) and the modified scripts (Sets 2–3) for most cases (Supplementary Tables S3–S4). For the cases with a difference in performance, there were more correct responses on the variant scripts. Only patient 5 in Azam et al. had a decline in performance on the variant scripts. The consistency of GPT on variant scripts remained excellent (Fleiss’ kappa: set 2 Datta et al., 0.776; set 3 Datta et al., 0.771; set 2 Azam et al., 0.843; set 3 Azam et al., 0.665).
To the authors’ knowledge, this is the first study to examine whether GPT can be used to determine the ECOG-PS and compare its performance to human raters. Our results show that GPT performs ECOG-PS scoring more accurately and consistently than human raters in fictional patient scripts. GPT performed comparably to human raters on Set 1 in cases 2, 4, 7, 8, 9, 10 and 12. It also had significantly more correct responses in cases 1, 3, 5, 6 and 11 than human raters who had considerable experience in oncology (mean = 4.4 ± 4.2 years, >80% having at least 1 year of experience in Datta et al.).
Apart from assigning ECOG-PS with greater accuracy, our results demonstrate that GPT can consistently determine ECOG-PS with high inter-rater reliability across 3 sets of fictional scripts (kappa >0.7). Our results also suggest a good basic performance level of GPT that could be improved with task-specific fine-tuning.
Important limitations of this study include the fact that GPT is a general language model not specifically trained for ECOG-PS scoring. At the time of writing, newer models such as GPT-4-turbo had not been released, and they were not assessed in this study. The quality information on ECOG in the underlying pretraining data of GPT is not public and cannot be independently verified. However, the relatively good performance in this study suggests that GPT processes this area of medicine well. Another limitation is that ECOG scores are subjective assessments dependent on the experience and opinion of the rater. While we have attempted to assign the most appropriate ECOG score by using consensus scoring and expert opinion from an experienced oncologist, they remain subjective measures of PS. Currently, only the zero-shot prompting technique was experimented. However, if necessary, LLM effectiveness may be enhanced through various prompting techniques and fine-tuning. Some prompting techniques that can be explored in future work range from in-context learning examples,5 to more intricate methods like the chain-of-thought,6 and even a prompting technique (MedPrompt) that yielded comparable performance in answering medical school examination questions to LLM pretrained on medical text.7 We also recognise that our study was performed on a small set of fictional patient scenarios. Future work should examine scoring on actual patient clinical notes with concurrent human evaluation (including clinicians across the spectrum of experience and practices) to better assess the capabilities of LLMs. With more refinements to LLM algorithms, training data and interpretability, LLM-determined clinical scoring may become even more consistent and can be used as a useful adjunct to counterpoise human bias for such scoring in the future.
Fig. S1. Example prompt and response from GPT
Fig. S2. A response with erroneous score by GPT
Table S1. Set 1 (original vignettes by Datta et al. and Azam et al.), Set 2 (vignettes with variation in language) and Set 3 (vignettes with variation in numeric values).
Table S2. Qualitative summary of GPT’s responses (compiled from Sets 1, 2 and 3).
Table S3. Comparison of GPT responses for Sets 1, 2 and 3 using the patient vignettes provided in Datta et al.
Table S4. Comparison of GPT responses for Sets 1, 2 and 3 using the patient vignettes provided in Azam et al.
This article was first published online on 13 September 2024 at annals.edu.sg.
REFERENCES
- Chow R, Bruera E, Temel JS, et al. Inter-rater reliability in performance status assessment among healthcare professionals: an updated systematic review and meta-analysis. Support Care Cancer 2020;28:2071-8.
- Datta SS, Ghosal N, Daruvala R, et al. How do clinicians rate patient’s performance status using the ECOG performance scale? A mixed-methods exploration of variability in decision-making in oncology. Ecancermedicalscience 2019;13:913.
- Azam F, Latif MF, Farooq A, et al. Performance Status Assessment by Using ECOG (Eastern Cooperative Oncology Group) Score for Cancer Patients by Oncology Healthcare Professionals. Case Rep Oncol 2019;12:728-36.
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics 1977;33:159-74.
- Brown TB, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877-901.
- Wei J, Wang X, Schuurmans D, et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Adv Neural Inf Process Syst 2022;35:24824-37.
- Nori H, Lee YT, Zhang S, et al. Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine. https://arxiv.org/pdf/2311.16452. Accessed 10 March 2023.
The author(s) declare there are no affiliations with or involvement in any organisation or entity with any financial interest in the subject matter or materials discussed in this manuscript.
Dr Daniel Yan Zheng Lim, Department of Gastroenterology and Hepatology, Singapore General Hospital, Outram Road, Singapore 169608. Email: [email protected]