MedGPTEval: A Dataset and Benchmark to Evaluate the Responses of Large Language Models in Medicine
MedGPTEval: 用于评估大语言模型在医学领域响应能力的数据集与基准
Jie $\mathbf{X}\mathbf{u}^{1}$ , $\mathbf{L}\mathbf{u}\mathbf{L}\mathbf{u}^{1}$ , Sen $\mathbf{Yang^{1}}$ , Bilin Liang1, Xinwei $\mathbf{Peng^{1}}$ , Jiali Pang1, Jinru $\mathbf{Ding^{1}}$ , Xiaoming $\mathbf{Shi^{1}}$ , Lingrui Yang2, Huan Song3, 4, Kang Li3, 4, Xin $\mathbf{Sun}^{2}$ , Shaoting Zhang1∗ 1Shanghai Articial Intelligence Laboratory, Shanghai, China 2Clinical Research and Innovation Unit (CRIU), Xinhua Hospital Affiliated To Shanghai Jiaotong University School Of Medicine, Shanghai, China 3West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Chin 4Med-X Center for Informatics, Sichuan University, Chengdu, China
杰 $\mathbf{X}\mathbf{u}^{1}$、$\mathbf{L}\mathbf{u}\mathbf{L}\mathbf{u}^{1}$、森 $\mathbf{Yang^{1}}$、李彬亮1、彭新伟 $\mathbf{Peng^{1}}$、庞佳丽1、丁金茹 $\mathbf{Ding^{1}}$、石晓明 $\mathbf{Shi^{1}}$、杨凌瑞2、宋欢3,4、李康3,4、孙欣 $\mathbf{Sun}^{2}$、张少霆1∗
1上海人工智能实验室,中国上海
2上海交通大学医学院附属新华医院临床研究与创新中心(CRIU),中国上海
3四川大学华西医院华西生物医学大数据中心,中国成都
4四川大学Med-X信息学中心,中国成都
Abstract
摘要
Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs’ responses create significant safety risks, potentially threatening patients’ physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation.
背景:大语言模型(LLM)在自然语言处理任务中取得了巨大进展,并展现出临床应用的潜力。尽管能力出众,医疗领域的大语言模型仍易产生幻觉(不完全可靠的响应)。这些幻觉会带来重大安全风险,可能危及患者生命安全。因此,为识别和防范此类风险,必须对医疗领域的大语言模型进行系统性评估。
Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical datasets in Chinese, and publicly available benchmarks.
目标:我们开发了一套全面的评估系统MedGPTEval,包含评估标准、中文医疗数据集和公开基准。
Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized for using a Delphi method by 5 experts in medicine and engineering. Third, 3 clinical experts designed a set of medical datasets to interact with LLMs. Finally, benchmarking experiments were conducted on the datasets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical datasets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT, by OpenAI; ERNIE Bot, by Baidu, Inc.; and Doctor PuJiang (Dr. PJ), by Shanghai Artificial Intelligence Laboratory.
方法:
首先,基于全面的文献综述设计了一套评估标准。其次,由5位医学和工程领域的专家采用德尔菲法对现有候选标准进行优化。接着,3位临床专家设计了一组用于与大语言模型交互的医学数据集。最后,在这些数据集上进行了基准测试实验。基于大语言模型的聊天机器人生成的响应被记录下来,由5位持证医疗专家进行盲评。所获得的评估标准涵盖医学专业能力、社会综合能力、情境能力和计算鲁棒性,共包含16项详细指标。医学数据集包括27个中文医疗对话和7份病例报告。评估的三种聊天机器人分别为:OpenAI的ChatGPT、百度公司的文心一言(ERNIE Bot)以及上海人工智能实验室的浦医(Dr. PJ)。
Results: Dr. PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr. PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr. PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario.
结果:Dr. PJ在多轮医学对话和病例报告场景中表现优于ChatGPT和ERNIE Bot。在语义一致率和完全错误率类别中,Dr. PJ也优于ChatGPT,表明其具有更好的鲁棒性。然而,在多轮对话场景中,Dr. PJ的医学专业能力得分略低于ChatGPT。
Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source datasets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr. PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source dataset.
结论:MedGPTEval通过开源数据集和评估3个大语言模型的基准测试,为医疗领域聊天机器人提供了全面的评估标准。实验结果表明,Dr. PJ在社交和专业场景中的表现优于ChatGPT和ERNIE Bot。因此,该评估系统可被该领域研究者便捷采用,以扩充开源数据集。
1 Introduction
1 引言
The development of large language models (LLMs) has revolutionized natural language processing, raising significant interest in LLMs as a solution for complex tasks such as instruction execution and elaborate question answering in various domains (Sezgin et al., 2022). Among these domains, the medical field has received significant attention because of its actual demands. Recently, progress has been achieved in medical education (Anders, 2023), patient care management (Schulman et al., 2022), medical exams (Levine et al., 2023), and other medical applications.
大语言模型 (LLM) 的发展为自然语言处理带来了革命性变革,引发了人们将其作为跨领域复杂任务解决方案的浓厚兴趣 (Sezgin et al., 2022) ,例如指令执行和精细化问答。其中,医疗领域因其实际需求获得了显著关注。近期在医学教育 (Anders, 2023) 、患者护理管理 (Schulman et al., 2022) 、医学考试 (Levine et al., 2023) 等医疗应用场景均取得了进展。
Despite their capabilities, LLMs are prone to generating hallucinations (not fully reliable responses) (Lee et al., 2023; Hagendorff et al., 2022). Hallucinations in LLMs’ responses create significant safety risks, potentially threatening patient’s physical safety and leading to serious medical malpractice. Thus, to perceive and prevent this safety risk, we must conduct an exhaustive evaluation of LLMs in the medical domain and build a systematic evaluation.
尽管大语言模型(LLM)能力强大,但它们容易产生幻觉(并非完全可靠的响应) (Lee et al., 2023; Hagendorff et al., 2022)。大语言模型响应中的幻觉会带来重大安全风险,可能危及患者生命安全并导致严重医疗事故。因此,为了识别和防范这种安全风险,我们必须对医疗领域的大语言模型进行详尽评估,并建立系统化的评估体系。
However, conducting an exhaustive evaluation for LLMs is nontrivial. First, LLMs lack robustness; that is, their performance is highly sensitive to prompts. White et al. (2023) showed that a meticulously crafted and thoroughly tested prompt could greatly improve performance and produce superior results. Thus, the robustness of LLMs must be evaluated through in-depth research. Second, the evaluation criteria of LLMs are critical. Recent evaluations have been mainly based on automatic metrics (Kung et al., 2023; Balas and Ing, 2023; Johnson et al., 2023) (e.g., BLEU, ROUGE, accuracy) in popular applications such as machine translation and text sum mari z ation. Despite their high efficiency, these automatic metrics are insufficient for employing LLMs in real-world medical scenarios. Other factors, such as the logical coherence of responses, social characteristics like tone, and the ability to understand contextual information, are essential influential factors (Hagendorff et al., 2022; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Chaves and Gerosa, 2021; Yang et al., 2023; Tan et al., 2023; West, 2023).
然而,对大语言模型进行全面评估并非易事。首先,大语言模型缺乏鲁棒性,其表现对提示词高度敏感。White等人 (2023) 研究表明,经过精心设计和充分测试的提示词能显著提升性能并产生更优结果。因此,必须通过深入研究来评估大语言模型的鲁棒性。其次,大语言模型的评估标准至关重要。近期评估主要基于机器翻译和文本摘要等热门应用中的自动指标 (Kung等人, 2023; Balas和Ing, 2023; Johnson等人, 2023) (如BLEU、ROUGE、准确率)。尽管这些自动指标效率很高,但将其应用于现实医疗场景仍显不足。其他因素如回答的逻辑连贯性、语气等社会特征,以及理解上下文信息的能力,都是关键影响因素 (Hagendorff等人, 2022; Hu等人, 2018; Liang和Li, 2021; Holmes等人, 2023; Chaves和Gerosa, 2021; Yang等人, 2023; Tan等人, 2023; West, 2023)。
To conduct an exhaustive study, we developed a comprehensive assessment system, MedGPTEval, composed of criteria, medical datasets in Chinese, and publicly available benchmarks. First, 5 interdisciplinary experts in medicine and engineering summarized existing criteria, based on a comprehensive literature review on the assessment of medical applications. Second, these candidate criteria were optimized using a Delphi method. The obtained evaluation criteria cover medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. Third, 3 clinical experts designed a set of medical datasets to interact with LLMs, including 27 medical dialogues and 7 case reports in Chinese. Finally, benchmarking experiments were conducted on the datasets. The responses generated by LLMs were recorded for blind evaluations by 5 licensed medical experts practicing medicine.
为进行全面研究,我们开发了一个由评估标准、中文医疗数据集和公开基准组成的综合评估系统MedGPTEval。首先,5位医学与工程跨学科专家在全面回顾医疗应用评估文献的基础上,总结了现有标准。其次,采用德尔菲法对这些候选标准进行优化,最终获得的评估标准涵盖医疗专业能力、社会综合能力、情境能力和计算鲁棒性四大维度,包含16项具体指标。第三,3位临床专家设计了一套用于与大语言模型交互的中文医疗数据集,包含27组医疗对话和7份病例报告。最后,在数据集上开展基准测试,记录大语言模型生成的响应,由5位执业医师执照的医疗专家进行盲评。
In the benchmarking experiments, 3 chatbots by LLMs were selected for evaluation. 1) ChatGPT, an LLM created by OpenAI, has gained global popularity owing to its exceptional language capabilities (Anders, 2023). However, ChatGPT has not been specifically trained for the medical domain (King, 2023). 2) ERNIE Bot is an LLM developed by Baidu, Inc., a Chinese computer technology company. It has been primarily trained on Chinese text and predominantly supports the Chinese language for general purposes. 3) Doctor PuJiang (Dr. PJ) is an LLM created by the medical research group of the Shanghai Artificial Intelligence Laboratory. Dr. PJ has been trained based on massive Chinese medical corpora and supports various application scenarios, such as diagnosis, triage, and medical question answering. Note that ChatGPT and ERNIE Bot are general-purpose conversational artificial intelligence (AI) systems, while Dr. PJ is an LLM fine-tuned specifically for medical use. To promote research on medical LLMs evaluation, we conducted benchmarking experiments on the proposed medical datasets in Chinese. Experimental results show that Dr. PJ outperformed ChatGPT and ERNIE Bot in both the multi-turn medical dialogues (13.95 vs 13.41 vs 12.56) and the case report scenarios (10.14 vs. 8.71 vs 8.0).
在基准测试实验中,我们选取了3款基于大语言模型(LLM)的聊天机器人进行评估:
- ChatGPT是由OpenAI开发的大语言模型,凭借卓越的语言能力在全球广受欢迎 (Anders, 2023) 。但该模型未针对医疗领域进行专门训练 (King, 2023) 。
- 文心一言(ERNIE Bot)是中国百度公司开发的大语言模型,主要基于中文文本训练,主要支持通用中文场景。
- 浦医医生(Dr. PJ)由上海人工智能实验室医疗研究团队开发,基于海量中文医疗语料训练,支持诊断、分诊、医疗问答等多种应用场景。
需要说明的是,ChatGPT和文心一言属于通用对话式人工智能(AI)系统,而Dr. PJ是专门针对医疗场景微调的大语言模型。为推进医疗大语言模型评估研究,我们在提出的中文医疗数据集上进行了基准测试。实验结果表明,在多轮医疗对话(13.95 vs 13.41 vs 12.56)和病例报告场景(10.14 vs 8.71 vs 8.0)中,Dr. PJ的表现均优于ChatGPT和文心一言。
The scale of the dataset remains limited. We urge researchers in this community to join this open project via email 1. MedGPTEval is open to researchers, that is, people affiliated with a research organization (in academia or industry), as well as to people whose technical and professional expertise is relevant to the social aspects of the project.
数据集规模仍然有限。我们呼吁该领域的研究人员通过电子邮件加入这一开放项目。MedGPTEval向研究人员开放,即隶属于研究机构(学术界或工业界)的人员,以及其技术和专业知识与项目社会层面相关的人员。
The contribution of this work is twofold:
本工作的贡献分为两个方面:
• By conducting a thorough study of LLMs employed in the medical context and collaborating with domain experts, we established comprehensive evaluation criteria to assess the medical responses of LLMs; • Based on the criteria, we released a set of open-source datasets for the evaluation of medical responses in Chinese and conducted benchmark experiments on 3 chatbots, including ChatGPT.
• 通过对医疗领域应用的大语言模型 (LLM) 进行深入研究并与领域专家合作,我们建立了全面的评估标准来衡量大语言模型的医疗响应能力;
• 基于该标准,我们发布了一套用于评估中文医疗响应的开源数据集,并在包括 ChatGPT 在内的 3 个聊天机器人上进行了基准实验。
2 Methods
2 方法
2.1 Evaluation Criteria
2.1 评估标准
The evaluation criteria for assessing the LLMs were summarized by a thorough literature review. Then, the evaluation criteria were optimized using the Delphi method (Côrte-Real et al., 2019). The general process involved sending the criteria to designated experts in the field as well as obtaining their opinions on linguistic embellishment, ambiguity, and readability. After making generalizations and corrections, we provided anonymous feedback to each expert. This cycle of seeking opinions, refining focus, and giving feedback was repeated until a unanimous consensus was reached. A team of 5 interdisciplinary experts in medicine and engineering collaborated to determine the final evaluation aspects, specific details, and scoring standards. All members of the team held doctoral degrees in their specialties, with titles of associate professor or above, including 2 clinical medicine specialists, 2 computer specialists, and 1 medical management specialist.
通过全面的文献综述总结出评估大语言模型(LLM)的标准,随后采用德尔菲法(Côrte-Real et al., 2019)对评估标准进行优化。该流程主要包括:将标准发送给指定领域专家,获取他们关于语言修饰、歧义性和可读性的意见;经过归纳修正后,向每位专家提供匿名反馈。这种"征求意见-聚焦改进-反馈循环"的过程反复进行,直至达成全体一致意见。最终由5名医学与工程跨学科专家组成的团队确定了评估维度、具体细则及评分标准。团队成员均持有专业博士学位且具有副教授以上职称,包括2名临床医学专家、2名计算机专家和1名医疗管理专家。
2.2 Medical Datasets in Chinese
2.2 中文医疗数据集
To apply the evaluation criteria, 3 licensed medical experts with over 10 years of extensive clinical experience worked together to create a set of medical datasets in Chinese, including the multiple-turn dialogue dataset and the case report dataset. The case report dataset necessitated a singular round of questioning and encompasses an elaborate medical record of the patient, encompassing age, gender, medical history (personal and familial), symptoms, medication history, and other relevant information. In addition, the medical problem consulted had to be clearly described. In contrast, the dataset with multiple-turn was derived through an iterative process comprising 4 rounds. The initial round initiated with the patient’s symptoms, followed by supplementary descriptions of medication, examination, or other symptom-related queries. The dataset with multiple-turn required careful consideration to assess contextual relevance.
为应用评估标准,3名拥有10年以上丰富临床经验的持证医学专家共同创建了一套中文医学数据集,包括多轮对话数据集和病例报告数据集。病例报告数据集需进行单轮提问,包含患者详尽的病历信息,涵盖年龄、性别、个人及家族病史、症状、用药史等相关资料,并要求清晰描述咨询的医学问题。相比之下,多轮对话数据集通过包含4轮对话的迭代过程生成:首轮以患者症状为起点,后续轮次补充用药、检查或其他症状相关询问。该数据集需特别关注上下文相关性评估。
2.3 Benchmark
2.3 基准测试
The generations of LLMs’ responses were recorded by an impartial programmer to ensure an unbiased evaluation. During the evaluation process, the LLMs’ responses were concealed from a different group of 5 clinical medical experts who were licensed practitioners. Then, the performances of 3 LLMs (ChatGPT, ERNIE Bot, and Dr. PJ) were compared based on the assessment criteria outlined above and on the proposed medical datasets. The datasets were used to evaluate the medical and social capabilities of the LLMs, while the multipleturn dialogue dataset was used to additionally assess their contextual abilities. The maximum scores available for LLMs in the multiple-turn dialogue dataset and the case report dataset were 16 and 13, respectively, where a higher score indicated superior performance. Furthermore, the computational robustness of the LLMs was assessed using extended datasets derived from the multiple-turn dialogue dataset. Lastly, a subset of the case reports was randomly selected and comprehensively reviewed by 5 medical experts. The benchmark assessment methods are summarized in Table 1.
一位公正的程序员记录了大语言模型(LLM)的生成结果以确保评估无偏。在评估过程中,这些回答对另一组5位持有执照的临床医学专家保持隐匿。随后,根据前述评估标准及所提出的医学数据集,比较了ChatGPT、文心一言(ERNIE Bot)和Dr. PJ这3个大语言模型的表现。数据集用于评估大语言模型的医学和社会能力,而多轮对话数据集额外测试了它们的上下文理解能力。在多轮对话数据集和病例报告数据集中,大语言模型可获得最高分分别为16分和13分,分数越高表现越优。此外,基于多轮对话数据集衍生的扩展数据集被用来评估大语言模型的计算鲁棒性。最后,随机选取部分病例报告由5位医学专家进行全面复核。基准评估方法汇总见表1:
3 Results
3 结果
3.1 Comprehensive Assessment Criteria
3.1 综合评估标准
The draft evaluation criteria for assessing the LLMs were summarized by a thorough literature review (Hagendorff et al., 2022; Wei et al., 2023; White et al., 2023; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Tan et al., 2023; West, 2023; Chaves and Gerosa, 2021) from 4 aspects: medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness. All 5 interdisciplinary experts made suggestions for fine-tuning the assessment method, and they reached a consensus using the Delphi method to make it more scientifically rigorous and easier to read (Côrte-Real et al., 2019).
通过全面文献综述 [Hagendorff et al., 2022; Wei et al., 2023; White et al., 2023; Hu et al., 2018; Liang and Li, 2021; Holmes et al., 2023; Tan et al., 2023; West, 2023; Chaves and Gerosa, 2021] ,我们从医学专业能力、社会综合能力、上下文理解能力和计算鲁棒性四个维度总结了大语言模型 (LLM) 的评估标准草案。5位跨学科专家对评估方法提出细化建议,并采用德尔菲法 (Delphi method) [Côrte-Real et al., 2019] 达成共识,使方案更具科学严谨性与可读性。
3.1.1 Medical Professional Capabilities
3.1.1 医疗专业能力
The professional comprehensive capabilities of LLMs’ answers were evaluated using 7 indicators (White et al., 2023; Liang and Li, 2021; West, 2023): 1) Accuracy, requiring that there are no medical errors in the answers and that the answers do not provide any harmful information to patients. Accuracy can also include the evaluation of safety; 2) Informative ness, where a 3-point Likert scale was used to evaluate the informative ness of the answers (0 – incomplete, 1 – adequate, 2 – comprehensive); 3) Expansiveness, meaning that the answers contain useful information besides the medical knowledge included in the question; 4) Logic, with a 3-point Likert scale (0 – the answer is irrelevant to the topic, 1 – off-topic, the answer does not directly address the topic but is still relevant, 2 – on-topic, the answer addresses the topic directly and positively); 5) Prohibitive ness, where the LLMs correctly identify medical vocabulary or prohibited vocabulary; 6) Sensitivity, ensuring that LLMs’ answers do not contain any politically sensitive expressions. Note that if the score for either knowledge accuracy or logical correlation is 0, the score for the overall professional comprehensive capabilities is set to 0.
大语言模型回答的专业综合能力通过7项指标进行评估 (White et al., 2023; Liang and Li, 2021; West, 2023):
- 准确性,要求回答中不存在医疗错误且不会向患者提供任何有害信息。准确性也可包含对安全性的评估;
- 信息量,采用3级李克特量表评估回答的信息完整度 (0-不完整,1-适中,2-全面);
- 扩展性,指回答包含问题所涉医学知识之外的有效信息;
- 逻辑性,采用3级李克特量表 (0-回答与主题无关,1-偏题,回答未直接切题但仍具相关性,2-切题,回答直接且积极地回应主题);
- 禁忌识别,要求大语言模型正确识别医学术语或禁忌词汇;
- 敏感性,确保大语言模型的回答不包含任何政治敏感表述。
注:若知识准确性或逻辑相关性任一得分为0,则专业综合能力总评分直接记为0。
3.1.2 Social Comprehensive Capabilities
3.1.2 社会综合能力
We conducted an overall evaluation of the social comprehensive performances using 4 indicators (Hagendorff et al., 2022; Hu et al., 2018; Liang and
我们采用4项指标对综合社会表现进行了整体评估 (Hagendorff et al., 2022; Hu et al., 2018; Liang and
Table 1: Summary of benchmark assessment.
Datasets | Assessmentaspects | Assessmentapproaches |
MedicalDialogue | medicalprofessionalcapabilities,socialcomprehensivecapabilities,contextualcapabilities computationalrobustness | maximum scoreof16 percentage |
CaseReport | medicalprofessionalcapabilities,socialcomprehensivecapabilities computationalrobustness comprehensivereview | maximumscore of13 percentage comments |
表 1: 基准评估总结
数据集 | 评估维度 | 评估方法 |
---|---|---|
MedicalDialogue | 医疗专业能力、社会综合能力、上下文理解能力、计算鲁棒性 | 满分16分 百分比 |
CaseReport | 医疗专业能力、社会综合能力、计算鲁棒性、综合评审 | 满分13分 百分比 评语 |
Li, 2021; Chaves and Gerosa, 2021). 1) Compre- hension, where a binary scale is used to evaluate the readability of the answers (0 – awkward-sounding: all answers are professional and not explanatory, 1 – understandable: intuitive and easy to understand); 2) Tone, which pertains to the appropriate use of mood/tone in the generated responses by the LLMs, including the use of mood particles, symbols, emotional rhythm, and emotional intensity; 3) Empathy, where the accuracy of the scenario analysis is considered, including emotional understanding and reasoning; 4) Social decorum, using a 3-point Likert scale to evaluate the social decorum (0 – rude: not matching any friendly social keywords or displaying malicious language attacks, 1 – general: matching 1 to 2 keywords, 2 – graceful: matching 3 or more keywords).
Li, 2021; Chaves and Gerosa, 2021)。1) 可理解性:采用二元量表评估答案的易读性 (0 - 生硬:所有回答都专业且无解释性,1 - 易懂:直观且易于理解);2) 语气:评估大语言模型生成回复中情绪/语气的恰当运用,包括语气助词、符号、情感节奏和情感强度;3) 共情:考量场景分析的准确性,包括情感理解与推理;4) 社交礼仪:采用3级李克特量表评估社交礼仪 (0 - 粗鲁:不符合任何友好社交关键词或出现恶意语言攻击,1 - 一般:匹配1-2个关键词,2 - 得体:匹配3个及以上关键词)。
3.1.3 Contextual Capabilities
3.1.3 上下文能力
Three indicators were used to access the contextual capabilities (Wei et al., 2023; Holmes et al., 2023) only in the multiple-turn dialogue dataset, as follows: 1) Repeated answer, which means that no duplicate answers should appear in the responses generated by LLMs; 2) Anaphora matching, which involves correctly identifying and matching the abbr evi at ions or aliases of medical professional terms used in the dialogue; 3) Key information, where LLMs can recognize and include all relevant information from the question in its response, particularly those that have been repeated 2 or more times in the questions. The content performance criteria used for scoring are outlined in Table 2.
在多轮对话数据集中,仅使用以下三个指标评估上下文能力 (Wei et al., 2023; Holmes et al., 2023):
- 重复答案,即大语言模型生成的响应中不应出现重复回答;
- 指代匹配,要求正确识别并匹配对话中使用的医学术语缩写或别名;
- 关键信息,大语言模型需在响应中识别并包含问题中的所有相关信息,尤其是问题中重复出现2次及以上的内容。评分采用的内容表现标准如 表 2 所示。
3.1.4 Computational Robustness
3.1.4 计算鲁棒性
To evaluate the robustness of the LLMs, 5 extended datasets were created based on first-round questions in the multiple-turn dialogue dataset described above. Specifically, the following strategies were employed to rephrase each original question and create 10 rephrasing questions: 1) Rephrasing the question sentence but maintaining the semantics (Dataset-A); 2) Rephrasing the question sentence and changing the semantics (Dataset-B); 3) Rephrasing the question sentence by introducing punctuation errors (Dataset-C); 4) Rephrasing the question sentence by introducing grammatical errors (Dataset-D); 5) Rephrasing the question sentence by introducing spelling errors (Dataset-E). The Dataset A-E was used to evaluate the robustness of the LLMs from different common scenarios, which could be classified into 3 anomaly categories. Specifically, Dataset-A was used for the adversarial success rate (ASR); Dataset-B, for the noise success rate (NSR); and Dataset C-E, for the input error success rate (IESR).
为评估大语言模型(LLM)的鲁棒性,基于上述多轮对话数据集的首轮问题创建了5个扩展数据集。具体采用以下策略对每个原始问题进行改写并生成10个改写问题:1) 保持语义改写问句(Dataset-A);2) 改变语义改写问句(Dataset-B);3) 引入标点错误改写问句(Dataset-C);4) 引入语法错误改写问句(Dataset-D);5) 引入拼写错误改写问句(Dataset-E)。Dataset A-E用于从不同常见场景评估大语言模型的鲁棒性,这些场景可分为3类异常:Dataset-A用于对抗成功率(ASR);Dataset-B用于噪声成功率(NSR);Dataset C-E用于输入错误成功率(IESR)。
For each dataset, the original and rephrasing questions were inputted into the LLMs, and 3 metrics were calculated according to LLMs’ answers as follows (Tan et al., 2023; West, 2023): 1) The semantic consistency rate $(R_ {1})$ represents the proportion of the answer able to maintain the same semantics when inputting a rephrasing question; 2) The semantically inconsistent but medically sound rate $(R_ {2})$ means that the semantics of the answer has changed but is medically sound when inputting rephrasing question; 3) The complete error rate $(R_ {3})$ means that the semantics of the answer have changed and that there is a medical error when inputting a rephrasing question.
针对每个数据集,将原始问题和改写问题输入大语言模型,并根据模型回答计算以下3个指标 (Tan et al., 2023; West, 2023): 1) 语义一致率 $(R_ {1})$ 表示输入改写问题时答案能保持相同语义的比例; 2) 语义不一致但医学合理率 $(R_ {2})$ 表示输入改写问题时答案语义发生变化但医学上合理的情况; 3) 完全错误率 $(R_ {3})$ 表示输入改写问题时答案语义发生变化且存在医学错误的情况。
3.2 Medical Datasets in Chinese
3.2 中文医疗数据集
Two medical datasets in Chinese were created: medical multiple-turn dialogues and case reports. The datasets 1 include a total of 34 cases, with 27 cases for multiple-turn dialogue and 7 case reports. Datasets include medical scenarios, questions, suspected diagnoses given by LLMs, disease types, and classification of medical questions. The medical questions were sorted into 6 categories: clinical manifestations, treatment, ancillary tests, lifestyle habits, etiology, and prognosis. Most questions focused on patients’ self-reported symptoms and their respective treatments. The datasets contain 14 types of diseases: systemic diseases, digestive system diseases, brain diseases, heart diseases, bone diseases, chest diseases, vascular diseases, eye diseases, uterine diseases, urinary system diseases, na so pharyngeal diseases, oral diseases, skin diseases, and accidental injuries. Some specific common diseases featured in the datasets are metabolic diseases like diabetes mellitus, gastrointestinal diseases such as gastritis and hyper acidity, and critical diseases like Parkinson’s disease and heart failure.
创建了两个中文医疗数据集:医疗多轮对话和病例报告。数据集1共包含34个案例,其中27个为多轮对话,7个为病例报告。数据集涵盖医疗场景、问题、大语言模型给出的疑似诊断、疾病类型及医疗问题分类。医疗问题被归类为6种:临床表现、治疗、辅助检查、生活习惯、病因和预后。大多数问题聚焦于患者自述症状及相应治疗方案。数据集包含14类疾病:全身性疾病、消化系统疾病、脑部疾病、心脏疾病、骨骼疾病、胸部疾病、血管疾病、眼部疾病、子宫疾病、泌尿系统疾病、鼻咽疾病、口腔疾病、皮肤疾病和意外伤害。数据集中涉及的具体常见疾病包括糖尿病等代谢性疾病、胃炎和胃酸过多等胃肠疾病,以及帕金森病和心力衰竭等危重疾病。
3.3 Benchmarks Based on ChatGPT, ERNIE Bot, and Dr. PJ
3.3 基于ChatGPT、文心一言和Dr. PJ的基准测试
3.3.1 Analysis of Results in 2 Medical Scenarios
3.3.1 2种医疗场景下的结果分析
As shown in Table 3, 3 assessment aspects were covered in the multiple-turn dialogue evaluation: medical professional capabilities, social comprehensive capabilities, and contextual capabilities. Table 3 shows the total scores of each assessment and the scores of specific indicators. Dr. PJ outperformed ChatGPT and ERNIE Bot, with total scores of 13.95, 13.41, and 12.56, respectively. ChatGPT achieved a slightly higher score of 6.30 in medical professional capabilities, compared to 6.25 for Dr. PJ and 5.63 for ERNIE Bot. Although ChatGPT performed better in the assessment of medical professional capabilities, Dr. PJ had a higher score for accuracy, meaning that the answers were harmless and that Dr. PJ performed better in the evaluation of safety. As for social comprehensive capabilities, ChatGPT, ERNIE and Dr. PJ achieved scores of 4.26, 4.33, and 4.70, respectively. Dr. PJ achieved a score of 3.00 for context relevance, while ChatGPT and ERNIE Bot achieved scores of 2.85 and 2.59, respectively.
如表 3 所示,多轮对话评估涵盖 3 个评估维度:医疗专业能力、社会综合能力和上下文能力。表 3 展示了各项评估总分及具体指标得分。Dr. PJ 表现优于 ChatGPT 和 ERNIE Bot,总分分别为 13.95、13.41 和 12.56。ChatGPT 在医疗专业能力维度以 6.30 分略高于 Dr. PJ 的 6.25 分和 ERNIE Bot 的 5.63 分。虽然 ChatGPT 在医疗专业能力评估中表现更优,但 Dr. PJ 在准确性指标上得分更高,这意味着其回答无害且在安全性评估中表现更佳。在社会综合能力方面,ChatGPT、ERNIE 和 Dr. PJ 分别获得 4.26、4.33 和 4.70 分。Dr. PJ 在上下文相关性指标上获得 3.00 分,而 ChatGPT 和 ERNIE Bot 分别获得 2.85 分和 2.59 分。
As shown in Table 4, 2 assessment aspects were covered in the case report evaluation: medical professional capabilities and social comprehensive capabilities. Dr. PJ outperformed ChatGPT and ERNIE Bot, with total scores of 10.14, 8.71, and 8.00, respectively. As for medical professional capabilities, Dr. PJ achieved 6.86, higher than that of ChatGPT (6.43) and ERNIE Bot (5.71). Similarly, Dr. PJ had the highest score (1.00) for accuracy in the evaluation of medical professional capabilities. In addition, Dr. PJ had the same scores as ChatGPT regarding informative ness and expansiveness. As for social comprehensive capabilities, the scores for Dr. PJ, ChatGPT and ERNIE Bot were 3.29, 2.29, and 2.29 respectively. Specific scores for each indicator can be found in Table 4.
如表 4 所示,病例报告评估涵盖 2 个方面:医疗专业能力和社会综合能力。PJ 医生表现优于 ChatGPT 和 ERNIE Bot,总分分别为 10.14、8.71 和 8.00。在医疗专业能力方面,PJ 医生得分为 6.86,高于 ChatGPT (6.43) 和 ERNIE Bot (5.71)。同样,PJ 医生在医疗专业能力评估的准确性方面得分最高 (1.00)。此外,PJ 医生在信息量和扩展性方面与 ChatGPT 得分相同。在社会综合能力方面,PJ 医生、ChatGPT 和 ERNIE Bot 的得分分别为 3.29、2.29 和 2.29。各指标具体得分见表 4。
3.3.2 Comprehensive Review of Detailed Case Reports
3.3.2 详细案例报告的综合评述
The comments of 2 case reports by 5 medical experts are shown in Figure 1. Overall, all 3 LLMs performed well in correctly understanding patients’ questions. They could comprehend the questions asked by patients and respond with logical answers. However, Dr. PJ outperformed the others in terms of sociality. Additionally, Dr. PJ answered the questions in an orderly manner, with clear and intuitive serial numbers listed.
5位医学专家对2份病例报告的评论如图1所示。总体而言,所有3个大语言模型都能正确理解患者的问题。它们可以理解患者提出的问题,并给出逻辑清晰的回答。然而,Dr. PJ在社会性方面表现优于其他模型。此外,Dr. PJ的回答条理分明,列出了清晰直观的序号。
3.3.3 Computational Robustness Performance
3.3.3 计算鲁棒性性能
The results in Table 5 show that Dr. PJ outperformed ChatGPT and ERNIE Bot in the semantic consistency rate, with higher ASR, NSR, and IESR. This indicates that Dr. PJ was the best at maintaining the same semantics of the model answers when questions were paraphrased. Furthermore, in the complete error rate category, both Dr. PJ and ERNIE Bot had lower error rates than ChatGPT, suggesting that the semantics of the answer changed when the question was altered. Dr. PJ also had a low probability of medical errors.
表 5 中的结果显示,Dr. PJ 在语义一致性率上优于 ChatGPT 和 ERNIE Bot,具有更高的 ASR、NSR 和 IESR。这表明当问题被转述时,Dr. PJ 最擅长保持模型答案的相同语义。此外,在完全错误率类别中,Dr. PJ 和 ERNIE Bot 的错误率均低于 ChatGPT,这表明当问题被修改时,答案的语义发生了变化。Dr. PJ 的医疗错误概率也很低。
4 Discussion
4 讨论
In this study, we introduced a set of comprehensive evaluation criteria for assessing LLMs’ performances in medical contexts, considering aspects such as medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness. We compared ChatGPT and ERNIE Bot with Dr. PJ in 2 medical scenarios: multi-turn dialogues and case reports. Experimental results show that Dr. PJ outperformed ChatGPT and ERNIE Bot in handling various forms of the same question in these 2 scenarios.
在本研究中,我们引入了一套全面评估大语言模型(LLM)在医疗场景表现的标准体系,涵盖医学专业能力、社会综合能力、语境理解能力和计算鲁棒性等维度。我们在多轮对话和病例报告两种医疗场景下,对比了ChatGPT、文心一言(ERNIE Bot)与Dr. PJ的表现。实验结果表明,在这两种场景中处理同一问题的不同表述形式时,Dr. PJ的表现优于ChatGPT和文心一言。
Recently, LLMs have achieved rapid advancements and demonstrated technical potential. However, only a few question-and-answer evaluation methods have been developed for nonmedical fields or accuracy aspects. Liu et al. (2023b) presented a research summary for ChatGPT/GPT-4 suggesting that there are several evaluation aspects to consider, such as engineering performance, scenario, user feedback, and negative impacts. Similarly, West et al. evaluated the accuracy of ChatGPT3.5 and ChatGPT4 in answering conceptual physics questions by assessing correctness, confidence, error type, and stability (West, 2023). Further, Tan et al. compared responses from 6 English and 2 multilingual datasets, totaling 190 000 cases, and they discovered that ChatGPT outperformed similar models in most results but struggled with questions requiring numerical or time-based answers. However, the team’s evaluation metrics, such as minimal functionality test (MFT), invariance test (INV), and directional expectation test (DIR) (Tan et al., 2023), are primarily focused on model performances and stability. Unlike general questioning-answering domains, medical datasets require a more comprehensive evaluation approach. It is essential to not only focus on the LLMs’ performances but also consider the physical and psychological state of the questioner, as well as potential patients seeking medical assistance, from a medical professional’s perspective. As a result, we propose content evaluation criteria including both medical and social capabilities. Simultaneously, in a recent publication comparing physicians vs LLMs’ responses to patient questions, the researchers assessed the quality of information and empathy of the responses on a 5-point scale. (Ayers et al., 2023). Moreover, a recent study on radiation oncology physics showed that GPT-4 performed better in answering highly specialized radiation oncology physics questions after labeling. However, results were obtained where human expertise won out, suggesting the importance of the diversity of expertise and contextual inference capabilities (Holmes et al., 2023). Similarly, contextual capabilities are incorporated as a crucial component to evaluate LLMs’ contextual inference professionally and objectively. We believe that the comprehensiveness of Chinese datasets is equally important. For example, our latest proposed medical datasets in Chinese include common and critical diseases from 14 different clinical departments. Furthermore, our open-source datasets can facilitate a fairer evaluation process and expedite the global assessment and advancement of LLMs applied to medical datasets in Chinese.
近年来,大语言模型(LLM)取得了快速进展并展现出技术潜力。然而,目前仅开发了少数针对非医学领域或准确性方面的问答评估方法。Liu等人(2023b)对ChatGPT/GPT-4的研究综述指出,需要考虑工程性能、场景、用户反馈和负面影响等多个评估维度。类似地,West等人通过评估正确性、置信度、错误类型和稳定性,测试了ChatGPT3.5和ChatGPT4在回答物理概念问题时的准确性(West, 2023)。Tan等人则对比了6个英文和2个多语言数据集共19万条案例,发现ChatGPT在多数结果上优于同类模型,但在需要数值或时间回答的问题上表现欠佳。不过该团队采用的评估指标——最小功能测试(MFT)、不变性测试(INV)和方向期望测试(DIR)(Tan等, 2023)主要关注模型性能和稳定性。
与通用问答领域不同,医学数据集需要更全面的评估方法。不仅需要关注大语言模型的表现,还需从医学专业角度考虑提问者的身心状态及潜在患者的医疗需求。因此,我们提出了包含医学能力和社会能力的综合评估标准。最近一项对比医生与大语言模型回答患者提问的研究中,研究人员采用5分量表评估了回答信息的质量和同理心(Ayers等, 2023)。辐射肿瘤物理学最新研究表明,GPT-4在标注后回答高度专业化问题时表现更优,但在某些需要人类专业知识的场景仍存在差距,这凸显了专业多样性及情境推理能力的重要性(Holmes等, 2023)。同样,我们将情境能力作为专业客观评估大语言模型推理能力的关键组成部分。
我们认为中文数据集的全面性同样重要。例如我们最新提出的中文医学数据集涵盖了14个临床科室的常见病与危重症。此外,我们的开源数据集能促进更公平的评估流程,加速全球范围内对中文医学大语言模型的评估与进步。
Table 2: Summary of evaluation aspects, indicators, criteria, and datasets.
Evaluation aspects | Datasets | Evaluationcriteria | Score |
MedicalProfessional Capabilities | Both | ||
Accuracy * Informativeness | Nomedicalknowledge errors arepresentin the answer Comprehensive: answers include additional information beyond the | 1 | |
expectations | 2 | ||
Expansiveness | Answers include content from aspects other thanmedicalknowledge includedinthequestion | 1 | |
Logic * | On-topic: the answers address the topic directly and positively | 2 | |
Prohibitiveness Sensitivity | The model can correctlyidentify medical or prohibited terms. | 1 1 | |
Social Comprehensive Capabili- | There is no political sensitivity expressed in the answers of LLMs | ||
ties | Both | ||
Comprehension | Understandable:the answers areintuitive and easytounderstand | 1 | |
Tone | The answers use correct modal particles and symbols | 1 | |
Empathy | The answers can accurately empathizewith thepatient | 1 | |
Social decorum | Appropriately:matching 3 or more keywords | 2 | |
Contextual Capabilities | Multiple-turn | ||
Repeated answer | Themodelhas noduplicate answers | 1 | |
Anaphora matching | The model can identify medical professional abbreviations and aliases | 1 | |
Key information | The model can identify key information that appears 2 or more times | 1 |
* Highest priority. If the score of an item is 0, no further evaluation is conducted on either medical professional capabilities.
表 2: 评估维度、指标、标准及数据集摘要
评估维度 | 数据集 | 评估标准 | 得分 |
---|---|---|---|
医疗专业能力 | Both | ||
准确性* ·信息量 | 回答中无医学知识错误 | ||
全面性: 回答包含超出预期的补充信息 | 1 | ||
2 | |||
扩展性 | 回答包含问题涉及医学知识之外的内容 | 1 | |
逻辑性* | 紧扣主题: 回答直接且正面回应问题 | 2 | |
禁忌敏感性 | 模型能正确识别医学或禁忌术语 | 1 | |
1 | |||
社会综合能力 | Both | ||
理解力 | 易懂性: 回答直观且易于理解 | 1 | |
语气 | 回答使用正确的语气词和标点符号 | 1 | |
共情力 | 回答能准确体现对患者的共情 | 1 | |
社交礼仪 | 匹配3个及以上关键词 | 2 | |
上下文能力 | Multiple-turn | ||
重复回答 | 模型无重复性回答 | 1 | |
指代匹配 | 模型能识别医学专业缩写和别名 | 1 | |
关键信息 | 模型能识别出现2次及以上的关键信息 | 1 |
- 最高优先级。若某项得分为0,则不再评估医疗专业能力的其他项。
Table 3: The content performances of chatbots in medical scenarios on multiple-turn dialogues.
EvaluationIndicators | Chatbots | ||
Total score | ChatGPT | ERNIE Bot | Dr. PJ |
13.41 | 12.56 | 13.95 | |
Medical professional capabilities Accuracy | 6.30 0.91 | 5.63 0.79 | 6.25 0.94 |
Informativeness | 1.40 | 1.22 | 1.31 |
Expansiveness | 0.19 | 0.12 | 0.17 |
Logic | 1.81 | 1.50 | 1.84 |
Prohibitiveness | 1.00 | 1.00 | 1.00 |
Sensitivity | 1.00 | 1.00 | 1.00 |
Social comprehensive capabilities | 4.26 | 4.33 | 4.70 |
Comprehension | 0.96 | 0.96 | 0.96 |
Tone | 0.96 | 1.00 | 1.00 |
Empathy | 0.70 | 0.70 | 0.85 |
Social decorum | 1.63 | 1.67 | 1.89 |
Contextual capabilities | 2.85 | 2.59 | 3.00 |
Repeated answer | 0.96 | 0.81 | 1.00 |
Anaphora matching | 0.96 | 0.85 | 1.00 |
Keyinformation | 0.93 | 0.93 | 1.00 |
表 3: 医疗场景多轮对话中聊天机器人的内容表现
评估指标 | ChatGPT | ERNIE Bot | Dr. PJ |
---|---|---|---|
总分 | 13.41 | 12.56 | 13.95 |
* * 医疗专业能力* * | |||
准确性 | 6.30 (0.91) | 5.63 (0.79) | 6.25 (0.94) |
信息量 | 1.40 | 1.22 | 1.31 |
扩展性 | 0.19 | 0.12 | 0.17 |
逻辑性 | 1.81 | 1.50 | 1.84 |
禁止性 | 1.00 | 1.00 | 1.00 |
敏感性 | 1.00 | 1.00 | 1.00 |
* * 社交综合能力* * | 4.26 | 4.33 | 4.70 |
理解力 | 0.96 | 0.96 | 0.96 |
语气 | 0.96 | 1.00 | 1.00 |
同理心 | 0.70 | 0.70 | 0.85 |
社交礼仪 | 1.63 | 1.67 | 1.89 |
* * 上下文能力* * | 2.85 | 2.59 | 3.00 |
重复回答 | 0.96 | 0.81 | 1.00 |
指代匹配 | 0.96 | 0.85 | 1.00 |
关键信息 | 0.93 | 0.93 | 1.00 |
Table 4: The content performances of chatbots in medical scenarios with the case report.
EvaluationIndicators | Chatbots ERNIEBot | ||
Total score | ChatGPT 8.71 | 8.00 | Dr. PJ 10.14 |
Medicalprofessionalcapabilities | 6.43 | 5.71 | 6.86 |
Accuracy | 0.86 | 0.71 | 1.00 |
Informativeness | 1.43 | 1.14 | 1.43 |
Expansiveness | 0.43 | 0.43 | 0.43 |
Logic | 1.71 | 1.43 | 2.00 |
Prohibitiveness | 1.00 | 1.00 | 1.00 |
Sensitivity | 1.00 | 1.00 | 1.00 |
Social comprehensivecapabilities | 2.29 | 2.29 | 3.29 |
Comprehension | 1.00 | 1.00 | 1.00 |
Tone | 0.29 | 0.14 | 0.71 |
Empathy | 0.00 | 0.14 | 0.29 |
Social decorum | 1.00 | 1.00 | 1.29 |
表 4: 基于病例报告的医疗场景下聊天机器人内容表现
评估指标 | ChatGPT | ERNIEBot | Dr. PJ |
---|---|---|---|
总分 | 8.71 | 8.00 | 10.14 |
医疗专业能力 | 6.43 | 5.71 | 6.86 |
准确性 | 0.86 | 0.71 | 1.00 |
信息量 | 1.43 | 1.14 | 1.43 |
扩展性 | 0.43 | 0.43 | 0.43 |
逻辑性 | 1.71 | 1.43 | 2.00 |
禁止性 | 1.00 | 1.00 | 1.00 |
敏感性 | 1.00 | 1.00 | 1.00 |
社会综合能力 | 2.2 |