[论文翻译]MEDITRON-70B: 大语言模型的医疗预训练扩展


原文地址:https://arxiv.org/pdf/2311.16079


MEDITRON-70B: Scaling Medical Pre training for Large Language Models

MEDITRON-70B: 大语言模型的医疗预训练扩展

1EPFL 2Idiap Research Institute 3Open Assistant 4Yale {zeming.chen, antoine.bosselut}@epfl.ch

1EPFL 2Idiap研究所 3Open Assistant 4耶鲁大学 {zeming.chen, antoine.bosselut}@epfl.ch

Abstract

摘要

Large language models (LLMs) can potentially democratize access to medical knowledge. While many efforts have been made to harness and improve LLMs’ medical knowledge and reasoning capacities, the resulting models are either closedsource (e.g., PaLM, GPT-4) or limited in scale $\leq13\mathrm{B}$ parameters), which restricts their abilities. In this work, we improve access to large-scale medical LLMs by releasing MEDITRON: a suite of open-source LLMs with 7B and 70B parameters adapted to the medical domain. MEDITRON builds on Llama-2 (through our adaptation of Nvidia’s Megatron-LM distributed trainer), and extends pre training on a comprehensively curated medical corpus, including selected PubMed articles, abstracts, and internationally-recognized medical guidelines. Evaluations using four major medical benchmarks show significant performance gains over several state-of-the-art baselines before and after task-specific finetuning. Overall, MEDITRON achieves a $6%$ absolute performance gain over the best public baseline in its parameter class and $3%$ over the strongest baseline we finetuned from Llama-2. Compared to closed-source LLMs, MEDITRON-70B outperforms GPT-3.5 and Med-PaLM and is within $5%$ of GPT-4 and $10%$ of Med-PaLM-2. We release our code for curating the medical pre training corpus and the MEDITRON model weights to drive open-source development of more capable medical LLMs.

大语言模型(LLMs)有望实现医学知识的民主化获取。尽管已有诸多努力试图利用和提升大语言模型的医学知识与推理能力,但现有成果要么是闭源的(如PaLM、GPT-4),要么规模有限(≤130亿参数),制约了其能力表现。本研究通过发布MEDITRON——一套适配医疗领域的70亿和700亿参数开源大语言模型套件,推动大规模医学大语言模型的普及应用。MEDITRON基于Llama-2(通过适配Nvidia的Megatron-LM分布式训练器),并扩展了经过系统整理的医学语料库预训练,包括精选的PubMed论文、摘要及国际公认医疗指南。在四大医学基准测试中,该模型在任务特定微调前后均展现出显著优于现有技术基线的性能表现。总体而言,MEDITRON在其参数级别中比最佳公开基线实现6%的绝对性能提升,比我们基于Llama-2微调的最强基线高出3%。与闭源大语言模型相比,MEDITRON-70B性能超越GPT-3.5和Med-PaLM,与GPT-4差距在5%以内,与Med-PaLM-2差距在10%以内。我们公开了医学预训练语料库的整理代码及MEDITRON模型权重,以推动更具能力的医学大语言模型的开源发展。


Figure 1: MEDITRON-70B’s performance on MedQA MEDITRON-70B achieves an accuracy of $70.2%$ on USMLE-style questions in the MedQA (4 options) dataset.

图 1: MEDITRON-70B在MedQA上的表现
MEDITRON-70B在MedQA(4选项)数据集的USMLE风格问题上达到了$70.2%$的准确率。

Safety Advisory

安全公告

While MEDITRON is designed to encode medical knowledge from sources of high-quality evidence, it is not yet adapted to deliver this knowledge appropriately, safely, or within professional actionable constraints. We recommend against deploying MEDITRON in medical applications without extensive use-case alignment, as well as additional testing, specifically including randomized controlled trials in real-world practice settings.

虽然MEDITRON旨在从高质量证据来源中编码医学知识,但它尚未适应在专业可操作范围内恰当、安全地传递这些知识。我们建议不要在未经广泛用例对齐及额外测试(特别是包括真实世界实践环境中的随机对照试验)的情况下将MEDITRON部署于医疗应用。

1 Introduction

1 引言

Medicine is deeply rooted in knowledge, and recalling evidence is key to guiding standards in clinical decision-making. However, while ‘Evidence-based medicine’ (EBM) is now synonymous with quality care, it requires expertise that is not universally available. Thus, ensuring equitable access to standardized medical knowledge is an ongoing priority across all domains of medicine. Recent advances in large language models (LLMs) (Brown et al., 2020; Touvron et al., 2023a; Almazrouei et al., 2023; Touvron et al., 2023b; OpenAI, 2023b; Chowdhery et al., 2022) have the potential to revolutionize access to medical evidence. Today, the largest LLMs have tens or hundreds of billions of parameters (Bommasani et al., 2021; Hoffmann et al., 2022; Kaplan et al., 2020) and are trained on enormous pre training corpora (Raffel et al., 2019; Gao et al., 2020; Together AI, 2023; Soldaini et al., 2023). This unprecedented scale has enabled emergent properties in LLMs that are core traits of human decision-making: step-by-step chain-of-thought reasoning, coherent communication, and contextual interpretation (Bubeck et al., 2023; Wei et al., 2023; Wang et al., 2023).

医学深深植根于知识,而循证是指导临床决策标准的关键。尽管"循证医学 (EBM)"已成为优质医疗的代名词,但其所需的专业知识并非普遍可得。因此,确保公平获取标准化医学知识始终是各医学领域的首要任务。大语言模型 (LLM) 的最新进展 (Brown et al., 2020; Touvron et al., 2023a; Almazrouei et al., 2023; Touvron et al., 2023b; OpenAI, 2023b; Chowdhery et al., 2022) 有望彻底改变医学证据的获取方式。当今最大的LLM拥有数百甚至数千亿参数 (Bommasani et al., 2021; Hoffmann et al., 2022; Kaplan et al., 2020),并基于海量预训练语料库进行训练 (Raffel et al., 2019; Gao et al., 2020; Together AI, 2023; Soldaini et al., 2023)。这种前所未有的规模使LLM具备了人类决策核心特征的涌现能力:逐步的思维链推理、连贯的沟通和情境化解读 (Bubeck et al., 2023; Wei et al., 2023; Wang et al., 2023)。

Until recently, LLMs have been developed and evaluated for generalist tasks, principally using data collected from diverse internet sources with varying levels of quality in terms of domain-specific evidence (Rozière et al., 2023). This approach, while generally very powerful, hampers task-specific performance, including in the medical domain. Several newer task-specific models, trained on more carefully curated datasets, have repeatedly out-performed generalist models (Wu et al., 2023b; Yue et al., 2023; Rozière et al., 2023; Azerbayev et al., 2023), revealing the potential of balancing quality with quantity with regards to pre training data. A promising method for achieving this equilibrium is to use general-purpose LLMs and then continue training on more selective domain-specific data. These systems acquire a combination of both natural and domain-specific language understanding and generation skills (Gururangan et al., 2020). In the medical domain, this approach has only been reported for models below 13B parameters (Lee et al., 2020; Gu et al., 2021; Peng et al., 2023; Wu et al., 2023a). At larger scales (i.e., $\geq70\mathrm{B}$ -parameters), prior studies have only explored the scope of instruction-tuning (M42-Health) or parameter-efficient finetuning (Toma et al., 2023).

直到最近,大语言模型(LLM)的开发与评估仍主要面向通用任务,其训练数据多采集自质量参差不齐的互联网多源信息(Rozière等,2023)。这种方法虽然普遍表现强劲,却制约了包括医疗领域在内的特定任务性能。多项研究表明,基于更精细筛选数据集训练的领域专用模型(Wu等,2023b;Yue等,2023;Rozière等,2023;Azerbayev等,2023)持续超越通用模型,揭示了预训练数据质量与数量平衡的重要性。实现这种平衡的有效途径是:先采用通用大语言模型,再使用精选领域数据进行持续训练。此类系统能同步掌握自然语言与专业领域的理解生成能力(Gururangan等,2020)。在医疗领域,现有研究仅报道过130亿参数以下的模型采用此方案(Lee等,2020;Gu等,2021;Peng等,2023;Wu等,2023a)。针对更大规模模型(即$\geq70\mathrm{B}$参数),先前研究仅探索过指令微调(M42-Health)或参数高效微调(Toma等,2023)的应用范畴。

In this work, we present MEDITRON-7B and 70B, a pair of generative LLMs for medical reasoning, adapted from Llama-2 (Touvron et al., 2023b) through continued pre training on carefully curated high-quality medical data sources: PubMed Central (PMC) and PubMed open-access research papers (collected through the S2ORC corpus, Lo et al., 2020), PubMed abstracts (from non-open-access papers) in S2ORC, and a unique set of diverse medical guidelines from the internet, covering multiple countries, regions, hospitals, and international organizations. To enable training, we extend Nvidia’s Megatron-LM distributed training library to support the Llama-2 architecture.

在本工作中,我们提出了MEDITRON-7B和70B,这是一对用于医学推理的生成式大语言模型,通过对精选高质量医学数据源进行持续预训练,从Llama-2 (Touvron et al., 2023b) 适配而来。这些数据源包括:PubMed Central (PMC) 和PubMed开放获取研究论文(通过S2ORC语料库收集,Lo et al., 2020)、S2ORC中的PubMed摘要(来自非开放获取论文),以及从互联网收集的一套独特的多样化医学指南,涵盖多个国家、地区、医院和国际组织。为了支持训练,我们扩展了Nvidia的Megatron-LM分布式训练库以支持Llama-2架构。

We evaluate MEDITRON on four medical reasoning benchmarks using both in-context learning (providing examples during prompting, i.e., within the context window) and task-specific finetuning. The benchmarks comprise two medical examination question banks, MedQA (from the United States Medical Licensing Examination, Jin et al., 2020), and MedMCQA (a Multi-Subject Multi-Choice Dataset for the Medical domain, Pal et al., 2022), PubMedQA (biomedical question answering based on PubMed abstracts, Jin et al., 2019), and MMLU-Medical (a medically themed evaluation set from Massive Multitask Language understanding, Hendrycks et al., 2021a). Using in-context learning without fine-tuning, MEDITRON-7B outperforms several state-of-the-art baselines, showing a $10%$ average performance gain over PMC-Llama-7B (a similar LLM adapted from Llama, Touvron et al., 2023a, through continued pre training on PubMed Central papers), and a $5%$ average performance gain over the Llama-2-7B model. After finetuning on task-specific training data, MEDITRON’s performance also improves over other finetuned baselines at the same scale, achieving a $5%$ (7B) and a $2%$ (70B) average performance gain. Finally, finetuning MEDITRON-70B to support advanced prompting strategies such as chain-of-thought and self-consistency further improves over the best baseline by $3%$ and the best public baseline by $12%$ . Overall, MEDITRON achieves strong performance on medical reasoning benchmarks, matching or outperforming state-of-the-art baselines at the same scale.

我们在四个医学推理基准上评估了MEDITRON,采用了上下文学习(在提示时提供示例,即在上下文窗口内)和任务特定微调两种方法。这些基准包括两个医学考试题库:MedQA(来自美国医师执照考试,Jin等人,2020)和MedMCQA(医学领域多学科多选题数据集,Pal等人,2022),PubMedQA(基于PubMed摘要的生物医学问答,Jin等人,2019),以及MMLU-Medical(来自大规模多任务语言理解的医学主题评估集,Hendrycks等人,2021a)。在不进行微调的情况下使用上下文学习,MEDITRON-7B优于多个最先进的基线模型,平均性能比PMC-Llama-7B(一个通过PubMed Central论文持续预训练从Llama调整而来的类似大语言模型,Touvron等人,2023a)高出10%,比Llama-2-7B模型高出5%。在任务特定训练数据上进行微调后,MEDITRON的性能也优于同规模的其他微调基线模型,平均性能提升分别为5%(7B)和2%(70B)。最后,对MEDITRON-70B进行微调以支持思维链和自我一致性等高级提示策略,进一步将性能提升至比最佳基线高出3%,比最佳公开基线高出12%。总体而言,MEDITRON在医学推理基准上表现出色,匹配或超越了同规模的最先进基线模型。

In summary, we propose an optimized workflow to scale domain-specific pre training for medical LLMs, incorporating knowledge-based data curation, continued pre training via a distributed training pipeline, finetuning, few-shot in-context learning, and advanced inference methods such as chainof-thought reasoning and self-consistency. We release the curated training corpus, the distributed training library2, and the MEDITRON models (7B and 70B)3 with and without fine-tuning to the public to ensure access for real-world evaluation and to facilitate similar efforts in other domains.

总结来说,我们提出了一种优化的工作流程,用于扩展医疗大语言模型 (LLM) 的领域特定预训练,该方法整合了基于知识的数据筛选、通过分布式训练流水线进行的持续预训练、微调、少样本上下文学习,以及先进的推理方法(如思维链推理和自我一致性)。我们向公众发布了经过筛选的训练语料库、分布式训练库2,以及经过微调和未经微调的MEDITRON模型(7B和70B)3,以确保实际评估的可及性,并促进其他领域的类似工作。

2 Medical Training Data

2 医疗训练数据

MEDITRON ’s domain-adaptive pre-training corpus GAP-REPLAY combines 48.1B tokens from four datasets; Clinical Guidelines: a new dataset of 46K clinical practice guidelines from various healthcare-related sources, Paper Abstracts: openly available abstracts from 16.1M closed-access PubMed and PubMed Central papers, Medical Papers: full-text articles extracted from 5M publicly available PubMed and PubMed Central papers, and a Replay dataset: general domain data distilled to compose $1%$ of the entire corpus.

MEDITRON 的领域自适应预训练语料库 GAP-REPLAY 结合了来自四个数据集的 481 亿 Token:临床指南 (Clinical Guidelines):一个包含来自各类医疗相关来源的 4.6 万份临床实践指南的新数据集;论文摘要 (Paper Abstracts):从 1610 万篇受限访问的 PubMed 和 PubMed Central 论文中公开获取的摘要;医学论文 (Medical Papers):从 500 万篇公开可用的 PubMed 和 PubMed Central 论文中提取的全文;以及重放数据集 (Replay dataset):经过蒸馏处理的通用领域数据,占整个语料库的 1%。

2.1 Clinical Guidelines

2.1 临床指南

Clinical practice guidelines (CPGs) are rigorously researched frameworks designed to guide healthcare practitioners and patients in making evidence-based decisions regarding diagnosis, treatment, and management (Berg et al., 1997). They are compiled through a systematic process of collaborative consensus between experts to establish recommendations from the latest evidence on best practices that would maximize benefit in light of practical concerns such as available resources and context. As a super-synthesis of meta-analyses, they sit atop the ‘evidence pyramid’ and form the basis of actionable evidence-based practice (Burns et al., 2011). CPGs are produced at various geographic and organizational granular i ties, ranging from global to hospital-level initiatives directed by international professional medical associations to informal consortia, regional or national governmental bodies to individual NGOs and hospitals.

临床实践指南(CPG)是经过严格研究的框架,旨在指导医疗从业者和患者在诊断、治疗和管理方面做出循证决策(Berg et al., 1997)。这些指南通过专家协作共识的系统性流程编制而成,基于最新证据提出最佳实践建议,同时兼顾可用资源和实际环境等现实因素以实现效益最大化。作为元分析的高级综合产物,它们居于"证据金字塔"顶端,构成了可操作的循证实践基础(Burns et al., 2011)。临床实践指南的制定主体涵盖不同地域和组织层级,包括由国际专业医学协会主导的全球性项目或医院级倡议,非正式联盟,地区或国家政府机构,以及非政府组织和独立医院等。

Our GUIDELINES pre-training corpus comprises 46,469 guideline articles from 16 globally recognized sources for clinician and patient-directed guidance across high and low-resource settings, multiple medical domains (internal medicine, pediatrics, oncology, infectious disease, etc.), and various geographic granular i ties. The full list of sources used, along with the descriptive statistics of each source, can be found in Table 9. We publicly release4 a subset of 35,733 articles from the GUIDELINES corpus, extracted from 8 of 16 sources allowing content redistribution, namely CCO, CDC, CMA, ICRC, NICE, SPOR, WHO and WikiDoc. For all 16 sources, we release our web scrapers and pre-processing code.

我们的GUIDELINES预训练语料库包含来自16个全球知名临床指南来源的46,469篇文章,涵盖高资源和低资源环境下的临床医生与患者指导,涉及多个医学领域(内科、儿科、肿瘤学、传染病等)及不同地理粒度。完整来源列表及各来源描述性统计见表9。我们公开释放了来自8个允许内容再分发来源(CCO、CDC、CMA、ICRC、NICE、SPOR、WHO和WikiDoc)的35,733篇文章子集,并提供全部16个来源的网络爬虫和预处理代码。

Collection and processing We employed pragmatic selection criteria, seeking CPGs that were: (1) open-access, (2) systematically formatted with homogenous textual structure (i.e., in a format in which automated processes could be deployed without excessive risk of mis aligning textual sequences), (3) in the language predominantly represented by the pre-training corpus of Llama (i.e., English), and (4) covering a breadth of medical sub-domains, audiences (clinician, nurse, patient), and resource settings (high, low, and humanitarian response settings).

收集与处理
我们采用实用筛选标准,寻找符合以下条件的临床实践指南(CPG): (1) 开放获取; (2) 具有统一文本结构的系统化格式(即采用可部署自动化流程且文本序列错位风险较低的格式); (3) 语言与Llama预训练语料库主体一致(即英语); (4) 覆盖广泛的医学子领域、目标人群(临床医生、护士、患者)及资源环境(高、低资源及人道救援场景)。

After extracting the raw text from each source, we cleaned data to exclude irrelevant or repetitive content that did not contribute to the textual content, such as URLs, references, figures, table delimiters, and ill-formatted characters. Additionally, the text was standardized to a unified format with indicated section headers, homogenous space separating paragraphs, and normalized lists. Finally, all samples were de duplicated using title matching, and articles that were too short or not English were filtered out.

从每个来源提取原始文本后,我们对数据进行了清洗,排除了与文本内容无关或重复的内容,例如URL、参考文献、图表、表格分隔符以及格式错误的字符。此外,文本被标准化为统一格式,包括明确的章节标题、段落间统一间距以及规范化的列表。最后,通过标题匹配对所有样本进行去重,并过滤掉过短或非英文的文章。

DatasetNumber ofsamplesNumberoftokens
TrainValidationTrainValidation
ClinicalGuidelines41K2284 (5%)107M6M (5%)
PubMedAbstracts15.7M487K (3%)5.48B170M (3%)
PubMedPapers4.9M142K (3%)40.7B1.23B (3%)
Experience Replay494K(%0) 0420M(%0) 0
Total21.1M631K46.7B1.4B
数据集 样本数量 Token数量
训练集 验证集 训练集 验证集
ClinicalGuidelines 41K 2284 (5%) 107M 6M (5%)
PubMedAbstracts 15.7M 487K (3%) 5.48B 170M (3%)
PubMedPapers 4.9M 142K (3%) 40.7B 1.23B (3%)
Experience Replay 494K (0%) 0 420M (0%) 0
总计 21.1M 631K 46.7B 1.4B

Table 1: GAP-Replay data mixture statistics. The size of both training and validation sets of the GAP-REPLAY pre-training mixture. For each set, we give the total number of samples and the total number of tokens belonging to each dataset. The portion of each dataset allocated to the validation set (relative to the training set) is given as a percentage.

表 1: GAP-Replay 数据混合统计。GAP-REPLAY 预训练混合中训练集和验证集的规模。对于每个集合,我们给出了样本总数以及属于每个数据集的 token 总数。每个数据集分配到验证集的比例 (相对于训练集) 以百分比形式给出。

Content The GUIDELINES corpus comprises a broad range of contexts. For instance, the geographic scope ranges from global (WHO) to national (CDC, NICE) and regional (Ontario, Melbourne) to institutional (ICRC, Mayo Clinic). The corpus also represents health care concerns from high(Ontario, Melbourne), low- (WHO), and volatile- (ICRC) resource settings. GUIDELINES also contains a range of technical and conversational vocabulary with target audiences of clinicians or patients (or both), and is sometimes highly specialized within a theme (cancer, pediatrics, infectious disease). The peer review processes also ranged from UN bodies (WHO), institutional review boards (ICRC), professional associations (AAFP) to publicly crowd sourced knowledge bases (WikiDoc).

内容
GUIDELINES语料库涵盖广泛的情境。例如,地理范围从全球(WHO)到国家(CDC、NICE)、区域(安大略、墨尔本)直至机构(ICRC、梅奥诊所)。该语料库还体现了高资源(安大略、墨尔本)、低资源(WHO)和资源波动(ICRC)环境下的医疗关切。GUIDELINES包含面向临床医生或患者(或两者)的技术性与对话性词汇组合,有时会高度聚焦特定主题(癌症、儿科、传染病)。同行评审流程也涵盖联合国机构(WHO)、机构审查委员会(ICRC)、专业协会(AAFP)及公开众包知识库(WikiDoc)等不同体系。

2.2 PubMed Papers & Abstracts

2.2 PubMed论文与摘要

Adapting a large language model to the health domain requires vast amounts of biomedical textual data. As the largest public corpus of medical research papers, PubMed was chosen to form the backbone of MEDITRON’s pre-training mix. From the Semantic Scholar Open Research Corpus (S2ORC) (Lo et al., 2020), which aggregates papers from hundreds of academic publishers and digital archives into a unified source, we collected 4.47M full-text papers from the PubMed Central Open Access Subset (National Library of Medicine, 2003–2023). We added 444,521 open-access full-text PubMed papers that are not found in the PubMed Central archive. Finally, we collected 16,209,047 PubMed and PubMed Central abstracts for which full-text versions are unavailable in S2ORC. The knowledge cutoff for all papers and abstracts in the corpus is August 2023.

要让大语言模型适应健康领域,需要海量的生物医学文本数据。作为最大的医学研究论文公开语料库,PubMed被选为MEDITRON预训练混合数据的核心来源。我们从Semantic Scholar开放研究语料库(S2ORC) [20](该库整合了数百家学术出版商和数字档案库的论文)中收集了PubMed Central开放获取子集(美国国家医学图书馆,2003-2023年)的447万篇全文论文。此外补充了444,521篇未收录在PubMed Central档案中的开放获取PubMed全文论文。最后收集了16,209,047篇S2ORC中无全文版本的PubMed和PubMed Central摘要。该语料库中所有论文和摘要的知识截止时间为2023年8月。

Pre-processing PubMed For all full-text articles, we removed the metadata information and references, namely the authors, bibliography, acknowledgments, tables, and figures, and kept only the main text of each paper. Using automatic annotations from S2ORC, we identified inline citations, section headers, figures, and tables within the text using special tokens to allow for higher flexibility in downstream tasks. To promote the use of accurate citations by the model, we formatted all in-text references with a similar methodology to the Galactica model (Taylor et al., 2022). We replaced the paper citations with the special token [BIB_ REF] and formatted them with the referenced paper’s title, truncated to a maximum of 12 words, and the main author’s last name. Similarly, we wrapped in-text figure and table references with the special token [FIG_ REF] and formatted them with the figure number and the truncated figure caption. Finally, we wrapped all mathematical formulas using the special tokens [FORMULA]. We additionally removed URLs and references and normalized whitespace between paragraphs. To promote hierarchical structure learning, we indicate section headers with ’#’ for main sections and ’##’ for subsections. We also prepend the paper title to the main body. We performed the same formatting procedure described above for both abstracts and full-text articles. We de duplicated articles and abstracts based on PubMed and PubMed Central IDs and filtered out non-English content. Additional details on our pre-processing procedure are given in Appendix B.2.

预处理PubMed数据
对于所有全文文章,我们移除了元数据信息和参考文献,包括作者、文献目录、致谢、表格和图表,仅保留每篇论文的正文部分。利用S2ORC的自动标注功能,我们通过特殊token识别了文中的内联引用、章节标题、图表引用,以便在下游任务中获得更高灵活性。为促使模型准确使用引用,我们采用与Galactica模型 (Taylor et al., 2022) 类似的方法格式化所有文中引用:将论文引用替换为特殊token [BIB_ REF],并附上被引论文的标题(截断至最多12个词)和第一作者姓氏。同样地,我们用[FIG_ REF]标记图表引用,并添加图表编号及截断的说明文字。所有数学公式均用[FORMULA]标记包裹。我们还移除了URL和参考文献,并规范化段落间空白。

为促进层次结构学习,我们使用"#"标记主章节,"##"标记子章节,并将论文标题添加在正文起始位置。摘要和全文文章均采用上述相同格式化流程。基于PubMed和PubMed Central ID进行去重,并过滤非英文内容。预处理流程的更多细节见附录B.2。

2.3 Experience Replay

2.3 经验回放

Experience replay refers to the process of including data from old, previously seen tasks when training on new tasks. Distilling replay data into the training mixture has been shown to overcome catastrophic forgetting, a phenomenon where a model incorporating out-of-distribution data forgets its previous training (Sun et al., 2020b). To promote the retention of knowledge acquired by the pre-trained Llama-2 model, we included general domain data into GAP-REPLAY that consists of the $1%$ of the mixture. We used a randomly selected subset of 420 million tokens from the RedPajama dataset, an open-access equivalent to the original Llama-2 pre-training corpus (Together AI, 2023). This dataset contains a mixture of the Falcon refined web corpus (Penedo et al., 2023), the StarCoder dataset (Li et al., 2023), and Wikipedia, ArXiv, books, and Stack Exchange.

经验回放 (Experience replay) 是指在训练新任务时加入旧任务数据的过程。研究表明,将回放数据蒸馏到训练混合集中能有效克服灾难性遗忘 (catastrophic forgetting) [20],即模型在整合分布外数据时遗忘先前训练内容的现象。为保持预训练 Llama-2 模型获得的知识,我们在 GAP-REPLAY 中加入了占混合集 1% 的通用领域数据。这些数据来自 RedPajama 数据集(Together AI, 2023)中随机选取的 4.2 亿 token,该数据集是原始 Llama-2 预训练语料的开源等效版本,包含 Falcon 精炼网络语料 [20]、StarCoder 数据集 [20] 以及 Wikipedia、ArXiv、书籍和 Stack Exchange 的内容。

3 Engineering

3 工程

Training LLMs at scale presents an important engineering challenge. The large model parameter size and pre training token count require a framework for large-scale distributed training that can harness the power of multiple GPUs across many computation nodes. To distribute the training within a cluster, we developed the Megatron-LLM distributed training library (Cano et al., 2023), which extends Nvidia’s Megatron-LM (Shoeybi et al., 2019; Narayanan et al., 2021) to support the training of three popular open-source LLMs that have recently been released: Llama, Falcon, and Llama-2. We use it to pretrain and finetune all MEDITRON models. The library supports several forms of complementary parallelism for distributed training, including Data Parallelism (DP – different GPUs process different subsets of the batches), Pipeline Parallelism (PP – different GPUs process different layers), Tensor Parallelism (TP – different GPUs process different subtensors for matrix multiplication). The library also includes activation re computation to reduce memory usage at the expense of increased computation times, sequence parallelism to exploit further the coordinate-wise independence of batch norm and dropout operations (see (Kort hik anti et al., 2022)), fused operations, and other modern primitives to help increase training throughput.

大规模训练大语言模型是一项重要的工程挑战。庞大的模型参数量与预训练token数需要一套能够利用多计算节点上多个GPU能力的大规模分布式训练框架。为了在集群内分配训练任务,我们开发了Megatron-LLM分布式训练库(Cano等人,2023),该库扩展了Nvidia的Megatron-LM(Shoeybi等人,2019;Narayanan等人,2021),以支持训练近期发布的三种流行开源大语言模型:Llama、Falcon和Llama-2。我们使用该库对所有MEDITRON模型进行预训练和微调。该库支持多种互补的分布式训练并行方式,包括数据并行(DP——不同GPU处理批次的不同子集)、流水线并行(PP——不同GPU处理不同层)、张量并行(TP——不同GPU处理矩阵乘法的不同子张量)。该库还包括激活重计算以减少内存使用(以增加计算时间为代价)、序列并行以进一步利用批归一化和dropout操作的坐标独立性(参见(Korthikanti等人,2022))、融合操作以及其他现代原语,以帮助提高训练吞吐量。

Natively, Megatron-LM’s language modeling is oriented around a GPT-like architecture. We extended its functionalities to support the Llama (Touvron et al., 2023a), Llama-2 (Touvron et al., 2023b), and Falcon (Almazrouei et al., 2023) models. We integrate necessary new architecture features such as the rotary position embedding (Chen et al., 2023), grouped-query attention (Ainslie et al., 2023), the parallel attention/MLP in the transformer layer of Falcon-40B, and the unbinding of the word embedding and the next-token prediction classifier weights used in Llama. We also added support for Flash Attention (Dao et al., 2022) and Flash Attention-2 (Dao, 2023) for more efficient inference and long-context decoding.

原生情况下,Megatron-LM的语言建模围绕类GPT架构设计。我们扩展了其功能以支持Llama (Touvron et al., 2023a)、Llama-2 (Touvron et al., 2023b)和Falcon (Almazrouei et al., 2023)模型。我们整合了必要的新架构特性,包括旋转位置编码 (Chen et al., 2023)、分组查询注意力 (Ainslie et al., 2023)、Falcon-40B的Transformer层中并行注意力/MLP设计,以及Llama采用的词嵌入与下一Token预测分类器权重解耦技术。同时新增了对Flash Attention (Dao et al., 2022)和Flash Attention-2 (Dao, 2023)的支持,以实现更高效的推理和长上下文解码能力。

Hardware The MEDITRON models are trained on an in-house cluster with 16 nodes, each with 8 Nvidia A100 80GB GPUs. The nodes are equipped with $2\times\mathrm{AMD}$ EPYC 7543 32-Core Processors and 512 GB of RAM. The large parameter size of models requires distributed training across many GPUs and computation nodes, making network efficiency paramount. The 16 nodes used for training are connected via RDMA over Converged Ethernet. The 8 Nvidia A100 80GB GPUs in each node are connected by NVLink and NVSwitch with a single Nvidia ConnectX-6 DX network card.5 We expect relatively low inter-node bandwidth to relatively disadvantageous forms of parallelism, such as pipeline parallelism, which relies upon communicating activation values across nodes.

硬件
MEDITRON模型在内部集群上进行训练,该集群包含16个节点,每个节点配备8块Nvidia A100 80GB GPU。节点配置为$2\times\mathrm{AMD}$ EPYC 7543 32核处理器和512GB内存。模型的大参数量需要跨多GPU和计算节点进行分布式训练,这使得网络效率至关重要。训练所用的16个节点通过基于融合以太网的RDMA互联,每个节点内的8块Nvidia A100 80GB GPU通过NVLink和NVSwitch连接,并配备单块Nvidia ConnectX-6 DX网卡。由于节点间带宽相对较低,这对依赖跨节点传输激活值的并行形式(如流水线并行)较为不利。

Model Parallelism Narayanan et al. (2021) prescribe that tensor parallelism equal to the number of GPUs per node should be used, which is 8 in our cluster. We empirically found this to be correct across every parallel iz ation configuration considered and do not analyze it further. For our largest training run using a 70 billion parameter model, we use a pipeline parallelism (PP) factor of 8. With a total of 128 GPUs in our cluster, we get a data parallelism (DP) of 2 $(=128/\mathrm{TP/PP})$ . We use a micro-batch size of 2 and a global-batch size of 512. Although one would prefer larger batch sizes in general for greater pipeline parallelism, we observe negative impacts from a disc ret iz ation problem: raising the micro-batch size from 2 to 4 simply requires too much memory that must be compensated by less pipeline parallelism. We note that Narayanan et al. (2021, Figure 13) also shows that on a similar-sized problem with a similar number of GPUs, with (TP, PP) $\in{(2,32)$ , (4, 16), $(8,8)$ , (16, 4), (32, 2)}, ${\mathrm{TP}}={\mathrm{PP}}=8$ is also observed to deliver the highest per-GPU flops. Fundamentally, we do find that 3D model parallelism is necessary for the efficient training of models of this scale in the sense that TP, PP, and DP are all greater than one.

模型并行
Narayanan等人(2021)提出应使用与每个节点GPU数量相等的张量并行(tensor parallelism),在我们的集群中该值为8。我们通过实验发现这一设置在所考虑的所有并行配置中都是正确的,因此不再进一步分析。对于使用700亿参数模型的最大规模训练任务,我们采用8的流水线并行(pipeline parallelism)因子。在集群总计128个GPU的情况下,我们得到2的数据并行(data parallelism) $(=128/\mathrm{TP/PP})$。我们使用2的微批次(micro-batch)大小和512的全局批次(global-batch)大小。虽然通常人们倾向于使用更大的批次大小以获得更高的流水线并行效率,但我们观察到离散化问题带来的负面影响:将微批次大小从2提升到4会消耗过多内存,必须通过降低流水线并行度来补偿。值得注意的是,Narayanan等人(2021,图13)也表明,在GPU数量相近的类似规模问题上,当(TP, PP) $\in{(2,32)$,(4,16),$(8,8)$,(16,4),(32,2)}时,${\mathrm{TP}}={\mathrm{PP}}=8$的配置同样能实现最高的单GPU浮点运算性能。本质上,我们发现对于这种规模模型的高效训练,确实需要TP、PP和DP都大于1的三维模型并行。


Figure 2: MEDITRON. The complete pipeline for continued pre training, supervised finetuning, and evaluation of MEDITRON-7B and MEDITRON-70B.

图 2: MEDITRON。展示MEDITRON-7B和MEDITRON-70B持续预训练、监督微调及评估的完整流程。

4 Modeling

4 建模

4.1 Pre training

4.1 预训练

To adapt the Llama-2 (Touvron et al., 2023b) language model to the medical domain, we start with the process of continued pre training on the GAP-REPLAY data mixture we build in Section 2. This mixture contains papers from PubMed and PubMed Central (PMC), abstracts from PubMed, medical guidelines published and used by different regions, hospitals, and health organizations, as well as experience replay data (see Table 1).

为了使Llama-2 (Touvron et al., 2023b) 大语言模型适应医疗领域,我们首先在第2节构建的GAP-REPLAY混合数据集上进行持续预训练。该混合数据集包含来自PubMed和PubMed Central (PMC) 的论文、PubMed摘要、不同地区/医院/卫生组织发布和使用的医疗指南,以及经验回放数据 (参见表1)。

Training Details We adopt most pre training settings and model architecture from the Llama-2 paper (Touvron et al., 2023b). For optimization, we use the AdamW optimizer with a cosine learning rate scheduler. For the model architecture, we inherit the standard transformer architecture, the use of RMSNorm, the SwiGLU activation function, and rotary positional embeddings directly from the implementation of Llama. We use group-query attention (GQA) introduced by Llama-2, and a context length of 2048 for the 7B model and 4096 for the 70B model.

训练细节
我们采用了Llama-2论文 (Touvron et al., 2023b) 中的大部分预训练设置和模型架构。在优化方面,我们使用AdamW优化器和余弦学习率调度器。模型架构上,我们直接继承了Llama的标准Transformer架构、RMSNorm、SwiGLU激活函数和旋转位置嵌入 (rotary positional embeddings)。对于7B模型采用2048上下文长度,70B模型采用4096上下文长度,并沿用了Llama-2提出的分组查询注意力机制 (group-query attention, GQA)。

For the pre training run with Llama-2-70B, we achieve a throughput of 40, 200 tokens/second. This amounts to $1.688\bar{4}\times10^{16}$ bfloat16 flop/second and represents roughly $42.3%$ of the theoretical peak flops of 128 A100 GPUs, which is $12{\dot{8}}\times(312\times10^{1{\dot{2}}})=3.993{\dot{6}}\times{\dot{1}}0^{16}$ flops. This is in line with existing runs of comparable size. For instance, Narayanan et al. (2021, Table 1) shows a model flops utilization (MFU) of $45%$ for a 76B parameter GPT-3, and Mangrulkar et al. (2023) gives an MFU of $45.5%$ on a Llama-2 finetuning task similar to ours.

在使用Llama-2-70B的预训练运行中,我们实现了每秒40,200个token的吞吐量。这相当于每秒$1.688\bar{4}\times10^{16}$次bfloat16浮点运算(FLOP),约占128块A100 GPU理论峰值算力$12{\dot{8}}\times(312\times10^{1{\dot{2}}})=3.993{\dot{6}}\times{\dot{1}}0^{16}$ FLOPs的$42.3%$。这与同类规模运行的现有结果一致。例如,Narayanan等人(2021, 表1)显示76B参数的GPT-3模型算力利用率(MFU)为$45%$,而Mangrulkar等人(2023)在类似我们任务的Llama-2微调工作中实现了$45.5%$的MFU。

Hyper parameters and Token iz ation The parameters for the AdamW optimizer are as follows: $\beta_ {1}=0.9$ , $\beta_ {2}=0.95$ , $\mathrm{eps}=\mathrm{10^{-5}}$ . The cosine learning rate schedule uses 2000 steps for warmup and decays the final learning rate to $10%$ of the maximum learning rate. We use $1.5^{\star}\times10^{-4}$ as the learning rate for the 70B model and $3\times10^{-4}$ for the 7B and 13B models. The weight decay is set to 0.1, and the gradient clipping is set to 1.0. We inherit the tokenizer from Llama and use the bytepair encoding algorithm (BPE) implemented with Sentence Piece. The total vocabulary size is $32k$ tokens. Extra tokens are added to incorporate the new tokens we introduced for the pre training data preprocessing. See Section 2.2 and Appendix B.2 for more details.

超参数与分词 (Tokenization)
AdamW优化器的参数如下:$\beta_ {1}=0.9$、$\beta_ {2}=0.95$、$\mathrm{eps}=\mathrm{10^{-5}}$。余弦学习率调度使用2000步进行预热,并将最终学习率衰减至最大学习率的$10%$。70B模型的学习率设为$1.5^{\star}\times10^{-4}$,7B和13B模型设为$3\times10^{-4}$。权重衰减设置为0.1,梯度裁剪设为1.0。我们继承了Llama的分词器 (Tokenizer),采用基于Sentence Piece实现的字节对编码算法 (BPE)。总词汇量为$32k$个token。为适应预训练数据预处理引入的新token,额外添加了特殊token。详见第2.2节与附录B.2。

4.2 Supervised Finetuning

4.2 监督式微调

To evaluate the downstream performance of our MEDITRON models on common medical reasoning benchmarks, we individually finetune the pretrained model on each benchmark’s training set. For example, we finetune the model on the MedMCQA training set and evaluate it on the MedMCQA test set. Since MMLU does not have a training set, we evaluate the model finetuned on MedMCQA for out-of-distribution inference. For instruction finetuning, we manually write expressive and clear instructions for each training set. We list these instructions in Table 2.

为评估MEDITRON模型在常见医学推理基准上的下游性能,我们在每个基准训练集上分别对预训练模型进行微调。例如,在MedMCQA训练集上微调模型并在其测试集上评估。由于MMLU没有训练集,我们使用在MedMCQA上微调的模型进行分布外推理评估。对于指令微调,我们为每个训练集手动编写了表达清晰的任务指令,具体如 表2 所示。

Table 2: Medical task instructions. The instruction used for each benchmark for in-context learning and finetuning. Because MMLU-Medical does not provide training data, we evaluate MEDITRON models finetuned on MedMCQA on MMLU-Medical. Thus, the instruction for MMLU-Medical is identical to the one used for MedMCQA.

DatasetInstruction
MedQAYouareamedicaldoctortakingtheUSMedicalLicensingExamination.Youneedtodemonstrate your understanding of basic and clinical science, medical knowledge, and mechanisms underlying health, disease,patient care, and modes of therapy. Show your ability to apply the knowledge essential for medical practice. For the following multiple-choice question, select one correct answer from A to E.Base your answer on the current and standard practices referenced in medical guidelines.
PubMedQAAs an expert doctor in clinical science and medical knowledge, can you tell me if the following statement is correct?Answeryes,no,or maybe.
MedMCQAYou are a medical doctor answeringreal-world medicalentrance exam questions.Based onyour understandingofbasic and clinical science,medicalknowledge,andmechanisms underlyinghealth disease,patient care,and modes of therapy,answer thefollowingmultiple-choicequestion.Select one correct answer from A toD.Baseyour answer onthe current and standard practicesreferenced in medicalguidelines.

表 2: 医疗任务指令。针对上下文学习和微调各基准所使用的指令。由于MMLU-Medical未提供训练数据,我们在该数据集上评估基于MedMCQA微调的MEDITRON模型,因此MMLU-Medical的指令与MedMCQA保持一致。

数据集 指令
MedQA 您是一名参加美国医师执照考试的医生。需要展示您对基础与临床医学、医学知识以及健康机制、疾病、患者护理和治疗模式的理解。请证明您具备应用医学实践必备知识的能力。对于以下选择题,从A到E中选择一个正确答案。作答依据请参考现行医疗指南中的标准实践。
PubMedQA 作为临床科学与医学知识领域的专家医生,请判断以下陈述是否正确?回答是、否或可能。
MedMCQA 您是一名医生,正在回答真实世界医学入学考试题目。基于您对基础与临床医学、医学知识以及健康机制、疾病、患者护理和治疗模式的理解,回答以下选择题。从A到D中选择一个正确答案。作答依据请参考现行医疗指南中的标准实践。

Implementation We follow OpenAI’s ChatML format (OpenAI, 2023a) to format the instruction data. ChatML documents consist of a series of messages, starting with a special token $<|$ im_ start $|>$ , followed by the role of messenger (i.e., the “user” or the “assistant”), a new line, and then the message itself. The message is then suffixed with a second special token: $<|$ im_ end $|>$ . We adopt ChatML’s format for constructing the input prompt for the model. During training, we only compute the loss with respect to the response tokens (including $<|$ im_ start $|>$ and $<|$ im_ end $|>\rangle$ ).

实现
我们遵循 OpenAI 的 ChatML 格式 (OpenAI, 2023a) 来格式化指令数据。ChatML 文档由一系列消息组成,以特殊 token $<|$ im_ start $|>$ 开头,后跟消息发送者角色(即 "user" 或 "assistant")、换行符及消息内容。消息末尾以第二个特殊 token $<|$ im_ end $|>$ 结束。我们采用 ChatML 格式构建模型的输入提示。训练时,仅计算响应 token(包括 $<|$ im_ start $|>$ 和 $<|$ im_ end $|>$)对应的损失。

When preprocessing the input data, we keep each document separate and insert pad tokens $<\mathrm{PAD}>$ at the end of each text and mask out the loss on padding tokens. An example prompt format for task-specific-finetuning on MedQA is as follows:

在对输入数据进行预处理时,我们将每个文档保持独立,并在每个文本末尾插入填充符$<\mathrm{PAD}>$,同时屏蔽填充符上的损失。以下是针对MedQA任务进行特定微调的提示格式示例:

A finetuned MEDITRON model needs to predict (C) Cystic hygroma as the answer for this prompt.

经过微调的MEDITRON模型需要预测(C)囊性水瘤作为该提示的答案。

Hyper parameters The finetuning process uses the AdamW optimizer, with $\beta_ {1}=0.9$ , $\beta_ {2}=0.95$ , and $\mathrm{ep{\bar{s}}=1\times10^{-5}}$ . We use a cosine learning rate schedule with a $10%$ warmup ratio and decay the final learning rate down to $10%$ of the peak learning rate. Following Llama2-chat (Touvron et al., 2023b), we use a learning rate of $2\times10^{\bar{-}5}$ , a weight decay of 0.1, and a batch size of 64. We finetune the model for 3 epochs for all the finetuning runs.

超参数
微调过程使用 AdamW 优化器,设置 $\beta_ {1}=0.9$、$\beta_ {2}=0.95$ 和 $\mathrm{ep{\bar{s}}=1\times10^{-5}}$。采用余弦学习率调度,包含 10% 的预热比例,并将最终学习率衰减至峰值学习率的 10%。参照 Llama2-chat (Touvron et al., 2023b),设定学习率为 $2\times10^{\bar{-}5}$,权重衰减为 0.1,批量大小为 64。所有微调实验均训练 3 个周期。

Dataset#TrainSamples#TestSamplesFormat#Choices
MedQA10,1781,273Question+Answer5
MedQA-4-optionot1,273Question+Answer4
PubMedQA200,000500Abstract+Question+Answer3
MedMCQA159,6694,183Question+Answer4
MMLU-Medical1,862Question+Answer4
数据集 训练样本数 测试样本数 格式 选项数
MedQA 10,178 1,273 问题+答案 5
MedQA-4-option ot 1,273 问题+答案 4
PubMedQA 200,000 500 摘要+问题+答案 3
MedMCQA 159,669 4,183 问题+答案 4
MMLU-Medical 1,862 问题+答案 4

Table 3: Medical benchmark datasets. In this table, we summarize the major details of each benchmark we use to evaluate MEDITRON. We report the number of train and test questions, the format of the questions, and the number of choices for each benchmark. Note that all benchmarks are multiple-choice question-answering tasks. †For MedQA-4-option, we train on the 5-option variant and evaluate on the 4-option setting.

表 3: 医学基准数据集。本表汇总了用于评估 MEDITRON 的各基准数据集关键信息,包括训练/测试问题数量、问题格式及选项数目。注:所有基准均为多项选择题任务。†MedQA-4-option 使用5选项变体进行训练,在4选项设置下评估。

4.3 Inference

4.3 推理

We apply several different inference methods to elicit answers from the resulting model from continued pre training or instruction tuning.

我们对持续预训练或指令调优得到的模型应用了多种不同的推理方法来获取答案。

Top Token Selection (Top-Token): For tasks with a single-label answer, such as Multiple-choice or Boolean QA, we follow the HELM implementation (Liang et al., 2023) of the Open LLM benchmark (Beeching et al., 2023). In particular, we rely on a text generation engine to generate the next token output and gather the probability from the model for each word in the vocabulary. We select the token with the maximum log probability as the model’s generated answer and then compare the model answer to the text of the expected answer to determine the accuracy. For models finetuned on the downstream task, we pass the question directly to the model as input. For the pretrained model, we perform in-context learning (Xie et al., 2022) and provide the model with few-shot demonstrations as part of the input. For both in-context learning and direct generation from a finetuned model, we append the instruction of each benchmark in front of the question for answer generation.

Top Token选择(Top-Token):针对单标签答案任务(如多选题或布尔问答),我们遵循Open LLM基准测试(Beeching等人,2023)的HELM实现方案(Liang等人,2023)。具体而言,我们依赖文本生成引擎输出下一个token,并收集模型对词汇表中每个单词的概率预测。选择具有最大对数概率的token作为模型生成答案,再将该答案与预期答案文本进行比对以确定准确率。对于在下游任务上微调过的模型,我们直接将问题作为输入传递给模型。对于预训练模型,我们采用上下文学习(Xie等人,2022)方法,在输入中提供少量示例演示。无论是上下文学习还是微调模型的直接生成,我们都会在每个基准测试的问题前附加相应指令来生成答案。

Chain-of-Thought $\mathbf{(CoT)}$ : CoT, introduced by Wei et al. (2023), enables an LLM to condition its generation on its intermediate reasoning steps when answering multi-step problems, thereby augmenting the LLM’s reasoning ability on complex problems such as math word problems. We apply zero-shot CoT prompting to the models finetuned on medical data since we only finetune on zero-shot CoT training samples. In the case of zero-shot CoT, we add the phrase “Let’s think step-by-step" at the end of the question following Kojima et al. (2023).

思维链 (CoT) :CoT由Wei等人 (2023) 提出,使大语言模型在回答多步骤问题时能够基于其中间推理步骤生成答案,从而增强大语言模型在数学应用题等复杂问题上的推理能力。由于我们仅在零样本CoT训练样本上进行微调,因此将零样本CoT提示应用于经过医学数据微调的模型。在零样本CoT场景下,我们按照Kojima等人 (2023) 的方法,在问题末尾添加"让我们逐步思考"这一短语。

Self-consistency CoT (SC-CoT): Wang et al. (2023) found that sampling multiple reasoning traces and answers from the model and selecting the final answer through majority voting can significantly improve large language model performance on multiple-choice question-answering benchmarks. We apply SC-CoT prompting using a decoding temperature of 0.8, sample 5 generations, extract the answer options from each generation, and use majority voting to select the final prediction.

自洽思维链 (SC-CoT): Wang 等人 (2023) 发现,通过从模型中采样多条推理轨迹和答案,并采用多数投票方式选择最终答案,可显著提升大语言模型在多项选择题基准测试中的表现。我们采用解码温度为 0.8 的 SC-CoT 提示策略,采样 5 次生成结果,从每次生成中提取答案选项,最终通过多数投票确定预测结果。

5 Medical Benchmarks

5 医学基准测试

Following previous works on developing medical LLMs and evaluation methods (Wu et al., 2023a; Singhal et al., 2023a,b), we selected four commonly used medical benchmarks, which are MedQA, MedMCQA, PubMedQA, and MMLU-Medical.

遵循先前关于开发医疗大语言模型和评估方法的研究 [20, 21, 22],我们选择了四个常用的医疗基准测试,分别是 MedQA、MedMCQA、PubMedQA 和 MMLU-Medical。

MedQA: The MedQA (Jin et al., 2020) dataset consists of questions in the style of the US Medical License Exam (USMLE). MedQA is a challenging benchmark due to its combination of different medical knowledge (patient profile, disease symptoms, drug dosage requirements, etc.) that needs to be contextual i zed for the questions to be answered correctly. The training set consists of 10178 samples, and the test set has 1273 questions. MedQA was compiled with a choice of four (MedQA4-option) or five possible answers, so we finetuned the models on the original 5-option dataset and tested it on both the 5 and 4-option questions (MedQA-4-option) to have comparable results with existing evaluations of medical LLMs. This dataset does not include any long explanatory answers, so to finetune a model for chain-of-thought reasoning, we used a training set of questions in the distribution of MedQA that provides human-written explanations.

MedQA: MedQA (Jin等人, 2020) 数据集包含美国医师执照考试(USMLE)风格的题目。由于需要结合不同医学知识(患者档案、疾病症状、药物剂量要求等)进行情境化推理才能正确作答,该数据集具有较高挑战性。训练集包含10178个样本,测试集含1273道题目。MedQA提供四选一(MedQA4-option)或五选一题型,我们在原始5选项数据集上微调模型,并在5选项和4选项(MedQA-4-option)题目上测试,以保持与现有医疗大语言模型评估的可比性。该数据集不含详细解释性答案,因此为了微调思维链推理模型,我们使用了MedQA分布中附带人工撰写解释的题目训练集。

ModelAccuracy(↑)
MMLU-MedicalPubMedQAMedMCQAMedQAMedQA-4-OptionAvg
MPT-7B23.5±0.9343.9±21.932.1±0.9122.5±0.5927.6±1.5729.9
Falcon-7B26.1±0.5152.8±44.227.3±1.5319.6±1.8625.3±1.6330.2
Llama-2-7B41.4±0.2449.1±51.137.9±1.1629.1±0.9035.4±4.2738.6
PMC-Llama-7B26.2±1.2757.0±20.627.4±5.9121.6±0.3227.8±0.8632.0
MEDITRON-7B42.3±2.3769.3±15.136.3±1.3828.7±0.8137.4±3.2742.8
Llama-2-70B71.3±0.87一一 72.8±7.3452.4±0.2158.4±0.9560.8
MEDITRON-70B71.5±0.6779.8±0.4653.3±0.5149.0±0.85 52.0±1.2159.8±0.2463.3
模型 MMLU-Medical PubMedQA MedMCQA MedQA MedQA-4-Option 平均
MPT-7B 23.5±0.93 43.9±21.9 32.1±0.91 22.5±0.59 27.6±1.57 29.9
Falcon-7B 26.1±0.51 52.8±44.2 27.3±1.53 19.6±1.86 25.3±1.63 30.2
Llama-2-7B 41.4±0.24 49.1±51.1 37.9±1.16 29.1±0.90 35.4±4.27 38.6
PMC-Llama-7B 26.2±1.27 57.0±20.6 27.4±5.91 21.6±0.32 27.8±0.86 32.0
MEDITRON-7B 42.3±2.37 69.3±15.1 36.3±1.38 28.7±0.81 37.4±3.27 42.8
Llama-2-70B 71.3±0.87 72.8±7.34 52.4±0.21 - 58.4±0.95 60.8
MEDITRON-70B 71.5±0.67 79.8±0.46 53.3±0.51 49.0±0.85 52.0±1.21 63.3

Table 4: Few-shot Learning results of raw MEDITRON models against open-source pretrained baselines. This table shows the main few-shot learning results of MEDITRON on downstream medical tasks against other open-source pretrained models. Our models (MEDITRON-7B and MEDITRON70B) are continue-pretrained raw models with no additional supervised finetuning on task-specific training sets. For the 7B models, we apply 3-shot in-context learning with 3 demonstrations randomly sampled from each benchmark’s training set because the maximum context window size is limited to 2048 tokens. For the 70B models, we use 5-shot in-context learning. We report the average accuracy across three random seeds used for sampling random demonstrations.

表 4: MEDITRON原始模型的少样本学习结果与开源预训练基线的对比。本表展示了MEDITRON在医疗下游任务上与其他开源预训练模型的主要少样本学习结果。我们的模型(MEDITRON-7B和MEDITRON-70B)是持续预训练的原始模型,未在任务特定训练集上进行额外的监督微调。对于7B模型,由于最大上下文窗口尺寸限制为2048个token,我们采用3样本上下文学习,从每个基准训练集中随机抽取3个示例。对于70B模型,我们使用5样本上下文学习。报告结果为三个随机种子采样所得演示的平均准确率。

MedMCQA: The MedMCQA (Pal et al., 2022) dataset consists of more than 194k 4-option multiple-choice questions from the Indian medical entrance examinations (AIIMS/NEET). This dataset covers $2.4\mathrm{k}$ healthcare topics and 21 medical subjects. The training set contains $187\mathrm{k}$ samples, and the validation set has 4183 questions. Because the test set of MedMCQA does not provide the answer keys to the general public, we follow Wu et al. (2023a) and use the validation set to report evaluations. For hyper parameter tuning, we randomly split the training set into new train/validation splits. For both single-answer and chain-of-thought training data, we also remove all the samples with "None" as the explanation, resulting in 159,669 training samples.

MedMCQA: MedMCQA数据集 (Pal等人, 2022) 包含超过19.4万道来自印度医学入学考试 (AIIMS/NEET) 的四选一选择题。该数据集涵盖$2.4\mathrm{k}$个医疗主题和21个医学学科。训练集包含$187\mathrm{k}$个样本,验证集有4183道题目。由于MedMCQA的测试集未向公众提供答案,我们遵循Wu等人 (2023a) 的做法,使用验证集进行评测。为进行超参数调优,我们将训练集随机划分为新的训练/验证拆分。对于单答案和思维链训练数据,我们还移除了所有解释为"None"的样本,最终得到159,669个训练样本。

PubMedQA: The PubMedQA (Jin et al., 2019) dataset consists of 200k artificially created multiplechoice QA samples and 1k samples QA labeled by experts. Given a PubMed abstract as context and a question, the model needs to predict a yes, no, or maybe answer. We follow the reasoning-required evaluation setting where the model is given a question together with a PubMed abstract as context. Out of the 1k expert-labeled samples, we use the 500 test samples for evaluation following Singhal et al. (2023a)’s setting. Because the size of the other 500 training samples is relatively small, we use the 200k artificially labeled examples as the training data to finetune our models.

PubMedQA: PubMedQA (Jin等人, 2019) 数据集包含20万个人工生成的多选题问答样本和1千个专家标注的问答样本。给定一篇PubMed摘要作为上下文和一个问题,模型需要预测"是"、"否"或"可能"的答案。我们采用需要推理能力的评估设置,即模型会同时获得问题和PubMed摘要作为上下文。在这1千个专家标注样本中,按照Singhal等人 (2023a) 的设置,我们使用其中500个测试样本进行评估。由于另外500个训练样本规模较小,我们使用20万个人工标注样本作为训练数据来微调模型。

MMLU-Medical: The MMLU dataset (Hendrycks et al., 2021b) includes exam questions from 57 subjects (e.g., STEM, social sciences, etc.). Each MMLU subject contains four-option multiplechoice questions and their respective answer. We selected the nine subjects that are most relevant to medical and clinical knowledge: high school biology, college biology, college medicine, professional medicine, medical genetics, virology, clinical knowledge, nutrition, and anatomy, and we concatenate them into one medical-related benchmark: MMLU-Medical. The total number of questions in MMLU-Medical is 1862. Note that MMLU does not provide any training data. Therefore, we used MedMCQA’s training data (four-answer options, the same as MMLU-Medical) to finetune our models and evaluate the generalization performance from MedMCQA to MMLU-Medical.

MMLU-Medical:MMLU数据集(Hendrycks等人,2021b)包含来自57个学科(如STEM、社会科学等)的考试题目。每个MMLU学科包含四选一选择题及其对应答案。我们选取了与医学和临床知识最相关的九个学科:高中生物学、大学生物学、大学医学、专业医学、医学遗传学、病毒学、临床知识、营养学和解剖学,并将它们合并为一个医学相关基准:MMLU-Medical。MMLU-Medical的总题量为1862道。需要注意的是,MMLU不提供任何训练数据。因此,我们使用MedMCQA的训练数据(四选一选项,与MMLU-Medical相同)对模型进行微调,并评估从MedMCQA到MMLU-Medical的泛化性能。

6 Main Results

6 主要结果

6.1 Pretrained Model Evaluation

6.1 预训练模型评估

Setup: For the benchmarks that provide publicly available training sets, i.e., PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022), and MedQA (Jin et al., 2020), we randomly sample few-shot demonstrations from the training data using three different random seeds (3-shot for 7B models and 5-shot for 70B models). We report the average accuracy across three random seeds. As baselines, we compare the raw MEDITRON models to other pretrained models. Our first baselines are the Llama-2 models (7B and 70B) without any continued pre training, as it allows us to control for the effect of our continued pre training. For MEDITRON-7B, we additionally run comparisons with PMC-Llama-7B (Wu et al., 2023a), a medical LLM adapted from Llama through continued pre training on PubMed Central papers. We also select general-purpose pretrained models that perform well in open-source reasoning benchmarks as baselines, including Falcon-7B (Almazrouei et al., 2023) and MPT-7B (MosaicML NLP Team, 2023).

实验设置:对于提供公开训练集的基准测试(即PubMedQA (Jin et al., 2019)、MedMCQA (Pal et al., 2022)和MedQA (Jin et al., 2020)),我们使用三种不同随机种子从训练数据中随机抽取少样本示例(7B模型采用3样本,70B模型采用5样本)。最终报告三个随机种子的平均准确率。作为基线,我们将原始MEDITRON模型与其他预训练模型进行对比。第一组基线是未经继续预训练的Llama-2模型(7B和70B),这有助于控制继续预训练带来的影响。针对MEDITRON-7B,我们还与PMC-Llama-7B (Wu et al., 2023a)进行了对比——该医疗大语言模型通过对PubMed Central论文进行继续预训练从Llama适配而来。同时选取在开源推理基准中表现优异的通用预训练模型作为基线,包括Falcon-7B (Almazrouei et al., 2023)和MPT-7B (MosaicML NLP Team, 2023)。

Table 5: Main results of MEDITRON against open-source baselines. This table shows the main results of MEDITRON’s downstream medical task performance against other best-performing opensource medical models measured by accuracy. Our models (MEDITRON-7B and MEDITRON-70B), the Llama-2 models (7B and 70B), and PMC-Llama-7B are individually finetuned on PubMedQA, MedMCQA, and MedQA training sets. The baselines with ∗, i.e., Mistral-7B (instruct version), Zephyr-7B $\cdot\beta$ , Med42-70B, and Clinical-Camel-70B are instruction-tuned, so we do not perform further finetuning on the training sets and use the out-of-box model for inference. The inference modes consist of (1) top-token selection based on probability, (2) zero-shot chain-of-thought prompting, and (3) self-consistency chain-of-thought prompting (5 branches with 0.8 temperature). According to Tian et al. (2023), the passing score for humans on MedQA is 60.0.

ModelAccuracy (↑)
MMLU-MedicalPubMedQAMedMCQAMedQAMedQA-4-OptionAvg
Top Token 1 Selection
Mistral-7B* 55.817.840.232.441.137.5
Zephyr-7B-β* 63.346.043.042.848.548.7
PMC-Llama-7B59.759.257.642.449.253.6
Llama-2-7B56.361.854.444.049.653.2
MEDITRON-7B55.674.459.247.952.057.5
Clinical-Camel-70B*
Med42-70B* 65.767.046.750.856.857.4
Llama-2-70B74.561.259.259.163.963.6
MEDITRON-70B74.778.062.7 65.159.2 60.761.367.2
73.680.065.469.0
Chain-of-thought
Llama-2-70B76.779.862.160.863.968.7
MEDITRON-70B74.981.063.261.567.869.7
Self-consistency Chain-of-thought
Llama-2-70B77.980.062.661.563.869.2
MEDITRON-70B77.681.666.064.470.272.0

表 5: MEDITRON与开源基线模型的主要结果对比。本表展示了MEDITRON在下游医疗任务中与其他表现最佳的开源医疗模型(以准确率为衡量标准)的主要对比结果。我们的模型(MEDITRON-7B和MEDITRON-70B)、Llama-2模型(7B和70B)以及PMC-Llama-7B均在PubMedQA、MedMCQA和MedQA训练集上进行了单独微调。带∗标记的基线模型(Mistral-7B指令版、Zephyr-7B$\cdot\beta$、Med42-70B和Clinical-Camel-70B)已进行指令微调,因此我们未在训练集上进一步微调,直接使用开箱即用模型进行推理。推理模式包括:(1) 基于概率的top-token选择,(2) 零样本思维链提示,(3) 自洽思维链提示(5个分支,温度系数0.8)。根据Tian等人(2023)的研究,人类在MedQA上的及格分数为60.0。

模型 MMLU-Medical PubMedQA MedMCQA MedQA MedQA-4-Option Avg
* * Top Token 1 Selection* *
Mistral-7B* 55.8 17.8 40.2 32.4 41.1 37.5
Zephyr-7B-β* 63.3 46.0 43.0 42.8 48.5 48.7
PMC-Llama-7B 59.7 59.2 57.6 42.4 49.2 53.6
Llama-2-7B 56.3 61.8 54.4 44.0 49.6 53.2
MEDITRON-7B 55.6 74.4 59.2 47.9 52.0 57.5
* * Clinical-Camel-70B* * *
Med42-70B* 65.7 67.0 46.7 50.8 56.8 57.4
Llama-2-70B 74.5 61.2 59.2 59.1 63.9 63.6
MEDITRON-70B 74.7 78.0 62.7 65.1 59.2 60.7 61.3 67.2
73.6 80.0 65.4 69.0
* * Chain-of-thought* *
Llama-2-70B 76.7 79.8 62.1 60.8 63.9 68.7
MEDITRON-70B 74.9 81.0 63.2 61.5 67.8 69.7
* * Self-consistency Chain-of-thought* *
Llama-2-70B 77.9 80.0 62.6 61.5 63.8 69.2
MEDITRON-70B 77.6 81.6 66.0 64.4 70.2 72.0

Results: In Table 4, we observe that at the 7B-scale, MEDITRON-7B with in-context learning outperforms other pretrained baselines. A potential reason for the improved performance is that MEDITRON-7B uses Llama-2 as a backbone model, which already achieves much higher average performance than other pretrained baselines. However, we show that continued pre training on medical data brings additional benefits and further improves Llama-2’s performance on the medical benchmarks. In particular, MEDITRON-7B shows much higher performance on PubMedQA than the base model $20%$ increase). At the 70B scale, the base model Llama-2-70B and MEDITRON-70B’s performances increase significantly compared to the 7B models, with MEDITRON-70B outperforming the base model on all benchmarks. At the 7B scale, we observe that MEDITRON-7B does not perform as well as the base model on the most difficult benchmark, MedQA (though the difference is within the margin of error). However, At the 70B scale, we see that MEDITRON-70B outperforms the base Llama-2 by $3%$ . Overall, we show that MEDITRON models, particularly At the 70B scale, already demonstrate decent reasoning ability on medical tasks even before finetuning for a particular task. More specifically, for PubMedQA, the in-context learning performance $(79.8%)$ is only $0.2%$ behind the model finetuned on non-chain-of-thought PubMedQA training data $(80.0%)$ .

结果:在表4中,我们观察到在70亿参数规模下,采用上下文学习的MEDITRON-7B超越了其他预训练基线模型。性能提升的潜在原因在于MEDITRON-7B以Llama-2作为主干模型,其基础性能已显著高于其他预训练基线。但研究表明,医疗数据的持续预训练能带来额外收益,可进一步提升Llama-2在医疗基准测试中的表现。具体而言,MEDITRON-7B在PubMedQA上的表现较基础模型提升达20%。在700亿参数规模下,基础模型Llama-2-70B和MEDITRON-70B的性能较70亿参数模型均有显著提升,其中MEDITRON-70B在所有基准测试中均优于基础模型。值得注意的是,在70亿参数规模下,MEDITRON-7B在最困难的MedQA基准上表现略逊于基础模型(尽管差异在误差范围内)。但在700亿参数规模下,MEDITRON-70B以3%的优势超越基础Llama-2模型。总体而言,MEDITRON系列模型(尤其是700亿参数版本)在未经特定任务微调前,已展现出优秀的医疗任务推理能力。具体到PubMedQA任务,其上下文学习性能(79.8%)与非思维链训练数据微调模型(80.0%)仅相差0.2%。

6.2 Finetuned Model Evaluation

6.2 微调模型评估

Setup: For the benchmarks that provide publicly available training sets, we conduct supervised finetuning individually on each training set and evaluate on the corresponding test sets. Both PubMedQA and MedMCQA provide reasoning traces (long answers or explanations) for chain-ofthought. For MedQA, which does not provide reasoning traces, we use a separate training set that provides a human-written explanation for each question.6 We train with the format where the answer is concatenated to the explanation. For MMLU-Medical (Hendrycks et al., 2021b), which does not contain a training set, we test the model trained on MedMCQA instead since both datasets have the four-option answer format (with A, B, C, D). For the MedQA-4-option test set, we directly evaluate the model trained on the MedQA training set with five options.

设置:对于提供公开训练集的基准测试,我们分别在每个训练集上进行监督微调,并在相应的测试集上评估。PubMedQA和MedMCQA均提供思维链的推理轨迹(长答案或解释)。对于未提供推理轨迹的MedQA,我们使用一个单独的训练集,其中每个问题都附有人工撰写的解释。我们采用将答案与解释拼接的格式进行训练。对于不含训练集的MMLU-Medical (Hendrycks et al., 2021b),由于该数据集与MedMCQA均采用四选项(A、B、C、D)的答案格式,我们改用经MedMCQA训练的模型进行测试。针对MedQA四选项测试集,我们直接使用经五选项MedQA训练集训练的模型进行评估。


Figure 3: Main results of MEDITRON against commercial LLMs. We compare MEDITRON-70B’s performance on four medical benchmarks (PubMedQA, MedMCQA, MedQA, MedQA-4-option) against commercial LLMs that have much larger parameter counts. We focus on GPT-3.5 (175B), GPT-4, Med-PaLM (540B), and Med-PaLM-2 (540B). The results of these commercial LLMs are directly taken from the associated papers (Nori et al., 2023; Singhal et al., 2023a,b). Note that MedPaLM does not report its performance on MedQA, and MedPaLM-2 does not report its performance on MedQA-4-option.

图 3: MEDITRON与商用大语言模型的主要对比结果。我们将MEDITRON-70B在四个医学基准测试(PubMedQA、MedMCQA、MedQA、MedQA-4-option)上的表现与参数量大得多的商用大语言模型进行对比,重点关注GPT-3.5(175B)、GPT-4、Med-PaLM(540B)和Med-PaLM-2(540B)。这些商用大语言模型的结果直接引自相关论文(Nori et al., 2023; Singhal et al., 2023a,b)。需要注意的是,MedPaLM未报告其在MedQA上的表现,MedPaLM-2未报告其在MedQA-4-option上的表现。

We evaluate MEDITRON models finetuned on each individual benchmark’s training set against Llama-2 (7 and 70B) and PMC-Llama-7B (also finetuned on each benchmark’s training sets). We then include 4 instruction-tuned models as public baselines: Mistral-7B-instruct (Jiang et al., 2023) and Zephyr-7B $\beta$ (Tunstall et al., 2023) for as 7B-scale baselines, and Clinical-Camel-70B (Toma et al., 2023) and Med42-70B (M42-Health) as 70B-scale baseline. Clinical-Camel-70B is a Llama2 70B variant tuned using QLoRA (Dettmers et al., 2023) on multi-turn dialogues transformed from conversations, clinical articles, and medical task data. Med42-70B is instruction-tuned on medical tasks, but the training details are not publicly released. We do not further finetune the public baselines on the task-specific training sets because they are already instruction-tuned. Finally, we compare MEDITRON-70B against commercial LLMs, including GPT-3.5 (Ouyang et al., 2022), GPT-4 (OpenAI, 2023b), Med-PaLM (Singhal et al., 2023a), and Med-PaLM-2 (Singhal et al., 2023b). These LLMs are pretrained or tuned on large-scale, high-quality, proprietary corpora and instruction data. They are also significantly larger than MEDITRON (i.e., 175B, 540B). Note that only MEDITRON, Llama-2, and PMC-Llama-7B models are finetuned on the training sets. Because Med42 (M42-Health) and Clinical-Camel (Toma et al., 2023) have already been tuned on these datasets as part of their initial instruction-tuning, we exclude them from further supervised finetuning.

我们评估了在各基准训练集上微调后的MEDITRON模型,并与Llama-2(7B和70B版本)及PMC-Llama-7B(同样在各基准训练集上微调)进行对比。同时选取了4个指令微调模型作为公开基线:7B规模的Mistral-7B-instruct (Jiang et al., 2023)和Zephyr-7B β (Tunstall et al., 2023),以及70B规模的Clinical-Camel-70B (Toma et al., 2023)和Med42-70B (M42-Health)。其中,Clinical-Camel-70B是基于Llama2 70B通过QLoRA (Dettmers et al., 2023)对转化自对话、临床文献和医疗任务数据的多轮对话进行调优的变体;Med42-70B虽经医疗任务指令微调,但其训练细节未公开。由于这些基线模型已完成指令微调,我们未在任务特定训练集上对其进一步微调。最后,我们将MEDITRON-70B与商用大语言模型(包括GPT-3.5 (Ouyang et al., 2022)、GPT-4 (OpenAI, 2023b)、Med-PaLM (Singhal et al., 2023a)和Med-PaLM-2 (Singhal et al., 2023b))进行对比。这些大语言模型均基于大规模高质量专有语料库和指令数据预训练或调优,且参数量远超MEDITRON(分别达175B和540B)。需注意的是,仅MEDITRON、Llama-2和PMC-Llama-7B模型在训练集上进行了微调。由于Med42 (M42-Health)和Clinical-Camel (Toma et al., 2023)已在初始指令微调阶段使用过这些数据集,故未对其进行额外监督微调。

Results: We report the performance of MEDITRON and related baselines in both the 7B and 70B parameter scales. Table 5 shows all the performance measured in terms of accuracy (↑). At the 7B scale, we first compare with Llama-2-7B and PMC-Llama-7B, which are finetuned in the same manner as MEDITRON-7B. The results show that MEDITRON-7B outperforms these two baselines by an average of $4%$ . Compared to the state-of-the-art instruction-tuned models Mistral (Jiang et al., 2023) and Zephyr $\beta$ (Tunstall et al., 2023), MEDITRON achieves significant performance gains on all benchmarks except MMLU-Medical, particularly on PubMedQA, with more than a $10%$ increase. Overall, MEDITRON-7B achieves the best PubMedQA performance with $74.4%$ accuracy, the best MedMCQA performance with $59.2%$ accuracy, and the best performance on both MedQA and MedQA-4-option with $47.9%$ and $52.0%$ accuracy, respectively. At 70B scale, we compare with

结果:我们报告了MEDITRON及相关基线模型在70亿和700亿参数规模下的性能表现。表5展示了所有以准确率(↑)衡量的性能数据。在70亿参数规模下,我们首先与经过相同方式微调的Llama-2-7B和PMC-Llama-7B进行对比。结果显示MEDITRON-7B平均优于这两个基线模型4%。相较于当前最先进的指令调优模型Mistral (Jiang等人, 2023)和Zephyr β (Tunstall等人, 2023),MEDITRON在除MMLU-Medical外的所有基准测试中都取得了显著性能提升,尤其在PubMedQA上实现了超过10%的准确率提高。总体而言,MEDITRON-7B以74.4%的准确率取得PubMedQA最佳表现,以59.2%的准确率获得MedMCQA最佳成绩,同时在MedQA和MedQA-4-option上分别以47.9%和52.0%的准确率创下最优结果。在700亿参数规模下,我们与


Figure 4: Training and validation loss during continued pre training of the MEDITRON-70B model. We report the training and validation loss of the 70B MEDITRON model across the number of processed tokens during the pre training run.

图 4: MEDITRON-70B模型持续预训练过程中的训练与验证损失。我们展示了70B MEDITRON模型在预训练期间处理的token数量对应的训练损失和验证损失。

Llama-2-70B (finetuned exactly like MEDITRON-70B) and two other medical LLMs, both of which are instruction-tuned for medical tasks from Llama-2-70B. On average, MEDITRON-70B improves over all three baseline models with an $11.6%$ gain over Clinical-Camel-70B, a $5.4%$ performance gain over Med42-70B, and a $1.8%$ performance gain over the finetuned Llama-2-70B.

Llama-2-70B(微调方式与MEDITRON-70B完全相同)以及另外两个医疗领域大语言模型(均基于Llama-2-70B进行医疗任务指令微调)。平均而言,MEDITRON-70B在所有三个基线模型上均有提升:相较Clinical-Camel-70B实现11.6%的性能提升,较Med42-70B提升5.4%,较微调版Llama-2-70B提升1.8%。

Next, we apply chain-of-thought (CoT) and self-consistency chain-of-thought (SC-CoT) to investigate if they can further improve our model’s performance. CoT improves MEDITRON-70B’s average performance by $0.7%$ , and SC-CoT improves the performance by $3%$ . Although the finetuned Llama2-70B’s performance also improves through CoT and SC-CoT, MEDITRON-70B maintains and extends its advantage by outperforming Llama-2 (by $1.9%$ with CoT and $2.8%$ with SC-CoT). Overall, with SC-CoT, MEDITRON-70B achieves the highest accuracy on average $(72.0%)$ and on all the benchmarks except MMLU-Medical $(81.6%$ with PubMedQA, $66.0%$ with MedMCQA, $64.4%$ with MedQA, and $70.2%$ with MedQA-4-option). Interestingly, MEDITRON-70B with the three inference modes all surpass the human passing score, 60.0, for MedQA (Tian et al., 2023).

接下来,我们应用思维链 (CoT) 和自洽思维链 (SC-CoT) 来探究它们是否能进一步提升模型性能。CoT 将 MEDITRON-70B 的平均性能提高了 $0.7%$,而 SC-CoT 则提升了 $3%$。尽管微调后的 Llama2-70B 通过 CoT 和 SC-CoT 也实现了性能提升,但 MEDITRON-70B 保持并扩大了其优势:使用 CoT 时领先 Llama-2 $1.9%$,使用 SC-CoT 时领先 $2.8%$。总体而言,采用 SC-CoT 的 MEDITRON-70B 取得了最高平均准确率 $(72.0%)$,并在除 MMLU-Medical 外的所有基准测试中领先 (PubMedQA $81.6%$、MedMCQA $66.0%$、MedQA $64.4%$、MedQA-4-option $70.2%$)。值得注意的是,MEDITRON-70B 的三种推理模式在 MedQA 上的表现均超过了人类及格分数 60.0 (Tian et al., 2023)。

MEDITRON vs. Commercial LLMs: We also compare MEDITRON’s performance to commercial LLMs. These models often have a massive parameter count $(>100\mathrm{B}\$ ). We focus on four popular LLMs: GPT-3.5 (i.e., text-davinci-003, (Ouyang et al., 2022)), GPT-4 (OpenAI, 2023b; Nori et al., 2023), MedPaLM-540B (Singhal et al., 2023a), and MedPaLM-2-540B (Singhal et al., 2023b). In Figure 3, we show that MEDITRON-70B outperforms the GPT-3.5 model on all benchmarks despite the latter having 175B parameters. On PubMedQA, MEDITRON-70B outperforms MedPaLM and GPT-4, and its performance is only $0.2%$ behind the state-of-the-art model, Med-PaLM2. On MedMCQA and MedQA (5-option and 4-option), MEDITRON-70B’s performance falls between Med-PaLM and the SOTA performance (GPT-4 and Med-PaLM-2).7 Overall, we show that MEDITRON-70B’s performance on medical reasoning tasks is competitive with commercial LLMs with significantly larger parameter sizes.

MEDITRON与商用大语言模型对比:我们还比较了MEDITRON与商用大语言模型的性能。这些模型通常具有庞大的参数量$(>100\mathrm{B}\$)。我们重点关注四种流行的大语言模型:GPT-3.5(即text-davinci-003,(Ouyang et al., 2022))、GPT-4 (OpenAI, 2023b; Nori et al., 2023)、MedPaLM-540B (Singhal et al., 2023a)和MedPaLM-2-540B (Singhal et al., 2023b)。在图3中,我们展示了尽管GPT-3.5拥有1750亿参数,MEDITRON-70B在所有基准测试中均优于该模型。在PubMedQA上,MEDITRON-70B超越了MedPaLM和GPT-4,其性能仅落后当前最优模型Med-PaLM2 $0.2%$。在MedMCQA和MedQA(5选项和4选项)测试中,MEDITRON-70B的性能介于Med-PaLM与最优表现(GPT-4和Med-PaLM-2)之间。总体而言,MEDITRON-70B在医学推理任务上的表现与参数量显著更大的商用大语言模型具有竞争力。

7 Analysis

7 分析

7.1 Impact of Continued Pre training

7.1 持续预训练的影响

During the continued pre training process, we closely monitor the learning quality of the model. We report the language modeling losses of training and validation in Figure 4, indicating that both losses decrease as the model consumes more tokens and the model learns effectively without over fitting. To monitor MEDITRON’s downstream performance during the pre training process, we also conduct intermediate evaluations on the $5\mathrm{k\Omega}$ , 10k, and $15\mathrm{k\Omega}$ iteration checkpoints. We evaluated each medical benchmark in a 5-shot in-context learning setting. We provided five demonstrations randomly sampled from each benchmark’s training data with associated instructions from Table 2. We used top-token generation as the inference method used to get the model’s prediction for each multiple-choice question-answer pair. Table 6 reports the in-context learning performance for these intermediate checkpoints. We observe that the intermediate performance fluctuates between different checkpoints. However, the average performance grows consistently across iterations, and the final checkpoint achieves the best performance. We note that on certain individual datasets, the model’s performance drops in the intermediate checkpoints relative to the seed Llama-2 model, demonstrating the benefit of large-scale continual pre training.

在持续预训练过程中,我们密切监控模型的学习质量。图4展示了训练和验证的语言建模损失,表明随着模型消耗更多token,两种损失均持续下降,模型能有效学习且未出现过拟合现象。为监测MEDITRON在预训练期间的下游表现,我们还对5kΩ、10k和15kΩ迭代检查点进行了中期评估。我们在5样本上下文学习设置下评估每个医疗基准测试,从各基准训练数据中随机抽取5个示例演示(含表2的关联指令),并采用top-token生成作为推理方法来获取模型对每个选择题答案对的预测。表6报告了这些中期检查点的上下文学习性能,结果显示不同检查点间的性能存在波动,但平均性能随迭代次数持续提升,最终检查点达到最佳表现。值得注意的是,在某些特定数据集上,中期检查点的性能相较初始Llama-2模型有所下降,这印证了大规模持续预训练的价值。

Table 6: In-context learning performance of intermediate MEDITRON-70B checkpoints. We monitor the pre training process through intermediate evaluations of the downstream tasks using in-context learning. Without any finetuning, we provide the model five demonstrations sampled from the training data as a part of the prompt and generate the model’s answer. The average performance increases consistently as the iteration number increases, though this varies across benchmarks. We report the average accuracy across three random seeds used for sampling random demonstrations.

Iteration#TokensAccuracy(↑)
MMLU-MedicalPubMedQAMedMCQAMedQAMedQA-4-OptionAvg
O (Llama-2)OB71.3±0.8772.8±7.3452.4±0.2149.0±0.8558.4±0.9560.8
5,00010B70.2±1.1379.2±3.8151.0±0.4848.4±0.8657.3±1.2161.2
10,00021B70.0±0.8577.8±4.9652.3±0.9149.8±0.7157.0±1.0661.4
15,00031B70.8±0.4278.9±5.0251.3±0.9548.9±0.7957.7±0.7961.5
23,00048B71.5±0.6779.8±0.4653.3±0.5152.0±1.2159.8±0.2463.3

表 6: MEDITRON-70B 中间检查点的上下文学习性能。我们通过下游任务的中间评估来监控预训练过程。在没有任何微调的情况下,我们从训练数据中采样五个演示作为提示的一部分,并生成模型的答案。随着迭代次数的增加,平均性能持续提升,但不同基准测试之间存在差异。我们报告了用于随机演示采样的三个随机种子的平均准确率。

迭代次数 Token数量 MMLU-Medical PubMedQA MedMCQA MedQA MedQA-4-Option 平均
O (Llama-2) OB 71.3±0.87 72.8±7.34 52.4±0.21 49.0±0.85 58.4±0.95 60.8
5,000 10B 70.2±1.13 79.2±3.81 51.0±0.48 48.4±0.86 57.3±1.21 61.2
10,000 21B 70.0±0.85 77.8±4.96 52.3±0.91 49.8±0.71 57.0±1.06 61.4
15,000 31B 70.8±0.42 78.9±5.02 51.3±0.95 48.9±0.79 57.7±0.79 61.5
23,000 48B 71.5±0.67 79.8±0.46 53.3±0.51 52.0±1.21 59.8±0.24 63.3

Table 7: Different data mixtures for continued pre training trial runs. In this table, we summarize the details of five different data mixtures we use for continued pre training trial runs.

Name#TokensDescription
PMC (2.2)39.2BOnlypubliclyaccessiblePubMedpapersdirectlyfrom thePubMed CentralportionoftheS2ORCcollection.
PMC + Replay (2.3)37.5BCombinesPMCwith400milliontokenssampledfrom the1trillion RedPajama? training corpus for experience replay in the general domain.
PMCUpsampled(B.4)41.4BFilters out the animalstudies,preprints,and retracted documentsinPMC, and weigh each paper according to a set of predefined quality criteria such as publication type, recency, and number of citations. Higher- quentlyinthepretrainingcorpus.
PMC+Replay+Code (10B&2B)(B.3)39.5BMixPMC+Replaywith10Bor 2Btokens ofcodedatafrom theStar- Codertrainingcorpus.Wecreate thismixturetostudytheimpact of including code data in the pretraining corpus on the model's downstream reasoningperformance.
GAP+Replay (2.1)46.8BGAPcontainsPMC,PubMed abstracts,and medicalguidelines and is mixedwiththe400millionreplaytokensfromRedPajama.Thisis the data mixture chosen for MED1TRON's continued pretraining.

表 7: 持续预训练试验运行中使用的不同数据混合方案。本表总结了五种用于持续预训练试验运行的数据混合方案的详细信息。

名称 Token数量 描述
PMC (2.2) 39.2B 仅包含来自S2ORC数据集中PubMed Central部分的公开可获取的PubMed论文。
PMC + 回放 (2.3) 37.5B 将PMC数据与从RedPajama训练语料库中采样的4亿token相结合,用于通用领域的经验回放。
PMC上采样 (B.4) 41.4B 过滤掉PMC中的动物研究、预印本和撤稿文献,并根据出版类型、时效性和引用次数等预定义质量标准对每篇论文进行加权,使高质量论文在预训练语料库中出现频率更高。
PMC+回放+代码 (10B&2B)(B.3) 39.5B 将PMC+回放数据与来自StarCoder训练语料库的100亿或20亿token代码数据混合。创建此混合方案是为了研究在预训练语料库中加入代码数据对模型下游推理性能的影响。
GAP+回放 (2.1) 46.8B GAP包含PMC、PubMed摘要和医学指南,并与来自RedPajama的4亿回放token混合。这是为MED1TRON持续预训练选择的数据混合方案。

7.2 Data Mixture Ablation

7.2 数据混合消融实验

Multiple prior works show that the content of pre training data can significantly impact the pre training and downstream performance of the model (Xie et al., 2023; Du et al., 2022; Penedo et al., 2023; Longpre et al., 2023). Thus, in this ablation study, we analyze the impact of different distributions of the training corpus on the model’s downstream medical reasoning ability. Based on prior assumptions, we conduct continued pre training of the Llama2-7B model on several data mixtures. The list of data mixtures and their details are shown in Table 7. We assess the downstream performance of the trial models by evaluating the finetuned models on the training sets of PubMedQA, MedMCQA, and MedQA. The setup for the supervised finetuning is the same as that described in Section 6.2. The results are displayed in Table 8, and all reported metrics are measured in terms of accuracy (↑). We now discuss the findings from the trial-run experiments.

多项先前研究表明,预训练数据的内容会显著影响模型的预训练及下游性能 [Xie et al., 2023; Du et al., 2022; Penedo et al., 2023; Longpre et al., 2023]。因此在本消融实验中,我们分析了不同训练语料分布对模型下游医学推理能力的影响。基于先验假设,我们在多种数据混合方案上对Llama2-7B模型进行持续预训练,具体数据混合方案如 表7 所示。我们通过在PubMedQA、MedMCQA和MedQA训练集上评估微调模型,来检验试验模型的下游性能。监督微调设置与第6.2节所述一致,结果展示于 表8 ,所有报告指标均采用准确率(↑)衡量。以下讨论试验运行的发现。

Replay tokens are beneficial for downstream performance. Experience replay with tokens from the general domain improves the model’s performance on all benchmarks except MedMCQA. On average, PMC $^+$ Replay increases the performance by $1.6%$ compared to PMC results. We conclude that adding replay data to the training corpus for continued pre training benefits the model’s downstream performance. Based on this observation, we add the same 400M replay tokens to the final training data mixture (GAP $^+$ Replay) for our pre training runs.

重放Token有利于下游性能表现。使用通用领域Token进行经验重放可提升模型在所有基准测试中的性能(MedMCQA除外)。平均而言,PMC$^+$Replay相较PMC结果使性能提升了$1.6%$。我们得出结论:在训练语料库中添加重放数据以进行持续预训练,有助于提升模型的下游性能。基于这一观察,我们在最终预训练运行的混合训练数据(GAP$^+$Replay)中额外加入了相同的4亿个重放Token。

Table 8: Performance comparison of different trial-runs on 7B models. We analyze which pre training data mixture yields the best performance on downstream medical benchmarks. For each data mixture, we first do continued pre training from the base Llama-2-7B model. Next, we finetune the pretrained model on individual medical tasks’ training sets and evaluate using their corresponding test sets. Note that for MMLU-Medical, we use the model finetuned on MedMCQA since both have 4 options. For inference, we select the token with the maximum log probability.

Accuracy(↑)
MixtureMMLU-MedicalPubMedQAMedMCQAMedQAAvg
PMC-Llama-7B56.459.257.642.453.9
Llama-2-7B53.761.854.444.053.5
PMC55.662.854.545.454.6
PMC+Replay56.463.258.146.956.2
PMCUpsampled55.261.657.244.954.7
PMC + Replay + Code (10B)55.858.047.235.149.0
PMC + Replay + Code (2B)54.164.258.045.855.5
GAP+Replay54.274.459.247.958.9

表 8: 不同试验方案在7B模型上的性能对比。我们分析了哪种预训练数据混合方案在下游医学基准测试中表现最佳。对于每种数据混合方案,我们首先基于Llama-2-7B基础模型进行持续预训练,然后在各医学任务的训练集上微调预训练模型,并使用相应测试集进行评估。注意对于MMLU-Medical任务,我们使用在MedMCQA上微调的模型(二者均为四选项任务)。推理时选择对数概率最高的token。

混合方案 MMLU-Medical PubMedQA MedMCQA MedQA 平均
PMC-Llama-7B 56.4 59.2 57.6 42.4 53.9
Llama-2-7B 53.7 61.8 54.4 44.0 53.5
PMC 55.6 62.8 54.5 45.4 54.6
PMC+Replay 56.4 63.2 58.1 46.9 56.2
PMCUpsampled 55.2 61.6 57.2 44.9 54.7
PMC + Replay + Code (10B) 55.8 58.0 47.2 35.1 49.0
PMC + Replay + Code (2B) 54.1 64.2 58.0 45.8 55.5
GAP+Replay 54.2 74.4 59.2 47.9 58.9

Upsampling the medical papers leads to weaker downstream performance. Comparing the upsampled version of PMC to the full PMC corpus, the model’s performance on MedMCQA increases, but the performance on MedQA decreases, making this mixture weaker than $\mathrm{PMC+}.$ Replay. Although showing a weaker performance, there may be other potential benefits of an upsampled version of PMC, such as allowing the model to generate content that is more clinic-ready or reducing the model’s tendency to generate content that is not tested on human subjects. However, in the scope of this preliminary analysis of data mixture, we omit additional evaluations since they would require expert-level opinions that are hard to collect.

上采样医学论文会导致下游性能下降。将PMC的上采样版本与完整PMC语料库相比,模型在MedMCQA上的性能有所提升,但在MedQA上的表现却有所下降,使得这种混合效果弱于$\mathrm{PMC+}$。

重放。尽管表现较弱,PMC上采样版本可能具有其他潜在优势,例如使模型生成更符合临床要求的内容,或减少模型生成未经人体测试内容的倾向。然而,在此数据混合的初步分析范围内,我们省略了额外评估,因为这些评估需要难以收集的专家级意见。

Adding code does not improve the performance. There has been some speculation that training on code could improve the model’s ability to perform reasoning tasks (Chen et al., 2021). However, at this model scale, we find that adding code decreases the overall performance on medical benchmarks, with the PMC-Replay mixture slightly outperforming the 2B-Code addition $(+0.6%)$ and greatly outperforming the 10B-Code addition by $5.7%$ . Thus, in this setting, where no explicit reasoning (e.g., mathematical reasoning) is required from the model, we decide against using code in the final pre-training mixture.

添加代码不会提升性能。此前有推测认为代码训练可能增强模型执行推理任务的能力 (Chen et al., 2021) 。但在当前模型规模下,我们发现添加代码会降低医学基准测试的整体表现,其中PMC-Replay混合数据略优于2B-Code版本 $(+0.6%)$ ,并大幅领先10B-Code版本达 $5.7%$ 。因此,在模型无需执行显式推理(如数学推理)的场景下,我们最终决定不在预训练混合数据中使用代码。

GAP mixture is better than PubMed only. The GAP mixture adds PubMed abstracts and medical guidelines to the PMC corpus. Here, we compare $\mathrm{GAP+I}$ Replay with $\mathrm{PMC}+\mathrm{F}$ eplay, the latter outperforming the former by $2.8%$ on average. This mixture leads to the best average performance and is chosen for MEDITRON’s continued pre training.

GAP混合策略优于仅使用PubMed。GAP混合策略将PubMed摘要和医疗指南添加到PMC语料库中。在此,我们比较$\mathrm{GAP+I}$ Replay与$\mathrm{PMC}+\mathrm{F}$ Replay,后者平均表现优于前者$2.8%$。该混合策略实现了最佳平均性能,因此被选为MEDITRON持续预训练的方案。

8 Related Work

8 相关工作

Medical Large Language Models. Developing large language models in the medical domain and supporting biomedical and clinical tasks has been an ongoing effort. Early works on adapting pretrained language models to the medical domain focused on pre training encoder-only models (e.g., BERT) with large-scale biomedical corpora such as the PubMed Central articles and PubMed abstracts (Gu et al., 2021; Lee et al., 2020). Further approaches used links between documents (Yasunaga et al., 2022b) and knowledge graphs (Yasunaga et al., 2022a) to improve model performance. As large auto regressive generative models became more popular and delivered improved performances, decoder-only architectures such as GPT (Radford and Narasimhan, 2018) and Llama (Touvron et al., 2023a) were used to pretrain medical LLMs on medical domain text data (Stanford CRFM; Wu et al., 2023a). With the recent trend of scaling up pre training data size and model parameter size, multiple studies explored the benefit of scaling up on medical tasks. Gator Tron GP T (Peng et al., 2023) is a GPT-3-like (Brown et al., 2020) model with 20B parameters pretrained on 227B words of mixed clinical and English text. Clinical-Camel (Toma et al., 2023) adapted from the Llama-2-70B (Touvron et al., 2023b) model using QLoRA (Dettmers et al., 2023) training on medical data. Singhal et al. (2023a) and Singhal et al. (2023b) study the medical reasoning ability of Flan-PaLM and PaLM-2, both with 540B parameter sizes. PaLM-2 achieves state-of-the-art performance on the major medical benchmarks. Our work scales up full-parameter medical domain pre training to 70B parameters. Our evaluations show that our model outperforms previous pretrained language models and is competitive with Flan-PaLM and PaLM-2.

医疗大语言模型。在医疗领域开发大语言模型并支持生物医学和临床任务一直是一项持续的努力。早期将预训练语言模型适配到医疗领域的工作主要集中在使用大规模生物医学语料库(如PubMed Central文章和PubMed摘要)对仅编码器模型(例如BERT)进行预训练(Gu等人,2021;Lee等人,2020)。进一步的方法利用文档间链接(Yasunaga等人,2022b)和知识图谱(Yasunaga等人,2022a)来提升模型性能。随着大型自回归生成模型日益流行并展现出更优性能,仅解码器架构(如GPT(Radford和Narasimhan,2018)和Llama(Touvron等人,2023a)被用于在医疗领域文本数据上预训练医疗大语言模型(Stanford CRFM;Wu等人,2023a)。随着预训练数据规模和模型参数规模扩大的趋势,多项研究探讨了这种扩展对医疗任务的益处。GatorTron GPT(Peng等人,2023)是一个类似GPT-3(Brown等人,2020)的200亿参数模型,在2270亿单词的临床与通用英文混合文本上预训练。Clinical-Camel(Toma等人,2023)基于Llama-2-700亿参数模型(Touvron等人,2023b),采用QLoRA(Dettmers等人,2023)方法在医疗数据上微调。Singhal等人(2023a)和Singhal等人(2023b)研究了Flan-PaLM和PaLM-2(均为5400亿参数)的医疗推理能力,其中PaLM-2在主要医疗基准测试中达到最先进性能。我们的工作将全参数医疗领域预训练规模扩展至700亿参数,评估表明该模型优于先前预训练语言模型,并与Flan-PaLM和PaLM-2具有竞争力。

Continued Pre training. Early studies on pretrained language models show that continued pretraining in a specific domain is beneficial for downstream task performance (Hoang et al., 2019; Alsentzer et al., 2019; Chakra barty et al., 2019; Lee et al., 2020; Gu et al., 2021). Several studies found that continued pre training of a language model on the unlabeled data of a given task improves the models’ end-task performance (Howard and Ruder, 2018; Phang et al., 2019; Sun et al., 2020a). Gururangan et al. (2020) performed a comprehensive study exploring the benefit of continued pre training on multiple domains for the BERT (Devlin et al., 2019) class of models and showed that a second phase of in-domain pre training and adapting to the task’s unlabeled data improved the performance on downstream domain-specific tasks. Additional benefits of continued pre training also include improved zero-shot and few-shot prompt ability (Wu et al., 2022). In the medical domain, the most similar work to ours is PMC-Llama (Wu et al., 2023a), which adapts the Llama model through continued pre training on PubMed Central papers and medical textbooks. In contrast to prior works, MEDITRON studies the benefit of continued pre training at the 70B scale and shows that expanding the domain-specific pre training data brings significant performance gain on downstream tasks.

持续预训练。早期关于预训练语言模型的研究表明,在特定领域进行持续预训练有助于提升下游任务性能 [Hoang et al., 2019; Alsentzer et al., 2019; Chakrabarty et al., 2019; Lee et al., 2020; Gu et al., 2021]。多项研究发现,针对给定任务的未标注数据持续预训练语言模型可提升其最终任务表现 [Howard and Ruder, 2018; Phang et al., 2019; Sun et al., 2020a]。Gururangan et al. (2020) 通过全面研究证明了BERT类模型 [Devlin et al., 2019] 在多领域持续预训练的收益,表明领域内二次预训练及任务未标注数据适配能提升下游领域特定任务性能。持续预训练的额外优势还包括增强零样本和少样本提示能力 [Wu et al., 2022]。在医疗领域,与我们工作最相似的是PMC-Llama [Wu et al., 2023a],该研究通过对PubMed Central论文和医学教材持续预训练来适配Llama模型。与先前研究不同,MEDITRON探索了700亿参数规模的持续预训练收益,并证明扩展领域专用预训练数据能为下游任务带来显著性能提升。

9 Conclusion

9 结论

We release MEDITRON, a suite of domain-adapted medical LLMs that demonstrate high-level medical reasoning and improved domain-specific benchmark performance. Through continued pre training on carefully curated high-quality medical resources, including a novel set of clinical guidelines, MEDITRON shows improved performance over all the state-of-the-art baselines at matched scale on clinical reasoning benchmarks, coming within $10%$ performance of state-of-the-art commercial LLMs that are $8\times$ larger. Importantly, MEDITRON outperforms all open-source generalist and medical LLMs on all medical benchmarks. We make our models (at both 7B and 70B scale), tools required for curating the training corpus, and our distributed training library available as an open resource. This not only ensures access to real-world evaluation but also enables further fine-tuning and the development of instruction-based models, among other efforts. By providing these resources openly, we aim to help unlock the transformative potential of openly shared models in enhancing medical research, improving patient care, and fostering innovation across various health-related fields.

我们发布了MEDITRON,这是一套经过领域适配的医疗大语言模型,展现出高阶医学推理能力并提升了特定领域的基准性能。通过对精心筛选的高质量医疗资源(包括一组新颖的临床指南)进行持续预训练,MEDITRON在临床推理基准测试中以同等规模超越了所有最先进的基线模型,其性能与体积大8倍的最先进商用大语言模型差距仅在10%以内。值得注意的是,MEDITRON在所有医疗基准测试中都优于所有开源通用及医疗专用大语言模型。我们以开放资源形式提供了7B和70B规模的模型、训练语料库构建工具以及分布式训练库。这不仅确保了现实场景的评估可行性,还支持基于指令的模型微调与开发等后续工作。通过公开这些资源,我们旨在释放开放共享模型在促进医学研究、改善患者护理及推动健康相关领域创新方面的变革潜力。

Safety Advisory. While MEDITRON is designed to encode medical knowledge from sources of high-quality evidence, it is not yet adapted to deliver this knowledge appropriately, safely, or within professional actionable constraints. We recommend against deploying MEDITRON in medical applications without extensive use-case alignment, as well as additional testing, specifically including randomized controlled trials in real-world practice settings. While we do not view MEDITRON as being ready for real-world use in its current form, we release MEDITRON to the research community to promote work on the safety of language models in medical applications. Our work represents the largest open-source model adapted for the medical domain, trained on a large and diverse medical pre training corpus. We hope these resources will enable the research community to more comprehensively study large language models for the medical domain.

安全公告。虽然 MEDITRON 旨在从高质量证据来源中编码医学知识,但它尚未适应在专业可操作范围内恰当、安全地传递这些知识。我们建议不要在没有广泛用例对齐及额外测试(特别是包括真实世界实践环境中的随机对照试验)的情况下将 MEDITRON 部署于医疗应用。尽管我们认为 MEDITRON 目前的形式尚未准备好投入实际使用,但我们向研究社区发布 MEDITRON,以促进语言模型在医疗应用中的安全性研究。我们的工作代表了目前医疗领域最大的开源模型,其训练基于一个庞大且多样化的医学预训练语料库。我们希望这些资源能使研究社区更全面地研究适用于医疗领域的大语言模型。

Acknowledgements

致谢

We are extremely grateful to the EPFL Research Computing Platform Cluster team and the EPFL School of Computer and Communication Sciences for providing the computing resources for this project. We are especially grateful to Khadidja Malleck, Ed Bugnion, Jim Larus, Anna Font cube rta i Morral, and Rüdiger Urbanke for their support in organizing the resources for this project. We also thank the IT team, Yoann Moulin and Emmanuel Jaep, for their technical support on the cluster, and Marcel Salathé, Jacques Fellay, and François Fleuret for providing feedback on earlier versions of this draft. We also thank Katie Link and Lewis Tunstall from Hugging Face for their support.

我们非常感谢EPFL研究计算平台集群团队和EPFL计算机与通信科学学院为本项目提供的计算资源。特别感谢Khadidja Malleck、Ed Bugnion、Jim Larus、Anna Font cubierta i Morral以及Rüdiger Urbanke在组织项目资源方面的支持。同时感谢IT团队的Yoann Moulin和Emmanuel Jaep在集群上提供的技术支持,以及Marcel Salathé、Jacques Fellay和François Fleuret对本文稿早期版本的反馈。此外,我们还要感谢Hugging Face的Katie Link和Lewis Tunstall给予的帮助。

The availability of open-access clinical practice guidelines (CPG) was critical to this work, and we thank all the societies listed in Table 9. A broader representation of geography, medical specialties, and contexts (especially low-resource settings) could be achieved through more standardized CPG formatting practices to ensure reliable textual extraction (e.g., releasing .txt or .html versions with structured content). We encourage the CPG community to continue to make these documents available (open-access with permissive licenses for incorporation into large language models) and easily usable.

开放获取临床实践指南(CPG)的可用性对本研究至关重要,我们感谢表9中列出的所有学会。通过更标准化的CPG格式规范(例如发布具有结构化内容的.txt或.html版本)以确保可靠的文本提取,可以实现更广泛的地域、医学专业和场景(特别是资源匮乏地区)的代表性。我们鼓励CPG社区继续提供这些文档(采用允许纳入大语言模型的开放获取许可),并确保其易用性。

Kyle Matoba is supported by SNSF grant number FNS-188758 “CORTI". Amirkeivan Mohtashami is supported by SNSF grant number 200020 200342. Alexandre Sallinen is supported by the Science and Technology for Humanitarian Action Challenges (HAC) program from the Engineering for Humanitarian Action (EHA) initiative, a partnership between the ICRC, EPFL, and ETH Zurich. EHA initiatives are managed jointly by the ICRC, EPFL Essential Tech Centre, and ETH Zurich’s ETH4D. Antoine Bosselut gratefully acknowledges the support of the Swiss National Science Foundation (No. 215390), Innosuisse (PFFS-21-29), the EPFL Science Seed Fund, the EPFL Center for Imaging, Sony Group Corporation, and the Allen Institute for AI.

Kyle Matoba 获得瑞士国家科学基金会 (SNSF) 资助 (项目编号 FNS-188758 "CORTI")。Amirkeivan Mohtashami 获得瑞士国家科学基金会 (SNSF) 资助 (项目编号 200020 200342)。Alexandre Sallinen 获得人道主义行动工程 (EHA) 计划下属的"人道主义行动科技挑战"(HAC) 项目资助,该计划由红十字国际委员会 (ICRC)、洛桑联邦理工学院 (EPFL) 和苏黎世联邦理工学院 (ETH Zurich) 共同发起。EHA 计划由红十字国际委员会、EPFL Essential Tech 中心和苏黎世联邦理工学院 ETH4D 联合管理。Antoine Bosselut 衷心感谢瑞士国家科学基金会 (项目编号 215390)、瑞士创新署 (项目编号 PFFS-21-29)、EPFL 科学种子基金、EPFL 成像中心、索尼集团公司以及艾伦人工智能研究所的支持。

References

参考文献

[50] MosaicML NLP Team. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms. Accessed: 2023-05-05.

[50] MosaicML NLP Team. 2023. 推出MPT-7B:开源商用大语言模型新标杆. 访问日期: 2023-05-05.

A Carbon Emissions

碳排放

Our training of the 70B model ran for 332 hours on 128 A100 GPUs, for 42,496 GPU-hours.

我们对70B模型的训练在128块A100 GPU上运行了332小时,总计消耗42,496 GPU小时。

The computation was performed on hardware located in Western Switzerland. Switzerland has a carbon efficiency of $\mathrm{0.016\mathrm{kg{CO_ {2}}/\mathrm{kWh}.}}$ 9 Our particular energy mix should be even superior to the national average.10

计算在位于瑞士西部的硬件上进行。瑞士的碳效率为$\mathrm{0.016\mathrm{kg{CO_ {2}}/\mathrm{kWh}.}}$[9] 我们的特定能源组合甚至应优于全国平均水平。[10]

Each A100 has a TDP of 400W, giving1

每块A100的热设计功耗(TDP)为400W

$$
400\mathbf{W}/1000\mathbf{W}/1\mathbf{W}\mathbf{h}/\mathbf{GPU}\times0.016\mathbf{kg}\mathbf{CO}_ {2}/\mathbf{kWh}\times332\mathbf{h}\times128\mathbf{GPU}=272\mathbf{kg}\mathbf{CO}_ {2}
$$

$$
400\mathbf{W}/1000\mathbf{W}/1\mathbf{W}\mathbf{h}/\mathbf{GPU}\times0.016\mathbf{kg}\mathbf{CO}_ {2}/\mathbf{kWh}\times332\mathbf{h}\times128\mathbf{GPU}=272\mathbf{kg}\mathbf{CO}_ {2}
$$

emitted for the GPUs alone. Assuming an additional 2000Wh for node peripheries (CPU, RAM, fans, losses through the power supply, etc.) increases this by a factor of $(2000/3200+1)=1.625$ , and a datacenter PUE of 1.1 gives an estimate of the total emissions for the computation of the 70B model of $272\times1.625\times1.1=486\mathrm{kg{CO_ {2}}}$ .

仅GPU的排放量。假设节点外围设备(CPU、RAM、风扇、电源损耗等)额外消耗2000Wh,会使排放量增加 $(2000/3200+1)=1.625$ 倍,数据中心PUE为1.1时,可估算70B模型计算的总排放量为 $272\times1.625\times1.1=486\mathrm{kg{CO_ {2}}}$。

B Additional Details on Pre training Data

B 预训练数据的额外细节

B.1 Clinical Guideline Details

B.1 临床指南详情

Table 9 reports the details for each clinical guideline source that was used for the pre-training data mixture. To adhere to the copyright licenses granted by each source, we publicly release clean versions of all scraped articles for 8 out of 16 guideline sources, namely CCO, CDC, CMA, ICRC, NICE, SPOR, WHO, and WikiDoc. Additionally, we provide open access to our web scraping and pre-processing code for all the guideline sources.

表 9 报告了用于预训练数据混合的每个临床指南来源的详细信息。为遵守各来源授予的版权许可,我们公开了16个指南来源中8个的清洗后文章版本,包括CCO、CDC、CMA、ICRC、NICE、SPOR、WHO和WikiDoc。此外,我们还提供了所有指南来源的网络爬取和预处理代码的开放访问权限。

Table 9: GUIDELINES Corpus composition. For each clinical guideline source, we give the number of distinct documents, the approximate token count (in thousands) across all documents, the most common target audience, the country of origin, and whether we publicly release these articles.

SourceNameArticlesTokens (K)AudienceCountryReleased
AAFPAmericanAcademyofFamilyPhysicians5016DoctorUSANo
CCOCancer CareOntario87347DoctorCanadaYes
CDCCenterforDiseaseControlandPrevention62111,596BothUSAYes
CMACanadianMedicalAssociation4312,985DoctorCanadaYes
CPSCanadianPaediatricSociety54232KDoctorCanadaNo
drugs.comDrugs.com6,5487,129BothInternationalNo
GCGuidelineCentral1,0291,753DoctorMixNo
ICRCInternationalCommitteeof theRedCross492,109DoctorInternationalYes
IDSAInfectiousDiseasesSocietyofAmerica471,124DoctorUSANo
MAGICMakingGRADE TheIrresistibleChoice52722DoctorMixNo
MayoClinicMayoClinic1,1003,851PatientUSANo
NICENationalInstituteforHealthandCareExcellence1,65614,039DoctorUKYes
RCHRoyalChildren'sHospitalMelbourne384712DoctorAustraliaNo
SPORStrategyforPatient-OrientedResearch2171,921DoctorCanadaYes
WHOWorldHealthOrganization2235,480BothInternationalYes
WikiDocWikiDoc33,05858,620BothInternationalYes
Total46,649112,716

表 9: GUIDELINES 语料库构成。针对每个临床指南来源,我们列出了不同文档的数量、所有文档的近似 token 计数(以千为单位)、最常见的目标受众、来源国家以及是否公开这些文章。

Source Name Articles Tokens (K) Audience Country Released
AAFP AmericanAcademyofFamilyPhysicians 50 16 Doctor USA No
CCO Cancer CareOntario 87 347 Doctor Canada Yes
CDC CenterforDiseaseControlandPrevention 621 11,596 Both USA Yes
CMA CanadianMedicalAssociation 431 2,985 Doctor Canada Yes
CPS CanadianPaediatricSociety 54 232K Doctor Canada No
drugs.com Drugs.com 6,548 7,129 Both International No
GC GuidelineCentral 1,029 1,753 Doctor Mix No
ICRC InternationalCommitteeof theRedCross 49 2,109 Doctor International Yes
IDSA InfectiousDiseasesSocietyofAmerica 47 1,124 Doctor USA No
MAGIC MakingGRADE TheIrresistibleChoice 52 722 Doctor Mix No
MayoClinic MayoClinic 1,100 3,851 Patient USA No
NICE NationalInstituteforHealthandCareExcellence 1,656 14,039 Doctor UK Yes
RCH RoyalChildren'sHospitalMelbourne 384 712 Doctor Australia No
SPOR StrategyforPatient-OrientedResearch 217 1,921 Doctor Canada Yes
WHO WorldHealthOrganization 223 5,480 Both International Yes
WikiDoc WikiDoc 33,058 58,620 Both International Yes
Total 46,649 112,716

B.2 PubMed Pre-Processing

B.2 PubMed 预处理

In this section, we provide additional details and examples of our pre-processing pipeline for PubMed full-text articles and abstracts.

在本节中,我们将提供PubMed全文文章和摘要预处理流程的更多细节与示例。

B.2.1 Bibliography references

B.2.1 参考文献

Each article starts with an authors section (a list of authors and their respective affiliations) and ends with a bibliography section (a list of papers and resources cited within the main text). As these segments follow a textual structure that deviates from the main body, we filter them out during pre-processing. This ensures that MEDITRON is not trained on patterns related to the authors and bibliography sections, which could otherwise impede its ability to generate human-like language for the main body of the articles.

每篇文章以作者部分(作者列表及其所属机构)开头,以参考文献部分(正文中引用的论文和资源列表)结尾。由于这些段落遵循与正文不同的文本结构,我们在预处理过程中将其过滤。这确保MEDITRON不会在作者和参考文献部分的模式上进行训练,否则可能会影响其为文章正文生成类人语言的能力。

In-text references to external resources constitute key pieces of information found in PubMed papers and abstracts. These references are crucial in substantiating claims through pertinent research and attributing credit to authors. However, most of these references are typically formatted using either reference numbers (linked to the bibliography section) or solely the primary author’s last name and publication date. Without pre-processing the training data, a foundation model may learn to finish generated sentences with reference numbers that point to no resource in particular. To integrate these references into our corpus text, we use S2ORC annotations by replacing these in-text references with a short paper summary framed by the [BIB_ REF] and [/BIB_ REF] special tokens. The paper summary comprises the paper title (truncated to a maximum of 12 words) and the main author’s last name.

文本中对外部资源的引用是PubMed论文和摘要中的关键信息。这些引用对于通过相关研究证实主张以及为作者署名至关重要。然而,大多数引用通常采用参考文献编号(链接至文献目录部分)或仅使用第一作者的姓氏和出版日期进行格式化。若不对训练数据进行预处理,基础模型可能会学会生成指向特定资源的引用编号来结束生成的句子。为了将这些引用整合到我们的语料库文本中,我们利用S2ORC注释,将这些文内引用替换为由[BIB_ REF]和[/BIB_ REF]特殊标记包围的简短论文摘要。该论文摘要包括论文标题(截断至最多12个单词)和主要作者的姓氏。

In-text bibliography references

文内参考文献引用

Format: [BIB_ REF]Summarized paper title, Main author last name[/BIB_ REF]

格式:[BIB_ REF]论文标题缩写, 第一作者姓氏[/BIB_ REF]

Raw: “... different behavior between them [7]. Its diagnosis is made by. . . ” Processed: “... different behavior between them [BIB_ REF]Cancer Incidence and Survival Trends by Subtype Using Data from the Surveillance..., Noone[/BIB_ REF]. Its diagnosis is made by. . . ”

... 它们之间的不同行为 [7]。其诊断依据是...

B.2.2 Figures and Tables

B.2.2 图与表

MEDITRON is trained exclusively on textual data. Therefore, we exclude image-based figure content. However, figure captions remain a valuable source of information, which we retain in the final corpus and identify by wrapping in [FIG] and [/FIG] special tokens. The S2ORC annotation procedure relies on GROBID for table extraction,