LMFlow: An Extensible Toolkit for Finetuning and Inference of Large Foundation Models

LMFlow: 一个用于大基础模型微调和推理的可扩展工具包

Shizhe Diao♡∗, Rui $\mathbf{Pan}^{\heartsuit* }$ , Hanze Dong♡∗, KaShun Shum♡, Jipeng Zhang♡, Wei Xiong♠, Tong Zhang♠ ♡The Hong Kong University of Science and Technology ♠University of Illinois Urbana-Champaign {sdiaoaa, rpan, hdongaj}@ust.hk tozhang@illinois.edu

Shizhe Diao♡∗, Rui $\mathbf{Pan}^{\heartsuit* }$, Hanze Dong♡∗, KaShun Shum♡, Jipeng Zhang♡, Wei Xiong♠, Tong Zhang♠ ♡香港科技大学 ♠伊利诺伊大学厄巴纳-香槟分校 {sdiaoaa, rpan, hdongaj}@ust.hk tozhang@illinois.edu

Abstract

摘要

Foundation models have demonstrated a great ability to achieve general human-level intelligence far beyond traditional approaches. As the technique keeps attracting attention from the AI community, an increasing number of foundation models are becoming publicly accessible. However, a significant shortcoming of most of these models lies in their performance in specialized-domain and task-specific applications, necessitating domain- and taskaware fine-tuning to develop effective scientific language models. As the number of available foundation models and specialized tasks keeps growing, the job of training scientific language models becomes highly nontrivial. In this paper, we initiate steps to tackle this issue. We introduce an extensible and lightweight toolkit, LMFlow, which aims to simplify the domainand task-aware finetuning of general foundation models. LMFlow offers a complete fine- tuning workflow for a foundation model to support specialized training with limited computing resources. Furthermore, it supports contin- uous pre training, instruction tuning, parameterefficient finetuning, alignment tuning, inference acceleration, long context generalization, model customization, and even multimodal finetuning, along with carefully designed and extensible APIs. This toolkit has been thoroughly tested and is available at github.com/Optimal Scale/LMFlow.1

基础模型展现出远超传统方法的通用人类水平智能潜力。随着该技术持续吸引AI社区关注，越来越多的基础模型逐步开放使用。然而这些模型普遍存在专业领域和特定任务性能不足的缺陷，需要针对领域和任务进行微调才能构建有效的科学语言模型。随着可用基础模型和专项任务数量持续增长，科学语言模型的训练工作变得异常复杂。本文针对该问题提出了解决方案——我们推出可扩展的轻量级工具包LMFlow，旨在简化通用基础模型的领域与任务感知微调流程。该工具包提供完整的基础模型微调工作流，支持在有限算力下进行专业化训练，同时涵盖持续预训练、指令微调、参数高效微调、对齐调优、推理加速、长上下文泛化、模型定制乃至多模态微调等功能模块，并配备精心设计且可扩展的API接口。该工具包已通过全面测试，开源地址为https://github.com/OptimalScale/LMFlow.1

1 Introduction

1 引言

Foundation models (FMs), and in particular large language models (LLMs), have demonstrated general abilities to perform different tasks beyond what was possible previously. While a number of pretrained large models, including GPT-J (Wang and Komatsu zak i, 2021), Bloom (Scao et al., 2022), LLaMA (Touvron et al., 2023a,b), etc., are publicly available and have already been incorporated into the Hugging Face model repository (Hugging face, 2022), there is no publicly available toolkit that can be easily used to perform finetuning and inference for these different models. For specialized domains or tasks, it is necessary to further finetune such LLMs to achieve improved performance on such domains or tasks. The purpose of this package is to offer a simple-to-use and lightweight toolkit so that developers and researchers can perform efficient finetuning and inference of scientific language models with limited resources. The typical processes to train a scientific language model are shown in Figure 1, which include:

基础模型 (FMs) ，尤其是大语言模型 (LLMs) ，已展现出执行不同任务的通用能力，超越了以往的可能性。虽然包括 GPT-J (Wang and Komatsu zak i, 2021) 、Bloom (Scao et al., 2022) 、LLaMA (Touvron et al., 2023a,b) 等在内的多个预训练大模型已公开并纳入 Hugging Face 模型库 (Hugging face, 2022) ，但目前尚无公开可用的工具包能轻松对这些不同模型进行微调和推理。针对特定领域或任务，需要进一步微调此类大语言模型以提升其性能。本工具包旨在提供一个简单易用且轻量级的工具，使开发者和研究人员能够在有限资源下高效完成科学语言模型的微调与推理。训练科学语言模型的典型流程如图 1 所示，包括：

LMFlow enhances and streamlines the aforementioned fine-tuning procedures, enabling the efficient and effective training of a scientific language model. We focus on improving training speed. For example, it only takes one Nvidia 3090 GPU and five hours to train a medical LLaMA comparable to ChatGPT, based on a 7-billion-parameter LLaMA model. In addition to speed, we also aspire to achieve higher model performance. We used this framework to train medical LLaMA, a series of models with 7-billion, 13-billion, 33-billion, and

LMFlow优化并简化了上述微调流程，实现了科学语言模型的高效训练。我们重点提升训练速度，例如基于70亿参数的LLaMA模型，仅需一张Nvidia 3090显卡和五小时即可训练出媲美ChatGPT的医疗版LLaMA。除速度外，我们还致力于实现更优的模型性能。通过该框架训练了医疗版LLaMA系列模型，参数量分别为70亿、130亿、330亿及...

Table 1: Comparison with competing packages. Cont. PT: continuous pre training. FT: finetuning. RLHF: reinforcement learning from human feedback. Deploy.: deployment. Adapt.: domain/task adaptation. Acc.: acceleration techniques for finetuning and inference. LC: long context generalization. VE: vocabulary extension. MM: multimodal training.

Transformers (Wolf et al., 2020)
Accelerate (Gugger et al., 2022)
Deepspeed (Rasley et al., 2020)
Trl (von Werra et al., 2020)
LMFlow (ours)

表 1: 与竞争方案的对比。Cont. PT: 持续预训练。FT: 微调。RLHF: 基于人类反馈的强化学习。Deploy.: 部署。Adapt.: 领域/任务适配。Acc.: 微调与推理加速技术。LC: 长上下文泛化。VE: 词表扩展。MM: 多模态训练。

65-billion parameters, on a single machine and have released the model weights for academic research. Using LMFlow, anyone can train their own scientific or personalized language models. Each person can choose the appropriate foundation model according to their available resources, for tasks such as question answering, companionship, and expert consultations in various domains. The larger the model and data size, the longer the training time and the better the results. Compared with existing packages, LMFlow encompasses a multitude of features that are absent in others, such as the support for long context generalization, as shown in Table 1. Most importantly, LMFlow stands out as a comprehensive, full-cycle foundation model adaptation toolkit. While other packages excel in specific areas like finetuning, they lack functionalities like RLHF and others. To our knowledge, LMFlow is the first to offer a complete pipeline that integrates all these processes. This holistic toolkit allows for more robust and adaptable language model training and inference, setting a new standard in the field of natural language processing.

65亿参数，在单台机器上训练并发布了模型权重供学术研究使用。通过LMFlow，任何人都能训练自己的科学或个性化语言模型。用户可根据可用资源选择合适的基础模型，用于问答、陪伴、跨领域专家咨询等任务。模型规模和数据量越大，训练时间越长，效果也越好。如表1所示，与现有工具包相比，LMFlow具备长文本泛化支持等多项独有特性。最重要的是，LMFlow是首个覆盖全周期的基础模型适配工具包。虽然其他工具在微调等特定环节表现优异，但缺乏RLHF等完整功能。据我们所知，LMFlow首次实现了全流程整合，为自然语言处理领域树立了新标准，使语言模型训练与推理更具鲁棒性和适应性。

2 相关工作

In recent years, the finetuning of large language models (LLMs) has gained significant attention, especially for scientific domain applications. The necessity of adapting these general-purpose models to specific domains or tasks has led to the de- velopment of various scientific language models. Lehman et al. (2023) conducted an extensive empirical analysis on the performance of various language models in clinical tasks and found that specialized clinical models, even smaller in size, significantly outperform larger general-domain models when finetuned on domain-specific data. This emphasizes the importance of domain specialization in achieving higher accuracy in safety-critical fields like healthcare. Therefore, a series of scientific large models have emerged, including but not limited to: language models for Science (Beltagy et al., 2019; Luu et al., 2021; Taylor et al., 2022), Mathe- matics (Yue et al., 2023; Yu et al., 2023; Gao et al., 2023), Physics (Nguyen et al., 2023; Zheng et al., 2023b; Perkowski et al., 2024), Chemistry and Materials Science (Cao et al., 2023; Shetty et al., 2023; Rubungo et al., 2023), Biology and Medicine (Lee et al., 2020; Zhang et al., 2023; Singhal et al., 2023; Wu et al., 2023; Han et al., 2023; Wang et al., 2023; Yang et al., 2024), and Information Retrieval (Lassance et al., 2023) We recommend readers to refer to a paper list of scientific language models 2, which includes a more comprehensive range of works related to scientific language models. Among these works, LMFlow has successfully helped in training AstroLLaMA-Chat (Perkowski et al., 2024) and MarineGPT (Zheng et al., 2023b). The Medical LLaMA trained in the medical domain within this paper also demonstrates the effectiveness of LMFlow. In summary, our proposed LMFlow offers a comprehensive toolkit for efficient and effective finetuning of foundation models across various specialized domains.

近年来，大语言模型 (LLM) 的微调技术受到广泛关注，尤其在科学领域应用中表现突出。为了将通用模型适配到特定领域或任务，各类科学语言模型应运而生。Lehman 等人 (2023) 通过临床任务的实证分析发现：经过领域数据微调后，专业临床模型即便体积较小，其表现也显著优于大型通用领域模型。这凸显了在医疗等安全关键领域实现高精度时领域专业化的重要性。因此，一系列科学大模型相继涌现，主要包括但不限于：科学领域语言模型 (Beltagy 等人, 2019; Luu 等人, 2021; Taylor 等人, 2022) 、数学领域 (Yue 等人, 2023; Yu 等人, 2023; Gao 等人, 2023) 、物理学领域 (Nguyen 等人, 2023; Zheng 等人, 2023b; Perkowski 等人, 2024) 、化学与材料科学领域 (Cao 等人, 2023; Shetty 等人, 2023; Rubungo 等人, 2023) 、生物医学领域 (Lee 等人, 2020; Zhang 等人, 2023; Singhal 等人, 2023; Wu 等人, 2023; Han 等人, 2023; Wang 等人, 2023; Yang 等人, 2024) 以及信息检索领域 (Lassance 等人, 2023) 。我们建议读者参阅科学语言模型论文列表2以获取更全面的相关研究。其中，LMFlow 已成功助力训练 AstroLLaMA-Chat (Perkowski 等人, 2024) 和 MarineGPT (Zheng 等人, 2023b) ，本文在医疗领域训练的 Medical LLaMA 同样验证了 LMFlow 的有效性。综上所述，我们提出的 LMFlow 为跨领域基础模型的高效微调提供了完整工具包。

3 Toolkit Overview

3 工具包概述

3.1 System Design

3.1 系统设计

An illustration of the LMFlow system design is shown in Figure 1. There are four stages for improving the performance of a publicly available foundation model. The first stage is domain adaptation, which involves modifying the model to better handle a specific domain by training the model on that domain. The second stage is task adaptation, which involves adapting the model to perform a specific task, such as sum mari z ation, questionanswering, and translation. The third stage is instruction finetuning, which involves adjusting the model’s parameters based on instructional questionanswer pairs. The final stage is reinforcement learning with human feedback, which involves using human feedback to further align the model to human preference. LMFlow provides a complete finetuning workflow for these four stages, supporting large language models’ specialized training with limited computing resources. Especially, LMFlow supports the following key features:

图1展示了LMFlow系统设计的示意图。提升公开可用基础模型性能包含四个阶段：第一阶段是领域适应(domain adaptation)，通过在该领域数据上训练模型，使其更擅长处理特定领域；第二阶段是任务适应(task adaptation)，使模型适应摘要生成、问答和翻译等具体任务；第三阶段是指令微调(instruction finetuning)，基于指令问答对调整模型参数；最后阶段是基于人类反馈的强化学习(RLHF)，利用人类反馈进一步使模型符合人类偏好。LMFlow为这四个阶段提供完整的微调工作流，支持在有限算力下对大语言模型进行专项训练，具体包含以下关键特性：

Figure 1: The system design of LMFlow. Starting from a publicly available foundation model, there are four possible stages including (1) domain adaptation, (2) task adaptation, (3) instruction finetuning, and (4) reinforcement learning with human feedback.

图 1: LMFlow的系统设计。从公开可用的基础模型开始，包含四个可能阶段：(1) 领域适配 (domain adaptation)、(2) 任务适配 (task adaptation)、(3) 指令微调 (instruction finetuning) 和 (4) 基于人类反馈的强化学习 (reinforcement learning with human feedback)。

3.2 Installation

3.2 安装

LMFlow has been fully tested on Linux OS (Ubuntu 20.04) and can be installed by executing the following commands.

LMFlow已在Linux操作系统( Ubuntu 20.04 )上完成全面测试，可通过执行以下命令安装。

3.3 Data Format

3.3 数据格式

LMFlow accepts several .json files as input. Users can provide a list of .json files under a specified dataset directory. For example,

LMFlow 接受多个 .json 文件作为输入。用户可以在指定数据集目录下提供 .json 文件列表。例如，

I	\|- path_ to_ dataset
2	-( data_ 1.json
3	data_ 2.json
4	\|- another_ data.json
5	\|-

| I | |- path_ to_ dataset |
| 2 | -( data_ 1.json |
| 3 | data_ 2.json |
| 4 | |- another_ data.json |
| 5 | |- |

Each json file shall have the following format (three instances with four keys for example),

每个json文件应具有以下格式（例如包含四个键的三个实例），

14		"KEY_ 4"："VALUE_ 2.4"
15
16	{
17	"KEY_ 1":	"VALUE_3.1"
18	"KEY_ 2"	"VALUE_3.2"
19	"KEY_3	"VALUE_3.3
20	"KEY_ 4":	"VALUE_3.4"
21
22
23

| 14 | | "KEY_ 4": "VALUE_ 2.4" |
| 15 | | |
| 16 | { | |
| 17 | "KEY_ 1": | "VALUE_3.1" |
| 18 | "KEY_ 2" | "VALUE_3.2" |
| 19 | "KEY_3 | "VALUE_3.3 |
| 20 | "KEY_ 4": | "VALUE_3.4" |
| 21 | | |
| 22 | | |
| 23 | | |

where the TYPE indicates the dataset type and defines the set of keys { KEY_ 1, KEY_ 2, ... } and their corresponding interpretations. Two supported .json formats are detailed as follows.

其中 TYPE 表示数据集类型，并定义键集合 { KEY_ 1, KEY_ 2, ... } 及其对应解释。支持的两种 .json 格式如下。

TextOnly This is the most common dataset type, which only contains raw texts in each sample. This type of dataset can be used as the training set for text decoder models, or the input of decoder models / encoder-decoder models. Its format is as follows (three instances, for example),

纯文本这是最常见的数据集类型，每个样本仅包含原始文本。此类数据集可用作文本解码器模型的训练集，或作为解码器模型/编码器-解码器模型的输入。其格式如下（以三个样本为例）

{
2 type instances":	text_ only"
	{ "text" "SAMPLE_ TEXT_ 1"
4	{ "text SAMPLE_ TEXT_ 2" {
5	{ text"： "SAMPLE_ TEXT_3" }
	7
7
8 了

{
2 "类型实例":
"纯文本"
{ "文本": "示例文本_ 1"
4
{ "文本": "示例文本_ 2" {
5
{ "文本": "示例文本_3" }
7
7
8
}

Text2Text This is the dataset type mostly used for in ferenc ing, which contains a pair of texts in each sample. This type of dataset can be used as the training set for text encoder-decoder models, or question-answer pairs for evaluating model inferences. Its format is as follows (three instances for example),

Text2Text 这是最常用于推理的数据集类型，每个样本包含一对文本。此类数据集可用作文本编码器-解码器模型的训练集，或作为评估模型推理的问答对。其格式如下（例如三个实例），

{
2	"type":"text2text",
3	"instances":[
4	C
5	"input":"SAMPLE_iNPUT_ 1"
6	"output": "SAMPLE_ OUTPUT_ 1"
7
8
9	"input":"SAMPLE_iNPUT_ 2",
10	"output"： "SAMPLE_ OUTPUT_ 2",
11
12
13	"input": "SAMPLE_iNPUT_3",
14	"output": "SAMPLE_ OUTPUT_3""
15
16
17 C

{
2 | "type":"text2text",
3 | "instances":[
4 | C
5 | "input":"SAMPLE_iNPUT_ 1"
6 | "output": "SAMPLE_ OUTPUT_ 1"
7 | 
8 | 
9 | "input":"SAMPLE_iNPUT_ 2",
10 | "output": "SAMPLE_ OUTPUT_ 2",
11 | 
12 | 
13 | "input": "SAMPLE_iNPUT_3",
14 | "output": "SAMPLE_ OUTPUT_3""
15 | 
16 | 
17 C |

3.4 Continuous Pre training

3.4 持续预训练

The endeavor to bridge the divide between pretraining domains and downstream domains has led to the adoption of a prevalent approach, known as continuous pre training (Beltagy et al., 2019; Alsentzer et al., 2019; Huang et al., 2019; Lee et al., 2020), which involves the ongoing pre training on an extensive collection of unlabeled data that is specific to a given domain. LMFlow supports continuous pre training natively, which is an effective way to adapt LLMs to a specific domain. Users just need to collect a set of unlabeled data and prepare them to TextOnly data format. The following process will be handled by auto regressive training.

为弥合预训练领域与下游领域之间的鸿沟，业界普遍采用持续预训练 (continuous pre training) 方法 (Beltagy et al., 2019; Alsentzer et al., 2019; Huang et al., 2019; Lee et al., 2020)，即针对特定领域持续在大量无标注数据上进行预训练。LMFlow 原生支持持续预训练，这是将大语言模型适配到特定领域的有效方式。用户只需收集一组无标注数据并将其转换为纯文本 (TextOnly) 格式，后续过程将由自回归训练自动处理。

3.5 Instruction Tuning

3.5 指令微调

Instruction tuning (Sanh et al.; Wei et al.; Chung et al., 2022; Mu en nigh off et al., 2022; Wang et al., 2022), also called supervised finetuning, is an approach to enhance the performance of language models by training them to follow natural language instructions. This involves training the model on a small set of task-specific data, most of which are in prompt-answer format, including positive or negative examples, prompts, constraints, and other elements commonly present in human language. Instruction tuning enables LLMs to provide more accurate and relevant responses to user queries, making them more effective conversational agents.

指令微调 (Sanh等人; Wei等人; Chung等人, 2022; Muennighoff等人, 2022; Wang等人, 2022) ，也称为监督微调，是一种通过训练语言模型遵循自然语言指令来提升其性能的方法。该方法通过在少量任务特定数据上训练模型来实现，这些数据大多采用提示-答案格式，包含正负例、提示词、约束条件等人类语言中常见的要素。指令微调使大语言模型能够更精准地响应用户查询，从而成为更高效的对话智能体。

3.6 RLHF as Finetuning

3.6 基于人类反馈的强化学习 (RLHF) 微调

There is a growing need to explore alternative pretraining objectives that can guide LLMs to generate text that aligns with human preferences. By doing so, we can ensure that LLMs produce text that is more helpful, honest, and harmless for humans, which are called ‘HHH’ rules (Askell et al., 2021). Ouyang et al. (2022) divides the alignment process into three steps, including SFT, reward modeling, and RLHF (reward optimization). We have integrated all of these steps into our LMFlow framework. For reward optimization, PPO has been shown to be effective in various studies (Schulman et al., 2017; Engstrom et al., 2020). However, it relies on a trial-and-error approach through interaction with the environment, making it less stable and efficient than supervised learning (Choshen et al., 2019). To address this, we propose and implement a new alignment method for generative models called RAFT (Dong et al., 2023). RAFT utilizes a reward model to rank the output of the generative model, allowing us to continue training using supervised finetuning (SFT)-like techniques with the selected samples. This approach encourages the generative model to prioritize samples with higher rewards and offers significant computational advantages over PPO, resulting in substantial savings in memory and gradient computations. Moreover, due to the stability of SFT-like training, our approach demonstrates lower sample complexity and requires fewer learnable parameters, making it easily adaptable to any generative model. We believe our novel alignment algorithm represents a competitive and innovative approach that contributes to the well-behaved behavior of generative models.

探索能够引导大语言模型(LLM)生成符合人类偏好文本的替代预训练目标的需求日益增长。通过这种方式，我们可以确保大语言模型生成的文本对人类更有帮助、诚实且无害，即遵循"HHH"规则 (Askell et al., 2021)。Ouyang et al. (2022) 将对齐过程分为三个步骤，包括监督微调(SFT)、奖励建模和基于人类反馈的强化学习(RLHF)。我们已将所有这些步骤集成到LMFlow框架中。

在奖励优化方面，近端策略优化(PPO)已被多项研究证明是有效的 (Schulman et al., 2017; Engstrom et al., 2020)。然而，它依赖于与环境的试错交互方式，导致其稳定性和效率低于监督学习 (Choshen et al., 2019)。为此，我们提出并实现了一种名为RAFT (Dong et al., 2023) 的新型生成模型对齐方法。RAFT利用奖励模型对生成模型的输出进行排序，使我们能够使用类似监督微调(SFT)的技术继续训练选定的样本。这种方法鼓励生成模型优先选择奖励更高的样本，与PPO相比具有显著的计算优势，可大幅节省内存和梯度计算资源。此外，由于类似SFT训练的稳定性，我们的方法表现出更低的样本复杂度，需要更少的可学习参数，使其易于适配任何生成模型。我们相信这一创新的对齐算法代表了一种具有竞争力的新方法，有助于提升生成模型的良好行为表现。

Table 2: The performance on Massive Multitask Language Understanding (MMLU) benchmark. Bold represents the best among each dataset.

MODEL	anatomy	clinical knowledge	college biology	college medicine	medical genetics	professional medicine	Average
LLaMA33B	39.2	40.3	44.4	32.9	36.0	43.0	39.3
Galactica30B	32.5	26.0	30.5	25.4	39.0	23.1	29.4
Galactica120B	58.5	59.2	68.7	57.2	68.0	59.6	61.9
OPT175B	28.9	21.9	30.6		35.0	27.9
BLOOM176B	37.0	29.8	28.5		36.0	25.4
Gopher 280B	56.3	67.2	70.8	60.1	69.0	64.0	64.6
GPT3.5	56.3	69.8	72.2	61.3	70.0	70.2	66.6
Task-tuned LLaMA 33B (LoRA)	51.8	65.2	70.1	58.3	65.6	66.5	62.9

表 2: 大规模多任务语言理解 (MMLU) 基准测试性能。加粗数据表示各数据集中的最佳结果。

MODEL	anatomy	clinical knowledge	college biology	college medicine	medical genetics	professional medicine	Average
LLaMA33B	39.2	40.3	44.4	32.9	36.0	43.0	39.3
Galactica30B	32.5	26.0	30.5	25.4	39.0	23.1	29.4
Galactica120B	58.5	59.2	68.7	57.2	68.0	59.6	61.9
OPT175B	28.9	21.9	30.6		35.0	27.9
BLOOM176B	37.0	29.8	28.5		36.0	25.4
Gopher 280B	56.3	67.2	70.8	60.1	69.0	64.0	64.6
GPT3.5	56.3	69.8	72.2	61.3	70.0	70.2	66.6
Task-tuned LLaMA 33B (LoRA)	51.8	65.2	70.1	58.3	65.6	66.5	62.9

Table 3: The overall performance of task-tuned LLaMA models and the comparison with human and existing models on three medical datasets. PubMedQA and MedMCQA are evaluated on in-domain tests and MedQA-USMLE is evaluated on the out-of-domain test. Bold represents the best among each dataset.

MODEL	PubMedQA (ID)「	MedQA-USMLE (OOD)	MedMCQA(ID)	Average
Human (pass) Human (expert)	78.0	60.0 87.0	50.0 90.0	85.0
InstructGPT-175B ChatGPT LLaMA-7B LLaMA-33B	73.2 63.9 5.2 1.8	46.0 57.0 27.1 43.4	44.0 44.7 24.3 30.3	54.4 55.2 18.9 25.2
Task-tuned LLaMA-7B (full) Task-tunedLLaMA-33B(LoRA)	75.1 74.0	44.5 51.3	49.9 50.2	56.5 58.5

表 3: 任务调优后的LLaMA模型整体表现及与人类和其他现有模型在三个医学数据集上的对比。PubMedQA和MedMCQA采用领域内测试评估，MedQA-USMLE采用领域外测试评估。加粗数字代表各数据集中的最佳结果。

模型	PubMedQA (ID)	MedQA-USMLE (OOD)	MedMCQA(ID)	平均分
Human (pass) Human (expert)	78.0	60.0 87.0	50.0 90.0	85.0
InstructGPT-175B ChatGPT LLaMA-7B LLaMA-33B	73.2 63.9 5.2 1.8	46.0 57.0 27.1 43.4	44.0 44.7 24.3 30.3	54.4 55.2 18.9 25.2
Task-tuned LLaMA-7B (full) Task-tunedLLaMA-33B(LoRA)	75.1 74.0	44.5 51.3	49.9 50.2	56.5 58.5

3.7 Efficient Tuning

3.7 高效调优

LMFlow supports low-rank adaptation (LoRA) (Hu et al.) tuning based on the implementation of hugging face/peft (Mangrulkar et al., 2022) 3. LoRA is an efficient tuning method that involves freezing the weights of the pretrained model and incorporating trainable rank decomposition matrices into each layer of the Transformer architecture. This approach significantly reduces the number of trainable parameters. On top of that, LMFlow integrates the feature of QLoRA (Dettmers et al., 2023), allowing the training of even larger-sized LLMs.

LMFlow 支持基于 hugging face/peft (Mangrulkar et al., 2022) 实现的低秩自适应 (LoRA) (Hu et al.) 微调。LoRA 是一种高效微调方法，通过冻结预训练模型的权重并在 Transformer 架构的每一层加入可训练的秩分解矩阵，显著减少可训练参数量。此外，LMFlow 还集成了 QLoRA (Dettmers et al., 2023) 功能，支持训练更大规模的大语言模型。

3.8 Inference

3.8 推理

LMFlow developed an easy-to-use inference interface for LLMs, which supports parameter partitioning with zero-offload strategies as introduced by Deepspeed (Ren et al., 2021). In LMFlow, the inference interface is provided by an inferencer class. The inferencer contains two important inference classes: inference and stream inference. The distinction lies in whether the output is printed word by word in real-time. Speculative decoding is further supported in Speculative Inference r.

LMFlow为大语言模型开发了一个易于使用的推理接口，支持采用Deepspeed (Ren et al., 2021) 提出的零卸载 (zero-offload) 策略进行参数分区。该推理接口通过inferencer类实现，其中包含两个重要推理类：inference（常规推理）和stream inference（流式推理），区别在于是否实时逐词输出结果。此外，Speculative Inference r还支持推测解码 (speculative decoding) 技术。

4 API Documentation

4 API文档

Please refer to https://optimal scale.github. io/LMFlow/autoapi/index.html for the details of API documentation.

请参考 https://optimal scale.github.io/LMFlow/autoapi/index.html 查看API文档详情。

5 Results

5 结果

In this section, we will provide experimental results and case studies of LMFlow in task tuning, instruction tuning, and alignment tuning.

在本节中，我们将展示LMFlow在任务调优、指令调优和对齐调优方面的实验结果与案例研究。

5.1 Task Tuning

5.1 任务调优

The aim of task tuning is to enhance the proficiency of a language model in a specific field, such as the medical or financial domain, by imparting domainspecific information that allows it to better adapt to the target subject matter. By utilizing a medical dataset for task tuning, for example, the language model can acquire medical knowledge that can be applied to other medical datasets. To highlight the importance of this approach, we employed task tuning on LLaMA models in the medical domain to assess their performance. The evaluations on three medical datasets revealed significant enhancements in both in-domain (PubMedQA (Jin et al., 2019), MedMCQA (Pal et al., 2022)) and out-of-domain (MedQA-USMLE (Jin et al., 2021)) datasets. The results are shown in Table 3. The LLaMA-33B (LoRA) performance is achieved with only about 16 hours finetuning on the training split of Pub

任务调优的目标是通过向语言模型传授特定领域的信息，使其能更好地适应目标主题，从而提升模型在专业领域（如医疗或金融）的熟练度。例如，使用医疗数据集进行任务调优后，语言模型可以掌握可迁移至其他医疗数据的专业知识。为验证该方法的有效性，我们在医疗领域对LLaMA模型进行了任务调优测试。在三个医疗数据集上的评估显示，模型在领域内（PubMedQA (Jin et al., 2019)、MedMCQA (Pal et al., 2022)）和跨领域（MedQA-USMLE (Jin et al., 2021)）表现均获得显著提升，结果如表 3 所示。其中LLaMA-33B（LoRA）仅需在PubMedQA训练集上微调约16小时即可达成所述性能。

Table 4: Performance on Hugging face Open LLM Leader board. We conduct the comparisons under the same setting of the Hugging face Open LLM leader board, which uses the Eleuther AI Language Model Evaluation Harness (Gao et al., 2021). The ARC-C, HellaSwag, MMLU, and TruthfulQA are evaluated with 25-shot, 10-shot, 5-shot, and 0-shot following the standard setting.

MODEL	ARC-C\|]	HellaSwag		」MMLU」TruthfulQA」Average
7B
LLaMA-7B (Touvron et al., 2023a)	46.6	75.6	34.2	34.1	47.6
Baize-7B-v2 (Xu et al., 2023)	44.5	73.3	35.6	40.8	48.6
MPT-7B (Team,2023)	47.7	77.7	35.6	33.4	48.6
Falcon-7B (Penedo et al.,2023)	47.9	78.1	35.0	34.3	48.8
Robin-7B-v2	49.4	74.6	39.8	43.0	51.7
13B
Alpaca-13B (Taori et al.,2023)	51.9	77.6	37.6	39.6	51.7
LLaMA-13B (Touvron et al.,2023a)	50.8	78.9	37.7	39.9	51.8
Vicuna-13B (Zheng et al., 2023a)	47.4	75.2	39.6	49.8	53.7
Baize-13B-v2 (Xu et al.,2023)	50.3	77.1	39.4	48.3	53.8
Robin-13B-v2	56.5	80.4	48.8	50.8	59.1
>30B
LLaMA-33B (Touvron et al.,2023a)	57.1	82.6	45.7	42.3	56.9
LLaMA-65B (Touvron et al.,2023a)	57.8	84.2	48.8	42.3	58.3
Falcon-40B (Penedo et al.,2023)	61.9	85.3	52.7	41.7	60.4
Guanaco-65B-merged (Dettmers et al., 2023)	60.2	84.6	52.7	51.3	62.2
Falcon-40B-instruct (Penedo et al., 2023)	61.6	84.4	54.1	52.5	63.2
Robin-33B-v2	62.5	84.3	57.8	51.9	64.1
Robin-65B-v2	61.9	84.6	62.6	51.8	65.2

表 4: Hugging Face Open LLM 排行榜性能对比。我们在 Hugging Face Open LLM 排行榜的相同设置下进行比较，该排行榜使用 Eleuther AI 语言模型评估工具包 (Gao et al., 2021)。ARC-C、HellaSwag、MMLU 和 TruthfulQA 分别按照标准设置采用 25样本、10样本、5样本和零样本进行评估。

MODEL	ARC-C	HellaSwag	MMLU	TruthfulQA	Average
* * 7B* *
LLaMA-7B (Touvron et al., 2023a)	46.6	75.6	34.2	34.1	47.6
Baize-7B-v2 (Xu et al., 2023)	44.5	73.3	35.6	40.8	48.6
MPT-7B (Team, 2023)	47.7	77.7	35.6	33.4	48.6
Falcon-7B (Penedo et al., 2023)	47.9	78.1	35.0	34.3	48.8
Robin-7B-v2	49.4	74.6	39.8	43.0	51.7
* * 13B* *
Alpaca-13B (Taori et al., 2023)	51.9	77.6	37.6	39.6	51.7
LLaMA-13B (Touvron et al., 2023a)	50.8	78.9	37.7	39.9	51.8
Vicuna-13B (Zheng et al., 2023a)	47.4	75.2	39.6	49.8	53.7
Baize-13B-v2 (Xu et al., 2023)	50.3	77.1	39.4	48.3	53.8
Robin-13B-v2	56.5	80.4	48.8	50.8	59.1
* * >30B* *
LLaMA-33B (Touvron et al., 2023a)	57.1	82.6	45.7	42.3	56.9
LLaMA-65B (Touvron et al., 2023a)	57.8	84.2	48.8	42.3	58.3
Falcon-40B (Penedo et al., 2023)	61.9	85.3	52.7	41.7	60.4
Guanaco-65B-merged (Dettmers et al., 2023)	60.2	84.6	52.7	51.3	62.2
Falcon-40B-instruct (Penedo et al., 2023)	61.6	84.4	54.1	52.5	63.2
Robin-33B-v2	62.5	84.3	57.8	51.9	64.1
Robin-65B-v2	61.9	84.6	62.6	51.8	65.2

Table 5: Results on HH-RLHF dataset. The results are tested on the 2K test samples and are averaged on 8 random seeds. The LLaMA-7B-SFT is the SFT-aligned model. Reward and PPL denote the mean reward and perplexity, respectively. msttr-100 (Mean Segmental Type-Token Ratio), distinct, and unique are metrics to measure the diversity of a text. Pred. Length is the average length of predictions.

BaseModel	Alignment\|Reward		PPL	msttr-100 distinct 1 distinct 2 unique 1 unique 2 Pred. Length
LLaMA-7B		-0.435 4.781		0.579	0.032	0.258	7651	96071	119.9
LLaMA-7B	SFT	0.772	3.781	0.597	0.031	0.250	8198	110759	145.4
LLaMA-7B-SFT	PPO	2.077	4.156	0.597	0.033	0.262	7370	102437	127.8
LLaMA-7B-SFT	RAFT	2.294	4.031	0.611	0.032	0.258	8691	123576	156.2

表 5: HH-RLHF数据集上的结果。测试基于2K测试样本，并在8个随机种子下取平均值。LLaMA-7B-SFT为经过SFT对齐的模型。Reward和PPL分别表示平均奖励和困惑度。msttr-100 (Mean Segmental Type-Token Ratio)、distinct和unique是衡量文本多样性的指标。Pred. Length为预测的平均长度。

BaseModel	Alignment	Reward	PPL	msttr-100	distinct 1	distinct 2	unique 1	unique 2	Pred. Length
LLaMA-7B		-0.435	4.781	0.579	0.032	0.258	7651	96071	119.9
LLaMA-7B	SFT	0.772	3.781	0.597	0.031	0.250	8198	110759	145.4
LLaMA-7B-SFT	PPO	2.077	4.156	0.597	0.033	0.262	7370	102437	127.8
LLaMA-7B-SFT	RAFT	2.294	4.031	0.611	0.032	0.258	8691	123576	156.2

MedQA and MedMCQA with a single $8\times\mathrm{Al00}$ server. Furthermore, we conducted experiments on Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) to further confirm the out-of-domain robustness of the task tuning. The results are shown in Table 2.

使用单台 $8\times\mathrm{Al00}$ 服务器在MedQA和MedMCQA上进行测试。此外，我们在Massive Multitask Language Understanding (MMLU) (Hendrycks et al., 2020) 上进行了实验，以进一步验证任务调优的跨领域鲁棒性。结果如表 2 所示。

5.2 Instruction Tuning

5.2 指令微调 (Instruction Tuning)

Following previous work in instruction tuning (Wang et al., 2022; Taori et al., 2023; Zheng et al., 2023a), we finetune the model with a combination of ShareGPT 4, GPT-4-LLM (Peng et al., 2023), and BELLE (Ji et al., 2023a,b). This data fusion takes the Chinese and English data balance into consideration. Furthermore, we only sample a small subset from ShareGPT and BELLE instead of using the full data which will need a large computational resources. We call our instruction-tuned model Robin 5. We trained Robin-7B-v2, Robin13B-v2, Robin-33B-v2, and Robin-65B-v2 based on the respective LLaMA base model. The delta weights of Robin are released at https://github. com/Optimal Scale/LMFlow#model-zoo. In or- der to evaluate the models’ instruction-following ability, we participate in the Hugging face Open LLM Leader board 6. The performance is shown in Table 4. Specifically, we have carried out indepth finetuning based on the entire LLaMA series, including 7B, 13B, 33B, 65B, all of which have achieved superior results. Robin-7B-v2 scored 51.7 in the OpenLLM standard test, and Robin-13B even reached as high as 59.1, ranking sixth, surpassing many 33B models. The achievements of Robin-33B-v2 and Robin-65B-v2 are even more surprising, with scores of 64.1 and 65.2 respectively, firmly securing the top positions.

遵循指令微调 (instruction tuning) 的先前工作 (Wang et al., 2022; Taori et al., 2023; Zheng et al., 2023a)，我们采用 ShareGPT 4、GPT-4-LLM (Peng et al., 2023) 和 BELLE (Ji et al., 2023a,b) 的组合数据对模型进行微调。该数据融合考虑了中英文数据的平衡性。此外，我们仅从 ShareGPT 和 BELLE 中采样小规模子集，而非使用需要大量计算资源的完整数据集。我们将指令微调后的模型命名为 Robin 5，并基于不同规模的 LLaMA 基础模型训练了 Robin-7B-v2、Robin-13B-v2、Robin-33B-v2 和 Robin-65B-v2。Robin 的增量权重已发布于 https://github.com/OptimalScale/LMFlow#model-zoo。为评估模型遵循指令的能力，我们参与了 Hugging Face Open LLM 排行榜 6，性能结果如表 4 所示。具体而言，我们对整个 LLaMA 系列 (包括 7B、13B、33B、65B) 进行了深度微调，均取得卓越成果：Robin-7B-v2 在 OpenLLM 标准测试中获得 51.7 分，Robin-13B 更以 59.1 分排名第六，超越众多 33B 模型；Robin-33B-v2 和 Robin-65B-v2 分别以 64.1 和 65.2 的成绩稳居榜首，表现尤为突出。

5.3 Alignment Tuning

5.3 对齐调校

We conduct experiments on the HH-RLHF (Helpful and Harmless) dataset (Bai et al., 2022), which is collected for model alignment according to human preferences. The performance is reported in Table 5. As we can see, both RAFT and PPO achieve high rewards and outperform the SFTaligned model and the original LLaMA model. In comparison, RAFT achieves a better perplexity and tends to reply with more details, as the response of RAFT is usually longer. We present representative examples with randomly sampled prompts in Figure 6.

我们在HH-RLHF（有益无害）数据集（Bai等人，2022）上进行了实验，该数据集是为根据人类偏好进行模型对齐而收集的。性能结果如表5所示。可以看出，RAFT和PPO都获得了较高的奖励分数，优于经过SFT对齐的模型和原始LLaMA模型。相比之下，RAFT实现了更低的困惑度，并且倾向于给出更详细的回复，因为RAFT的响应通常更长。我们在图6中展示了随机采样提示的代表性示例。

6 Conclusion

6 结论

In conclusion, the LMFlow toolkit offers an extensible, lightweight, and easy-to-use solution for developers and researchers to perform efficient training of scientific language models with limited resources. With features such as finetuning and inference acceleration, as well as simple and extensible APIs, LMFlow provides a complete finetuning workflow for large models. Moreover, with the ability to customize training and achieve comparable or even better performance than ChatGPT, LMFlow represents a significant step forward in the development of large scientific models and their application to specialized tasks.

总之，LMFlow工具包为开发者和研究人员提供了一套可扩展、轻量级且易于使用的解决方案，使其能够在有限资源下高效训练科学领域的大语言模型。该工具集不仅具备微调 (finetuning) 和推理加速等特性，还通过简洁可扩展的API为大模型提供了完整的微调工作流。此外，LMFlow支持定制化训练，并能达到与ChatGPT相当甚至更优的性能，这标志着大科学模型开发及其在专业任务应用方面迈出了重要一步。

Acknowledgements

致谢

We thank the anonymous reviewers for their valuable suggestions and comments. Shizhe Diao and Rui Pan were supported by the Hong Kong Ph.D. Fellowship Scheme (HKPFS).

我们感谢匿名评审提出的宝贵建议和意见。Shizhe Diao 和 Rui Pan 的研究工作得到香港博士研究生奖学金计划 (HKPFS) 的支持。

Broader Impact and Responsible Use

更广泛的影响与负责任的使用

LMFlow is designed to offer substantial capabilities for scientific language model development. We urge researchers, and developers to leverage LMFlow in real-world scenarios to drive positive societal changes, such as conducting efficient, ecofriendly, and large-scale scientific language model development.

LMFlow旨在为科学语言模型开发提供强大能力。我们呼吁研究者和开发者将LMFlow应用于实际场景，推动积极的社会变革，例如开展高效、环保、大规模的科学语言模型开发。

Despite these benefits, there is a potential for misuse of LMFlow. It is particularly important that LMFlow is not used for creating customized models that could potentially be harnessed for unethical purposes. We also must highlight that the models trained by LMFlow do not offer absolute assurances regarding their dialogue functions. Users may encounter inaccuracies or biases in predictions. Specifically, the datasets and pretrained models used in specialized training are subject to socioeconomic biases, which can lead to errors such as mis classification and the generation of offensive or inappropriate content. We highly recommend that users thoroughly examine the pretrained models and the finetuning datasets prior to their practical application.

尽管LMFlow具有这些优势，但其仍存在被滥用的可能。尤其重要的是，LMFlow不得用于创建可能被用于不道德目的的定制模型。我们还必须强调，LMFlow训练的模型无法对其对话功能提供绝对保证，用户可能会遇到预测不准确或存在偏见的情况。具体而言，专业训练中使用的数据集和预训练模型会受到社会经济偏见的影响，可能导致错误分类或生成冒犯性/不当内容等错误。我们强烈建议用户在实际应用前，彻底检查预训练模型和微调数据集。

We are committed to the continuous improvement of LMFlow. Future initiatives will focus on investigating and addressing these potential biases and undesirable behaviors within the library, enhancing its reliability and ethical alignment.

我们致力于持续改进LMFlow。未来的工作将重点研究和解决该库中潜在的偏见与不良行为，以提升其可靠性和伦理一致性。

References

参考文献

Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jindi, Tristan Naumann, and Matthew McDermott. 2019. Publicly Available Clin- ical BERT Embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78.

Emily Alsentzer、John Murphy、William Boag、WeiHung Weng、Di Jindi、Tristan Naumann 和 Matthew McDermott。2019. 公开可用的临床 BERT 嵌入 (Publicly Available Clinical BERT Embeddings)。见《第二届临床自然语言处理研讨会论文集》，第72–78页。

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas

Amanda Askell、Yuntao Bai、Anna Chen、Dawn Drain、Deep Ganguli、Tom Henighan、Andy Jones、Nicholas

Joseph, Ben Mann, Nova DasSarma, et al. 2021. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.

Joseph、Ben Mann、Nova DasSarma 等. 2021. 通用语言助手作为对齐研究的实验平台. arXiv 预印本 arXiv:2112.00861.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. 2022. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan等. 2022. 基于人类反馈强化学习训练有益无害的助手. arXiv预印本 arXiv:2204.05862.

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606–3611.

Iz Beltagy、Kyle Lo 和 Arman Cohan. 2019. SciBERT: 面向科学文本的预训练语言模型. 见《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议论文集》(EMNLP-IJCNLP), 第3606–3611页.

He Cao, Zijing Liu, Xingyu Lu, Yuan Yao, and Yu Li. 2023. Instruct mol: Multi-modal integration for building a versatile and reliable molecular assistant in drug discovery. arXiv preprint arXiv:2311.16208.

何操、刘子靖、卢星宇、姚远和李昱。2023。InstructMol：多模态整合构建药物发现中多功能可靠分子助手。arXiv预印本arXiv:2311.16208。

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. 2023. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.

Shouyuan Chen、Sherman Wong、Liangjian Chen 和 Yuandong Tian。2023。通过位置插值扩展大语言模型的上下文窗口。arXiv预印本 arXiv:2306.15595。

Leshem Choshen, Lior Fox, Zohar Aizenbud, and Omri Abend. 2019. On the weaknesses of reinforcement learning for neural machine translation. arXiv preprint arXiv:1907.01752.

Leshem Choshen、Lior Fox、Zohar Aizenbud 和 Omri Abend。2019. 论强化学习在神经机器翻译中的局限性。arXiv预印本 arXiv:1907.01752。

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. 2022. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.

Hyung Won Chung、Le Hou、Shayne Longpre、Barret Zoph、Yi Tay、William Fedus、Eric Li、Xuezhi Wang、Mostafa Dehghani、Siddhartha Brahma 等. 2022. 规模化指令微

[论文翻译]LMFlow: 一个用于大基础模型微调和推理的可扩展工具包

原文地址：https://arxiv.org/pdf/2306.12420