[论文翻译]DeepSeek-R1:通过强化学习提升大语言模型的推理能力


原文地址:https://u254848-88c6-e493554b.yza1.seetacloud.com:8443/miner/v2/analysis/pdf_md?filename=full.md&as_attachment=False&user_id=10&pdf=52d8ca3ac93e88cef9944e1fd03b0e04aec5954495a8250fb2fadf8fa20a4dad1738754180_2501.12948v1.pdf


DeepSeek-R1: In centi viz ing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1:通过强化学习提升大语言模型的推理能力

Abstract

摘要

We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeekR1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, We open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.

我们推出了第一代推理模型,DeepSeek-R1-Zero 和 DeepSeek-R1。DeepSeek-R1-Zero 是一个通过大规模强化学习 (RL) 训练而成的模型,没有经过监督微调 (SFT) 作为初步步骤,展现了卓越的推理能力。通过 RL,DeepSeek-R1-Zero 自然衍生出许多强大且有趣的推理行为。然而,它也面临诸如可读性差和语言混合等挑战。为了解决这些问题并进一步提升推理性能,我们推出了 DeepSeek-R1,它在 RL 之前引入了多阶段训练和冷启动数据。DeepSeek-R1 在推理任务上达到了与 OpenAI-o1-1217 相当的性能。为了支持研究社区,我们开源了 DeepSeek-R1-Zero、DeepSeek-R1 以及基于 Qwen 和 Llama 从 DeepSeek-R1 蒸馏出的六个稠密模型 (1.5B, 7B, 8B, 14B, 32B, 70B)。


Figure 1 | Benchmark performance of DeepSeek-R1

图 1: DeepSeek-R1 的基准性能

Contents

目录

1 Introduction 3

1 引言 3

2 Approach

2 方法

3 Experiment 11

3 实验 11

4 Discussion

4 讨论

5 Conclusion, Limitations, and Future Work 16

5 结论、局限性与未来工作 16

A Contributions and Acknowledgments 20

贡献与致谢 20

1. Introduction

1. 引言

In recent years, Large Language Models (LLMs) have been undergoing rapid iteration and evolution (Anthropic, 2024; Google, 2024; OpenA1, 2024a), progressively diminishing the gap towards Artificial General Intelligence (AGI).

近年来,大语言模型 (LLMs) 经历了快速的迭代和演进 (Anthropic, 2024; Google, 2024; OpenA1, 2024a),逐步缩小了与通用人工智能 (AGI) 的差距。

Recently, post-training has emerged as an important component of the full training pipeline. It has been shown to enhance accuracy on reasoning tasks, align with social values, and adapt to user preferences, all while requiring relatively minimal computational resources against pre-training. In the context of reasoning capabilities, OpenAI's o1 (OpenAI, 2024b) series models Were the first to introduce inference-time scaling by increasing the length of the Chain-ofThought reasoning process. This approach has achieved significant improvements in various reasoning tasks, such as mathematics, coding, and scientific reasoning. However, the challenge of effective test-time scaling remains an open question for the research community. Several prior Works have explored various approaches, including process-based reward models (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023), reinforcement learning (Kumar et al., 2024), and search algorithms such as Monte Carlo Tree Search and Beam Search (Feng et al., 2024; Trinh et al., 2024; Xin et al., 2024). However, none of these methods has achieved general reasoning performance comparable to OpenAI's o1 series models.

最近,后训练 (post-training) 已成为完整训练流程中的一个重要组成部分。研究表明,它能够提高推理任务的准确性,符合社会价值观,并适应用户偏好,同时相比预训练 (pre-training) 所需的计算资源相对较少。在推理能力方面,OpenAI 的 o1 系列模型 (OpenAI, 2024b) 首次通过增加思维链 (Chain-of-Thought) 推理过程的长度引入了推理时缩放。这种方法在数学、编码和科学推理等各种推理任务中取得了显著改进。然而,有效的测试时缩放仍然是研究界的一个开放性问题。之前的几项研究探索了各种方法,包括基于过程的奖励模型 (Lightman et al., 2023; Uesato et al., 2022; Wang et al., 2023)、强化学习 (Kumar et al., 2024) 以及蒙特卡洛树搜索和束搜索等搜索算法 (Feng et al., 2024; Trinh et al.,# cloneBootstrap
Cloning bootstrap from scratch

In this paper, we take the first step toward improving language model reasoning capabilities using pure reinforcement learning (RL). Our goal is to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure RL process. Specifically, we use DeepSeek-V3-Base as the base model and employ GRPO (Shao et al., 2024) as the RL framework to improve model performance in reasoning. During training, DeepSeek-R1-Zero naturally emerged with numerous powerful and interesting reasoning behaviors. After thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. For instance, the pass $\mathcal{R}1$ score on AIME 2024 increases from $15.6%$ to $71.0%,$ and with majority voting, the score further improves to $86.7%$ matching the performance of OpenAI-01-0912.

在本文中,我们迈出了使用纯强化学习(RL)提升语言模型推理能力的第一步。我们的目标是探索大语言模型在没有监督数据的情况下发展推理能力的潜力,重点关注通过纯 RL 过程实现的自我进化。具体而言,我们使用 DeepSeek-V3-Base 作为基础模型,并采用 GRPO (Shao et al., 2024) 作为 RL 框架,以提升模型在推理任务中的表现。在训练过程中,DeepSeek-R1-Zero 自然涌现出许多强大且有趣的推理行为。经过数千步的 RL 训练后,DeepSeek-R1-Zero 在推理基准测试中表现出色。例如,在 AIME 2024 上的通过 $\mathcal{R}1$ 分数从 $15.6%$ 提升至 $71.0%,$ 并且在多数投票下,分数进一步提高至 $86.7%$ ,与 OpenAI-01-0912 的表现相当。

However, DeepSeek-R1-Zero encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates a small amount of cold-start data and a multi-stage training pipeline. Specifically, we begin by collecting thousands of cold-start data to fine-tune the DeepSeek-V3-Base model. Following this, we perform reasoning-oriented RL like DeepSeek-R1- Zero. Upon nearing convergence in the RL process, we create new SFT data through rejection sampling on the RL checkpoint, combined with supervised data from DeepSeek-V3 in domains such as writing, factual QA, and self-cognition, and then retrain the DeepSeek-V3-Base model. After fine-tuning with the new data, the checkpoint undergoes an additional RL process, taking into account prompts from all scenarios. After these steps, we obtained a checkpoint referred to as DeepSeek-R1, which achieves performance on par with OpenAI-o1-1217.

然而,DeepSeek-R1-Zero 面临诸如可读性差和语言混合等挑战。为了解决这些问题并进一步提升推理性能,我们引入了 DeepSeek-R1,它结合了少量冷启动数据和多阶段训练流程。具体来说,我们首先收集了数千个冷启动数据来微调 DeepSeek-V3-Base 模型。随后,我们进行了类似于 DeepSeek-R1-Zero 的面向推理的强化学习 (RL)。在 RL 过程接近收敛时,我们通过 RL 检查点上的拒绝采样生成新的 SFT 数据,并结合来自 DeepSeek-V3 的监督数据,涵盖写作、事实问答和自我认知等领域,然后重新训练 DeepSeek-V3-Base 模型。使用新数据进行微调后,检查点会经历额外的 RL 过程,考虑到所有场景的提示。经过这些步骤,我们获得了称为 DeepSeek-R1 的检查点,其性能与 OpenAI-o1-1217 相当。

We further explore distillation from DeepSeek-R1 to smaller dense models. Using Qwen2.5- 32B (Qwen, 2024b) as the base model, direct distillation from DeepSeek-R1 outperforms applying RL on it. This demonstrates that the reasoning patterns discovered by larger base models are crucial for improving reasoning capabilities. We open-source the distilled Qwen and Llama (Dubey et al., 2024) series. Notably, our distilled 14B model outperforms state-of-the-art open-source QwQ-32B-Preview (Qwen, 2024a) by a large margin, and the distilled 32B and 70B models set a new record on the reasoning benchmarks among dense models.

我们进一步探索了从 DeepSeek-R1 到更小的密集模型的蒸馏过程。以 Qwen2.5-32B (Qwen, 2024b) 为基础模型,直接从 DeepSeek-R1 进行蒸馏的表现优于在其上应用强化学习 (RL)。这表明较大基础模型发现的推理模式对于提升推理能力至关重要。我们开源了蒸馏后的 Qwen 和 Llama (Dubey et al., 2024) 系列。值得注意的是,我们蒸馏的 14B 模型大幅超越了目前最先进的开源模型 QwQ-32B-Preview (Qwen, 2024a),而蒸馏的 32B 和 70B 模型在密集模型的推理基准上创造了新的记录。

1.1. Contributions

1.1. 贡献

Post-Training: Large-Scale Reinforcement Learning on the Base Model

后训练:基于大模型的强化学习

Distillation: Smaller Models Can Be Powerful Too

蒸馏:小模型也能强大

· We demonstrate that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models. The open source DeepSeek-R1, as well as its API, will benefit the research community to distill better smaller models in the future. · Using the reasoning data generated by DeepSeek-R1, we fine-tuned several dense models that are widely used in the research community. The evaluation results demonstrate that the distilled smaller dense models perform exceptionally well on benchmarks. DeepSeek R1-Distill-Qwen-7B achieves $55.5%$ on AIME 2024, surpassing QwQ-32B-Preview. Additionally, DeepSeek-R1-Distill-Qwen-32B scores $72.6%$ on AIME 2024, $94.3%$ on MATH-500, and $57.2%$ on Live Code Bench. These results significantly outperform previous opensource models and are comparable to o1-mini. We open-source distilled 1.5B, 7B, 8B, 14B, 32B, and 70B checkpoints based on Qwen2.5 and Llama3 series to the community.

· 我们展示了可以将较大模型的推理模式提炼到较小模型中,与通过强化学习在小模型上发现的推理模式相比,性能更优。开源的 DeepSeek-R1 及其 API 将有助于研究社区在未来提炼出更好的小模型。· 使用 DeepSeek-R1 生成的推理数据,我们对研究社区中广泛使用的多个密集模型进行了微调。评估结果表明,提炼后的小型密集模型在基准测试中表现优异。DeepSeek R1-Distill-Qwen-7B 在 AIME 2024 上达到 55.5%,超过了 QwQ-32B-Preview。此外,DeepSeek-R1-Distill-Qwen-32B 在 AIME 2024 上得分为 72.6%,在 MATH-500 上得分为 94.3%,在 Live Code Bench 上得分为 57.2%。这些结果显著优于之前的开源模型,并与 o1-mini 相当。我们向社区开源了基于 Qwen2.5 和 Llama3 系列的 1.5B、7B、8B、14B、32B 和 70B 的检查点。

1.2. Summary of Evaluation Results

1.2. 评估结果总结

· Others: DeepSeek-R1 also excels in a wide range of tasks, including creative writing, general question answering, editing, sum mari z ation, and more. It achieves an impressive length-controlled win-rate of $87.6%$ on AlpacaEval 2.0 and a win-rate of $92.3%$ OnArenaHard, showcasing its strong ability to intelligently handle non-exam-oriented queries. Additionally, DeepSeek-R1 demonstrates outstanding performance on tasks requiring long-context understanding, substantially outperforming DeepSeek-V3 on long-context benchmarks.

· 其他:DeepSeek-R1 还在创意写作、通用问答、编辑、摘要等多种任务中表现出色。它在 AlpacaEval 2.0 上实现了 $87.6%$ 的令人印象深刻的长度控制胜率,在 OnArenaHard 上实现了 $92.3%$ 的胜率,展示了其智能处理非应试查询的强大能力。此外,DeepSeek-R1 在需要长上下文理解的任务中表现出色,在长上下文基准测试中显著优于 DeepSeek-V3。

2. Approach

2. 方法

2.1. Overview

2.1. 概述

Previous work has heavily relied on large amounts of supervised data to enhance model performance. In this study, we demonstrate that reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, performance can be further enhanced with the inclusion of a small amount of cold-start data. In the following sections, we present: (1) DeepSeek-R1-Zero, which applies RL directly to the base model without any SFT data, and (2) DeepSeek-R1, which applies RL starting from a checkpoint fine-tuned with thousands of long Chain-of-Thought (CoT) examples. 3) Distill the reasoning capability from DeepSeek-R1 to small dense models.

先前的工作严重依赖大量监督数据来提升模型性能。在本研究中,我们证明通过大规模强化学习 (Reinforcement Learning, RL) 可以显著提高推理能力,即使在不使用监督微调 (Supervised Fine-Tuning, SFT) 作为冷启动的情况下。此外,通过加入少量冷启动数据,性能可以进一步增强。在接下来的部分中,我们将介绍:(1) DeepSeek-R1-Zero,它直接将强化学习应用于基础模型,不使用任何 SFT 数据;(2) DeepSeek-R1,它从经过数千个长链思维 (Chain-of-Thought, CoT) 示例微调的检查点开始应用强化学习;(3) 将 DeepSeek-R1 的推理能力蒸馏到小型密集模型中。

2.2. DeepSeek-R1-Zero: Reinforcement Learning on the Base Model

2.2. DeepSeek-R1-Zero: 基础模型的强化学习

Reinforcement learning has demonstrated significant effectiveness in reasoning tasks, as evidenced by our previous works (Sha0 et al., 2024; Wang et al., 2023). However, these works heavily depended on supervised data, which are time-intensive to gather. In this section, we explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure reinforcement learning proces. We start with a brief overview of our RL algorithm, followed by the presentation of some exciting results, and hope this provides the community with valuable insights.

强化学习在推理任务中已展现出显著的有效性,这在我们之前的工作中得到了证明 (Sha0 等人, 2024; Wang 等人, 2023)。然而,这些工作严重依赖于监督数据,而收集这些数据非常耗时。在本节中,我们探索了大语言模型在没有任何监督数据的情况下发展推理能力的潜力,重点关注它们通过纯粹的强化学习过程进行自我进化。我们首先简要概述了我们的强化学习算法,随后展示了一些令人振奋的结果,并希望这能为社区提供有价值的见解。

2.2.1. Reinforcement Learning Algorithm

2.2.1. 强化学习算法

Group Relative Policy Optimization In order to save the training costs of $\mathrm{RL},$ we adopt Group Relative Policy Optimization (GRPO) (Shao et al., 2024), which foregoes the critic model that is typically the same size as the policy model, and estimates the baseline from group scores instead. Specifically, for each question $q,$ GRPO samples a group of outputs ${o_{1},o_{2},\cdots,,o_{G}}$ from the old policy $\pi_{\theta_{o l d}}$ and then optimizes the policy model $\pi_{\theta}$ by maximizing the following objective:

群组相对策略优化 (Group Relative Policy Optimization, GRPO)

\begin{array}{r l}&{\mathcal{I}_{G R P O}(\theta)=\mathbb{E}[q\sim P(Q),\{o_{i}\}_{i=1}^{G}\sim\pi_{\theta_{o l d}}(O|q)]}\\ &{\quad\quad\quad\frac{1}{G}\sum_{i=1}^{G}\left(\operatorname*{min}\left(\frac{\pi_{\theta}\left(o_{i}|q\right)}{\pi_{\theta_{o l d}}(o_{i}|q)}A_{i},\operatorname{clip}\left(\frac{\pi_{\theta}\left(o_{i}|q\right)}{\pi_{\theta_{o l d}}(o_{i}|q)},1-\varepsilon,1+\varepsilon\right)A_{i}\right)-\beta\mathbb{D}_{K L}\left(\pi_{\theta}||\pi_{r e f}\right)\right),}\end{array}
\mathbb{D}_{K L}\left(\pi_{\theta}||\pi_{r e f}\right)=\frac{\pi_{r e f}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-\log\frac{\pi_{r e f}(o_{i}|q)}{\pi_{\theta}(o_{i}|q)}-1,```


where  \$\varepsilon\$  and  \$\beta\$   are hyper-parameters, and  \$A_{i}\$  is the advantage, computed using a group of rewards  \$\{r_{1},r_{2},\dots,r_{G}\}\$  corresponding to the outputs within each group:

其中,\$\varepsilon\$ 和 \$\beta\$ 是超参数,\$A_{i}\$ 是优势,通过一组与每个组内输出对应的奖励 \$\{r_{1},r_{2},\dots,r_{G}\}\$ 计算得出:

A_{i}={\frac{r_{i}-{\bf m}e a n\left(\left{r_{1},r_{2},\cdot\cdot\cdot,r_{G}\right}\right)}{{\bf s}t d\left(\left{r_{1},r_{2},\cdot\cdot\cdot,r_{G}\right}\right)}}.```

2.2.2. Reward Modeling

2.2.2. 奖励建模

The reward is the source of the training signal, which decides the optimization direction of RL. To train DeepSeek-R1-Zero, we adopt a rule-based reward system that mainly consists of twO types of rewards:

奖励是训练信号的来源,决定了RL的优化方向。为了训练DeepSeek-R1-Zero,我们采用了一个基于规则的奖励系统,主要包括两种类型的奖励:

We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

在开发 DeepSeek-R1-Zero 时,我们没有应用结果或过程神经奖励模型,因为我们发现神经奖励模型可能在大规模强化学习过程中出现奖励黑客问题,而且重新训练奖励模型需要额外的训练资源,并使整个训练流程复杂化。

2.2.3. Training Template

2.2.3. 训练模板

To train DeepSeek-R1-Zero, we begin by designing a straightforward template that guides the base model to adhere to our specified instructions. As depicted in Table 1, this template requires DeepSeek-R1-Zero to first produce a reasoning process, followed by the final answer. We intentionally limit our constraints to this structural format, avoiding any content-specific biases-—such as mandating reflective reasoning or promoting particular problem-solving strate gies-to ensure that we can accurately observe the model's natural progression during the RL process.

为了训练 DeepSeek-R1-Zero,我们首先设计了一个简单的模板,指导基础模型遵循我们指定的指令。如表 1 所示,该模板要求 DeepSeek-R1-Zero 先生成推理过程,然后给出最终答案。我们有意将约束限制在这种结构格式上,避免任何特定内容的偏见——例如强制进行反思性推理或推广特定的问题解决策略——以确保我们能够准确观察模型在强化学习过程中的自然进展。

2.2.4. Performance, Self-evolution Process and Aha Moment of DeepSeek-R1-Zero

2.2.4. DeepSeek-R1-Zero 的性能、自我进化过程与顿悟时刻

Performance of DeepSeek-R1-Zero Figure 2 depicts the performance trajectory of DeepSeekR1-Zero on the AIME 2024 benchmark throughout the RL training process. As illustrated, DeepSeek-R1-Zero demonstrates a steady and consistent enhancement in performance as the RL training advances. Notably, the average pass $\mathcal{R}1$ score on AIME 2024 shows a significant increase, jumping from an initial $15.6%$ to an impressive $71.0%$ , reaching performance levels comparable to OpenAI-o1-0912. This significant improvement highlights the efficacy of our RL algorithm in optimizing the model's performance over time.

DeepSeek-R1-Zero 的性能表现 图 2 展示了 DeepSeek-R1-Zero 在 AIME 2024 基准测试中的性能轨迹,贯穿了整个 RL 训练过程。如图所示,随着 RL 训练的推进,DeepSeek-R1-Zero 表现出持续且稳定的性能提升。值得注意的是,AIME 2024 的平均通过 $\mathcal{R}1$ 分数显著增加,从最初的 $15.6%$ 跃升至令人印象深刻的 $71.0%$,达到了与 OpenAI-o1-0912 相当的性能水平。这一显著提升突显了我们的 RL 算法在优化模型性能方面的有效性。

Table 2 provides a comparative analysis between DeepSeek-R1-Zero and OpenAI's o1-0912 models across a variety of reasoning-related benchmarks. The findings reveal that RL empowers

表 2: DeepSeek-R1-Zero 与 OpenAI 的 o1-0912 模型在多种推理相关基准上的对比分析

模型 AIME2024 MATH-500 GPQA Diamond LiveCode Bench CodeForces
pass@1 cons@64 pass@1 pass@1 pass@1 rating
OpenAI-o1-mini OpenAI-01-0912 63.6 74.4 80.0 83.3 90.0 94.8 60.0 77.3 53.8 63.4 1820 1843
DeepSeek-R1-Zero 71.0 86.7 95.9 73.3 50.0 1444

Table 2 | Comparison of DeepSeek-R1-Zero and OpenAI o1 models on reasoning-related benchmarks.

表 2: DeepSeek-R1-Zero 和 OpenAI o1 模型在推理相关基准测试中的对比


Figure 2 | AIME accuracy of DeepSeek-R1-Zero during training. For each question, we sample 16 responses and calculate the overall average accuracy to ensure a stable evaluation.

图 2 | DeepSeek-R1-Zero 训练期间的 AIME 准确率。对于每个问题,我们采样 16 个响应并计算整体平均准确率,以确保评估的稳定性。

DeepSeek-R1-Zero to attain robust reasoning capabilities without the need for any supervised fine-tuning data. This is a noteworthy achievement, as it underscores the model's ability to learn and generalize effectively through RL alone. Additionally, the performance of DeepSeek R1-Zero can be further augmented through the application of majority voting. For example, when majority voting is employed on the AIME benchmark, DeepSeek-R1-Zero's performance escalates from $71.0%$ to $86.7%$ , thereby exceeding the performance of OpenAI-o1-0912. The ability of DeepSeek-R1-Zero to achieve such competitive performance, both with and without majority voting, highlights its strong foundational capabilities and its potential for further advancements in reasoning tasks.

DeepSeek-R1-Zero 无需任何监督微调数据即可获得强大的推理能力。这是一个值得注意的成就,因为它强调了模型仅通过强化学习 (RL) 就能有效学习和泛化的能力。此外,通过应用多数投票 (majority voting),DeepSeek R1-Zero 的性能可以进一步增强。例如,在 AIME 基准测试中应用多数投票时,DeepSeek-R1-Zero 的性能从 $71.0%$ 提升至 $86.7%$,从而超过了 OpenAI-o1-0912 的表现。DeepSeek-R1-Zero 无论是否使用多数投票,都能取得如此具有竞争力的性能,这凸显了其强大的基础能力及其在推理任务中进一步发展的潜力。

Self-evolution Process of DeepSeek-R1-ZeroThe self-evolution process of DeepSeek-R1-Zero is a fascinating demonstration of how RL can drive a model to improve its reasoning capabilities autonomously. By initiating RL directly from the base model, we can closely monitor the model's progression without the influence of the supervised fine-tuning stage. This approach provides a clear view of how the model evolves over time, particularly in terms of its ability to handle complex reasoning tasks.

DeepSeek-R1-Zero 的自我进化过程

As depicted in Figure 3, the thinking time of DeepSeek-R1-Zero shows consistent improvement throughout the training proces. This improvement is not the result of external adjustments but rather an intrinsic development within the model. DeepSeek-R1-Zero naturally acquires the ability to solve increasingly complex reasoning tasks by leveraging extended test-time computation. This computation ranges from generating hundreds to thousands of reasoning tokens, allowing the model to explore and refine its thought processes in greater depth.

如图 3 所示,DeepSeek-R1-Zero 的思考时间在整个训练过程中持续提升。这种提升并非来自外部调整,而是模型内在的发展。DeepSeek-R1-Zero 通过利用延长的测试时间计算,自然地获得了解决日益复杂推理任务的能力。这种计算范围从生成数百到数千个推理 Token,使模型能够更深入地探索和完善其思考过程。


Figure 3 I The average response length of DeepSeek-R1-Zero on the training set during the RL process. DeepSeek-R1-Zero naturally learns to solve reasoning tasks with more thinking time.

图 3: DeepSeek-R1-Zero 在 RL 过程中训练集上的平均响应长度。DeepSeek-R1-Zero 自然地学会了通过更多的思考时间来解决推理任务。

One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—-where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model's interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero's reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.

这种自我进化最显著的一个方面是,随着测试时计算量的增加,出现了复杂的行为。例如,模型会进行反思(reflection)——重新审视和评估其先前的步骤——并自发地探索解决问题的替代方法。这些行为并非显式编程,而是模型与强化学习环境交互的结果。这种自发的发展显著增强了 DeepSeek-R1-Zero 的推理能力,使其能够更高效、更准确地处理更具挑战性的任务。

Aha Moment of DeepSeek-R1-Zero A particularly intriguing phenomenon observed during the training of DeepSeek-R1-Zero is the occurrence of an "aha moment". This moment, as illustrated in Table 3, occurs in an intermediate version of the model. During this phase, DeepSeek-R1-Zero learns to allocate more thinking time to a problem by reevaluating its initial approach. This behavior is not only a testament to the model's growing reasoning abilities but also a captivating example of how reinforcement learning can lead to unexpected and sophisticated outcomes.

DeepSeek-R1-Zero 的顿悟时刻

This moment is not only an "aha moment" for the model but also for the researchers observing its behavior. It underscores the power and beauty of reinforcement learning: rather than explicitly teaching the model on how to solve a problem, we simply provide it with the right incentives, and it autonomously develops advanced problem-solving strategies. The "aha moment" serves as a powerful reminder of the potential of RL to unlock new levels of intelligence in artificial systems, paving the way for more autonomous and adaptive models in the future.

这一刻不仅是模型的“顿悟时刻”,也是观察其行为的研究人员的“顿悟时刻”。它突显了强化学习的力量与美感:我们无需明确教导模型如何解决问题,只需提供正确的激励,它便能自主发展出高级的问题解决策略。这一“顿悟时刻”有力地提醒了强化学习在解锁人工智能系统新层次智能方面的潜力,为未来更自主、更自适应模型的开发铺平了道路。


Table 3| An interesting "aha moment" of an intermediate version of DeepSeek-R1-Zero. The model learns to rethink using an anthropomorphic tone. This is also an aha moment for us, allowing us to witness the power and beauty of reinforcement learning.

表 3: DeepSeek-R1-Zero 中间版本的一个有趣的“顿悟时刻”。模型学会了以拟人化的语气重新思考。这也是我们的一个顿悟时刻,让我们见证了强化学习的力量和美妙。

Drawback of DeepSeek-R1-Zero _Although DeepSeek-R1-Zero exhibits strong reasoning capabilities and autonomously develops unexpected and powerful reasoning behaviors, it faces several issues. For instance, DeepSeek-R1-Zero struggles with challenges like poor readability, and language mixing. To make reasoning processes more readable and share them with the open community, we explore DeepSeek-R1, a method that utilizes RL with human-friendly cold-start data.

DeepSeek-R1-Zero的缺点 _尽管DeepSeek-R1-Zero展现出强大的推理能力,并自主开发出意想不到且强大的推理行为,但它仍面临一些问题。例如,DeepSeek-R1-Zero在可读性差和语言混杂等挑战上表现不佳。为了使推理过程更具可读性并与开放社区分享,我们探索了DeepSeek-R1,这是一种利用人类友好的冷启动数据进行强化学习的方法。

2.3. DeepSeek-R1: Reinforcement Learning with Cold Start

2.3. DeepSeek-R1: 冷启动下的强化学习

Inspired by the promising results of DeepSeek-R1-Zero, two natural questions arise: 1) Can reasoning performance be further improved or convergence accelerated by incorporating a small amount of high-quality data as a cold start? 2) How can we train a user-friendly model that not only produces clear and coherent Chains of Thought (CoT) but also demonstrates strong general capabilities? To address these questions, we design a pipeline to train DeepSeek-R1. The pipeline consists of four stages, outlined as follows.

受 DeepSeek-R1-Zero 的优异结果启发,两个自然问题随之而来:1) 通过引入少量高质量数据作为冷启动,能否进一步提升推理性能或加速收敛?2) 如何训练一个用户友好的模型,使其不仅能生成清晰连贯的思维链 (Chains of Thought, CoT),还能展示强大的通用能力?为回答这些问题,我们设计了一个训练 DeepSeek-R1 的流程。该流程包括以下四个阶段。

2.3.1. Cold Start

2.3.1. 冷启动

Unlike DeepSeek-R1-Zero, to prevent the early unstable cold start phase of RL training from the base model, for DeepSeek-R1 we construct and collect a small amount of long CoT data to fine-tune the model as the initial RL actor. To collect such data, we have explored several approaches: using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1- Zero outputs in a readable format, and refining the results through post-processing by human annotators.

与 DeepSeek-R1-Zero 不同,为了防止从基础模型开始的强化学习 (RL) 训练早期的冷启动阶段不稳定,我们为 DeepSeek-R1 构建并收集了少量长链思维 (CoT) 数据来微调模型,作为初始的 RL 智能体。为了收集这些数据,我们探索了多种方法:使用少样本提示并以长链思维为例,直接提示模型生成带有反思和验证的详细答案,以可读格式收集 DeepSeek-R1-Zero 的输出,并通过人工标注员的后处理进行结果优化。

In this work, we collect thousands of cold-start data to fine-tune the DeepSeek-V3-Base as the starting point for RL. Compared to DeepSeek-R1-Zero, the advantages of cold start data

在本工作中,我们收集了数千个冷启动数据来微调 DeepSeek-V3-Base,作为强化学习的起点。与 DeepSeek-R1-Zero 相比,冷启动数据的优势

include:

包含:

2.3.2. Reasoning-oriented Reinforcement Learning

2.3.2. 面向推理的强化学习

After fine-tuning DeepSeek-V3-Base on the cold start data, we apply the same large-scale reinforcement learning training process as employed in DeepSeek-R1-Zero. This phase focuses on enhancing the model's reasoning capabilities, particularly in reasoning-intensive tasks such as coding, mathematics, science, and logic reasoning, which involve well-defined problems with clear solutions. During the training process, we observe that CoT often exhibits language mixing, particularly when RL prompts involve multiple languages. To mitigate the issue of language mixing, we introduce a language consistency reward during RL training, which is calculated as the proportion of target language words in the CoT. Although ablation experiments show that such alignment results in a slight degradation in the model's performance, this reward aligns with human preferences, making it more readable. Finally, we combine the accuracy of reasoning tasks and the reward for language consistency by directly summing them to form the final reward. We then apply RL training on the fine-tuned model until it achieves convergence on reasoning tasks.

在对冷启动数据进行 DeepSeek-V3-Base 微调后,我们采用了与 DeepSeek-R1-Zero 相同的大规模强化学习训练流程。这一阶段主要增强模型的推理能力,特别是在编码、数学、科学和逻辑推理等推理密集型任务中,这些任务涉及具有明确解决方案的清晰问题。在训练过程中,我们观察到 CoT 经常出现语言混合现象,尤其是在 RL 提示涉及多种语言时。为了缓解语言混合问题,我们在 RL 训练中引入了语言一致性奖励,该奖励通过计算 CoT 中目标语言词汇的比例来确定。尽管消融实验表明,这种对齐会导致模型性能略有下降,但该奖励符合人类偏好,使其更具可读性。最后,我们将推理任务的准确性与语言一致性奖励直接相加,形成最终奖励。然后,我们对微调后的模型进行 RL 训练,直到其在推理任务上达到收敛。

2.3.3. Rejection Sampling and Supervised Fine-Tuning

2.3.3. 拒绝采样和监督微调

When reasoning-oriented RL converges, we utilize the resulting checkpoint to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model's capabilities in writing, role-playing, and other general-purpose tasks. Specifically, we generate the data and fine-tune the model as described below.

当面向推理的强化学习收敛时,我们利用生成的检查点收集 SFT(监督微调)数据以用于下一轮。与主要关注推理的初始冷启动数据不同,此阶段引入了来自其他领域的数据,以增强模型在写作、角色扮演和其他通用任务中的能力。具体来说,我们按照以下方式生成数据并微调模型。

Reasoning data We curate reasoning prompts and generate reasoning trajectories by performing rejection sampling from the checkpoint from the above RL training. In the previous stage, We only included data that could be evaluated using rule-based rewards. However, in this stage, We expand the dataset by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long parapraphs, and code blocks. For each prompt, we sample multiple responses and retain only the correct ones. In total, we collect about 600k reasoning related training samples.

推理数据 我们从上述强化学习训练的检查点中进行拒绝采样,整理推理提示并生成推理轨迹。在前一阶段,我们仅包含可以使用基于规则的奖励进行评估的数据。然而,在这一阶段,我们通过纳入更多数据来扩展数据集,其中一些数据通过将真实值和模型预测输入DeepSeek-V3进行判断来使用生成式奖励模型。此外,由于模型输出有时混乱且难以阅读,我们过滤掉了混合语言的思维链、长段落和代码块。对于每个提示,我们采样多个响应并仅保留正确的响应。我们总共收集了约60万条与推理相关的训练样本。

Non-Reasoning data For non-reasoning data, such as writing, factual QA, self-cognition, and translation, we adopt the DeepSeek-V3 pipeline and reuse portions of the SFT dataset of DeepSeek-V3. For certain non-reasoning tasks, we call DeepSeek-V3 to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as "hello" we do not provide a CoT in response. In the end, we collected a total of approximately 200k training samples that are unrelated to reasoning.

非推理数据:对于非推理数据,如写作、事实问答、自我认知和翻译,我们采用 DeepSeek-V3 的流程,并复用 DeepSeek-V3 的部分 SFT 数据集。对于某些非推理任务,我们通过提示调用 DeepSeek-V3 生成潜在的思维链,然后再回答问题。然而,对于较简单的查询,例如 "hello",我们不会提供 CoT 作为回应。最终,我们收集了大约 20 万个与推理无关的训练样本。

We fine-tune DeepSeek-V3-Base for two epochs using the above curated dataset of about 800k samples.

我们使用上述约80万样本的精选数据集对DeepSeek-V3-Base进行了两个epoch的微调。

2.3.4. Reinforcement Learning for all Scenarios

2.3.4. 全场景的强化学习

To further align the model with human preferences, we implement a secondary reinforcement learning stage aimed at improving the model's helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, we train the model using a combination of reward signals and diverse prompt distributions. For reasoning data, we adhere to the methodology outlined in DeepSeek-R1-Zero, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains. For general data, we resort to reward models to capture human preferences in complex and nuanced scenarios. We build upon the DeepSeek-V3 pipeline and adopt a similar distribution of preference pairs and training prompts. For helpfulness, we focus exclusively on the final summary, ensuring that the assessment emphasizes the utility and relevance of the response to the user while minimizing interference with the underlying reasoning process. For harmlessness, we evaluate the entire response of the model, including both the reasoning process and the summary, to identify and mitigate any potential risks, biases, or harmful content that may arise during the generation process. Ultimately, the integration of reward signals and diverse data distributions enables us to train a model that excels in reasoning while prioritizing helpfulness and harmlessness.

为了进一步使模型与人类偏好对齐,我们实施了第二阶段的强化学习,旨在提高模型的有用性和无害性,同时提升其推理能力。具体而言,我们结合奖励信号和多样化的提示分布来训练模型。对于推理数据,我们遵循DeepSeek-R1-Zero中概述的方法,利用基于规则的奖励来指导数学、代码和逻辑推理领域的学习过程。对于通用数据,我们使用奖励模型来捕捉人类在复杂和微妙场景中的偏好。我们在DeepSeek-V3流程的基础上,采用了类似的偏好对和训练提示分布。对于有用性,我们仅关注最终总结,确保评估强调响应对用户的实用性和相关性,同时最小化对底层推理过程的干扰。对于无害性,我们评估模型的整个响应,包括推理过程和总结,以识别和缓解生成过程中可能出现的任何潜在风险、偏见或有害内容。最终,通过整合奖励信号和多样化的数据分布,我们能够训练出一个在推理方面表现出色,同时优先考虑有用性和无害性的模型。

2.4. Distillation: Empower Small Models with Reasoning Capability

2.4. 蒸馏:赋能小模型具备推理能力

To equip more efficient smaller models with reasoning capabilities like DeepSeek-R1, we directly fine-tuned open-source models like Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024) using the 800k samples curated with DeepSeek-R1, as detailed in $\S2.3.3$ . Our findings indicate that this straightforward distillation method significantly enhances the reasoning abilities of smaller models. The base models we use here are Qwen2.5-Math-1.5B, Qwen2.5-Math-7B, Qwen2.5- 14B, Qwen2.5-32B, Llama-3.1-8B, and Llama-3.3-70B-Instruct. We select Llama-3.3 because its reasoning capability is slightly better than that of Llama-3.1.

为了让更高效的小模型具备类似 DeepSeek-R1 的推理能力,我们直接使用 DeepSeek-R1 筛选的 80 万样本对开源模型 Qwen (Qwen, 2024b) 和 Llama (AI@Meta, 2024) 进行了微调,详见 $\S2.3.3$。我们的研究表明,这种简单的蒸馏方法显著提升了小模型的推理能力。我们使用的基础模型包括 Qwen2.5-Math-1.5B、Qwen2.5-Math-7B、Qwen2.5-14B、Qwen2.5-32B、Llama-3.1-8B 和 Llama-3.3-70B-Instruct。我们选择 Llama-3.3 是因为其推理能力略优于 Llama-3.1。

For distilled models, we apply only SFT and do not include an RL stage, even though incorporating RL could substantially boost model performance. Our primary goal here is to demonstrate the effectiveness of the distillation technique, leaving the exploration of the RL stage to the broader research community.

对于蒸馏模型,我们仅应用SFT(监督微调),不包括RL(强化学习)阶段,尽管加入RL可能会显著提升模型性能。我们的主要目标是展示蒸馏技术的有效性,将RL阶段的探索留给更广泛的研究社区。

3. Experiment

3. 实验

Benchmarks We evaluate models on MMLU (Hendrycks et al., 2020), MMLU-Redux (Gema et al., 2024), MMLU-Pro (Wang et al., 2024), C-Eval (Huang et al., 2023), and CMMLU (Li et al., 2023), IFEval (Zhou et al., 2023), FRAMES (Krishna et al., 2024), GPQA Diamond (Rein et al., 2023), SimpleQA (OpenAI, 2024c), C-SimpleQA (He et al., 2024), SWE-Bench Verified (OpenAI,

我们在 MMLU (Hendrycks et al., 2020)、MMLU-Redux (Gema et al., 2024)、MMLU-Pro (Wang et al., 2024)、C-Eval (Huang et al., 2023)、CMMLU (Li et al., 2023)、IFEval (Zhou et al., 2023)、FRAMES (Krishna et al., 2024)、GPQA Diamond (Rein et al., 2023)、SimpleQA (OpenAI, 2024c)、C-SimpleQA (He et al., 2024)、SWE-Bench Verified (OpenAI,

2024d), Aider 1, LiveCodeBench (Jain et al., 2024) (2024-08 - 2025-01), Codeforces 2, Chinese National High School Mathematics Olympiad (CNMO 2024)3, and American Invitational Mathematics Examination 2024 (AIME 2024) (MAA, 2024). In addition to standard benchmarks, we also evaluate our models on open-ended generation tasks using LLMs as judges. Specifically, We adhere to the original configurations of AlpacaEval 2.0 (Dubois et al., 2024) and Arena-Hard (Li et al., 2024), which leverage GPT-4-Turbo-1106 as judges for pairwise comparisons. Here, We only feed the final summary to evaluation to avoid the length bias. For distilled models, we report representative results on AIME 2024, MATH-500, GPQA Diamond, Codeforces, and Live Code Bench.

2024d), Aider 1, LiveCodeBench (Jain et al., 2024) (2024-08 - 2025-01), Codeforces 2, 中国高中数学奥林匹克 (CNMO 2024)3, 以及美国数学邀请赛2024 (AIME 2024) (MAA, 2024)。除了标准基准测试外,我们还使用大语言模型作为评判者,在开放式生成任务上评估我们的模型。具体来说,我们遵循AlpacaEval 2.0 (Dubois et al., 2024) 和Arena-Hard (Li et al., 2024) 的原始配置,这些配置使用GPT-4-Turbo-1106作为成对比较的评判者。在这里,我们仅将最终摘要提供给评估,以避免长度偏差。对于蒸馏模型,我们报告了AIME 2024、MATH-500、GPQA Diamond、Codeforces和Live Code Bench上的代表性结果。

Evaluation Prompts Following the setup in DeepSeek-V3, standard benchmarks such as MMLU, DROP, GPQA Diamond, and SimpleQA are evaluated using prompts from the simpleevals framework. For MMLU-Redux, We adopt the Zero-Eval prompt format (Lin, 2024) in a zero-shot setting. In terms of MMLU-Pro, C-Eval and CLUE-WSC, since the original prompts are few-shot, we slightly modify the prompt to the zero-shot setting. The CoT in few-shot may hurt the performance of DeepSeek-R1. Other datasets follow their original evaluation protocols with default prompts provided by their creators. For code and math benchmarks, the HumanEval-Mul dataset covers eight mainstream programming languages (Python, Java, $C++,$ C#, JavaScript, TypeScript, PHP, and Bash). Model performance on Live Code Bench is evaluated using CoT format, with data collected between August 2024 and January 2025. The Codeforces dataset is evaluated using problems from 10 Div.2 contests along with expert-crafted test cases, after which the expected ratings and percentages of competitors are calculated. SWE-Bench verified results are obtained via the agentless framework (Xia et al., 2024). AIDER-related benchmarks are measured using a "diff" format. DeepSeek-R1 outputs are capped at a maximum of 32,768 tokens for each benchmark.

评估提示

遵循 DeepSeek-V3 的设置,使用 simpleevals 框架中的提示对 MMLU、DROP、GPQA Diamond 和 SimpleQA 等标准基准进行评估。对于 MMLU-Redux,我们在零样本设置中采用 Zero-Eval 提示格式 (Lin, 2024)。对于 MMLU-Pro、C-Eval 和 CLUE-WSC,由于原始提示是少样本的,我们将其略微修改为零样本设置。少样本中的 CoT 可能会影响 DeepSeek-R1 的性能。其他数据集遵循其原始评估协议,并使用其创建者提供的默认提示。对于代码和数学基准,HumanEval-Mul 数据集涵盖了八种主流编程语言 (Python, Java, C++, C#, JavaScript, TypeScript, PHP 和 Bash)。使用 CoT 格式评估模型在 Live Code Bench 上的性能,数据收集于 2024 年 8 月至 2025 年 1 月之间。Codeforces 数据集使用 10 场 Div.2 竞赛中的问题以及专家编写的测试用例进行评估,之后计算预期评分和竞争对手的百分比。SWE-Bench 验证结果通过无代理框架 (Xia et al., 2024) 获得。AIDER 相关基准使用 "diff" 格式进行测量。DeepSeek-R1 输出在每个基准中最多为 32,768 个 Token。

Baselines We conduct comprehensive evaluations against several strong baselines, including DeepSeek-V3, Claude-Sonnet-3.5-1022, GPT-4o-0513, OpenAI-o1-mini, and OpenAI-o1-1217. Since accessing the OpenAI-o1-1217 API is challenging in mainland China, we report its performance based on official reports. For distilled models, we also compare the open-source model QwQ-32B-Preview (Qwen, 2024a).

基线 我们对多个强大的基线进行了全面评估,包括 DeepSeek-V3、Claude-Sonnet-3.5-1022、GPT-4o-0513、OpenAI-o1-mini 和 OpenAI-o1-1217。由于在中国大陆访问 OpenAI-o1-1217 API 较为困难,我们根据官方报告进行了性能评估。对于蒸馏模型,我们还比较了开源模型 QwQ-32B-Preview (Qwen, 2024a)。

Evaluation Setup We set the maximum generation length to 32,768 tokens for the models. We found that using greedy decoding to evaluate long-output reasoning models results in higher repetition rates and significant variability across different checkpoints. Therefore, we default to pass $@k$ evaluation (Chen et al., 2021) and report pass $@1$ using a non-zero temperature. Specifically, we use a sampling temperature of 0.6 and a top $\boldsymbol{p}$ value of 0.95 to generate $k$ responses (typically between 4 and 64, depending on