[论文翻译]PIKE-RAG:专业知识与推理增强生成


原文地址:https://arxiv.org/pdf/2501.11551v2


PIKE-RAG: sPecIalized KnowledgE and Rationale Augmented Generation

PIKE-RAG:专业知识与推理增强生成

Abstract

摘要

Despite notable advancements in Retrieval-Augmented Generation (RAG) systems that expand large language model (LLM) capabilities through external retrieval, these systems often struggle to meet the complex and diverse needs of real-world industrial applications. The reliance on retrieval alone proves insufficient for extracting deep, domain-specific knowledge performing in logical reasoning from specialized corpora. To address this, we introduce sPecIalized KnowledgE and Rationale Augmentation Generation (PIKE-RAG), focusing on extracting, understanding, and applying specialized knowledge, while constructing coherent rationale to increment ally steer LLMs toward accurate responses. Recognizing the diverse challenges of industrial tasks, we introduce a new paradigm that classifies tasks based on their complexity in knowledge extraction and application, allowing for a systematic evaluation of RAG systems’ problem-solving capabilities. This strategic approach offers a roadmap for the phased development and enhancement of RAG systems, tailored to meet the evolving demands of industrial applications. Furthermore, we propose knowledge atomizing and knowledge-aware task decomposition to effectively extract multifaceted knowledge from the data chunks and iterative ly construct the rationale based on original query and the accumulated knowledge, respectively, showcasing exceptional performance across various benchmarks. The code is publicly available at https://github.com/microsoft/PIKE-RAG.

尽管检索增强生成 (RAG) 系统通过外部检索扩展了大语言模型 (LLM) 的能力,并取得了显著的进展,但这些系统往往难以满足现实工业应用中复杂多样的需求。仅依赖检索在从专业语料库中提取深度领域知识并进行逻辑推理方面显得不足。为了解决这一问题,我们引入了专业化知识与推理增强生成 (PIKE-RAG),专注于提取、理解和应用专业化知识,同时构建连贯的推理,逐步引导大语言模型生成准确的响应。认识到工业任务的多样化挑战,我们引入了一种新范式,根据任务在知识提取和应用中的复杂性进行分类,从而系统评估 RAG 系统的解决问题能力。这一策略为 RAG 系统的阶段性开发和增强提供了路线图,以满足工业应用不断变化的需求。此外,我们提出了知识原子化和知识感知任务分解,以有效从数据块中提取多方面的知识,并分别基于原始查询和累积知识迭代构建推理,在各种基准测试中展示了卓越的性能。代码公开在 https://github.com/microsoft/PIKE-RAG

1 Introduction

1 引言

Large Language Models (LLMs) have revolutionized the field of natural language processing by demonstrating the capability to generate coherent and con textually relevant text. These advanced models are trained on expansive corpora, equipping them with the versatility to execute a diverse spectrum of linguistic tasks, ranging from text completion to translation and sum mari z ation [5, 9, 50, 6]. Despite their broad capabilities, LLMs exhibit pronounced limitations when tasked with specialized queries in professional domains [38, 54], a demand that is particularly acute in industrial applications. This primarily stems from the scarcity of domain-specific training material and a limited grasp of specialized knowledge and rationale within these domains. As a result, LLMs may produce responses that are not only potentially erroneous but also lack the detail and precision required for expert-level engagement [11]. Besides the limitations in the domain-specific tasks, another striking issue with LLMs is the phenomena known as "hallucination", where the model generates information that is not grounded in reality or factual data [10, 57]. Moreover, the knowledge base of LLMs, being static and crystallized at the point of their last update, introduces temporal stasis [13]. Further compounding these challenges is the issue of long-context comprehension [37]. Existing LLMs struggle to maintain an understanding of task definitions across long context, and their performance tends to deteriorate significantly when confronted with more complex and demanding tasks.

大语言模型 (LLMs) 通过展示生成连贯且上下文相关文本的能力,彻底改变了自然语言处理领域。这些先进的模型在广泛的语料库上进行训练,使其具备执行多种语言任务的多功能性,从文本补全到翻译和摘要 [5, 9, 50, 6]。尽管它们具备广泛的能力,但在处理专业领域的特定查询时,LLMs 表现出明显的局限性 [38, 54],这种需求在工业应用中尤为迫切。这主要源于领域特定训练材料的稀缺以及对这些领域专业知识和原理的有限理解。因此,LLMs 可能会生成不仅可能错误,而且缺乏专家级参与所需的细节和精度的响应 [11]。除了在领域特定任务中的局限性外,LLMs 另一个显著问题是所谓的“幻觉”现象,即模型生成的信息不基于现实或事实数据 [10, 57]。此外,LLMs 的知识库在其最后一次更新时是静态和固化的,这引入了时间停滞 [13]。进一步加剧这些挑战的是长上下文理解问题 [37]。现有的 LLMs 在长上下文中难以保持对任务定义的理解,当面对更复杂和要求更高的任务时,其性能往往会显著下降。

To address the inherent limitations of LLMs, Retrieval-Augmented Generation (RAG) [35] has been proposed, which merges the generative capabilities of LLMs with a retrieval mechanism, allowing the incorporation of relevant external information to anchor the generated text in factual data. This integrated strategy improves both the accuracy and reliability of the generated content, providing a promising pathway for the practical deployment of LLMs in industrial applications. However, current RAG methods remain heavily reliant on text retrieval and the comprehension capabilities of LLMs, with a lack of attention to extracting, understanding, and utilizing knowledge from the diverse source data. In industrial applications requiring expertise, such as specialized knowledge and problem-solving rationale, existing RAG approaches primarily designed for research benchmarks demonstrate significant limitations. There is a lack of clarity regarding the challenges that RAG encounters in industrial applications. Gaining a comprehensive insight into these challenges is crucial for the development of RAG algorithms. Therefore, we summarize the main challenges as follows.

为了解决大语言模型(LLM)的固有局限性,检索增强生成(Retrieval-Augmented Generation,RAG)[35] 被提出,它将大语言模型的生成能力与检索机制相结合,允许引入相关的外部信息,使生成的文本基于事实数据。这种综合策略提高了生成内容的准确性和可靠性,为大语言模型在工业应用中的实际部署提供了一条有前景的路径。然而,当前的 RAG 方法仍然严重依赖文本检索和大语言模型的理解能力,缺乏对从多样化的源数据中提取、理解和利用知识的关注。在需要专业知识的工业应用中,如专业知识和问题解决逻辑,现有的主要为研究基准设计的 RAG 方法表现出明显的局限性。对于 RAG 在工业应用中遇到的挑战,目前缺乏清晰的认识。全面了解这些挑战对于 RAG 算法的发展至关重要。因此,我们将主要挑战总结如下。

• Knowledge source diversity: RAG systems are constructed upon a diverse corpus of source documents collected over many years from various domains, encompassing a wide range of file formats like scanned images, digital text files, and web data, sometimes accompanied by specialized databases. In contrast, widely-used datasets [28, 60, 51] typically feature pre-segmented, simplified corpora that do not capture the complexity of real-world data. Existing methods designed for these benchmarks struggle to efficiently extract specialized knowledge and uncover underlying rationales from diverse sources, particularly in industrial applications. For example, an LED product datasheet typically comprises specifications such as performance characteristics presented in complex tables, electrical properties depicted in charts, and installation instructions illustrated with figures. Addressing queries related to the non-textual knowledge presents significant challenges for existing RAG approaches.

• 知识来源多样性:RAG系统构建于多年收集的多样化源文档语料库之上,这些文档来自各个领域,涵盖扫描图像、数字文本文件和网络数据等多种文件格式,有时还伴随专门的数据库。相比之下,广泛使用的数据集 [28, 60, 51] 通常以预先分割、简化的语料库为特征,无法捕捉现实世界数据的复杂性。为这些基准设计的方法难以从多样化来源中高效提取专门知识并揭示潜在原理,尤其是在工业应用中。例如,LED产品数据表通常包含性能特征等规格,以复杂表格呈现,电气特性以图表展示,安装说明则通过图示说明。处理与非文本知识相关的查询对现有RAG方法提出了重大挑战。

• Domain specialization deficit: In industrial applications, RAG are expected to leverage the specialized knowledge and rationale in professional fields. However, these specialized knowledge are characterized by domain-specific terminologies, expertise, and distinctive logical frameworks that are integral to their functioning. RAG approaches built on common knowledge-centric datasets demonstrate unsatisfactory performance when applied to professional fields, as LLMs exhibit deficiencies in extracting, understanding, and organizing domain specific knowledge and rationale [38]. For example, in the field of semiconductor design, research relies heavily on a deep understanding of underlying physical properties. When LLMs are utilized to extract and organize the specialized knowledge and rationale from the research documents, they often fail to properly capture essential physical principles and achieve a comprehensive understanding due to their inherent limitations. Consequently, RAG systems frequently produce incomplete or inaccurate interpretations of critical problem elements and generate responses that lack proper rationale grounded in physical principles. Moreover, assessing the quality of professional content generation poses a significant challenge. This issue not only impedes the development and optimization of RAG algorithms but also complicates their practical deployment across various industrial applications.

• 领域专业化不足:在工业应用中,RAG 被期望利用专业领域的专业知识和原理。然而,这些专业知识具有领域特定的术语、专业知识和独特的逻辑框架,这些是其功能不可或缺的组成部分。基于以常识为中心的数据集构建的 RAG 方法在应用于专业领域时表现出不令人满意的性能,因为大语言模型在提取、理解和组织领域特定知识和原理方面存在不足 [38]。例如,在半导体设计领域,研究严重依赖于对基础物理特性的深入理解。当利用大语言模型从研究文档中提取和组织专业知识和原理时,由于其固有的局限性,它们往往无法正确捕捉基本的物理原理并实现全面的理解。因此,RAG 系统经常对关键问题要素产生不完整或不准确的解释,并生成缺乏基于物理原理的适当推理的响应。此外,评估专业内容生成的质量也构成了重大挑战。这个问题不仅阻碍了 RAG 算法的开发和优化,还使其在各种工业应用中的实际部署变得复杂。

• One-size-fits-all: Various RAG application scenarios, although based on a similar framework, present different challenges that require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale. The complexity and focus of questions vary across these scenarios, and within a single scenario, the difficulty can also differ. For example, in rule-based query scenarios, such as determining the legal conditions for mailing items, RAG systems primarily focus on retrieving relevant factual rules by bridging the semantic gap between the query and the rules. In multihop query scenarios, such as comparing products across multiple aspects, RAG systems emphasize extracting information from diverse sources and performing multihop reasoning to arrive at accurate answers. Most existing RAG approaches [62] adopt a one-size-fits-all strategy, failing to account for the varying complexities and specific demands both within and across scenarios. This results in solutions that do not meet the comprehensive accuracy standards required for practical applications, thereby limiting the development and integration of RAG systems in real-world environments.

• 通用解决方案:尽管基于相似的框架,各种 RAG 应用场景面临不同的挑战,需要多样化的能力,特别是在提取、理解和组织领域特定知识和逻辑方面。这些场景中的问题复杂性和重点各不相同,甚至在单个场景中,难度也可能有所差异。例如,在基于规则的查询场景中,如确定邮寄物品的法律条件,RAG 系统主要通过弥合查询与规则之间的语义差距来检索相关的事实规则。在多跳查询场景中,如从多个方面比较产品,RAG 系统则强调从不同来源提取信息并进行多跳推理,以得出准确的答案。大多数现有的 RAG 方法 [62] 采用了一种通用解决方案,未能考虑到场景内外的不同复杂性和特定需求。这导致解决方案无法满足实际应用所需的全面准确性标准,从而限制了 RAG 系统在现实环境中的发展和集成。

We believe that the key to addressing these challenges lies in advancing beyond traditional retrieval augmentation, by effectively extracting, understanding, and applying specialized knowledge, and developing appropriate reasoning logic tailored to the specific tasks and the knowledge involved. We refer to this approach as sPecIalized Knowledge and Rationale Augmentation. Given that various tasks require diverse capabilities, particularly for extracting, understanding, and organizing domain-specific knowledge and rationale, we summarize and categorize the questions commonly encountered into four types with respect to their difficulty: factual questions, linkable-reasoning questions, predictive questions, and creative questions. Accordingly, we propose a classification of RAG system capability levels, aligned with the system’s ability to solve these different types of problems. This classification serves as a guideline for systematically advancing the system’s capabilities in a controllable and measurable manner.

我们相信,解决这些挑战的关键在于超越传统的检索增强,通过有效提取、理解和应用专业知识,并开发适合特定任务和相关知识的推理逻辑。我们将这种方法称为专业知识与推理增强 (sPecIalized Knowledge and Rationale Augmentation) 。鉴于各种任务需要不同的能力,特别是在提取、理解和组织领域特定知识和推理方面,我们总结并分类了常见问题,根据其难度分为四种类型:事实性问题、可链接推理问题、预测性问题和创造性问题。相应地,我们提出了RAG系统能力水平的分类,与系统解决这些不同类型问题的能力相匹配。该分类为系统在可控和可衡量的方式下系统地提升能力提供了指导。

Furthermore, we propose sPecIalized KnowledgE and Rationale Augmented Generation (PIKERAG) framework, which not only support phased system development and deployment, demonstrating excellent versatility, but also enhances capabilities by effectively leveraging specialized knowledge and rationale. Within this framework, knowledge extraction components are employed to extract specialized knowledge from diverse source data, laying a robust foundation for knowledgebased retrieval and reasoning. Additionally, a task decomposer is utilized to dynamically manage the routing of retrieval and reasoning operations, creating specialized rationale based on available knowledge. PIKE-RAG enables a phased exploration of RAG capabilities, which facilitates the progressive refinement of RAG algorithms and the staged implementation of RAG applications. For each developing phase, the RAG framework and its modules are tailored to address specific challenges. For example, in the knowledge base construction phase, a multi-layer heterogeneous graph is employed to effectively represent relationship between various components of the data, enhancing knowledge organization and integration. The RAG system, designed for factual questions, introduces multi-granularity retrieval, allowing for multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph to improve factual retrieval accuracy. In the advanced RAG system, aiming at addressing complex queries, knowledge atomizing is introduced to fully explore the intrinsic knowledge from data chunks, while knowledge-aware task decomposition manages the retrieval and organization of multiple pieces of atomic knowledge to construct a coherent rationale.

此外,我们提出了一种专门化知识与推理增强生成(PIKE-RAG)框架,该框架不仅支持分阶段的系统开发与部署,展现出极佳的通用性,还能通过有效利用专门化知识和推理来增强能力。在该框架中,知识提取组件用于从多样化的源数据中提取专门化知识,为基于知识的检索和推理奠定了坚实的基础。此外,任务分解器被用来动态管理检索与推理操作的路由,基于可用知识生成专门的推理。PIKE-RAG 支持对 RAG 能力的分阶段探索,从而促进 RAG 算法的逐步优化和 RAG 应用的分阶段实施。在每个发展阶段,RAG 框架及其模块都会根据特定挑战进行调整。例如,在知识库构建阶段,采用多层异构图来有效表示数据各组件之间的关系,增强知识的组织与整合。针对事实性问题设计的 RAG 系统引入了多粒度检索,允许在异质知识图谱上进行多层、多粒度的检索,以提高事实检索的准确性。在高级 RAG 系统中,为了解决复杂查询,引入了知识原子化,以充分挖掘数据块中的内在知识,而知识感知的任务分解则管理多个原子知识的检索与组织,以构建连贯的推理。

Extensive experiments are conducted to evaluate the performance of the proposed PIKE-RAG framework on both open-domain and legal benchmarks, and experimental results demonstrate the effectiveness of PIKE-RAG. Our framework and staged development strategy could further advance the current research and application of RAG in industrial contexts. In summary, the contributions of this work are as follows:

我们进行了大量实验,以评估所提出的 PIKE-RAG 框架在开放领域和法律基准上的性能,实验结果证明了 PIKE-RAG 的有效性。我们的框架和分阶段开发策略可以进一步推动 RAG 在工业环境中的研究和应用。综上所述,本工作的贡献如下:

2 Related work

2 相关工作

2.1 RAG

2.1 RAG

Retrieval-Augmented Generation (RAG) has emerged as a promising solution that effectively incorporates external knowledge to enhance response generation. Initially, retrieval-augmented techniques were introduced to improve the performance of pre-trained language models on knowledge-intensive tasks [35, 29, 12]. With the booming of Large Language Models [5, 9, 50, 6], most research in the RAG paradigm has shifted towards a framework that initially retrieves pertinent information from external data sources and subsequently integrates it into the context of the query prompt as supplementing knowledge for con textually relevant generation [46]. Following this framework, naive RAG research paradigm [25] converts raw data into uniform plain text and segment it into smaller chunks, which are encoded into vector space for query-based retrieval. The top k relevant chunks are used to expand the context of the prompt for generation. To enhance the retrieval quality of the naive RAG, advanced RAG approaches implement specific enhancements across the pre-retrieval, retrieval, and post-retrieval processes, including query optimization [39, 63], multi-granularity chunking [16, 65], mixed retrieval and chunk re-ranking.

检索增强生成 (Retrieval-Augmented Generation, RAG) 作为一种有前景的解决方案,能够有效整合外部知识以增强响应生成。最初,检索增强技术被引入以提高预训练语言模型在知识密集型任务上的表现 [35, 29, 12]。随着大语言模型 (Large Language Models) 的蓬勃发展 [5, 9, 50, 6],RAG 范式中的大多数研究转向了一种框架,该框架首先从外部数据源检索相关信息,随后将其整合到查询提示的上下文中,作为上下文相关生成的补充知识 [46]。遵循这一框架,朴素的 RAG 研究范式 [25] 将原始数据转换为统一的纯文本并将其分割成较小的块,这些块被编码到向量空间中以进行基于查询的检索。前 k 个相关块用于扩展提示的上下文以进行生成。为了提高朴素 RAG 的检索质量,先进的 RAG 方法在预检索、检索和后检索过程中实施了特定的增强,包括查询优化 [39, 63]、多粒度分块 [16, 65]、混合检索和块重排序。

Beyond the aforementioned RAG paradigms, numerous sophisticated enhancements in RAG pipelines and system modules are introduced within modular RAG systems [26], aiming to improve system capability and versatility. These advancements have enabled the processing of a wider variety of source data, facilitating the transformation of raw information into structured data and, ultimately, into valuable knowledge [56, 20]. Furthermore, the indexing and retrieval modules have been refined with multi-granularity and multi-architecture approaches [58, 65]. Various pre-retrieval [24, 64] and postretrieval [18, 30] functions are proposed to enhance both the retrieval effectiveness and the quality of sequential generation. It has been recognized that naïve RAG systems are insufficient to tackle complex tasks such as sum mari z ation [27] and multi-hop reasoning [51, 28]. Consequently, most recent research focuses on developing advanced coordination schemes that leverage existing modules to collaborative ly address these challenges. ITERRETGEN [48] and DSP [33] employ retrieve-read iteration to leverage generation response as the context for next round retrieval. FLARE [31] proposes a confidence-based active retrieval mechanism that dynamically adjusts query with respect to the low-confidence tokens in the regenerated sentences. These loop-based RAG pipelines progressively converge towards the correct answer and provide enhanced flexibility to RAG systems in addressing diverse requirements.

除了上述的 RAG 范式外,模块化 RAG 系统中引入了许多复杂的 RAG 管道和系统模块增强 [26],旨在提高系统能力和多功能性。这些进步使得处理更多种类的源数据成为可能,促进了原始信息向结构化数据以及最终向有价值知识的转化 [56, 20]。此外,索引和检索模块通过多粒度和多架构方法进行了优化 [58, 65]。提出了各种检索前 [24, 64] 和检索后 [18, 30] 功能,以增强检索效果和序列生成的质量。人们认识到,简单的 RAG 系统不足以应对诸如摘要 [27] 和多跳推理 [51, 28] 等复杂任务。因此,最近的研究主要集中在开发高级协调方案,利用现有模块协同应对这些挑战。ITERRETGEN [48] 和 DSP [33] 采用检索-读取迭代,利用生成响应作为下一轮检索的上下文。FLARE [31] 提出了一种基于置信度的主动检索机制,根据再生句子中低置信度的 Token 动态调整查询。这些基于循环的 RAG 管道逐步收敛到正确答案,并为 RAG 系统提供了应对多样化需求的增强灵活性。

2.2 Knowledge bases for RAG

2.2 用于RAG的知识库

In naïve RAG approaches, source data is converted to plain text and chunked for retrieval. However, as RAG applications expand and demand for diversity grows, plain text-based retrieval becomes insufficient for several reasons: (1) textual information is generally redundant and noisy, leading to decreased retrieval quality; (2) complex problems require the integration of multiple data sources, and plain text alone cannot adequately represent the intricate relationships between objects. As a result, researchers are exploring diverse data sources to enrich the corpus, incorporating search engines [59, 53], databases [55, 41, 47], knowledge graphs [49, 56], and multimodal corpora [17, 15]. Concurrently, there is an emphasis on developing efficient knowledge representations for corpus to enhance knowledge retrieval. A graph is regarded as a powerful knowledge representation because of its capacity to intuitively model complex relationships. GraphRAG [20] combines knowledge graph generation and query-focused sum mari z ation with RAG to address both local and global questions. HOLMES [42] construct hyper-relational KGs and prune them to distilled graphs, which serve as an input to LLMs for multihop question answering. However, the construction of knowledge graphs is extremely resource-intensive, and the associated costs scale up with the size of the corpus.

在简单的 RAG 方法中,源数据被转换为纯文本并分块以便检索。然而,随着 RAG 应用的扩展和对多样性的需求增加,基于纯文本的检索在多个方面变得不足:(1) 文本信息通常冗余且嘈杂,导致检索质量下降;(2) 复杂问题需要整合多个数据源,而纯文本无法充分表示对象之间的复杂关系。因此,研究人员正在探索多样化的数据源以丰富语料库,包括搜索引擎 [59, 53]、数据库 [55, 41, 47]、知识图谱 [49, 56] 和多模态语料库 [17, 15]。同时,重点在于开发高效的语料库知识表示以增强知识检索。图因其能够直观地建模复杂关系而被视为一种强大的知识表示。GraphRAG [20] 将知识图谱生成和以查询为中心的摘要与 RAG 结合,以解决局部和全局问题。HOLMES [42] 构建超关系知识图谱并将其修剪为精简图,作为大语言模型的多跳问答输入。然而,知识图谱的构建极其耗费资源,且相关成本随着语料库的规模而增加。

2.3 Multi-hop QA

2.3 多跳问答

Multi-hop Question Answering (MHQA) [60] involves answering questions that require reasoning over multiple pieces of information, often scattered across different documents or paragraphs. This task presents unique challenges as it necessitates not only retrieving relevant information but also effectively combining and reasoning over the retrieved pieces to arrive at a correct answer. The traditional graph-based methods in MHQA solve the problem by building graphs and inferring on graph neural networks(GNN) to predict answers [44, 21]. With the advent of LLMs, recent graph-based methods [36, 42] have evolved to construct knowledge graphs for retrieval and generate response through LLMs. Another branch of methods dynamically convert multi-hop questions into a series of sub-queries by generating subsequent questions based on the answers to previous ones [52, 33, 23]. The subqueries guides the sequential retrieval and the retrieved results in turn are used to improve reasoning. Treating MHQA as a supervised problem, Self-RAG [61] trains an LM to learn to retrieve, generate, and critique text passages, and beam-retrieval [7] models the multi-hop retrieval process in an end-to-end manner by jointly optimizing an encoder and classification heads across all hops. Self-Ask [43] improves CoT by explicitly asking itself follow-up questions before answering the initial question. This method enables the automatic decomposition of questions and can be seamlessly integrated with retrieval mechanisms to tackle Multi-hop Question Answering.

多跳问答 (MHQA) [60] 涉及回答需要跨多个信息片段进行推理的问题,这些信息通常分散在不同的文档或段落中。该任务提出了独特的挑战,因为它不仅需要检索相关信息,还需要有效地组合并推理检索到的信息片段,以得出正确答案。传统的基于图的方法在 MHQA 中通过构建图并在图神经网络 (GNN) 上进行推理来预测答案 [44, 21]。随着大语言模型的出现,最近的基于图的方法 [36, 42] 已经演变为构建知识图谱进行检索,并通过大语言模型生成响应。另一类方法通过根据先前问题的答案生成后续问题,将多跳问题动态转换为一系列子查询 [52, 33, 23]。子查询指导顺序检索,检索到的结果反过来用于改进推理。将 MHQA 视为监督问题,Self-RAG [61] 训练一个语言模型来学习检索、生成和评论文本段落,而 beam-retrieval [7] 通过在所有跳数上联合优化编码器和分类头,以端到端的方式对多跳检索过程进行建模。Self-Ask [43] 通过在回答初始问题之前明确询问自己后续问题来改进思维链 (CoT)。这种方法能够自动分解问题,并可以无缝集成检索机制以解决多跳问答问题。

3 Problem formulation

3 问题表述

Existing research mainly concentrates on algorithmic enhancements to improve the performance of RAG systems. However, there is limited effort in providing a comprehensive and systematic discussion of the RAG framework. In this work, we conceptualize the RAG framework from three key perspectives: knowledge base, task classification, and system development. We assert that the knowledge base serves as the fundamental cornerstone of RAG, underpinning all retrieval and generation processes. Furthermore, we recognize that RAG tasks can vary significantly in complexity and difficulty, depending on the required generation capabilities and the availability of supporting corpora. By categorizing tasks according to their difficulty levels, we classify RAG systems into distinct levels based on their problem-solving capabilities across the different types of questions.

现有研究主要集中在算法增强上,以提高 RAG 系统的性能。然而,关于 RAG 框架的全面系统性讨论却较为有限。在本研究中,我们从三个关键视角对 RAG 框架进行了概念化:知识库、任务分类和系统开发。我们主张知识库是 RAG 的基石,支撑着所有的检索与生成过程。此外,我们认识到 RAG 任务的复杂度和难度可以根据所需的生成能力及辅助语料的可用性而有显著不同。通过按难度等级划分任务,我们依据 RAG 系统在解决不同类型问题上的能力,将其划分为不同的级别。

3.1 Knowledge base

3.1 知识库

In industrial applications, specialized knowledge primarily originates from years of accumulated data within specific fields such as manufacturing, energy, and logistics. For example, in the pharmaceutical industry, data sources include extensive research and development documentation, as well as drug application files amassed over many years. These sources are not only diverse in file formats, but also encompass a significant amount of multi-modal contents such as tables, charts, and figures, which are also crucial for problem-solving. Furthermore, there are often functional connections between files within a specialized domain, such as hyperlinks, references, and relational database links, which explicitly or implicitly reflect the logical organization of knowledge within the professional field. Currently, existing datasets provide pre-segmented corpora and do not account for the complexities encountered in real-world applications, such as the integration of multi-format data and the maintenance of referential relationships between documents. Therefore, the construction of a comprehensive knowledge base is foundational for Retrieval-Augmented Generation (RAG) in the industrial field. As the architecture and quality of the knowledge base directly influence the retrieval methods and their performance, we propose structuring the knowledge base as a multi-layer heterogeneous graph, denoted as $G$ , with corresponding nodes and edges represented by $(V,\dot{E})$ . The graph nodes can include documents, sections, chunks, figures, tables, and customized nodes from distilled knowledge. The edges signify the relationships among these nodes, encapsulating the interconnections and dependencies within the graph. This multi-layer heterogeneous graph encompasses three distinct layers: the information resource layer $G_{i}$ , the corpus layer $G_{c}$ and the distilled knowledge layer $G_{d k}$ . Each layer corresponds to different stages of information processing, representing varying levels of granularity and abstraction in knowledge.

在工业应用中,专业知识主要来源于制造业、能源和物流等特定领域多年积累的数据。例如,在制药行业,数据来源包括广泛的研发文档以及多年积累的药物申请文件。这些来源不仅在文件格式上多样化,还包含大量多模态内容,如表格、图表和图形,这些内容对于解决问题也至关重要。此外,专业领域内的文件之间通常存在功能性联系,如超链接、引用和关系数据库链接,这些联系明确或隐含地反映了专业领域内知识的逻辑组织。目前,现有数据集提供了预分段的语料库,但并未考虑实际应用中遇到的复杂性,例如多格式数据的整合和文档之间引用关系的维护。因此,构建一个全面的知识库是工业领域检索增强生成(RAG)的基础。由于知识库的结构和质量直接影响检索方法及其性能,我们建议将知识库构建为多层异构图,记为 $G$,相应的节点和边表示为 $(V,\dot{E})$。图节点可以包括文档、章节、块、图形、表格以及从提炼知识中定制的节点。边表示这些节点之间的关系,封装了图中的相互联系和依赖关系。这种多层异构图包含三个不同的层次:信息资源层 $G_{i}$、语料层 $G_{c}$ 和提炼知识层 $G_{d k}$。每一层对应信息处理的不同阶段,代表知识的不同粒度和抽象层次。

3.2 Task classification

3.2 任务分类

Contemporary RAG frameworks frequently overlook the intricate difficulty and logistical demands inherent to diverse tasks, typically employing a one-size-fits-all methodology. However, even with comprehensive knowledge retrieval, current RAG systems are insufficient to handle tasks of varying difficulty with equal effectiveness. Therefore, it is essential to categorize tasks and analyze the typical strategies for overcoming the challenges inherent to each category. The difficulty of a task is closely associated with several critical factors.

当代 RAG 框架常常忽视不同任务固有的复杂性和后勤需求,通常采用一刀切的方法。然而,即使有全面的知识检索,当前的 RAG 系统也无法同样有效地处理不同难度的任务。因此,必须对任务进行分类,并分析克服每类任务固有挑战的典型策略。任务的难度与几个关键因素密切相关。

Factual Questions

事实性问题


Linkable & Reasoning Questions

图 1:
可链接与推理问题


Figure 1: Illustrative examples of distinct question types

图 1: 不同问题类型的示例

• Effectiveness of Knowledge Utilization: The sophistication involved in applying the extracted knowledge to formulate responses, including synthesizing, organizing, and generating insights or predictions.

• 知识利用的有效性:应用提取的知识来制定回应的复杂性,包括综合、组织以及生成见解或预测。

In categorizing real-world RAG tasks within industries, we focus on the processes of knowledge extraction, understanding, organization, and utilization to provide structured and insightful responses. Taking the aforementioned factors into account, we identify four distinct classes of questions that address a broad spectrum of demands. The first type, Factual Questions, involves extracting specific, explicit information directly from the corpus, relying on retrieval mechanisms to identify the relevant facts. Linkable-Reasoning Questions demand a deeper level of knowledge integration, often requiring multi-step reasoning and linking across multiple sources. Predictive Questions extend beyond the available data, requiring inductive reasoning and structuring of retrieved facts into analyzable forms, such as time series, for future-oriented predictions. Finally, Creative Questions engage domainspecific logic and creative problem-solving, encouraging the generation of innovative solutions by synthesizing knowledge and identifying patterns or influencing factors. This categorization, driven by varying levels of reasoning and knowledge management, ensures a comprehensive approach to addressing industry-specific queries.

在对行业中的现实世界 RAG 任务进行分类时,我们重点关注知识提取、理解、组织和利用的过程,以提供结构化和有洞察力的响应。考虑到上述因素,我们确定了四类不同的问题,以满足广泛的需求。第一类,事实性问题 (Factual Questions) ,涉及直接从语料库中提取特定的、明确的信息,依靠检索机制来识别相关事实。可链接推理问题 (Linkable-Reasoning Questions) 需要更深层次的知识整合,通常需要多步推理和跨多个来源的链接。预测性问题 (Predictive Questions) 超出了现有数据的范围,需要归纳推理并将检索到的事实构建为可分析的形式,例如时间序列,以进行面向未来的预测。最后,创造性问题 (Creative Questions) 涉及特定领域的逻辑和创造性问题解决,通过综合知识并识别模式或影响因素,鼓励生成创新解决方案。这种由不同层次的推理和知识管理驱动的分类,确保了解决行业特定问题的全面方法。

The criteria defining each category are elaborated in the following sections, with representative examples for each provided in Figure 1. For each question type, we also present the associated support data and the expected reasoning processes to illustrate the differences between these categories. These inquiries are formulated by experts in pharmaceutical applications, based on the data released by the FDA.2

定义每个类别的标准将在以下章节中详细阐述,并在图 1 中提供了每个类别的代表性示例。对于每种问题类型,我们还提供了相关的支持数据和预期的推理过程,以说明这些类别之间的差异。这些查询是由药物应用专家根据 FDA 发布的数据制定的。

Factual Questions These questions seek specific, concrete pieces of information explicitly presented in the original corpus. The referenced text can be processed within the context of a conversation in LLMs. As shown in Figure 1, this class of questions can be effectively answered if the relevant fact is successfully retrieved.

事实性问题

Linkable-Reasoning Questions Answering these questions necessitates gathering pertinent information from diverse sources and/or executing multi-step reasoning. The answers may be implicitly distributed across multiple texts. Due to variations in the linking and reasoning processes, we further divide this category into four subcategories: bridging questions, comparative questions, quantitative questions, and summarizing questions. Examples of each subcategory are illustrated in Figure 1. Specifically, bridging questions involve sequentially bridging multiple entities to derive the answer. Quantitative questions require statistical analysis based on the retrieved data. Comparative questions focus on comparing specified attributes of two entities. Summarizing questions require condensing or synthesizing information from multiple sources or large volumes of text into a concise, coherent summary, and they often involve integrating key points, identifying main themes, or drawing conclusions based on the aggregated content. Summarizing questions may combine elements of other question types, such as bridging, comparative, or quantitative questions, as they frequently require the extraction and integration of diverse pieces of information to generate a comprehensive and meaningful summary. Given these questions require multi-step retrieval and reasoning, it is crucial to establish a reasonable operation route for answer-seeking in interaction with the knowledge base.

可链接推理问题:回答这些问题需要从不同来源收集相关信息或执行多步推理。答案可能隐含分布在多个文本中。由于链接和推理过程的差异,我们进一步将此类问题分为四个子类别:桥接问题、比较问题、定量问题和总结问题。每个子类别的示例如图 1 所示。具体来说,桥接问题涉及依次桥接多个实体以得出答案。定量问题需要基于检索到的数据进行统计分析。比较问题侧重于比较两个实体的指定属性。总结问题需要将多个来源或大量文本中的信息浓缩或综合为一个简洁、连贯的摘要,并且通常涉及整合关键点、识别主题或基于聚合内容得出结论。总结问题可能结合了其他问题类型的元素,例如桥接、比较或定量问题,因为它们经常需要提取和整合不同的信息片段以生成全面且有意义的摘要。鉴于这些问题需要多步检索和推理,在与知识库的交互中建立合理的答案寻找操作路径至关重要。

Predictive Questions For this type of questions, the answers are not directly available in the original text and may not be purely factual, necessitating inductive reasoning and prediction based on existing facts. To harness the predictive capabilities of LLMs or other external prediction tools, it is essential to gather and organize relevant knowledge to generate structured data for further analysis. For instance, as illustrated in Figure 1, all biosimilar products with the approval dates are retrieved, and the total number of approvals for each year is calculated and organized to year-indexed time series data for prediction purposes. Furthermore, it is important to note that the correct answer to predictive questions may not be unique, reflecting the inherent uncertainty and variability in predictive tasks.

预测性问题

Creative Questions One significant demand of RAG is to mine valuable domain-specific logic from professional knowledge bases and introduce novel perspectives that can innovate and advance existing solutions. Addressing creative questions necessitates creative thinking based on the availability of factual information and an understanding of the underlying principles and rules. As illustrated in the example, it is essential to organize the extracted information to highlight key stages and their duration, and then identify common patterns and influential factors. Subsequently, solutions are developed with the objective of evaluating potential outcomes and stimulating fresh ideas. The goal of these responses is to inspire experts to generate innovative ideas, rather than to provide ready-to-implement solutions.

创造性问题
RAG(Retrieval-Augmented Generation)的一个重要需求是从专业知识库中挖掘有价值的领域特定逻辑,并引入能够创新和推进现有解决方案的新颖视角。解决创造性问题需要基于事实信息的可用性以及对基本原理和规则的理解进行创造性思考。如示例所示,必须组织提取的信息以突出关键阶段及其持续时间,然后识别常见模式和有影响力的因素。随后,开发的解决方案旨在评估潜在结果并激发新想法。这些回答的目标是激发专家产生创新想法,而不是提供现成的解决方案。

It is crucial to recognize that the classification of a question may shift with changes in the knowledge base. Questions Q1, Q2, and Q3 in Figure1, although seemingly similar, are categorized differently depending on the availability of information and the logical steps required to derive an answer. For instance, Q1 is classified as a factual question because it can be directly answered using a table that concisely lists all biosimilar products along with their respective approval dates, providing sufficient explicit information. In contrast, Q2, which inquires about the total count of interchangeable biosimilar products, cannot be resolved by directly referencing a single explicit source. To answer Q2, one must identify all the products meeting the specified criteria and subsequently calculate the total, necessitating an additional step of statistical aggregation. Therefore, Q2 is categorized as a linkable-reasoning question due to the need for an intermediate processing. Finally, Q3 poses a challenge because the answer does not explicitly exist within the knowledge base. Addressing

关键是要认识到,问题的分类可能会随着知识库的变化而改变。图1中的问题Q1、Q2和Q3,虽然看似相似,但根据信息的可用性和推导答案所需的逻辑步骤,它们被归类为不同类型。例如,Q1被归类为事实性问题,因为它可以通过直接参考一张表格来回答,该表格简明地列出了所有生物类似物产品及其各自的批准日期,提供了足够的显性信息。相比之下,Q2询问的是可互换生物类似物产品的总数,无法通过直接引用单一的显性来源来解决。要回答Q2,必须识别所有符合指定标准的产品,随后计算总数,这需要额外的统计聚合步骤。因此,Q2被归类为可链接推理问题,因为它需要中间处理。最后,Q3提出了一个挑战,因为答案并未在知识库中明确存在。解决

Table 1: Level definition based on RAG system’s capability

表 1: 基于 RAG 系统能力的等级定义

等级 系统能力描述
L1 L1 系统旨在为事实性问题提供准确可靠的答案,确保基础信息检索的坚实基础。
L2 L2 系统扩展其功能,包括对事实性问题和可链接推理问题的准确可靠响应,支持更复杂的多步检索和推理任务。
L3 L3 系统进一步增强其能力,通过加入对预测性问题提供合理预测的能力,同时在回答事实性问题和可链接推理问题时保持准确性和可靠性。
L4 L4 系统能够为创造性问题提出合理规划或解决方案。此外,它保留了对预测性问题提供合理预测的能力,同时还能为事实性问题和可链接推理问题提供准确可靠的答案。

Q3 requires gathering relevant data, organizing it to infer hidden patterns, and making predictions based on these inferred rules. As a result, Q3 is categorized as a predictive question, indicating the requirement to extrapolate beyond the existing data to forecast potential outcomes or trends.

Q3需要收集相关数据,组织数据以推断隐藏的模式,并根据这些推断的规则进行预测。因此,Q3被归类为预测性问题,表明需要从现有数据中推断出潜在的结果或趋势。

3.3 RAG system level

3.3 RAG系统层级

In industrial RAG systems, inquiries encompass a broad spectrum of difficulties and are approached from diverse perspectives. Although RAG systems can leverage the general question-answering(QA) abilities of LLMs, their limited comprehension of expert-level knowledge often leads to inconsistent response quality across questions of varying complexities. In response to this status quo, we propose categorizing RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions outlined in the previous subsection. This stratified approach facilitates the phased development of RAG systems, allowing capabilities to be increment ally enhanced through iterative module refinement and algorithmic optimization. Our framework is strategically designed to provide a standardized, objective methodology for developing RAG systems that effectively meet the specialized needs of various industry scenarios. The definition of RAG systems in different level is presented in Table 1. It highlights the systems’ capabilities to handle increasingly complex queries, demonstrating the evolution from simple information retrieval to advanced predictive and creative problem-solving. Each level represents a step towards more sophisticated interactions with knowledge bases, requiring the RAG systems to demonstrate higher levels of understanding, reasoning, and innovation.

在工业 RAG 系统中,查询涵盖了广泛的难度范围,并从不同的角度进行处理。尽管 RAG 系统可以利用大语言模型的通用问答(QA)能力,但它们对专家级知识的有限理解往往导致对不同复杂度问题的回答质量不一致。针对这一现状,我们提议根据 RAG 系统在前一小节中列出的四类问题上的解决能力,将其分为四个不同的层次。这种分层方法有助于 RAG 系统的分阶段开发,通过迭代模块优化和算法改进逐步增强能力。我们的框架旨在为开发 RAG 系统提供一种标准化、客观的方法论,以有效满足各种行业场景的专业需求。不同层次 RAG 系统的定义如表 1 所示。它展示了系统处理日益复杂查询的能力,从简单的信息检索演变为高级的预测性和创造性问题解决。每个层次都代表了与知识库更复杂的交互,要求 RAG 系统展现出更高层次的理解、推理和创新能力。

More specially, at the foundational level, RAG systems respond to factual questions with answers that are directly extract able from provided texts. Advancing to the second level, RAG systems are equipped to handle complex questions involving linkage and reasoning. These queries necessitate the synthesis of information from disparate sources or multi-step reasoning processes. The RAG could address a variety of composite questions, includes bridging questions that necessitate a sequence of logical reasoning, comparative questions demanding parallel analysis, and summarizing questions that involve condensing information into comprehensive responses. At the third level, the systems are intricately designed to tackle predictive questions where answers are not immediately discernible from the original text. Finally, RAG systems at the forth level demonstrate the capacity for creative problem-solving, utilizing a solid factual base to foster novel concepts or strategies. While these systems may not offer ready-to-implement solutions, they play a crucial role in stimulating expert creativity to advance fields such as analytics or treatment design.

更具体地说,在基础层面上,RAG 系统能够从提供的文本中直接提取答案来回应事实性问题。进阶到第二层面,RAG 系统具备处理涉及链接和推理的复杂问题的能力。这些查询需要从不同来源综合信息或多步骤的推理过程。RAG 能够应对多种复合问题,包括需要一系列逻辑推理的桥梁问题、要求并行分析的比较问题,以及涉及将信息浓缩为全面回答的总结性问题。在第三层面,这些系统被精心设计以应对预测性问题,这些问题的答案无法直接从原始文本中识别。最后,处于第四层面的 RAG 系统展现了创造性解决问题的能力,利用坚实的事实基础来孕育新的概念或策略。虽然这些系统可能不提供即插即用的解决方案,但它们在激发专家创造力以推动分析或治疗设计等领域的发展中扮演着至关重要的角色。

4 Methodology

4 方法论

4.1 Framework

4.1 框架

Based on the formulation of RAG systems in terms of knowledge base, task classification, and systemlevel division, we propose a versatile and expandable RAG framework. Within this framework, the progression in levels of RAG systems can be achieved by adjusting submodules within the main modules. The overview of our framework is depicted in Figure 2. The framework primarily consists of several fundamental modules, including file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, knowledge-centric reasoning, and task decomposition and coordination. In this framework, domain-specific documents of diverse formats are processed by file parsing module to convert the file to machine-readable formats, and file units are generated to build up graph in information source layer. The knowledge extraction module chunks the text and generates corpus and knowledge units to construct graph in corpus layer and distilled knowledge layer. The heterogeneous graph established is utilized as the knowledge base for retrieval. Extracted knowledge is stored in multiple structured formats, and the knowledge retrieval module employs hybrid retrieval strategy to access relevant information. Note that the knowledge base not only serves as the source of knowledge gathering but also benefits from a feedback loop, where the organized and verified knowledge is regarded as feedback to refine and improve the knowledge base.

基于知识库、任务分类和系统层级划分的 RAG 系统表述,我们提出了一个多功能且可扩展的 RAG 框架。在该框架中,RAG 系统的层级演进可以通过调整主模块中的子模块来实现。图 2 展示了我们框架的概览。该框架主要由几个基础模块组成,包括文件解析、知识提取、知识存储、知识检索、知识组织、以知识为中心的推理以及任务分解与协调。在该框架中,不同格式的领域特定文档通过文件解析模块处理,将文件转换为机器可读格式,并生成文件单元以构建信息源层的图。知识提取模块对文本进行分块并生成语料库和知识单元,以构建语料库层和精炼知识层的图。建立的异构图被用作检索的知识库。提取的知识以多种结构化格式存储,知识检索模块采用混合检索策略来访问相关信息。需要注意的是,知识库不仅是知识收集的来源,还受益于反馈循环,其中组织和验证的知识被视为反馈,以改进和完善知识库。


Figure 2: Overview of the PIKE-RAG framework, comprising several key components: file parsing, knowledge extraction, knowledge storage, knowledge retrieval, knowledge organization, task decomposition and coordination, and knowledge-centric reasoning. Each component can be tailored to meet the evolving demands of system capability.


图 2: PIKE-RAG 框架概览,包含几个关键组件:文件解析、知识提取、知识存储、知识检索、知识组织、任务分解与协调,以及以知识为中心的推理。每个组件都可以根据系统能力的演进需求进行定制。

As highlighted in the task classification examples, questions of different classes require distinct rationale routing for answer-seeking, influenced by multiple factors such as the availability of relevant information, the complexity of knowledge extraction, and the sophistication of reasoning. It is challenging to address these questions in a single retrieval and generation pass. To tackle this, we propose an iterative retrieval-generation mechanism supervised by task decomposition and coordination. This iterative mechanism enables the gradual collection of relevant information and progressive reasoning over incremental context, ensuring a more accurate and comprehensive response. More specially, the questions in industrial applications are fed into task decomposition module to produce preliminary decomposition scheme. This scheme outlines the retrieval steps, reasoning steps, and other necessary operations. Following these instructions, the knowledge retrieval module retrieves relevant information, which is then passed to the knowledge organization module for processing and organization. The organized knowledge is used to perform knowledge-centric reasoning, yielding an intermediate answer. With the updated relevant information and intermediate answer, the task decomposition module regenerates an updated scheme for the next iteration. This design boasts excellent adaptability, allowing us to tackle problems of varying difficulties and perspectives by adjusting the modules and iterative mechanisms.

如任务分类示例所示,不同类别的问题需要不同的推理路径来寻找答案,这受到多种因素的影响,例如相关信息的可用性、知识提取的复杂性以及推理的复杂度。在单次检索和生成过程中解决这些问题具有挑战性。为此,我们提出了一种由任务分解和协调监督的迭代检索-生成机制。这种迭代机制能够逐步收集相关信息,并在增量上下文的基础上进行渐进式推理,从而确保回答更加准确和全面。具体而言,工业应用中的问题被输入到任务分解模块中,生成初步的分解方案。该方案概述了检索步骤、推理步骤以及其他必要操作。遵循这些指令,知识检索模块检索相关信息,然后将其传递给知识组织模块进行处理和组织。组织好的知识用于执行以知识为中心的推理,生成中间答案。随着更新的相关信息和中间答案,任务分解模块重新生成更新的方案以进行下一次迭代。这种设计具有出色的适应性,允许我们通过调整模块和迭代机制来解决不同难度和视角的问题。

Table 2: Proposed frameworks for different system levels. To address the challenges facing at each level, we propose customized frameworks based on the framework illustrated in Figure 2. The following abbreviations are used: "PA" for file parsing, "KE" for knowledge extraction, "RT" for knowledge retrieval, "KO" for knowledge organization, and "KR" for knowledge-centric reasoning.

表 2: 针对不同系统层次的定制框架。为了解决每个层次面临的挑战,我们基于图 2 中展示的框架提出了定制化的框架。以下缩写用于表示:“PA”表示文件解析、“KE”表示知识提取、“RT”表示知识检索、“KO”表示知识组织、“KR”表示以知识为中心的推理。

Level Challenges ProposedFramework
LO 由于源文档格式多样,知识提取面临挑战,需要复杂的文件解析技术。· 从原始异构数据构建高质量知识库在知识组织和集成方面引入了显著的复杂性。 PA KE
L1 · 由于不恰当的分块破坏了语义连贯性,阻碍了知识的理解和提取,使准确检索复杂化。知识检索受到嵌入模型在对齐专业术语和别名方面的限制,降低了系统的精确度。 PA KE RT KO KR
L2 有效的知识提取和利用至关重要,因为分块后的文本通常包含相关和不相关信息。确保检索高质量数据对于准确生成至关重要。· 任务的理解和分解及其背后的逻辑往往忽略了支持数据的可用性,严重依赖大语言模型能力。 PA KR Task Decomp.& Coord.
L3 这一级别的挑战集中在知识收集和组织上,这对于支持预测推理至关重要。· 大语言模型在应用专业推理逻辑方面存在局限性,限制了其在预测任务中的有效性。 PA TaskDecomp.&Coord.
L4 困难在于从复杂的知识库中提取连贯的逻辑推理,其中多个因素之间的相互依赖性可能导致非唯一解。创造性问题的开放性使得对推理和知识合成过程的评估复杂化,难以定量评估答案质量。 PA RT Multi-agent Plan. Task Decomp.&Coord.

4.2 Phased system development

4.2 分阶段系统开发

We have categorized RAG systems into four distinct levels based on their problem-solving capabilities across the four classes of questions, as outlined in Table 1. Recognizing the pivotal role of knowledge base generation in RAG systems, we designate the construction of the knowledge base as the L0 stage of system development. The challenges faced by RAG systems vary across different levels. We analyze these challenges for each level and propose corresponding frameworks in Table 2. This stratified approach facilitates the phased development of RAG systems, enabling incremental enhancement of capabilities through iterative module refinement and algorithmic optimization.

我们根据RAG系统在四类问题中的解决能力,将其分为四个不同的层次,如表1所示。鉴于知识库生成在RAG系统中的关键作用,我们将知识库的构建定义为系统开发的L0阶段。RAG系统在不同层次面临的挑战各不相同。我们在表2中分析了每个层次的挑战,并提出了相应的框架。这种分层方法有助于RAG系统的分阶段开发,通过迭代模块优化和算法改进,逐步提升系统能力。

We observe that from L0 to L4, higher-level systems can inherit modules from lower levels and add new modules to enhance system capabilities. For instance, compared to an L1 system, an L2 system not only introduces a task decomposition and coordination module to leverage iterative retrieval-generation routing but also incorporates more advanced knowledge extraction modules, such as distilled knowledge generation, indicated in dark green in Figure 2. In the L3 system, the growing emphasis on predictive questioning necessitates enhanced requirements for knowledge organization and reasoning. Consequently, the knowledge organization module introduces additional submodules for knowledge structuring and knowledge induction, indicated in dark orange. Similarly, the knowledge-centric reasoning module has been expanded to include a forecasting submodule, highlighted in dark purple. In the L4 system, extracting complex rationale from an established knowledge base is highly challenging. To address this, we introduce multi-agent planning module to activate reasoning from diverse perspectives.

我们观察到,从 L0 到 L4,更高级的系统可以从低层级继承模块,并添加新模块以增强系统能力。例如,与 L1 系统相比,L2 系统不仅引入了任务分解和协调模块以利用迭代检索-生成路由,还引入了更高级的知识提取模块,如蒸馏知识生成,如图 2 中的深绿色所示。在 L3 系统中,对预测性问题的日益重视对知识组织和推理提出了更高的要求。因此,知识组织模块引入了额外的子模块,用于知识结构化和知识归纳,如深橙色所示。同样,以知识为中心的推理模块也扩展了预测子模块,如深紫色所示。在 L4 系统中,从已建立的知识库中提取复杂原理具有高度挑战性。为了解决这个问题,我们引入了多智能体规划模块,以激活从不同角度进行的推理。


Figure 3: Multi-layer heterogeneous graph as the knowledge base. The graph comprises three distinct layers: information resource layer, corpus layer and distilled knowledge layer.

图 3: 多层异构图作为知识库。该图包含三个不同的层次:信息资源层、语料库层和提炼知识层。

5 Detailed Implementation

5 详细实现

In this section, we delve into the implementation specifics of each module within our proposed versatile and expandable RAG framework. By elucidating the details at each level, we aim to provide a comprehensive understanding of how the framework operates and how its modularity and expand ability are achieved. The subsections that follow will cover the file parsing, knowledge extraction, knowledge storage, knowledge-centric reasoning, and task decomposition and coordination modules, providing insights into their individual functionalities and interactions.

在本节中,我们将深入探讨我们提出的多功能且可扩展的 RAG 框架中每个模块的实现细节。通过阐明每个层次的细节,我们的目标是提供对该框架如何运作以及如何实现其模块化和可扩展性的全面理解。接下来的小节将涵盖文件解析、知识提取、知识存储、以知识为中心的推理以及任务分解和协调模块,深入探讨它们各自的功能和交互。

5.1 Level-0: Knowledge Base Construction

5.1 Level-0: 知识库构建

The foundational stage of the proposed RAG systems is designated as the L0 system, focuses on the construction of a robust and comprehensive knowledge base. This stage is critical for enabling effective knowledge retrieval in subsequent levels. The primary objective of the L0 system is to process and structure domain-specific documents, transforming them into a machine-readable format and organizing the extracted knowledge into a heterogeneous graph. This graph serves as the backbone for all higher-level reasoning and retrieval tasks. The L0 system encompasses several key modules: file parsing, knowledge extraction, and knowledge storage. Each of these modules plays a crucial role in ensuring that the knowledge base is both extensive and accurately reflects the underlying information contained within the source documents.

所提出的 RAG 系统的基础阶段被指定为 L0 系统,专注于构建一个健壮且全面的知识库。这一阶段对于在后续层级实现有效的知识检索至关重要。L0 系统的主要目标是处理和结构化特定领域的文档,将其转换为机器可读的格式,并将提取的知识组织成异构图。该图作为所有更高级别推理和检索任务的基础。L0 系统包含几个关键模块:文件解析、知识提取和知识存储。这些模块中的每一个都在确保知识库既广泛又准确反映源文档中包含的底层信息方面发挥着关键作用。

5.1.1 File parsing

5.1.1 文件解析

The ability to effectively parse and read various types of files is a critical component in the development of RAG systems that rely on diverse data sources. Frameworks such as LangChain3 provide a comprehensive suite of tools for natural language processing (NLP), including modules for parsing and extracting information from unstructured text documents. Its file reader capabilities are designed to handle a wide range of file formats, ensuring that data from heterogeneous sources can be seamlessly integrated into the system. Additionally, several deep learning-based tools [2, 3] and commercial cloud APIs [1, 4] have been developed to conduct robust Optical Character Recognition (OCR) and accurate table extraction, enabling the conversion of scanned documents and images into structured, machine-readable text. Given that domain-specific files often encompass sophisticated tables, charts, and figures, text-based conversion may lead to information loss and disrupt the inherent logical structure. Therefore, we propose conducting layout analysis for these files and preserving multi-modal elements such as charts and figures. The layout information can aid the chunking operation, maintaining the completeness of chunked text, while figures and charts can be described

有效解析和读取各类文件的能力是开发依赖多样化数据源的 RAG 系统的关键组成部分。诸如 LangChain3 等框架提供了一套全面的自然语言处理 (NLP) 工具,包括用于从非结构化文本文档中解析和提取信息的模块。其文件读取功能设计用于处理广泛的文件格式,确保来自异构源的数据能够无缝集成到系统中。此外,还开发了几种基于深度学习的工具 [2, 3] 和商用云 API [1, 4],以进行稳健的光学字符识别 (OCR) 和准确的表格提取,从而将扫描文档和图像转换为结构化的、机器可读的文本。鉴于特定领域的文件通常包含复杂的表格、图表和图形,基于文本的转换可能导致信息丢失并破坏固有的逻辑结构。因此,我们建议对这些文件进行布局分析,并保留图表和图形等多模态元素。布局信息可以帮助分块操作,保持分块文本的完整性,同时图表和图形可以被描述

Figure 4: The process of distilling knowledge from corpus text. The corpus text are processed to extract knowledge units following customized extraction patterns. These knowledge units are then organized to structured knowledge in the distilled knowledge layer, which may take the form of knowledge graphs, atomic knowledge, tabular knowledge, and other induced knowledge.

图 4: 从语料文本中提炼知识的过程。语料文本经过处理,按照定制的提取模式抽取知识单元。这些知识单元随后被组织成结构化知识,存储在提炼的知识层中,其形式可能包括知识图谱、原子知识、表格知识以及其他归纳知识。

by Vision-Language Models (VLMs) to assist in knowledge retrieval. This approach ensures that the integrity and richness of the original documents are retained, enhancing the efficacy of RAG systems.

通过视觉语言模型(VLMs)辅助知识检索。这种方法确保了原始文档的完整性和丰富性,提升了RAG系统的效率。

5.1.2 Knowledge Organization

5.1.2 知识组织

The proposed knowledge base is structured as a multi-layer heterogeneous graph, representing different levels of information granularity and abstraction. The graph captures relationships between various components of the data (e.g., documents, sections, chunks, figures, and tables) and organizes them into nodes and edges, reflecting their interconnections and dependencies. As depicted in Figure 3, this multi-layer structure, encompassing the information resource layer, corpus layer, and distilled knowledge layer, enables both semantic understanding and rationale-based retrieval for downstream tasks.

所提出的知识库被构建为一个多层异构图,表示不同层次的信息粒度和抽象。该图捕获了数据各个组成部分之间的关系(例如文档、章节、片段、图表和表格),并将它们组织成节点和边,反映它们的相互联系和依赖关系。如图 3 所示,这种多层结构包括信息资源层、语料库层和精炼知识层,能够为下游任务实现语义理解和基于推理的检索。

Information Resource Layer: This layer captures the diverse information sources, treating them as source nodes with edges that denote referential relationships among them. This structure aids in cross-referencing and contextual i zing the knowledge, establishing a foundation for reasoning that depends on multiple sources.

信息资源层:该层捕捉多样化的信息源,将其视为源节点,并通过边表示它们之间的引用关系。这种结构有助于交叉引用和知识的情境化,为依赖多源的推理奠定了基础。

Corpus Layer: This layer organizes the parsed information into sections and chunks while preserving the document’s original hierarchical structure. Multi-modal content such as tables and figures is summarized by LLMs and integrated as chunk nodes, ensuring that multi-modal knowledge is available for retrieval. This layer enables knowledge extraction with varying levels of granularity, allowing for accurate semantic chunking and retrieval across diverse content types.

语料库层:该层将解析后的信息组织成章节和块,同时保留文档的原始层次结构。多模态内容(如表格和图表)由大语言模型进行总结,并作为块节点集成,确保多模态知识可用于检索。该层支持不同粒度的知识提取,允许跨多种内容类型进行准确的语义分块和检索。

Distilled Knowledge Layer: The corpus is further distilled into structured forms of knowledge (e.g., knowledge graphs, atomic knowledge, and tabular knowledge). This process, driven by techniques like Named Entity Recognition (NER) [19] and relationship extraction [40], ensures that the distilled knowledge captures key logical relationships and entities, supporting advanced reasoning processes. By organizing this structured knowledge in a distilled layer, we enhance the system’s ability to reason and synthesize based on deeper domain-specific knowledge. The knowledge distillation process is depicted in Figure 4. Below are the detailed distillation processes for typical knowledge forms.

蒸馏知识层:语料库进一步被提炼为结构化的知识形式(如知识图谱、原子知识和表格知识)。这一过程由命名实体识别(NER)[19] 和关系抽取 [40] 等技术驱动,确保蒸馏后的知识捕捉到关键的逻辑关系和实体,支持高级推理过程。通过将这些结构化知识组织在蒸馏层中,我们增强了系统基于更深层次的领域知识进行推理和综合的能力。知识蒸馏过程如图 4 所示。以下是典型知识形式的详细蒸馏过程。


Figure 5: Illustration of enhanced chunking with recurrent text splitting.

图 5: 使用循环文本分割进行增强分块的示意图。

5.2 Level-1: Factual Question focused RAG System

5.2 第一层:基于事实问题的 RAG 系统

Building upon the L0 system, the L1 system introduces knowledge retrieval and knowledge organization to realize its retrieval and generation capabilities. The primary challenges at this level are semantic alignment and chunking. The abundance of professional terminology and aliases can affect the accuracy of chunk retrieval, and unreasonable chunking can disrupt semantic coherence and introduce noise interference. To mitigate these issues, the L1 system incorporates more sophisticated query analysis techniques and basic knowledge extraction modules. The architecture is expanded to include components that facilitate task decomposition, coordination, and initial stages of knowledge organization (KO), ensuring that the system can manage more complex queries effectively.

在 L0 系统的基础上,L1 系统引入了知识检索和知识组织,以实现其检索和生成能力。这一层级的主要挑战是语义对齐和分块。专业术语和别名的丰富性可能影响分块检索的准确性,而不合理的分块则会破坏语义连贯性并引入噪声干扰。为了解决这些问题,L1 系统采用了更复杂的查询分析技术和基本知识提取模块。架构扩展了任务分解、协调和知识组织 (KO) 初期的组件,确保系统能够有效处理更复杂的查询。


Figure 6: Overview of L1 RAG framework. The squares $(\boxed{\Omega})$ indicate enhanced chunking and auto-tagging sub-module in knowledge extraction modules.

图 6: L1 RAG 框架概览。方框 $(\boxed{\Omega})$ 表示知识提取模块中的增强分块和自动标记子模块。

5.2.1 Enhanced chunking

5.2.1 增强分块

Chunking involves breaking down a large corpus of text into smaller, more manageable segments. The primary chunking strategies commonly utilized in RAG systems include fixed-size chunking, semantic chunking, and hybrid chunking. Chunking is essential for improving both the efficiency and accuracy of the retrieval process, which consequently affects the overall performance of RAG models in multiple dimensions. In our system, each chunk serves dual purposes: (i) it becomes a unit of information that is vectorized and stored in a database for retrieval, and (ii) it acts as a source for further knowledge extraction and information sum mari z ation. Improper chunking not only fails to ensures that text vectors encapsulate the necessary semantic information, but also hinders knowledge extraction based on complete context. For instance, in the context of laws and regulations, a fixed-size chunking approach are prone to destroying text semantics and omitting key conditions, thereby affecting the quality and accuracy of subsequent knowledge extraction.

分块 (Chunking) 涉及将大量文本分解为更小、更易管理的片段。RAG 系统中常用的主要分块策略包括固定大小分块、语义分块和混合分块。分块对于提高检索过程的效率和准确性至关重要,这进而影响 RAG 模型在多方面的整体性能。在我们的系统中,每个分块都承担双重作用:(i) 它成为向量化并存储在数据库中以便检索的信息单元,(ii) 它充当进一步知识提取和信息汇总的源。不恰当的分块不仅无法确保文本向量包含必要的语义信息,还会阻碍基于完整上下文的知识提取。例如,在法律法规的背景下,固定大小分块方法容易破坏文本语义并遗漏关键条件,从而影响后续知识提取的质量和准确性。

We propose a text split algorithm to enhance existing chunking methods by breaking down large text documents into smaller, manageable chunks while preserving context and enabling effective summary generation for each chunk. The chunking process is illustrated in Figure 5. Given a source text, the algorithm iterative ly splits the text into chunks. During the first iteration, it generates a forward summary of the initial chunk, providing context for generating summaries of subsequent chunks and maintaining a coherent narrative across splits. Each chunk is summarized using a predefined prompt template that incorporates both the forward summary and the current chunk. This summary is then stored alongside the chunk. The algorithm adjusts the text by removing the processed chunk and updating the forward summary with the summary of the current chunk, preparing for the next iteration. This process continues until the entire text is split and summarized. Additionally, the algorithm can dynamically adjust chunk sizes based on the content and structure of the text.

我们提出了一种文本分割算法,旨在增强现有的分块方法,通过将大型文本文档分解为更小、可管理的块,同时保留上下文并为每个块生成有效的摘要。分块过程如图 5 所示。给定源文本,该算法会迭代地将文本分割成块。在第一次迭代中,它生成初始块的前向摘要,为后续块的摘要生成提供上下文,并在分割中保持连贯的叙述。每个块都使用预定义的提示模板进行摘要,该模板结合了前向摘要和当前块的内容。然后,该摘要与块一起存储。算法通过移除已处理的块并使用当前块的摘要更新前向摘要来调整文本,为下一次迭代做好准备。此过程持续进行,直到整个文本被分割并摘要完毕。此外,该算法可以根据文本的内容和结构动态调整块的大小。


Figure 7: Illustration of the auto-tagging module.

图 7: 自动标注模块的示意图。

5.2.2 Auto-tagging

5.2.2 自动标签

In domain-specific RAG scenarios, the corpus is typically characterized by formal, professional, and rigorously expressed content, whereas the questions posed are often articulated in plain, easily understandable colloquial language. For instance, in medical question-answering (medQA) tasks [32], symptoms of diseases described in the questions are generally phrased in simple, conversational terms. In contrast, the corresponding medical knowledge within the corpus is often expressed using specialized professional terminology. This discrepancy introduces a domain gap that adversely affects the accuracy of chunk retrieval, especially given the limitations of the embedding models employed for this purpose.

在特定领域的 RAG 场景中,语料库通常以正式、专业且严谨表达的内容为特征,而提出的问题则往往以简单易懂的口语化语言表达。例如,在医疗问答 (medQA) 任务 [32] 中,问题中描述的疾病症状通常以简单的对话术语表达。相比之下,语料库中的相应医学知识则往往使用专业的术语表达。这种差异引入了领域差距,对块检索的准确性产生了不利影响,特别是在用于此目的的嵌入模型存在限制的情况下。

To address the domain gap issue, we propose an auto-tagging module designed to minimize the disparity between the source documents and the queries. This module pre processes the corpus to extract a comprehensive collection of domain-specific tags or to establish tag mapping rules. Prior to the retrieval process, tags are extracted from the query and then mapped to corpus domain using the pre processed tag collection or tag pair collection. This tag-based domain adaptation can be employed for query rewriting or keyword retrieval within sequential information retrieval frameworks, thereby enhancing both the recall and precision of the retrieval process.

为解决领域差距问题,我们提出了一个自动标注模块,旨在最小化源文档与查询之间的差异。该模块对语料库进行预处理,以提取全面的领域特定标签或建立标签映射规则。在检索过程之前,从查询中提取标签,然后使用预处理的标签集合或标签对集合将其映射到语料库领域。这种基于标签的领域适应可以用于序列信息检索框架中的查询重写或关键词检索,从而提高检索过程的召回率和精度。

Specifically, we leverage the capabilities of the LLMs to identify key factors within the corpus chunks, summarize these factors, and generalize them into category names, which we refer to as "tag classes." We generate semantic tag extraction prompts based on these tag classes to facilitate accurate tag extraction. In scenarios where only the corpus is available, LLMs are employed with meticulously designed prompts to extract semantic tags from the corpus, thereby forming a comprehensive corpus tag collection. When practical QA samples are available, semantic tag extraction is performed on both the queries and the corresponding retrieved answer chunks. Using the tag sets extracted from the chunks and queries, LLMs are utilized to map cross-domain semantic tags and generate a tag pair collection. After establishing both the corpus tag collection and the tag pair collection, tags can be extracted from the query, and the corresponding mapped tags can be identified within the collections. These mapped tags are then used to enhance subsequent information retrieval processes, improving both recall and precision. This workflow leverages the advanced understanding and contextual capabilities of LLMs for domain adaptation.

具体而言,我们利用大语言模型的能力,在语料块中识别关键因素,总结这些因素,并将其泛化为类别名称,我们称之为“标签类”。基于这些标签类,我们生成语义标签提取提示,以促进准确的标签提取。在仅有语料可用的情况下,大语言模型通过精心设计的提示从语料中提取语义标签,从而形成一个全面的语料标签集合。当有实际的问答样本时,对查询和相应的检索答案块都进行语义标签提取。利用从语料块和查询中提取的标签集,大语言模型被用于映射跨域语义标签,并生成标签对集合。在建立语料标签集合和标签对集合后,可以从查询中提取标签,并在集合中识别相应的映射标签。这些映射标签随后用于增强后续的信息检索过程,提高召回率和精确率。该工作流程利用了大语言模型的先进理解和上下文能力,进行领域适应。


Figure 8: Overview of multi-layer, multi-granularity retrieval over heterogeneous graph

图 8: 异质图多层多粒度检索概述

5.2.3 Multi-Granularity Retrieval

5.2.3 多粒度检索

The L1 system is designed to enable multi-layer, multi-granularity retrieval across a heterogeneous knowledge graph, which was constructed in the L0 system. Each layer of the graph (e.g., information source layer, corpus layer, distilled knowledge layer) represents knowledge at different levels of abstraction and granularity, allowing the system to explore and retrieve relevant information at various scales. For example, queries can be mapped to entire documents (information source layer) or specific chunks of text (corpus layer), ensuring that knowledge can be retrieved at the appropriate level for a given task. To support this, similarity scores between queries and graph nodes are computed to measure the alignment between the query and the retrieved knowledge. These scores are then propagated through the layers of the graph, allowing the system to aggregate information from multiple levels. This multi-layer propagation ensures that retrieval can be fine-tuned based on both the broader context (e.g., entire documents) and finer details (e.g., specific chunks or distilled knowledge). The final similarity score is generated through a combination of aggregation and propagation, ensuring that knowledge extraction and utilization are optimized for both precision and efficiency in factual question answering. The retrieval process can be iterative, refining the results based on sub-queries generated through task decomposition, further enhancing the system’s ability to generate accurate and con textually relevant answers.

L1 系统旨在实现跨异构知识图谱的多层次、多粒度检索,该图谱在 L0 系统中构建。图谱的每一层(例如信息源层、语料层、提炼知识层)代表了不同抽象层次和粒度的知识,使系统能够在不同尺度上探索和检索相关信息。例如,查询可以映射到整个文档(信息源层)或特定的文本块(语料层),确保知识可以根据给定任务在适当的层次上被检索。为了支持这一点,系统计算查询与图谱节点之间的相似度分数,以衡量查询与检索知识之间的匹配程度。这些分数随后通过图谱的各层传播,使系统能够从多个层次聚合信息。这种多层传播确保检索可以根据更广泛的上下文(例如整个文档)和更精细的细节(例如特定块或提炼知识)进行微调。最终相似度分数通过聚合和传播的组合生成,确保知识提取和利用在事实性问答中的精度和效率都得到优化。检索过程可以是迭代的,基于任务分解生成的子查询优化结果,进一步增强系统生成准确且上下文相关答案的能力。

The overview of multi-layer, multi-granularity retrieval is depicted in Figure 8. For each layer of the graph, both queries $Q$ and graph node are transformed into high-dimensional vector embeddings for similarity evaluation. We denote the similarity evaluation operation as $g()$ . Here, $I,C$ , and $D$ indicate the node sets in the information source layer, corpus layer, and distilled knowledge layer, respectively. The propagation and aggregation operations are represented by the function $f()$ . The final chunk similarity score $S$ is obtained by aggregating the scores from other layers and nodes.

图 8 展示了多层多粒度检索的概览。对于图的每一层,查询 $Q$ 和图节点都被转换为高维向量嵌入以进行相似性评估。我们将相似性评估操作表示为 $g()$ 。这里,$I,C$ 和 $D$ 分别表示信息源层、语料层和蒸馏知识层中的节点集。传播和聚合操作由函数 $f()$ 表示。最终的块相似性得分 $S$ 是通过聚合其他层和节点的得分得到的。

5.3 Level-2: Linkable and Reasoning Question focused RAG System

5.3 第二层级:可链接且专注于推理问题的 RAG 系统

The core functionality of the L2 system lies in its ability to efficiently retrieve multiple sources of relevant information and perform complex reasoning based on it. To facilitate this, the L2 system integrates an advanced knowledge extraction module that comprehensively identifies and extracts pertinent information. Furthermore, a task decomposition and coordination module is implemented to break down intricate tasks into smaller, manageable sub-tasks, thereby enhancing the system’s efficiency in handling them. The proposed framework of L2 RAG system is illustrated in Figure 9.

L2 系统的核心功能在于其能够高效检索多个相关信息来源并基于此进行复杂推理。为了实现这一点,L2 系统集成了一个高级的知识提取模块,全面识别并提取相关信息。此外,系统还实现了一个任务分解与协调模块,将复杂任务拆分为更小、更易管理的子任务,从而提升系统处理这些任务的效率。L2 RAG 系统的框架如图 9 所示。

Chunked text contains multifaceted information, increasing the complexity of retrieval. Recent studies have focused on extracting triple knowledge units from chunked text and constructing knowledge graphs to facilitate efficient information retrieval [20, 42]. However, the construction of knowledge graphs is costly, and the inherent knowledge may not always be fully explored. To better present the knowledge embedded the documents, we propose atomizing the original documents in Knowledge Extraction phase, a process we refer as Knowledge Atomizing. Besides, industrial tasks often necessitate multiple pieces of knowledge, implicitly requiring the capability to decompose the original question into several sequential or parallel atomic questions. We refer to this operation as

分块文本包含多方面的信息,增加了检索的复杂性。最近的研究集中在从分块文本中提取三元组知识单元并构建知识图谱,以促进高效的信息检索 [20, 42]。然而,知识图谱的构建成本较高,且内在知识可能并不总是被完全挖掘。为了更好地呈现文档中嵌入的知识,我们提出在知识提取阶段对原始文档进行原子化处理,这一过程我们称之为知识原子化。此外,工业任务通常需要多条知识,这隐含地要求具备将原始问题分解为多个顺序或并行的原子问题的能力。我们将此操作称为

Task Decomposition. By combining the extracted atomic knowledge with the original chunks, we construct an atomic hierarchical knowledge base. Each time we decompose a task, the hierarchical knowledge base provides insights into the available knowledge, enabling knowledge-aware task decomposition.

任务分解。通过将提取的原子知识与原始块结合,我们构建了一个原子层次知识库。每次分解任务时,层次知识库都能提供可用知识的洞察,从而实现知识感知的任务分解。

5.3.1 Knowledge Atomizing

5.3.1 知识原子化

We believe that a single document chunk often encompasses multiple pieces of knowledge. Typically, the information necessary to address a specific task represents only a subset of the entire knowledge. Therefore, consolidating these pieces within a single chunk, as traditionally done in information retrieval, may not facilitate the efficient retrieval of the precise information required. To align the granularity of knowledge with the queries generated during task solving, we propose a method called knowledge atomizing. This approach leverage the context understanding and content generation capabilities of LLMs to automatically tag atomic knowledge pieces within each document chunk. Note that, these chunks could be segments of an original reference document, description chunks generated for tables, images, videos, or summary chunks of entire sections, chapters or even documents.

我们相信单个文档块通常包含多个知识片段。通常,解决特定任务所需的信息只是整个知识的一部分。因此,像传统信息检索那样将这些片段整合在一个块中,可能无法有效检索到所需的精确信息。为了使知识的粒度与任务解决过程中生成的查询相匹配,我们提出了一种称为知识原子化的方法。该方法利用大语言模型的上下文理解和内容生成能力,自动标记每个文档块中的原子知识片段。请注意,这些块可以是原始参考文档的片段、为表格、图像、视频生成的描述块,甚至是整个章节或文档的摘要块。

The presentation of atomic knowledge can be various. Instead of utilizing declarative sentences or subject-relationship-object tuples, we propose using questions as knowledge indexes to further bridge the gap between stored knowledge and query. Unlike the semantic tagging process, in knowledge atomizing process, we input the document chunk to LLM as context, ask it to generate relevant questions that can be answered by the given chunk as many as possible. These generated atomic questions are saved as the atomic question tags together with the given chunk. An example of knowledge atomizing is demonstrated in Figure 10(c), where the atomic questions encapsulate various aspects of the knowledge contained within the chunk. A hierarchical knowledge base can accommodate queries of varying granularity. Figure 11 illustrates the retrieval process from an atomic knowledge base comprising chunks and atomic questions. Queries can directly retrieve reference chunks as usual. Additionally, since each chunk is tagged with multiple atomic questions, an atomic query can be used to locate relevant atomic questions, which then leads to the associated reference chunks.

原子知识的呈现方式可以多种多样。我们提出使用问题作为知识索引,而不是利用陈述句或主语-关系-宾语元组,以进一步缩小存储知识与查询之间的差距。与语义标记过程不同,在知识原子化过程中,我们将文档块输入到大语言模型 (LLM) 作为上下文,要求它生成尽可能多的可以由给定块回答的相关问题。这些生成的原子问题与给定块一起保存为原子问题标签。图 10(c) 展示了知识原子化的一个例子,其中原子问题涵盖了块内包含知识的各个方面。分层知识库可以适应不同粒度的查询。图 11 展示了从包含块和原子问题的原子知识库中检索的过程。查询可以像往常一样直接检索参考块。此外,由于每个块都标记了多个原子问题,因此可以使用原子查询来定位相关的原子问题,然后找到相关的参考块。

5.3.2 Knowledge-Aware Task Decomposition

5.3.2 知识感知的任务分解

For a specific task, multiple decomposition strategies might be applicable. Consider Q2 in Figure 1 as an example. The two-step analytical reasoning process depicted may be effective if an interchangeable biosimilar products list is available. However, if only a general list of biosimilar products exists, with attributes dispersed throughout multiple documents, a different decomposition strategy may be necessary: (1) Retrieve the biosimilar product list; (2) Determine whether each product is interchangeable; (3) Count the total number of interchangeable products. The critical factor in selecting the most effective decomposition approach lies in understanding the contents of the specialized knowledge base. Motivated by this, we design the Knowledge-Aware Task Decomposition workflow, which is illustrated in Figure 10(a). The complete algorithm for task solving using Knowledge-Aware Task Decomposition is presented in Algorithm 1.

对于特定任务,可能会有多种分解策略适用。以图 1 中的 Q2 为例,如果存在可互换的生物类似物产品列表,则所展示的两步分析推理过程可能有效。然而,如果仅存在一个通用的生物类似物产品列表,且属性分散在多个文档中,则可能需要采用不同的分解策略:(1) 检索生物类似物产品列表;(2) 确定每个产品是否可互换;(3) 计算可互换产品的总数。选择最有效分解方法的关键在于理解专业知识库的内容。基于此,我们设计了知识感知任务分解工作流程,如图 10(a) 所示。使用知识感知任务分解进行任务求解的完整算法如算法 1 所示。

The reference context $\ensuremath{\mathcal{C}}{t}$ is initialized as an empty set, and the original question is denoted by $q$ . As illustrated in the for-loop starting at line 2 of the algorithm, in the $t$ -th iteration, we use an LLM, denoted by $\mathcal{L L M}$ , to generate query proposals potentially useful for task completion, denoted as $\hat{q}{i}^{t}$ .

参考上下文 $\ensuremath{\mathcal{C}}{t}$ 初始化为空集,原始问题记为 $q$。如算法第2行开始的for循环所示,在第 $t$ 次迭代中,我们使用一个表示为 $\mathcal{L L M}$ 的大语言模型来生成可能对任务完成有用的查询提议,记为 $\hat{q}{i}^{t}$。


Figure 10: The illustration of knowledge atomizing and knowledge-aware task decomposition: (a) Workflow of task solving with knowledge-aware task decomposition, (b) Workflow of knowledge atomizing, (c) Example of knowledge atomizing, (d) RAG case with knowledge atomizing and knowledge-aware task decomposition.

图 10: 知识原子化和知识感知任务分解的示意图:(a) 使用知识感知任务分解的任务解决流程,(b) 知识原子化的流程,(c) 知识原子化的示例,(d) 使用知识原子化和知识感知任务分解的 RAG 案例。

In this step, the chosen reference chunks $\ensuremath{\mathcal{C}}{t}$ are provided as context to avoid generating proposals linked to already known knowledge. These proposals are then utilized as atomic queries to determine if relevant knowledge exists within the knowledge base. For each atomic question proposal, we retrieve its relevant atomic question candidates along with their source chunks ${(q{i j}^{t},c_{i j}^{\bar{t}})}$ from the knowledge base, denoted as $\kappa{\tt B}$ . We can use any score metric sim to retrieve atomic questions. In our experiment, we use cosine similarity of their corresponding embeddings to retrieve all top $K$ atomic questions, provided their similarity to a proposed atomic question is greater than or equal to a given threshold $\delta$ . With the original question $q$ , the accumulated context $\ensuremath{\mathcal{C}}{t}$ , and the list of retrieved atomic questions $q{i j}^{t}$ , $\mathcal{L L M}$ selects the most useful atomic question $q^{t}$ from $q_{i j}^{t}$ and retrieves the relevant chunk $c^{t}$ . This retrieved chunk is aggregated into the reference context $\ensuremath{\mathcal{C}}{t}$ for the next round of decomposition. Knowledge-aware decomposition can iterate up to $N$ times, where $N$ is a hyper parameter set to control computational cost. The iteration process can be terminated early if there are no high-quality question proposals, no highly relevant atomic candidates retrieved, no suitable atomic knowledge selections, or if the $\mathcal{L L M}$ determines that the acquired knowledge is sufficient to complete the task. Finally, the accumulated context $\ensuremath{\mathcal{C}}{t}$ is utilized to generate answer $\hat{a}$ for the given question $q$ in line 14.

在这一步中,选定的参考块 $\ensuremath{\mathcal{C}}{t}$ 被作为上下文提供,以避免生成与已知知识相关的提案。这些提案随后被用作原子查询,以确定知识库中是否存在相关知识。对于每个原子问题提案,我们从知识库中检索其相关的原子问题候选及其源块 ${(q{i j}^{t},c_{i j}^{\bar{t}})}$,记为 $\kappa{\tt B}$。我们可以使用任何评分指标 sim 来检索原子问题。在我们的实验中,我们使用其对应嵌入的余弦相似度来检索所有前 $K$ 个原子问题,前提是它们与提案的原子问题的相似度大于或等于给定阈值 $\delta$。通过原始问题 $q$、累积上下文 $\ensuremath{\mathcal{C}}{t}$ 以及检索到的原子问题列表 $q{i j}^{t}$,$\mathcal{L L M}$ 从 $q_{i j}^{t}$ 中选择最有用的原子问题 $q^{t}$ 并检索相关块 $c^{t}$。这个检索到的块被聚合到参考上下文 $\ensuremath{\mathcal{C}}{t}$ 中,用于下一轮分解。知识感知分解最多可以迭代 $N$ 次,其中 $N$ 是一个用于控制计算成本的超参数。如果在没有高质量问题提案、没有检索到高度相关的原子候选、没有合适的原子知识选择,或者 $\mathcal{L L M}$ 确定获得的知识足以完成任务时,迭代过程可以提前终止。最后,累积上下文 $\ensuremath{\mathcal{C}}{t}$ 被用于在第 14 行为给定问题 $q$ 生成答案 $\hat{a}$。


Figure 11: Retrieval process from an atomic knowledge base. It supports two retrieval paths: (a) using queries to directly retrieve chunks as usual; (b) locating atomic nodes first then achieving the associated chunks.

图 11: 从原子知识库中检索的过程。它支持两种检索路径:(a) 使用查询直接检索块;(b) 先定位原子节点,然后获取相关的块。

Algorithm 1 Task Solving with Knowledge-Aware Decomposition

算法 1: 基于知识感知分解的任务求解

5.3.3 Knowledge-Aware Task Decomposer Training

5.3.3 知识感知任务分解器训练

It is worth mentioning that knowledge-aware decomposition can be a learnable component. This trained proposer can then directly suggest atomic queries $q^{t}$ during inference, which means lines 3 to 5 in Algorithm 1 can be replaced by a single call to this learned proposer, thereby reducing both inference time and computational cost. In order to train the knowledge-aware decomposer, we collect data about the rationale behind each step by sampling context and creating diverse interaction trajectories. With this data collected, we train a decomposer that can incorporate domain-specific rationale into the task decomposition and result-seeking process.

值得一提的是,知识感知分解可以成为一个可学习的组件。经过训练后,这个提议器可以直接在推理过程中建议原子查询 $q^{t}$,这意味着算法 1 中的第 3 至 5 行可以被对该学习到的提议器的单次调用所取代,从而减少推理时间和计算成本。为了训练知识感知分解器,我们通过采样上下文并创建多样化的交互轨迹来收集每个步骤背后的原理数据。收集到这些数据后,我们训练一个分解器,使其能够将特定领域的原理融入任务分解和结果寻找的过程中。

The data collection process, as depicted in Figure 12 and Algo. 2, implements a sophisticated dual-dictionary system for managing and tracking information. Our system utilizes two primary data structures: dictionary $\boldsymbol{S}$ for maintaining comprehensive score records, and dictionary $\mathcal{V}$ for systematically tracking visit frequencies of candidate chunks. During the initialization phase, we establish baseline values by setting all scores to zero and initializing visit counters to one, creating a foundation for dynamic updates throughout the subsequent processing stages.

如图 12 和 Algo. 2 所示,数据收集过程实现了一个复杂的双字典系统,用于管理和跟踪信息。我们的系统利用了两种主要的数据结构:字典 $\boldsymbol{S}$ 用于维护全面的评分记录,字典 $\mathcal{V}$ 用于系统性地跟踪候选块的访问频率。在初始化阶段,我们通过将所有评分设置为零并将访问计数器初始化为一来建立基线值,为后续处理阶段的动态更新奠定基础。

In each iteration of our decomposition process, the system executes a detailed retrieval operation targeting the top $K^{\prime}$ chunks demonstrating maximum relevance to the current atomic question. These chunks must satisfy our similarity threshold criterion (specifically, similarity exceeding $\delta^{\prime}$ , where $\delta^{\prime}<\delta)$ , with $K^{\prime}$ intentionally configured to be larger than $K$ to ensure comprehensive coverage. Following this initial retrieval, we carefully select and integrate the data chunks corresponding to the top $K$ most relevant atomic retrieved pairs into the context. For those retrieved chunks that do not make it into the top $\mathcal{K}$ selection, we systematically incorporate them into $\boldsymbol{S}$ and methodically update their scores based on precisely calculated relevance metrics.

在我们的分解过程的每次迭代中,系统会针对与当前原子问题相关性最高的前 $K^{\prime}$ 个块执行详细的检索操作。这些块必须满足我们的相似性阈值标准(具体来说,相似性超过 $\delta^{\prime}$,其中 $\delta^{\prime}<\delta$),并且 $K^{\prime}$ 有意配置为大于 $K$ 以确保全面覆盖。在初始检索之后,我们仔细选择并将对应于前 $K$ 个最相关的原子检索对的数据块集成到上下文中。对于那些未被选入前 $\mathcal{K}$ 的检索块,我们系统地将其纳入 $\boldsymbol{S}$ 中,并根据精确计算的相关性指标有条不紊地更新它们的分数。


Figure 12: Data collection process for decomposer training, comprising four main components: a) sampling data chunks from the context sampling pool to serve as the reference context for question decomposition, b) saving the generated atomic query proposals, c) after retrieval and selection, saving the chosen atomic query proposals as part of the reasoning trajectories, d) evaluating the answer to generate a score.

图 12: 分解器训练的数据收集过程,包含四个主要组成部分:a) 从上下文采样池中抽取数据块作为问题分解的参考上下文,b) 保存生成的原子查询建议,c) 在检索和选择后,将选定的原子查询建议保存为推理轨迹的一部分,d) 评估答案以生成分数。


Figure 13: An example of context sampling and an illustration of decomposer training with collected data.

图 13: 上下文采样的示例以及使用收集数据进行的分解器训练示意图。

To ensure comprehensive exploration of the solution space, we have implemented an advanced sampling mechanism that intelligently selects additional chunks from $\boldsymbol{S}$ when available, incorporating them seamlessly into the reference context. Our implementation leverages the Upper Confidence Bound [8] (UCB) algorithm for context sampling, establishing a balanced approach between exploitation and exploration. The exploitation component manifests through the retriever-selected chunks, focusing on options with currently highest estimated rewards to optimize immediate performance gains. Conversely, the exploration aspect is fulfilled through context sampling from $\boldsymbol{S}$ , enabling the systematic investigation of less-certain options to accumulate valuable data and potentially uncover superior long-term alternatives.

为确保全面探索解决方案空间,我们实施了一种先进的采样机制,当$\boldsymbol{S}$中有可用数据时,智能地选择额外数据块,并将其无缝整合到参考上下文中。我们的实现利用了置信区间上界 [8] (UCB) 算法进行上下文采样,在利用和探索之间建立了一种平衡的方法。利用部分通过检索器选择的数据块体现,专注于当前估计奖励最高的选项,以优化即时性能提升。相反,探索部分通过从 $\boldsymbol{S}$ 中进行上下文采样来实现,使得能够系统地研究不确定性较高的选项,以积累有价值的数据,并可能发现更优的长期替代方案。

This meticulously crafted strategy serves a dual purpose: it not only facilitates the generation of diverse and comprehensive atomic query proposals but also enables systematic exploration of multiple potential reasoning pathways. Through this sophisticated approach, we progressively work toward deriving optimal final answers while maintaining a balance between immediate performance optimization and long-term discovery of potentially superior solutions.

这一精心设计的策略具有双重目的:它不仅有助于生成多样且全面的原子查询提案,还能够系统性地探索多种潜在的推理路径。通过这种复杂的方法,我们逐步朝着得出最佳最终答案的方向努力,同时在即时性能优化与长期发现潜在更优解决方案之间保持平衡。

We record atomic proposals (AP), interactive trajectories, and answer scores to support decomposer training. For each specialized domain, interactive trajectories featuring distinct reasoning paths are gathered for decomposer training. This allows us to use the answer score as a supervised signal to train the decomposer. The decomposer training process is depicted in Figur