GENERATE RATHER THAN RETRIEVE: LARGE LANGUAGE MODELS ARE STRONG CONTEXT GENERATORS

生成而非检索：大语言模型是强大的上下文生成器

ABSTRACT

摘要

Knowledge-intensive tasks, such as open-domain question answering (QA), require access to a large amount of world or domain knowledge. A common approach for knowledge-intensive tasks is to employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from an external corpus such as Wikipedia and then predicts an answer conditioned on the retrieved documents. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators. We call our method generate-then-read (GENREAD), which first prompts a large language model to generate contextual documents based on a given question, and then reads the generated documents to produce the final answer. Furthermore, we propose a novel clustering-based prompting method that selects distinct prompts, in order to generate diverse documents that cover different perspectives, leading to better recall over acceptable answers. We conduct extensive experiments on three different knowledge-intensive tasks, including open-domain QA, fact checking, and dialogue system. Notably, GENREAD achieves 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the state-of-the-art retrieve-thenread pipeline DPR-FiD by $+4.0$ and $+3.9$ , without retrieving any documents from any external knowledge source. Lastly, we demonstrate the model performance can be further improved by combining retrieval and generation. Our code and generated documents can be found at https://github.com/wyu97/GenRead.

知识密集型任务（如开放域问答(QA)）需要获取大量世界或领域知识。针对此类任务的常见方法是采用"检索-阅读"流程：先从维基百科等外部语料库检索少量相关上下文文档，再基于检索到的文档预测答案。本文提出一种解决知识密集型任务的新视角——用大语言模型生成器替代文档检索器。我们将该方法称为"生成-阅读"(GENREAD)：先提示大语言模型根据给定问题生成上下文文档，再通过阅读生成文档得出最终答案。此外，我们提出一种基于聚类的新型提示方法，通过筛选不同提示来生成涵盖多元视角的文档，从而提升对可接受答案的召回率。我们在开放域QA、事实核查和对话系统这三个知识密集型任务上进行了广泛实验。值得注意的是，GENREAD在TriviaQA和WebQ上分别取得71.6和54.4的精确匹配分数，在完全不依赖外部知识源文档检索的情况下，显著超越当前最先进的"检索-阅读"流程DPR-FiD（分别提升+4.0和+3.9）。最后，我们证明结合检索与生成能进一步提升模型性能。代码及生成文档详见https://github.com/wyu97/GenRead。

1 INTRODUCTION

1 引言

Knowledge-intensive tasks, such as open-domain question answering (QA) and fact checking, require access to a large amount of world or domain knowledge (Petroni et al., 2021). These tasks are even challenging for humans without access to an external knowledge source such as Wikipedia. A common thread of existing methods for knowledge-intensive tasks employ a retrieve-then-read pipeline that first retrieves a handful of relevant contextual documents from Wikipedia and then conditions the prediction of the answer on these documents along with the question (Karpukhin et al., 2020; Lewis et al., 2020; Izacard & Grave, 2021). Nevertheless, these methods mainly suffer from three drawbacks. First, candidate documents for retrieval are chunked (e.g., 100 words) and fixed, so the retrieved documents might contain noisy information that is irrelevant to the question. Second, the representations of questions and documents are typically obtained independently in modern two-tower dense retrieval models (Karpukhin et al., 2020), leading to only shallow interactions captured between them (Khattab et al., 2021). Third, document retrieval over a large corpus requires the retriever model to first encode all candidate documents and store representations for each document. These two operations limit the parameters of dense retrievers and the size of embedding vectors, and thus cannot enjoy the world knowledge or deduction capabilities of large language models (Levine et al., 2022).

知识密集型任务，例如开放域问答(open-domain QA)和事实核查，需要获取大量世界或领域知识(Petroni et al., 2021)。即便对人类而言，若无法访问维基百科等外部知识源，这些任务也颇具挑战性。现有方法处理知识密集型任务的通用流程采用"检索-阅读"管道(retrieve-then-read pipeline)：先从维基百科检索少量相关上下文文档，再结合问题和这些文档预测答案(Karpukhin et al., 2020; Lewis et al., 2020; Izacard & Grave, 2021)。然而这些方法存在三个主要缺陷：首先，检索候选文档被分块(如100词)且固定，检索结果可能包含与问题无关的噪声信息；其次，现代双塔稠密检索模型(two-tower dense retrieval models)中问题和文档的表征通常独立获取(Karpukhin et al., 2020)，导致二者仅能捕捉浅层交互(Khattab et al., 2021)；第三，大规模文档检索要求检索模型先编码所有候选文档并存储各文档表征，这两个操作限制了稠密检索器的参数量与嵌入向量维度，因而无法利用大语言模型的世界知识或推理能力(Levine et al., 2022)。

In this paper, we propose to leverage large language models, such as Instruct GP T (Ouyang et al., 2022), to directly generate contextual documents for a given question, instead of retrieving relevant documents from an external corpus, such as Wikipedia. Our approach has two main advantages. First, we show that generated contextual documents contain the correct answer more often than the top retrieved documents. We believe this is because large language models generate contextual documents by performing deep token-level cross-attention between all the question and document contents, resulting in generated documents that are more specific to the question than retrieved documents. Second, we show that our approach significantly outperforms directly generating answers from large language models despite not incorporating any new external information. This is mainly because the task of generating document-level contexts is close to the objective of causal language modeling pre-training, so the world knowledge stored in the model parameters can be better utilized.

在本文中，我们提出利用大语言模型（如InstructGPT (Ouyang et al., 2022)）直接为给定问题生成上下文文档，而非从外部语料库（如维基百科）检索相关文档。我们的方法具有两大优势：首先，实验表明生成的上下文文档比检索到的Top文档更频繁包含正确答案。我们认为这是因为大语言模型通过对问题和文档内容进行深度token级交叉注意力计算来生成文档，使得生成结果比检索文档更具问题针对性。其次，尽管未引入任何新外部信息，该方法显著优于直接从大语言模型生成答案的方案。这主要因为生成文档级上下文的任务与因果语言建模预训练目标高度契合，能更有效调用模型参数中存储的世界知识。

We show, on multiple datasets, that generated documents are more likely to contain correct answers than the top retrieved documents. Notably, in dense retrieval methods, as more documents are retrieved, the recall of documents containing the correct answer increases (Karpukhin et al., 2020). However, the recall performance does not scale as well with generated documents because even with sampling methods, generated documents tend to contain duplicate information. In order to improve the recall performance of generated documents, we propose a novel clustering-based prompt method. We synthesize a prompt with in-context demonstrations of question-document pairs sampled from diverse clusters. These prompts result in generated documents that cover different perspectives of the question and improve the scaling of performance as more documents are generated per question.

我们在多个数据集上证明，生成文档比检索到的顶级文档更可能包含正确答案。值得注意的是，在密集检索方法中，随着检索文档数量的增加，包含正确答案的文档召回率会提升 [20]。但生成文档的召回性能扩展性较差，因为即使采用采样方法，生成文档也容易包含重复信息。为提升生成文档的召回性能，我们提出了一种基于聚类的新型提示方法：通过从不同聚类中采样问题-文档对的上下文示例来合成提示。这种提示能使生成文档覆盖问题的不同视角，从而在每问题生成更多文档时实现更好的性能扩展。

In contrast to the retrieve-then-read pipeline, our method is essentially a generate-then-read pipeline. Specifically, it first prompts a large language model to generate contextual documents based on a given question, and then reads the generated document to produce the final answer. The reader can still be a large model (e.g., Instruct GP T (Ouyang et al., 2022)) used under a zero-shot setting, or a small one (e.g., FiD (Izacard & Grave, 2021)) fine-tuned with generated documents on the training split of the target dataset. We evaluate our proposed method on three different knowledge-intensive tasks and demonstrate its effectiveness on both zero-shot and supervised settings.

与检索-阅读流程不同，我们的方法本质上是生成-阅读流程。具体而言，该方法首先提示大语言模型基于给定问题生成上下文文档，然后通过阅读生成的文档来产生最终答案。阅读器仍可采用零样本设置下的大模型（例如 InstructGPT (Ouyang et al., 2022)），或使用目标数据集训练集对生成文档进行微调的小型模型（例如 FiD (Izacard & Grave, 2021)）。我们在三个不同知识密集型任务上评估了所提方法，并验证了其在零样本和监督设置下的有效性。

Overall, our main contributions can be summarized as follows:

总体而言，我们的主要贡献可概括如下:

We propose a novel generate-then-read pipeline for solving knowledge-intensive tasks, i.e., replacing the process of retrieving documents from Wikipedia or searching for related documents on Google, by prompting a large language model to generate relevant contextual documents. 2. We propose a novel clustering-based prompting approach to generate multiple diverse contextual documents that increases the likelihood of covering the correct answer. We demonstrate this approach can significantly improve performance on end QA and other downstream tasks. 3. We conduct extensive experiments with three knowledge-intensive NLP tasks under both zeroshot and supervised settings. Notably, our method can match or even outperform retrieve-then-read pipeline methods, without retrieving any documents from any external knowledge source.
我们提出了一种新颖的生成后读取流程 (generate-then-read pipeline) 来解决知识密集型任务，即通过提示大语言模型生成相关上下文文档，取代从维基百科检索文档或在谷歌搜索相关文档的过程。
我们提出了一种基于聚类的新型提示方法，用于生成多个多样化的上下文文档，从而提高覆盖正确答案的可能性。实验证明该方法能显著提升端到端问答及其他下游任务的性能。
我们在零样本和监督设置下，针对三项知识密集型 NLP 任务进行了大量实验。值得注意的是，我们的方法无需从任何外部知识源检索文档，即可达到甚至超越检索后读取流程 (retrieve-then-read pipeline) 方法的性能。

2 相关工作

2.1 KNOWLEDGE-INTENSIVE NLP VIA RETRIEVE-THEN-READ PIPELINE.

2.1 基于检索-阅读流程的知识密集型自然语言处理

Mainstream methods for solving knowledge-intensive NLP tasks employ a retrieve-then-read model pipeline. Given a question, this model first leverages a retriever over a large evidence corpus (e.g. Wikipedia) to fetch a set of relevant documents that may contain the answer. A reader is then used to peruse the retrieved documents and predict an answer. Recent follow-up work has mainly focused on improving the retriever (Karpukhin et al., 2020; Qu et al., 2021; Sachan et al., 2022) or the reader (Izacard & Grave, 2021; Cheng et al., 2021; Yu et al., 2022), or training the system end-toend (Lewis et al., 2020; Singh et al., 2021). Early retrieval methods mainly employed sparse retrievers, such as BM25 (Chen et al., 2017). Recently, ORQA (Lee et al., 2019) and DPR (Karpukhin et al., 2020) have revolutionized the field by utilizing dense contextual i zed vectors for document indexing, leading to superior performance to traditional approaches. We propose an alternative approach which forgoes retrieval, instead extracting the knowledge from the model parameters of a large language model. We show that our approach is can be combine with dense retrievers to outperform both methods independently. Our method can also be combined with any reader mechanism, allowing generated context documents to be plugged into any current knowledge-intensive NLP pipelines.

解决知识密集型NLP任务的主流方法采用检索-阅读的模型流程。给定一个问题，该模型首先利用检索器在大型证据语料库(如维基百科)中获取一组可能包含答案的相关文档，随后通过阅读器分析检索到的文档并预测答案。近期研究主要聚焦于改进检索器(Karpukhin等人，2020；Qu等人，2021；Sachan等人，2022)或阅读器(Izacard & Grave，2021；Cheng等人，2021；Yu等人，2022)，或训练端到端系统(Lewis等人，2020；Singh等人，2021)。早期检索方法主要采用稀疏检索器，如BM25(Chen等人，2017)。近年来，ORQA(Lee等人，2019)和DPR(Karpukhin等人，2020)通过使用稠密上下文向量进行文档索引，实现了对传统方法的性能超越。我们提出了一种替代方案，放弃检索步骤，直接从大语言模型的参数中提取知识。研究表明，该方法与稠密检索器结合使用时，性能优于单独使用任一方法。本方案还可与任何阅读机制结合，使生成的上下文文档能无缝接入现有知识密集型NLP流程。

2.2 GENERATOR AS RETRIEVER FOR OBTAINING CONTEXTUAL DOCUMENTS.

2.2 作为检索器的生成器用于获取上下文文档

Recent works have investigated using auto-regressive language models to generate identifier strings for documents, as an intermediate target for retrievals, such as entity names (De Cao et al., 2020) or distinctive n-grams that can be mapped to full passages (Bevilacqua et al., 2022). However, one needs to create the identifiers, hence the structure was not thoroughly evaluated on a large-scale benchmark (Bevilacqua et al., 2022). Other works have demonstrated that the knowledge stored in the parameters of pre-trained language models could be “retrieved” to some extent by directly generating text (Petroni et al., 2019; Roberts et al., 2020). However, the previous work only used generation for query expansion (Mao et al., 2021), which did not exploit the potential of directly generating contextual documents for open-domain questions. Different from the above approaches that aimed to train a generator model to produce contextual document identifiers (which is still using the original Wikipedia text) or provide data augmentation to retrievers, our work directly generates contextual documents for given questions.

近期研究探索了使用自回归语言模型为文档生成标识符字符串，作为检索的中间目标，例如实体名称 (De Cao et al., 2020) 或可映射到完整段落的独特n元语法 (Bevilacqua et al., 2022)。然而，由于需要人工创建标识符，该结构尚未在大规模基准测试中得到充分评估 (Bevilacqua et al., 2022)。其他研究表明，预训练语言模型参数中存储的知识可以通过直接生成文本实现一定程度的"检索" (Petroni et al., 2019; Roberts et al., 2020)。但前人工作仅将生成技术用于查询扩展 (Mao et al., 2021)，未能充分发挥直接为开放域问题生成上下文文档的潜力。与上述旨在训练生成模型产生上下文文档标识符（仍使用原始维基百科文本）或为检索器提供数据增强的方法不同，我们的工作直接为给定问题生成上下文文档。

2.3 NLP MODELS ENHANCED BY LARGE LANGUAGE MODEL OUTPUTS.

2.3 基于大语言模型输出的增强型NLP模型

A line of recent work has shown that relevant knowledge can be elicited from large language models, especially for those domains that lack appropriate knowledge bases with sufficient coverage (Liu et al., 2022b; Fang et al., 2022). For example, Liu et al. (2022b) proposed leveraging GPT-3 to generate relevant contexts, then providing the contexts as additional input when answering a commonsense question. Another line of work focused on prompting a large language model to generate a series of intermediate reasoning steps, often referred to as chain-of-thought (Wei et al., 2022b; Kojima et al., 2022; Li et al., 2022). The prompt consists of an instruction (e.g., Let’s think step by step!), a few demonstrations that are fixed for each task, and a new-question placeholder. The demonstrations are human-written, and each consists of a question in the style of the task and a series of intermediate reasoning steps that is helpful for answering the question. Our work does not require any human annotation, but adds to this line of work of leveraging model generated text to guide further generations. In our case, we apply this approach to knowledge-intensive tasks, which have not been explored by previous work.

近期一系列研究表明，可以从大语言模型中提取相关知识，特别是在那些缺乏覆盖充分的知识库的领域 (Liu et al., 2022b; Fang et al., 2022)。例如，Liu et al. (2022b) 提出利用 GPT-3 生成相关上下文，然后在回答常识性问题时将这些上下文作为额外输入。另一类研究侧重于通过提示让大语言模型生成一系列中间推理步骤，通常称为思维链 (Wei et al., 2022b; Kojima et al., 2022; Li et al., 2022)。提示包括指令 (例如"让我们一步步思考！")、针对每个任务固定的少量示例，以及一个新问题的占位符。这些示例由人工编写，每个示例包含一个任务风格的问题和一系列有助于回答问题的中间推理步骤。我们的工作不需要任何人工标注，但延续了利用模型生成文本来指导后续生成的研究方向。在我们的案例中，我们将这种方法应用于知识密集型任务，这是以往研究尚未探索的领域。

3 PROPOSED METHOD

3 研究方法

In this section, we present details of our proposed novel generate-then-read (GENREAD) pipeline for solving various knowledge-intensive tasks. Specifically, it first prompts a large language model to generate contextual documents with respect to a given query, then reads the generated documents to predict the final answer. The reader can either be a large model (e.g., Instruct GP T) used for the zero-shot setting, or a small one (e.g., FiD) fine-tuned with generated documents on the training split of the target dataset. We introduce the zero-shot setting in $\S3.1$ and supervised setting in $\S3.2$ .

在本节中，我们将详细介绍所提出的新型生成后阅读（GENREAD）流程，用于解决各类知识密集型任务。具体而言，该流程首先提示大语言模型根据给定查询生成相关上下文文档，随后通过阅读生成文档来预测最终答案。阅读器既可以是用于零样本设置的大模型（如InstructGPT），也可以是在目标数据集训练集上使用生成文档微调的小模型（如FiD）。我们将在$\S3.1$介绍零样本设置，在$\S3.2$介绍监督设置。

3.1 ZERO-SHOT SETTING

3.1 零样本 (Zero-shot) 设定

Under the zero-shot setting, there is no training data – neither questions nor contextual documents. When tested on the open-domain QA task, most existing large language models directly encode the given question and predict the answer (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022). Specifically, the question $q$ , associated with some text prompt, is input to the model, which then generates the answer, denoted as $p(a|q,\theta)$ , where $\theta$ represents the pre-trained model parameters. In practice, the maximum a posteriori estimation (MAP) is the final answer, i.e., $\hat{a}=\arg\operatorname*{max}_{a}p(a|q,\theta)$ . However, this way of directly asking large language models to output answers often leads to poor performance, as it leaves a considerable amount of additional world knowledge un exploited (Levine et al., 2022). On the contrary, the zero-shot retrieve-then-read pipeline first uses an off-the-shelf retriever to fetch relevant documents from an external knowledge source such as Wikipedia, then asks the large language model to read the documents and predict the answer.

在零样本 (zero-shot) 设置下，既没有训练数据，也没有问题或上下文文档。在开放域问答任务测试中，大多数现有大语言模型直接对给定问题进行编码并预测答案 (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022)。具体而言，问题 $q$ 与某些文本提示一起输入模型，模型随后生成答案，表示为 $p(a|q,\theta)$，其中 $\theta$ 代表预训练模型参数。实践中，最大后验估计 (MAP) 为最终答案，即 $\hat{a}=\arg\operatorname*{max}_{a}p(a|q,\theta)$。然而，这种直接要求大语言模型输出答案的方式往往表现不佳，因为它未能充分利用大量额外的世界知识 (Levine et al., 2022)。相反，零样本检索-阅读流程首先使用现成的检索器从维基百科等外部知识源获取相关文档，然后要求大语言模型阅读文档并预测答案。

In this work, we improve the performance by introducing an additional auxiliary generated document variable $d$ , and then extend the model to have the form $\begin{array}{r}{p(a|q)=\sum_{i}p(a|d_{i},\dot{q})\bar{p}(d_{i}|q)}\end{array}$ . In practice, we cannot sum over all possible documents $d$ . Therefore, the mos t common approach is to compute the MAP estimate $\hat{d}=\arg\operatorname*{max}\hat{p}(d)$ using beam search, and then to approximate the sum over $d$ with this single value. This two step approach, we label it as a generate-then-read pipeline.

在本工作中，我们通过引入额外的辅助生成文档变量$d$来提升性能，并将模型扩展为$\begin{array}{r}{p(a|q)=\sum_{i}p(a|d_{i},\dot{q})\bar{p}(d_{i}|q)}\end{array}$形式。实际应用中无法对所有可能的文档$d$求和，因此最常见的方法是使用束搜索计算最大后验估计$\hat{d}=\arg\operatorname*{max}\hat{p}(d)$，然后用该单一值近似替代对$d$的求和。我们将这种两步法标注为生成-阅读管道。

Figure 1: An overall framework of clustering-based prompting method. It leverages distinct questiondocument pairs sampled from each embedding cluster as in-context demonstrations to prompt a large language model to generate diverse documents, then read the documents to predict an answer.

图 1: 基于聚类的提示方法整体框架。该方法从每个嵌入聚类中采样不同的问题-文档对作为上下文示例，提示大语言模型生成多样化文档，随后通过阅读这些文档来预测答案。

STEP1: GENERATE. In this step, we first prompt a large language model (e.g., Instruct GP T (Ouyang et al., 2022)) to generate documents based on the given question. For example, the input to the language model could be “Generate a background document to answer the given question. {question placeholder}”. We can use any decoding strategy (e.g., greedy decoding, beam search), but we used greedy decoding throughout the zero-shot experiments for simplicity and reproducibility.

步骤1：生成。在此步骤中，我们首先提示一个大语言模型（例如InstructGPT (Ouyang et al., 2022)）根据给定问题生成文档。例如，语言模型的输入可以是“生成一份背景文档来回答给定问题。{问题占位符}”。我们可以使用任何解码策略（例如贪婪解码、束搜索），但为了简单性和可复现性，在零样本实验中我们全程采用贪婪解码。

STEP 2: READ. In the second step, we use generated sentence $\hat{d}$ along with the input question to produce the final answer from the large language model. This is actually the same setting as “zeroshot” reading comprehension, as widely studied in existing works (Brown et al., 2020; Lazaridou et al., 2022). We choose appropriate prompts from P3 (Bach et al., 2022), such as “Refer to the passage below and answer the following question. Passage: {background placeholder} Question: {question placeholder}”. Finally, the language model is fed the prompted text to generate an answer.

步骤2：阅读。在第二步中，我们使用生成的句子$\hat{d}$和输入问题，从大语言模型中生成最终答案。这实际上与现有研究中广泛探讨的"零样本"阅读理解设置相同 (Brown et al., 2020; Lazaridou et al., 2022)。我们从P3 (Bach et al., 2022) 中选择合适的提示，例如"参考以下段落并回答问题。段落：{背景占位符} 问题：{问题占位符}"。最后，将提示文本输入语言模型以生成答案。

3.2 SUPERVISED SETTING

3.2 监督式设定

Although large language models demonstrate impressive performance on zero-shot learning abilities, their performance still lag behind the supervised setting. Therefore, we also explore how the generated documents from large language models can benefit the supervised setting. As directly fine-tuning large language models on downstream datasets could be prohibitively expensive, we leverage a small reader model such as FiD to peruse the generated documents under the supervised setting.

尽管大语言模型在零样本学习能力上表现出色，但其性能仍落后于监督学习场景。因此，我们同时探索了大语言模型生成的文档如何提升监督学习效果。鉴于直接在下游数据集上微调大语言模型可能成本过高，我们采用FiD等小型阅读器模型在监督环境下处理生成文档。

Under the supervised setting, scaling the size of retrieved documents can lead to better performance (Karpukhin et al., 2020; Izacard & Grave, 2021). This is mainly because retrieving more documents can cover more relevant information and knowledge, i.e., a higher recall score. Nevertheless, asking a large language model to generate multiple high-quality contextual documents is a challenging task. Dense retrieval methods can fetch multiple documents covering different perspectives of the answer. Compared to dense retrievers, simply prompting a large language model to generate multiple contextual documents often leads to low knowledge coverage, since the contents generated by multiple decoding passes from the same input tend to be similar. Sampling decoding methods, such as nucleus sampling1 (Holtzman et al., 2020) can diversify the generation process to some extent, but the knowledge content of generated texts still tends to be highly repetitive when used to generate documents for a given question. We further propose two novel solutions, including diverse human prompts and clustering-based prompts, which will be elaborated on in this section.

在有监督设置下，扩大检索文档规模可以提升性能 (Karpukhin et al., 2020; Izacard & Grave, 2021)。这主要是因为检索更多文档能覆盖更广泛的相关信息和知识，即获得更高的召回率。然而，要求大语言模型生成多份高质量上下文文档具有挑战性。稠密检索方法能获取涵盖答案不同视角的多个文档。与稠密检索器相比，单纯提示大语言模型生成多份上下文文档往往导致知识覆盖率低下，因为同一输入经多次解码生成的内容往往高度相似。虽然核心采样1 (Holtzman et al., 2020) 等采样解码方法能在一定程度上使生成过程多样化，但针对给定问题生成文档时，生成文本的知识内容仍存在高度重复性。我们进一步提出两种创新解决方案：多样化人工提示和基于聚类的提示，将在本节详细阐述。

3.2.1 DIVERSE HUMAN PROMPTS

3.2.1 多样化人类提示

In order to avoid similar token distributions under a single prompt, we ask human annotators to provide different prompts, in order to make the generated document diverse. This method is simple, but can effectively vary the token distribution during generation. In the experiments, we empirically found this method can bring improvement to the retrieval performance (Figure 2). However, this method suffers from two drawbacks. On one hand, it requires human annotators to write different prompts, which cannot be easily generalized to different knowledge-intensive tasks. On the other hand, different large language models might be sensitive to different prompt words, which might cause a set of good prompt words not work on a different large language model.

为了避免单一提示下出现相似的token分布，我们要求人工标注者提供不同的提示，以使生成文档多样化。该方法虽然简单，但能有效改变生成过程中的token分布。实验中发现该方法能提升检索性能（图2）。但存在两个缺陷：一方面需要人工编写不同提示，难以推广到不同知识密集型任务；另一方面不同大语言模型可能对提示词敏感度不同，导致一组优质提示词在其他大语言模型上失效。

3.2.2 CLUSTERING-BASED PROMPTS

3.2.2 基于聚类的提示

To increase knowledge coverage in generated documents, we propose a novel clustering-based prompt method. It first clusters the representations of a set of documents into $K$ classes $K=2$ in Figure 1), where the number of classes is equal to the number of documents that need to be generated in the end. Next, it randomly selects $n$ question-document pairs $\mathit{\Omega}_{n}=5$ in Figure 1) from each cluster. Lastly, a large language model presents the different $n$ question-document pairs as in-context demonstrations for generating documents to a given question. In this way, large language models are based on different distributions of examples, hence resulting in generated documents covering different perspectives. We show this in Figure 1 and illustrate the details of each step as follows.

为提升生成文档的知识覆盖范围，我们提出了一种基于聚类的新型提示方法。该方法首先将一组文档的表征聚类为$K$个类别（图1中$K=2$），其类别数量最终等于需要生成的文档数量。接着，从每个聚类中随机选取$n$个问答文档对（图1中$\mathit{\Omega}_{n}=5$）。最后，大语言模型将这$n$个不同的问答文档对作为上下文示例，根据给定问题生成文档。通过这种方式，大语言模型基于不同的示例分布进行生成，从而使最终文档涵盖不同视角。图1展示了这一流程，各步骤细节如下所述。

STEP 1: GET ONE INITIAL DOCUMENT PER QUESTION. Similar to the zero-shot setting, we first ask a large language model to generate one contextual document $d$ for each question $q\in\mathcal{Q}$ , where $\mathcal{Q}$ is the set of questions in the training split. Alternatively, we can use an unsupervised retriever (e.g., BM25) to obtain a document from Wikipedia. We now have a question-document pair set ${q_{i},d_{i}}_{i=1}^{|\mathcal{Q}|}$

步骤1：为每个问题获取初始文档。类似于零样本设置，我们首先让一个大语言模型为每个问题$q\in\mathcal{Q}$生成一个上下文文档$d$，其中$\mathcal{Q}$是训练集中的问题集合。或者，我们也可以使用无监督检索器（例如BM25）从维基百科获取文档。现在，我们拥有一个问题-文档对集合${q_{i},d_{i}}_{i=1}^{|\mathcal{Q}|}$。

STEP 2: ENCODE EACH DOCUMENT, DO K-MEANS CLUSTERING. We then use a large language model (i.e., GPT-3) to encode each question-document pair, i.e., $\mathbf{e}_ {i}=\mathrm{GPT}\mathbf{-}3([q_{i},d_{i}])$ , resulting in a 12,288-dimensional vector per document. Then, we use K-means to cluster all embedding vectors ${\mathbf{e}_ {i}}_{i=1}^{|Q|}$ into $K$ sets, so each question-document pair is assigned a unique cluster id $c\in{1,...,K}$ We vary the number of $K$ in the experiments, which will be illustrated in Figure 2.

步骤2：编码每个文档并进行K均值聚类。随后，我们使用一个大语言模型（即GPT-3）对每个问题-文档对进行编码，即$\mathbf{e}_ {i}=\mathrm{GPT}\mathbf{-}3([q_{i},d_{i}])$，每个文档生成一个12,288维的向量。接着，我们采用K均值算法将所有嵌入向量${\mathbf{e}_ {i}}_{i=1}^{|Q|}$聚类为$K$个集合，因此每个问题-文档对被分配一个唯一的聚类ID$c\in{1,...,K}$。实验中我们调整$K$的数量，具体结果将在图2中展示。

STEP 3: SAMPLE AND GENERATE $K$ DOCUMENTS. Lastly, we sample $n$ question-document pairs from each cluster $c$ , denoted as ${{q_{c1},d_{c1};q_{c2},d_{c2};...;q_{c n},d_{c n}}$ , in which $n$ is a hyper parameter 2. Then, the $n$ sampled question-document pairs from the same cluster serve as in-context demonstrations for the large language model to generate a contextual document. For example, the input to the large language model could be "${q_{c1} placeholder}$ ${d_{c1} placeholder}$ ... ${q_{c n} placeholder}$ ${d_{c n} placeholder}$ {input question placeholder}”. By enumerating the sampled documents in these $K$ clusters, we can finally get $K$ -generated documents. By conditioning on different sampled in-context demonstrations collected from different clusters, the large language model has been biased for different perspectives. Although these different perspectives exist in a latent manner, we empirically show it works well in practice, by comparing it with sampling methods, diverse human prompts (Figure 2 and Table 2) and randomly sampling $n$ pairs from the entire dataset (Table 11).

步骤3：采样并生成$K$份文档。最后，我们从每个聚类$c$中采样$n$个问题-文档对，记为${q_{c1},d_{c1};q_{c2},d_{c2};...;q_{c n},d_{c n}}$，其中$n$是一个超参数2。然后，来自同一聚类的$n$个采样问题-文档对作为大语言模型的上下文示例，用于生成上下文文档。例如，大语言模型的输入可能是${q_{c1} 占位符}$ ${d_{c1}占位符}$...${q_{c n}占位符}$ ${d_{c n} 占位符}${输入问题占位符}”。通过枚举这些$K$个聚类中的采样文档，我们最终可以得到$K$份生成的文档。通过基于从不同聚类收集的不同采样上下文示例进行条件化，大语言模型已经偏向于不同的视角。尽管这些不同的视角以潜在的方式存在，但通过将其与采样方法、多样化的人工提示（图2和表2）以及从整个数据集中随机采样$n$对（表11）进行比较，我们实证表明它在实践中效果良好。

4 EXPERIMENTS

4 实验

In this section, we conduct comprehensive experiments on three knowledge-intensive NLP tasks, including open-domain QA (NQ (Kwiatkowski et al., 2019), TriviaQA (Joshi et al., 2017) and WebQ (Berant et al., 2013)), fact checking (FEVER (Thorne et al., 2018) and FM2 (Eisen schlo s et al., 2021)) and open-domain dialogue system (WoW (Dinan et al., 2019)). More detailed dataset information can be found in Appendix A.1. To evaluate the model performance, we use exact match (EM) score for evaluating open-domain QA (Zhu et al., 2021). An answer is considered correct if and only if its normalized form has a match in the acceptable answer list. We also employ Recall $\ @\mathrm{K}$ $(\mathbf{R}\ @\mathbf{K})$ as an intermediate evaluation metric, measured as the percentage of top-K retrieved or generated documents that contain the answer. This metric is commonly used in evaluations of previous works (Karpukhin et al., 2020; Izacard & Grave, 2020; Sachan et al., 2022). For other knowledge-intensive tasks, we follow the KILT benchmark (Petroni et al., 2021) to use accuracy (ACC) for fact checking and F1 / Rouge-L (R-L) score for open-domain dialogue system.

在本节中，我们对三项知识密集型NLP任务进行了全面实验，包括开放域问答（NQ (Kwiatkowski et al., 2019)、TriviaQA (Joshi et al., 2017)和WebQ (Berant et al., 2013)）、事实核查（FEVER (Thorne et al., 2018)和FM2 (Eisen schlo s et al., 2021)）以及开放域对话系统（WoW (Dinan et al., 2019)）。更详细的数据集信息见附录A.1。为评估模型性能，我们采用精确匹配（EM）分数评估开放域问答（Zhu et al., 2021），当且仅当答案的标准化形式与可接受答案列表匹配时判定为正确。同时使用Recall @K (R@K)作为中间评估指标，即前K个检索或生成文档中包含答案的百分比，该指标在先前研究中被广泛采用（Karpukhin et al., 2020; Izacard & Grave, 2020; Sachan et al., 2022）。对于其他知识密集型任务，我们遵循KILT基准（Petroni et al., 2021），使用准确率（ACC）评估事实核查任务，采用F1/Rouge-L（R-L）分数评估开放域对话系统。

Models	Open-domain QA NQ	TriviaQAWebQ		Fact Checking FEVERFM2		DialogueSystem WoW (F1 /R-L)
withretriever,ANDdirectlytrainedonthesedatasets DPR + InstructGPT
	29.1	53.8	20.2	79.8	65.9	15.4 13.7
*withretriever,BUTNOTtrainedonthesedatasets
BM25+InstructGPT	19.7	52.2 15.8	78.7	65.2	15.7	13.7
Contriever+InstructGPT	18.0	51.3	16.6	80.4	66.6 15.5	14.0
Google+InstructGPT	28.8	58.8	20.4	82.9	66.0 14.8	13.2
*withoutretriever,andnotusingexternaldocuments
PreviousSoTAmethods	24.71	56.72 19.01
InstructGPT (no docs.)	20.9	57.5	18.6	77.6	59.4 15.4	13.8
GENREAD(InstructGPT)	28.0	59.0	24.6	80.4	65.5	15.8 14.2

模型	开放域问答 NQ	TriviaQAWebQ		事实核查 FEVERFM2		对话系统 WoW (F1/R-L)
带检索器，且直接在这些数据集上训练 DPR + InstructGPT	29.1	53.8	20.2	79.8	65.9	15.4 13.7
带检索器，但未在这些数据集上训练
BM25+InstructGPT	19.7	52.2 15.8	78.7	65.2	15.7	13.7
Contriever+InstructGPT	18.0	51.3	16.6	80.4	66.6 15.5	14.0
Google+InstructGPT	28.8	58.8	20.4	82.9	66.0 14.8	13.2
不带检索器，且不使用外部文档
先前SOTA方法	24.71	56.72 19.01
InstructGPT (无文档)	20.9	57.5	18.6	77.6	59.4 15.4	13.8
GENREAD(InstructGPT)	28.0	59.0	24.6	80.4	65.5	15.8 14.2

Table 1: Zero-shot open-domain QA performance. Our proposed GENREAD with the Instruct GP T reader (named GENREAD (Instruct GP T)) can significantly outperform the original Instruct GP T, achieving new state-of-the-art performance on three open-domain QA benchmarks (previous SoTA: 1GLaM (Du et al., 2022), 2FLAN (Wei et al., 2021)) under this setting without using any external document. Our GENREAD can achieve comparable or even better performance than zero-shot retrieve-then-read models that use a retriever or search engine to first obtain contextual documents. To ensure reproducibility, we use greedy search in decoding. All prompts used are shown in the $\S\mathrm{B}.1$ . Note: fix numbers in v2 by adding average performance of different prompts, see details in Table 20.

表 1: 零样本开放领域问答性能。我们提出的GENREAD与Instruct GPT阅读器(命名为GENREAD (Instruct GPT))显著优于原始Instruct GPT，在不使用任何外部文档的情况下，在此设置下实现了三个开放领域问答基准的最新性能(先前SoTA: 1GLaM (Du et al., 2022), 2FLAN (Wei et al., 2021))。我们的GENREAD可以达到与使用检索器或搜索引擎先获取上下文文档的零样本检索-阅读模型相当甚至更好的性能。为确保可复现性，我们在解码时使用贪心搜索。所有使用的提示如$\S\mathrm{B}.1$所示。注: 通过添加不同提示的平均性能来修正v2中的数字，详见表20。

4.1 ZERO-SHOT SETTING EXPERIMENTS

4.1 零样本设置实验

We first compare our proposed GENREAD approach with various large language models proposed in recent years, including GPT-3 (Brown et al., 2020), Gopher (Rae et al., 2021), FLAN (Wei et al., 2021), GLaM (Du et al., 2022), Chinchilla (Hoffmann et al., 2022), PaLM (Chowdhery et al., 2022) and Instruct GP T (Ouyang et al., 2022). Due to the space limitation, we only put the best performance on each dataset in Table 1, in which the line is called previous SoTA methods. In addition, their corresponding model parameters and performance are listed in Table 9 in Appendix. All of these baseline methods use the same input formats, i.e., [prompt words; question].

我们首先将提出的GENREAD方法与近年来提出的各种大语言模型进行比较，包括GPT-3 (Brown等人，2020)、Gopher (Rae等人，2021)、FLAN (Wei等人，2021)、GLaM (Du等人，2022)、Chinchilla (Hoffmann等人，2022)、PaLM (Chowdhery等人，2022)和InstructGPT (Ouyang等人，2022)。由于篇幅限制，我们仅在表1中列出每个数据集上的最佳性能，该行称为先前SoTA方法。此外，它们对应的模型参数和性能列在附录的表9中。所有这些基线方法都使用相同的输入格式，即[提示词；问题]。

GENREAD is based on Instruct GP T with 175B parameters. In order to fully evaluate the effectiveness of our proposed method, we also compare with Instruct GP T augmented with retrieved documents from Wikipedia or Google search. The baseline methods (1) BM25 / Contriever $+$ InstructGPT; (2) Google $+$ Instruct GP T; (3) DPR $+$ Instruct GP T have the same input format as our GENREAD , i.e., [prompt words; contextual document; question]. BM25 is a traditional sparse retrieval method. Contriever (Izacard et al., 2022a) is a state-of-the-art unsupervised dense retrieval model. DPR (Karpukhin et al., 2020) is a supervised dense retrieval model directly trained on NQ, TriviaQA and WebQ datasets. We note that comparing with above three methods is challenging because our method only relies on the large language model itself, without using any external corpus.

GENREAD基于拥有1750亿参数的InstructGPT。为了全面评估我们提出方法的有效性，我们还对比了通过维基百科或谷歌搜索获取检索文档增强的InstructGPT。基线方法包括：(1) BM25/Contriever $+$ InstructGPT；(2) 谷歌搜索 $+$ InstructGPT；(3) DPR $+$ InstructGPT，这些方法与我们的GENREAD采用相同的输入格式，即[提示词；上下文文档；问题]。BM25是传统的稀疏检索方法。Contriever (Izacard等，2022a) 是最先进的无监督密集检索模型。DPR (Karpukhin等，2020) 是直接在NQ、TriviaQA和WebQ数据集上训练的监督式密集检索模型。需要指出的是，与上述三种方法对比具有挑战性，因为我们的方法仅依赖大语言模型自身，未使用任何外部语料库。

4.1.1 EXPERIMENTAL RESULTS

4.1.1 实验结果

In the experiments, we use Instruct GP T as our backbone model. As shown in Table 1, compared with state-of-the-art large language models, our proposed GENREAD with the Instruct GP T reader improves its performance by generating contextual documents and conditioning on the generated documents, even though no new data is introduced, and the generator and reader have the exact same parameters. Specifically, GENREAD can improve the EM score by $+6.9$ on three open-domain QA benchmarks, compared to the original Instruct GP T. We also make a similar observation on fact checking and open-domain dialogue system. Our proposed GENREAD can consistently outperform the baseline Instruct GP T model without retrieving any contextual documents.

在实验中，我们使用Instruct GPT作为主干模型。如表1所示，与最先进的大语言模型相比，我们提出的GENREAD方法通过生成上下文文档并以生成文档为条件，即使没有引入新数据且生成器与阅读器参数完全相同，仍能提升Instruct GPT阅读器的性能。具体而言，在三个开放域QA基准测试中，GENREAD的EM分数比原始Instruct GPT提高了$+6.9$。在事实核查和开放域对话系统任务中，我们也观察到类似现象。提出的GENREAD方法在不检索任何上下文文档的情况下，始终优于基线Instruct GPT模型。

To further validate the effectiveness of GENREAD , we compare against zero-shot retrieve-then-read pipeline models, which first use a retrieval model or the Google search engine to get a relevant contextual document, then use Instruct GP T to read the texts and produce the final answer. As shown in Table 1, GENREAD can achieve on-par performance with zero-shot retrieve-then-read pipeline models on the NQ and FM2 datasets, and outperform them on all other benchmarks. The knowledge learned by the large language models can be retrieved via auto regressive text generation. Without seeing any examples from these datasets, GENREAD can outperform using the supervised retrieval model (i.e., DPR) to recover relevant contextual documents.

为进一步验证GENREAD的有效性，我们将其与零样本检索-阅读流程模型进行对比。该流程首先使用检索模型或Google搜索引擎获取相关上下文文档，再通过InstructGPT阅读文本生成最终答案。如表1所示，GENREAD在NQ和FM2数据集上与零样本检索-阅读流程模型表现相当，在其他所有基准测试中均优于后者。大语言模型通过自回归文本生成即可提取其习得的知识。在未接触这些数据集任何样本的情况下，GENREAD的表现优于使用监督式检索模型（如DPR）获取相关上下文文档的方案。

Figure 2: Recall $\ @\mathrm{K}$ on test sets, measured as the percentage of top-K documents that contain the answer. Our proposed clustering-based prompting method can outperform DPR and Google search, also two variants of using LLMs to generate documents. Exact numbers are reported in Table 6.

图 2: 测试集上的召回率 $\ @\mathrm{K}$，衡量包含答案的前K篇文档的百分比。我们提出的基于聚类的提示方法优于DPR和Google搜索，也优于使用大语言模型生成文档的两种变体。具体数值见表6。

4.2 SUPERVISED SETTING EXPERIMENTS

4.2 监督式设置实验

We compare our proposed GENREAD with retrieve-then-read models, including DPR (Karpukhin et al., 2020), RAG (Lewis et al., 2020), and FiD (Izacard & Grave, 2021). In addition, we compared with obtaining relevant documents from the internet using the Google search engine.

我们将提出的GENREAD与检索-阅读(retrieve-then-read)模型进行比较，包括DPR (Karpukhin等人，2020)、RAG (Lewis等人，2020)和FiD (Izacard & Grave，2021)。此外，我们还与使用Google搜索引擎从互联网获取相关文档的方法进行了对比。

4.2.1 EXPERIMENTAL SETUP

4.2.1 实验设置

For our proposed method, we replace the retriever with a large language model to directly generate contextual documents. In the experiments, we use Instruct GP T (Ouyang et al., 2022). After contextual documents are retrieved or generated, we employ a FiD reader with 770M parameter models (i.e., FiD-l) and 3B parameter models (i.e., FiD-xl) that are fine-tuned on the training split of target datasets. We note that we only use 10 documents during reading for the following reasons.

在我们提出的方法中，我们用一个大型语言模型替代了检索器，直接生成上下文文档。实验中我们使用了 InstructGPT (Ouyang et al., 2022)。当上下文文档被检索或生成后，我们采用了参数规模为 7.7 亿的 FiD 阅读器模型 (即 FiD-l) 和 30 亿参数的模型 (即 FiD-xl)，这些模型在目标数据集的训练集上进行了微调。需要说明的是，在阅读阶段我们仅使用 10 篇文档，原因如下。

Why do we choose to use only 10 documents instead of 100 when reading?

为什么我们在阅读时选择只用10份文档而非100份？

As noted in Section 6.2 in DPR (Karpukhin et al., 2020) and Figure 3 in FiD (Izacard & Grave, 2021), increasing the number of documents can lead to better model performance and achieve state-of-the-art when using 100 documents. However, there are two major drawbacks to using 100 documents during the reading step. First, the operation is very expensive, leading to a significant increase in memory consumption and training time. As reported by Izacard & Grave (2021), the training process requires 64 Tesla V100 32GB running for around one day. Second, generating documents by using a large language model is slow and expensive, so only using 10 documents can be a significant cost saving in our method. Therefore, in our experiments, we choose to use 10 documents during the reading process. When using FiD-770M (i.e., FiD-large), the training process can be easily performed even on a single Tesla V100 32GB GPU. Meanwhile, when only using 10 documents, we can also increase the size of FiD model from 770M to 3B, which takes about the same amount of GPU memory as using 100 documents on a 770M model, but at the same time significantly shortens the training time. We note that training T5-3B model needs a bigger cluster such as 8 Tesla V100 or A100 GPUs.

如DPR (Karpukhin等，2020) 第6.2节和FiD (Izacard & Grave，2021) 图3所示，增加文档数量能提升模型性能，使用100份文档时可达到最优水平。但阅读阶段使用100份文档存在两大缺陷：首先，该操作成本极高，会显著增加内存消耗和训练时长。据Izacard & Grave (2021) 报告，训练过程需要64块Tesla V100 32GB显卡运行约一整天；其次，通过大语言模型生成文档速度慢且成本高，因此本方法仅使用10份文档可大幅节省成本。故实验中我们选择在阅读阶段使用10份文档。使用FiD-770M (即FiD-large) 时，训练过程甚至可在单块Tesla V100 32GB显卡上轻松完成。同时，仅使用10份文档还能将FiD模型规模从770M扩大到3B，其显存占用与770M模型处理100份文档时相当，但能显著缩短训练时间。需注意的是，训练T5-3B模型需要更大规模的集群，例如8块Tesla V100或A100显卡。

4.2.2 EXPERIMENTAL RESULTS ON OPEN-DOMAIN QA

4.2.2 开放域问答实验结果

We first use Recall $\ @\mathrm{K}$ to compare the retrieval accuracy of different models. As shown in Figure 2, GENREAD can significantly outperform DPR and Google search for under 10 retrieved or generated documents. Compared to different GENREAD variants, including nucleus sampling, human written prompts, and clustering-based prompts, clustering-based prompts achieve the best performance. At the same time, we notice that the language model inevitably has the problem that the slope of the curve decreases as the number of generated documents increases. On one hand, this is due to the similarity of token distributions when large language models generate multiple documents. On the other hand, due to the shallow interaction characteristics of the dense retrieval model itself, the retrieved documents might not be completely relevant to the given question, so that the increase in recall might come from false positive documents, as also mentioned by Sachan et al. (2022).

我们首先使用召回率 $\ @\mathrm{K}$ 来比较不同模型的检索准确率。如图 2 所示，在检索或生成文档数量少于 10 篇时，GENREAD 明显优于 DPR 和谷歌搜索。与不同 GENREAD 变体（包括核采样、人工编写提示和基于聚类的提示）相比，基于聚类的提示表现最佳。同时，我们注意到语言模型不可避免地存在曲线斜率随生成文档数量增加而下降的问题。一方面，这是由于大语言模型生成多篇文档时 token 分布的相似性所致；另一方面，由于稠密检索模型本身的浅层交互特性，检索到的文档可能与给定问题不完全相关，因此召回率的提升可能来自误判文档，正如 Sachan 等人 (2022) 所提到的。

Table 2: Supervised open-domain QA performance. By only using generated documents from InstructGPT, our GENREAD with FiD reader (named GENREAD (FiD)) can achieve better performance than baseline methods on TriviaQA and WebQ. Through our detailed analysis of NQ, we found the performance gap mainly due to the temporal it y issue, which will be elaborated in §A.7.

Models	# reader parameters	# docu- ments	TriviaQA open test	WebQ open test open test	NQ	Avg.
*baselineswithretrievingfromWikipedia;allnumbersreported by existingpapers
DPR (Karpukhin et al., 2020)	110M	100	56.8	41.1	41.5	46.5
RAG (Lewis et al., 2020)	400M	10	56.1	45.2	44.5	48.6
FiD (Izacard & Grave, 2021)	770M	100	67.6	50.5	51.4	56.5
*baselines with retrieving from Wikipedia or Google; all numbers from our experiments
FiD-1 (DPR,Wikipedia)	770M	10	61.9	48.1	46.7	52.2
FiD-xl (DPR, Wikipedia)	3B	10	66.3	50.8	50.1	55.7
FiD-xl (Google search)	3B	10	70.1	53.6	45.0	56.2
*our proposed method by leveraging a large language model to generate documents
GENREAD (FiD-l) (sampling)	770M	10	67.8	51.5	40.3	53.2
GENREAD (FiD-1) (clustering)	770M	10	70.2	53.3	43.5	55.6
GENREAD (FiD-xl) (sampling)	3B	10	69.6	52.6	42.6	54.9
GENREAD (FiD-xl) (clustering)	3B	10	71.6	54.4	45.6	57.1
F merge retrieved documents with generated documents			74.3	56.2	54.0	61.5

表 2: 监督式开放域问答性能。仅使用 InstructGPT 生成的文档时，我们搭载 FiD 阅读器的 GENREAD (命名为 GENREAD (FiD)) 在 TriviaQA 和 WebQ 上的表现优于基线方法。通过对 NQ 的详细分析，我们发现性能差距主要源于时效性问题，具体将在 §A.7 中阐述。

模型	阅读器参数量	文档数	TriviaQA 开放测试	WebQ 开放测试	NQ	平均
*基线方法使用维基百科检索；所有数据来自现有论文
DPR (Karpukhin et al., 2020)	110M	100	56.8	41.1	41.5	46.5
RAG (Lewis et al., 2020)	400M	10	56.1	45.2	44.5	48.6
FiD (Izacard & Grave, 2021)	770M	100	67.6	50.5	51.4	56.5
*基线方法使用维基百科或谷歌检索；所有数据来自我们的实验
FiD-1 (DPR, Wikipedia)	770M	10	61.9	48.1	46.7	52.2
FiD-xl (DPR, Wikipedia)	3B	10	66.3	50.8	50.1	55.7
FiD-xl (Google search)	3B	10	70.1	53.6	45.0	56.2
*我们提出的方法：利用大语言模型生成文档
GENREAD (FiD-l) (采样)	770M	10	67.8	51.5	40.3	53.2
GENREAD (FiD-1) (聚类)	770M	10	70.2	53.3	43.5	55.6
GENREAD (FiD-xl) (采样)	3B	10	69.6	52.6	42.6	54.9
GENREAD (FiD-xl) (聚类)	3B	10	71.6	54.4	45.6	57.1
融合检索文档与生成文档	-	-	74.3	56.2	54.0	61.5

As shown in Table 2, we can first observe the FiD model performs the best among all baseline models. Using FiD-xl with only 10 documents achieves comparable performance with using FiD-l with 100 documents. The average gap is less than $1%$ on three benchmarks. Compared with both close-book models and Wikipedia-based retrieve-then-read pipelines, our proposed GENREAD can achieve state-of-the-art performance. Furthermore, compared with using sampling methods to generate documents, the clustering-based prompt method can improve the EM score by $+2.2$ on average. This indicates that the clustering-based prompt method is effectively increasing the knowledge coverage of generated documents, and also leading to better downstream QA performance. We also show that GENREAD can outperform Google search on all benchmarks. We observe both our method and Google search perform worse than DPR, mainly due to the significant portion of time-dependent questions in the dataset, which is described in the following analysis.

如表 2 所示，我们首先可以观察到 FiD 模型在所有基线模型中表现最佳。仅使用 10 篇文档的 FiD-xl 就能达到与使用 100 篇文档的 FiD-l 相当的性能，三个基准测试上的平均差距小于 $1%$。与闭卷模型和基于维基百科的检索-阅读流程相比，我们提出的 GENREAD 能够实现最先进的性能。此外，与使用采样方法生成文档相比，基于聚类的提示方法平均可将 EM 分数提高 $+2.2$。这表明基于聚类的提示方法有效增加了生成文档的知识覆盖范围，同时也带来了更好的下游问答性能。我们还展示了 GENREAD 在所有基准测试上都能超越 Google 搜索。观察到我们的方法和 Google 搜索的表现都不如 DPR，主要是由于数据集中存在大量时间敏感性问题，具体分析如下。

4.2.3 EXPERIMENTAL RESULTS ON OTHER TASKS

4.2.3 其他任务的实验结果

We demonstrate the experimental results in Table 3. Under the supervised setting, GENREAD can achieve on par performance on the fact checking task and superior performance on the dialogue system task, indicating that large language model can be seen as a strong knowledge generator.

我们在表3中展示了实验结果。在有监督设置下，GENREAD在事实核查任务上达到相当性能，在对话系统任务上表现更优，这表明大语言模型可视为强大的知识生成器。

Models	FEVER FM2 Acc. Acc.	Wow F1/R-L
RAG (Lewis et al., 2020)	86.3 71.1	13.1/11.6
FiD (Izacard & Grave, 2021)	90.2 77.6	17.5/16.1
GENREAD (FiD-xl) (sampling)	89.0 76.3	18.9/16.7
GENREAD (FiD-xl) (clustering	89.6 77.8	19.1/16.8
H merge two source docs.	91.8 78.9	20.1/17.9

Table 3: Supervised performance on fact checking (FEVER and FM2) and open-domain dialogue system (WoW).

模型	FEVER FM2 准确率	WoW F1/R-L
RAG (Lewis et al., 2020)	86.3 71.1	13.1/11.6
FiD (Izacard & Grave, 2021)	90.2 77.6	17.5/16.1
GENREAD (FiD-xl) (采样)	89.0 76.3	18.9/16.7
GENREAD (FiD-xl) (聚类)	89.6 77.8	19.1/16.8
H 合并两个源文档	91.8 78.9	20.1/17.9

表 3: 事实核查 (FEVER 和 FM2) 和开放域对话系统 (WoW) 的监督性能。

The main reason that GENREAD performs worse than the dense retriever

GENREAD 表现不如密集检索器的主要原因

for fact checking is that the task provides sufficient semantic information to reach strong performance on this binary decision task. So, there is a smaller semantic gap between the given factual statement and contextual documents than that of question and document pairs in open-domain QA, which is an easier retrieval setting for modern dense retrieval methods that are mainly based on vector similarity.

事实核查任务的优势在于，它为这项二元决策任务提供了足够的语义信息以实现强劲性能。因此，给定的事实陈述与上下文文档之间的语义差距，比开放域问答中的问题-文档对更小，这对主要基于向量相似度的现代密集检索方法而言是更简单的检索场景。

4.3 OBSERVATIONS AND EXPERIMENTAL ANALYSIS

4.3 观察与实验分析

4.3.1 COMPLEMENT ARI TY OF GENERATED AND RETRIEVED DOCUMENTS

4.3.1 生成文档与检索文档的互补性

Generated documents can be combined with retrieved documents to outperform both. Even with a very large number of retrieved documents, including few samples of generated knowledge leads to large improvements. As shown in Table 2, merging retrieved documents with generated documents can

生成的文档可以与检索到的文档结合，表现优于单独使用两者。即使检索到的文档数量非常多，加入少量生成知识的样本也能带来显著提升。如表2所示，将检索文档与生成文档合并可以

Figure 3: Combining DPR retrieved documents and large language model (LLM) generated documents can achieve significantly better performance than using DPR retrieved documents only. For a fair comparison, instead of adding LLM generated documents to the model, we replace 10 documents retrieved by DPR with 10 documents generated by LLM so the total number of documents is the same. In this experiment, we use FiD-l (i.e., FiD-large) as the reader model because when the documents scale to more than 20, FiD-xl (i.e., FiD-3B) causes out-of-memory issues on A100 GPUs.

图 3: 结合 DPR 检索文档与大语言模型 (LLM) 生成文档，能获得比仅使用 DPR 检索文档显著更优的性能。为确保公平对比，我们并非向模型追加 LLM 生成文档，而是用 10 篇 LLM 生成文档替换 DPR 检索的 10 篇文档，以保持文档总数不变。本实验采用 FiD-l (即 FiD-large) 作为阅读器模型，因为当文档量超过 20 篇时，FiD-xl (即 FiD-3B) 会在 A100 GPU 上引发内存溢出问题。

achieve state-of-the-art performance compared to all baseline methods listed in the table. Specifically, it can improve $+5.7$ averagely on three open-domain QA benchmarks compared to DPR alone, and improve $+4.4$ averagely compared to the large language model alone.

与表中列出的所有基线方法相比，该方案实现了最先进的性能。具体而言，在三个开放域问答基准测试中，相比单独使用DPR平均提升$+5.7$，相比单独使用大语言模型平均提升$+4.4$。

4.3.2 COVERAGE ANALYSIS OVER ALL POSSIBLE ANSWERS

4.3.2 全可能答案覆盖分析

The improvement in open-domain QA performance is due to the fact that correct answers are included more frequently in the generated text Recall $\ @\mathrm{K}$ is the most commonly used metric in existing works to measure the retrieval performance, which computes the percentage of top-K retrieved or generated documents that contain any possible answer at least once. than in the retrieved documents. However, as many questions contain multiple correct answers, recall $\ @\mathrm{K}$ cannot fully reflect the diversity of generated or retrieved documents. Each question in the WebQ has 2.39 correct answers, 1.79 correct answers in NQ and 14.02 (including all entity alias) in the TriviaQA. NQ and WebQ do not include alias names in the labels.

开放域问答性能的提升源于正确答案更频繁地出现在生成文本中。召回率 $\ @\mathrm{K}$ 是现有工作中最常用的检索性能衡量指标，它计算前K个检索或生成文档中至少包含一个可能答案的百分比。然而，由于许多问题存在多个正确答案，召回率 $\ @\mathrm{K}$ 无法完全反映生成或检索文档的多样性。WebQ数据集中每个问题平均包含2.39个正确答案，NQ数据集为1.79个，TriviaQA数据集则达到14.02个（含所有实体别名）。NQ和WebQ的标注中未包含别名信息。

In this section, we also demonstrate the answer coverage performance of different models in Table 6. Answer coverage measures the percentage of the number of answers that are contained in the documents over all possible answers. Coverage analysis showed that generated text tends to have lower coverage than retrieved documents

在本节中，我们还在表6中展示了不同模型的答案覆盖率性能。答案覆盖率衡量文档中包含的答案数量占所有可能答案的百分比。覆盖率分析表明，生成文本的覆盖率往往低于检索文档

Documents obtained by↓	NQ	TriviaQA		WebQ
		w.alias	w/oalias
BM25(Robertson et al.,2009)	48.4	17.1	63.8	41.2
Google search engine?	57.9	18.9	72.0	54.2
DPR (Karpukhin et al., 2020)	67.9	17.9	67.3	58.8
GENREAD (nucleus sampling)	56.6	19.6	74.5	59.8
GENREAD (10 human prompts)	57.4	20.1	74.8	61.1
GENREAD (clustering prompts)	61.7	20.4	76.5	62.1

Table 4: Answer coverage $%$ ) over 10 retrieved or generated documents. Case studies are provided in Tables 16-19 in Appendix.

获取文档方式↓	NQ	TriviaQA	WebQ
		w.alias	w/oalias
BM25 (Robertson et al., 2009)	48.4	17.1	63.8
Google 搜索引擎?	57.9	18.9	72.0
DPR (Karpukhin et al., 2020)	67.9	17.9	67.3
GENREAD (nucleus sampling)	56.6	19.6	74.5
GENREAD (10 human prompts)	57.4	20.1	74.8
GENREAD (clustering prompts)	61.7	20.4	76.5

表 4: 答案覆盖率 ($%$) 基于10篇检索或生成的文档。案例研究见附录中的表 16-19。

because generated documents tends to have little diversity compared to retrieved documents. To improve coverage, we propose GENREAD with clustering, where we include examples in the prompt from different clusters of the training data to elicit more diverse generations.

因为生成文档往往比检索文档的多样性更低。为了提高覆盖率，我们提出了带聚类功能的GENREAD方法，即在提示中包含来自训练数据不同聚类的示例，以激发更多样化的生成内容。

After we manually compare some retrieved documents from DPR and generated documents from Instruct GP T, we observe that the readability of different documents, when they contain the correct answer string, is different. In other words, documents containing answers might also contain noisy in

在我们手动对比了DPR检索的部分文档和Instruct GPT生成的文档后，发现当这些文档包含正确答案字符串时，它们的可读性存在差异。也就是说，包含答案的文档可能同时存在噪声。

4.4 READABILITY ANALYSIS OF RETRIEVED AND GENERATED DOCUMENTS

Documents obtained by ↓	NQ	TriviaQA	WebQ
DPR (Karpukhin et al., 2020)	63.1	80.2	63.3
GENREAD (nucleus sampling)	58.7	83.7	63.8
GENREAD (clustering prompts)	64.0	86.8	66.7

4.4 检索与生成文档的可读性分析

获取方式 ↓	NQ	TriviaQA	WebQ
DPR (Karpukhin et al., 2020)	63.1	80.2	63.3
GENREAD (nucleus sampling)	58.7	83.7	63.8
GENREAD (clustering prompts)	64.0	86.8	66.7

Table 5: Readability study on retrieved documents and generated documents. See detailed analysis in $\S4.4$ .

表 5: 检索文档与生成文档的可读性研究。详细分析见 $\S4.4$。

formation that is irrelevant to the question, which could affect both the model and human reading.

与问题无关的信息可能会影响模型和人类阅读。

In order to further validate the readability of retrieved documents and generated documents, we extracted a subset of data examples from NQ, TriviaQA and WebQ datasets, in which both retrieved and generated documents contain the correct answer. As shown in Table 5, when both retrieved and generated documents contain the correct answer, the FiD reader can produce more correct answers when reading the generated documents from large language models (e.g., Instruct GP T).

为了进一步验证检索文档和生成文档的可读性，我们从NQ、TriviaQA和WebQ数据集中提取了部分数据样本，这些样本中的检索文档和生成文档都包含正确答案。如表5所示，当检索文档和生成文档都包含正确答案时，FiD阅读器在读取大语言模型（如InstructGPT）生成的文档时能产生更多正确答案。

We also provide some case studies in Tables 16-19. For example, in Table 18, the question is “What city was Zeus the patron god of?”. The first document retrieved by DPR is “Like the other Pan hellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia.”. Although it contains the correct answer, it is hard to infer the answer “Olympia” from it. On the contrary, Instruct GP T generates the document “Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, held every four years in honor of Zeus.”, which is much easier to read.

我们还在表16-19中提供了一些案例研究。例如在表18中，问题为"What city was Zeus the patron god of?"。DPR检索到的首份文档是"Like the other Panhellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia."，虽然包含正确答案，但很难从中推断出"Olympia"这个答案。相反，InstructGPT生成的文档"Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, held every four years in honor of Zeus."更易于理解。

5 EPILOGUE

5 尾声

CONCLUSION. In this paper, we present a novel perspective for solving knowledge-intensive tasks by replacing the dense retrieval models with large language model generators. We call it generate-thenread, which first prompts a large language model to generate contextual documents, then read the generated document to infer the final answer. Notably, without retrieving any documents, it reaches 71.6 and 54.4 exact match scores on TriviaQA and WebQ, significantly outperforming the current retrieval-reader model DPR-FiD, as well as on other two knowledge-intensive tasks.

结论。本文提出了一种解决知识密集型任务的新视角，即用大语言模型生成器替代密集检索模型。我们称之为"生成后阅读"(generate-then-read)方法，该方法首先提示大语言模型生成上下文文档，然后阅读生成的文档来推断最终答案。值得注意的是，在未检索任何文档的情况下，该方法在TriviaQA和WebQ上分别达到71.6和54.4的精确匹配分数，显著优于当前检索-阅读模型DPR-FiD，在其他两项知识密集型任务中同样表现优异。

LIMITATION AND FUTURE WORK. Despite the strong performance on the presented datasets, our approach is limited in its ability to update knowledge state and adapt to new domains. A major feature of retrieve-then-read is the ability to swap in new documents when new information is learned, such as temporally more recent documents, or adding in documents from a new domain to quickly adapt to a new downstream task. Our approach relies on a large language model to contain all this knowledge and adding new knowledge would likely require some retraining. Future work will explore how to efficiently incorporate new knowledge into our generate-then-read method. Besides, generated documents might suffer from hallucination error, resulting in incorrect predictions. We demonstrated case study in Table 15. Consideration in combination with recent approaches (Creswell & Shanahan, 2022) to boost generative faithfulness is a also direction worthy of future research.

局限性与未来工作。尽管在现有数据集上表现优异，但我们的方法在更新知识状态和适应新领域方面存在局限。检索-阅读(retrieve-then-read)方法的核心优势在于能够动态替换新文档（例如时效性更强的文献或跨领域文档）以快速适应下游任务，而我们的生成-阅读(generate-then-read)方案依赖大语言模型内化所有知识，新增知识可能需要重新训练。未来工作将探索如何高效整合新知识。此外，生成文档可能存在幻觉错误（如表15案例所示），结合近期提升生成可信度的研究(Creswell & Shanahan, 2022)也是值得探索的方向。

ETHICS STATEMENT

伦理声明

Large language models have a wide range of beneficial applications for society, but they also have potentially harmful applications. Previous work has shown various forms of bias, such as racial and gender bias, in large language models like GPT-3, even after explicit efforts to reduce toxic language (Chan, 2022). The importance of addressing these societal harms is acknowledged by OpenAI themselves in their 2020 paper introducing GPT-3 (Brown et al., 2020), which stated “we focus on two primary issues: the potential for deliberate misuse of language models like GPT-3 ... and issues of bias, fairness, and representation within models like GPT-3.” on page 34.

大语言模型对社会具有广泛的有益应用，但也存在潜在的有害应用。先前研究表明，即便经过明确减少有害语言的努力，像GPT-3这样的大语言模型仍存在多种形式的偏见，例如种族和性别偏见 (Chan, 2022)。OpenAI在2020年介绍GPT-3的论文中承认了解决这些社会危害的重要性 (Brown et al., 2020)，该论文在第34页指出："我们重点关注两个主要问题：像GPT-3这样的语言模型可能被蓄意滥用...以及像GPT-3这样的模型内部的偏见、公平性和代表性等问题。"

The goal of this paper is to utilize knowledge stored in the parameters of large language models to answer open-domain questions and solve knowledge-intensive tasks. Unlike retrieve-then-read where an external corpus can be curated to be trustworthy, the use of a model to generate contextual documents may further permeate existing biases in common models. First, our work shows that generated documents suffer from challenges of stale information from outdated documents used for training. Second, we show that generated documents tend to be less diverse, potentially biasing answers towards more common entities and terms from the training data. Finally, we conducted experiments on only three large language models. It is possible that some of our conclusions or observations may not necessarily hold for other models trained with different data or objectives.

本文旨在利用大语言模型参数中存储的知识来回答开放域问题并解决知识密集型任务。与基于检索-阅读模式 (retrieve-then-read) 的方法不同（该方法可通过筛选外部语料库确保可信度），使用模型生成上下文文档可能会放大现有常见模型中的偏见。首先，我们的研究表明生成文档存在训练数据过时导致信息陈旧的问题。其次，生成文档往往多样性不足，可能使答案偏向训练数据中更常见的实体和术语。最后，我们仅针对三种大语言模型进行了实验，部分结论或观察可能不适用于采用不同训练数据或目标的其他模型。

Regarding ethical solutions, future work includes (i) further exploring potential bias and intentional or unintentional harm that may result from using generated contextual documents; (ii) better aligning language models with user intent to generate less biased contents and fewer fabricated facts.

关于伦理解决方案，未来工作包括：(i) 进一步探究使用生成式上下文文档可能导致的潜在偏见及有意/无意的伤害；(ii) 更好地使语言模型与用户意图对齐，以生成偏见更少的内容和更少的虚构事实。

ACKNOWLEDGEMENTS

致谢

This work was supported in part by NSF IIS-2119531, IIS-2137396, IIS-2142827, CCF-1901059, and ONR N00014-22-1-2507. Wenhao is supported in part by Bloomberg Data Science Ph.D Fellowship.

本研究部分受到美国国家科学基金会(NSF)项目IIS-2119531、IIS-2137396、IIS-2142827、CCF-1901059以及海军研究办公室(ONR)项目N00014-22-1-2507的资助。Wenhao还获得了彭博数据科学博士奖学金的部分支持。

A APPENDIX

A 附录

Datasets	Splits	Train	Valid	Test	Testlabels
TriviaQA (Joshi et al.,2017)	open domain	78,785	8,837	11,313	public
WebQ (Berant et al., 2013)	wikipedia split open domain	3,478	300	7,993 2,032	public public
NQ (Kwiatkowski et al., 2019)	open domain	79,168	8,757	3,610	public
FEVER (Thorne et al., 2018)	kilt challenge	104,966		10,100	hidden
FM2 (Eisenschlos et al.,2021)	official split	10,149	10,444 1169	1380
WoW (Dinan et al.,2019)	kiltchallenge	63,734	3,054	2,944	public hidden

Table 6: Datasets splits and statistics. For FEVER and WoW, labels in the test are hidden, so the model performance should be evaluated at https://ai.facebook.com/tools/kilt/.

数据集	划分方式	训练集	验证集	测试集	测试标签
TriviaQA (Joshi et al., 2017)	开放域	78,785	8,837	11,313	公开
WebQ (Berant et al., 2013)	维基百科划分/开放域	3,478	300	7,993/2,032	公开/公开
NQ (Kwiatkowski et al., 2019)	开放域	79,168	8,757	3,610	公开
FEVER (Thorne et al., 2018)	KILT挑战赛	104,966	-	10,100	隐藏
FM2 (Eisenschlos et al., 2021)	官方划分	10,149	10,444/1,169	1,380	-
WoW (Dinan et al., 2019)	KILT挑战赛	63,734	3,054	2,944	公开/隐藏

表 6: 数据集划分与统计信息。对于FEVER和WoW数据集，测试集标签为隐藏状态，模型性能评估请访问 https://ai.facebook.com/tools/kilt/。

A.1 DATASETS AND SPLITS

A.1 数据集与划分

– TRIVIAQA (TQA) (Joshi et al., 2017) contains a set of trivia questions with answers that were originally scraped from trivia and quiz-league websites.

– TRIVIAQA (TQA) (Joshi et al., 2017) 包含一组从问答网站和智力竞赛联盟网站抓取的 trivia 问题及其答案。

– WEB QUESTIONS (WebQ) (Berant et al., 2013) consists of questions selected using Google Suggest API, where the answers are entities in Freebase.

WEB QUESTIONS (WebQ) (Berant et al., 2013) 包含通过Google Suggest API筛选的问题，其答案为Freebase中的实体。

– NATURAL QUESTIONS (NQ) (Kwiatkowski et al., 2019) were mined from real Google search queries and the answers are spans in Wikipedia articles identified by human annotators.

NATURAL QUESTIONS (NQ) (Kwiatkowski等人，2019) 是从真实的Google搜索查询中挖掘的，答案是由人类标注者在维基百科文章中标注的文本片段。

We explore the same train / dev / test splits for the open-domain QA setting as used by Izacard & Grave (2021); Karpukhin et al. (2020). For TriviaQA, GPT-3 / GLaM / PaLM (Brown et al., 2020; Du et al., 2022; Chowdhery et al., 2022) evaluate on the Wikipedia dev set of 7,993 examples, so we ran an additional evaluation on that dev set in order to compare with their performance.

我们采用了与Izacard & Grave (2021)和Karpukhin等人(2020)相同的开放域问答任务训练/开发/测试集划分方案。对于TriviaQA数据集，GPT-3/GLaM/PaLM(Brown等人，2020；Du等人，2022；Chowdhery等人，2022)在包含7,993个样本的维基百科开发集上进行评估，因此我们额外在该开发集上进行了评测以对比模型性能。

– FEVER (Thorne et al., 2018) is one of the largest datasets for fact checking that requires retrieving evidence from external corpus to support if a statement is supported or refuted.

FEVER (Thorne et al., 2018) 是事实核查领域最大的数据集之一，需要通过从外部语料库检索证据来判断某个陈述是被支持还是被反驳。

– FOOL ME TWICE (FM2) (Eisen schlo s et al., 2021) is a challenging fact checking dataset collected by gam if i cation. Players write challenging claims either entailed or refuted by evidence from Wikipedia. They are then tasked to spot the refuted claim among a group.

– FOOL ME TWICE (FM2) (Eisen schlo s et al., 2021) 是一个通过游戏化方式收集的具有挑战性的事实核查数据集。玩家根据维基百科的证据撰写可被证实或反驳的复杂声明，随后需在一组声明中识别出被反驳的声明。

– WIZARD OF WIKIPEDIA (WoW) (Dinan et al., 2019) is an open-domain dialogue task for training agents that can converse knowledgeably about open-domain topics. One speaker in the conversation must ground their utterances in a specific knowledge sentence from a Wikipedia page.

– WIZARD OF WIKIPEDIA (WoW) (Dinan et al., 2019) 是一个开放领域对话任务，旨在训练能够就开放领域话题进行知识性对话的智能体。对话中的一方必须将其发言基于维基百科页面中的特定知识句子。

We use the same train / dev / test splits in KILT challenge (Petroni et al., 2021) for the FEVER and WoW datasets. Their test labels are hidden, so the performance can only be evaluated through https://ai.facebook.com/tools/kilt. For FM2, we use its official dataset splits.

我们在FEVER和WoW数据集上采用了与KILT挑战赛 (Petroni et al., 2021) 相同的训练/开发/测试集划分。这些数据集的测试标签未公开，性能评估需通过https://ai.facebook.com/tools/kilt完成。对于FM2数据集，我们使用其官方划分方案。

A.2 IMPLEMENTATION DETAILS

A.2 实现细节

We use T5-770M (Raffel et al., 2020) and T5-3B as our backbone models to implement FiD (Izacard & Grave, 2021). We use AdamW as the optimizer, with 2,000 warm-up steps. We set the dropout probability to 0.1 and weight decay to 0.01. We use one A100 for running T5-770M and set the batch size of 16. We use 8 A100 for running T5-3B and set the per GPU batch as 2, leading to the total batch size as 16. We searched different learning rates, ranging from 5e-6 to $4e{-}5$ , and we found 3e-5 to 6e-5 performed the best under the T5-3B setting and 5e-5 to 1e-4 performed the best under the T5-770M setting. We refer to more individual implementation details in Table 7.

我们采用T5-770M (Raffel等人，2020) 和T5-3B作为骨干模型来实现FiD (Izacard & Grave，2021)。优化器选用AdamW，预热步数为2000。dropout概率设为0.1，权重衰减设为0.01。运行T5-770M时使用单张A100显卡，批次大小设为16；运行T5-3B时使用8张A100显卡，每GPU批次设为2，总批次大小保持为16。我们测试了5e-6到$4e{-}5$范围内的不同学习率，发现T5-3B环境下3e-5至6e-5效果最佳，T5-770M环境下5e-5至1e-4表现最优。更多具体实现细节参见表7。

We implement other baseline methods by using repositories:

我们通过以下代码库实现了其他基线方法：

– BM25: https://github.com/castorini/pyserini – DPR: https://github.com/facebook research/DPR – Contriever: https://github.com/facebook research/contriever

BM25: https://github.com/castorini/pyserini
DPR: https://github.com/facebookresearch/DPR
Contriever: https://github.com/facebookresearch/contriever

Settings/ Datasets	NQ	TriviaQA	WebQ	FEVER	FM2	Wow
Peak learning rate Total batch size Total training steps Bestvalidationsteps Validation performance	1e-4 64 15,000 6,000 43.27 43.50	1e-4 64 10,000 500 69.47 70.22	1e-4 64 10,000 8,500 60.33	1e-4 64 10,000 5,000 88.97	1e-4 64 10,000 6,000 73.57	5e-5 16 20,000 20,000 18.60
Best validation = test Peak learning rate Totalbatch size Total training steps Best validation steps	5e-5 16 20,000 14,000	6e-5 16 15,000 8,500	53.33 3e-5 16 15,000 11,500	87.25 5e-5 16 15,000 10,000	74.21 5e-5 16 15,000 6,000	18.49 3e-5 8 20,000 16,500

Table 7: Hyper para ter s settings and validation performance for open-domain QA (numbers reported in Table 2), fact checking and dialogue system (numbers reported in Table 3). The upper part numbers are from GENREAD (FiD-l) and the lower part numbers are from GENREAD (FiD-xl).

设置/数据集	NQ	TriviaQA	WebQ	FEVER	FM2	Wow
峰值学习率总批次大小总训练步数最佳验证步数验证性能	1e-4 64 15,000 6,000 43.27 43.50	1e-4 64 10,000 500 69.47 70.22	1e-4 64 10,000 8,500 60.33	1e-4 64 10,000 5,000 88.97	1e-4 64 10,000 6,000 73.57	5e-5 16 20,000 20,000 18.60
最佳验证=测试峰值学习率总批次大小总训练步数最佳验证步数	5e-5 16 20,000 14,000	6e-5 16 15,000 8,500	53.33 3e-5 16 15,000 11,500	87.25 5e-5 16 15,000 10,000	74.21 5e-5 16 15,000 6,000	18.49 3e-5 8 20,000 16,500

表 7: 开放域问答 (表 2 中报告的数字)、事实核查和对话系统 (表 3 中报告的数字) 的超参数设置和验证性能。上半部分数字来自 GENREAD (FiD-l)，下半部分数字来自 GENREAD (FiD-xl)。

We note that reproducing experiments on the OpenAI API, though publicly available, costs money. For this reason, we further add an evaluation on two open-source large language models OPT (Zhang et al., 2022) and Codex (OpenAI, 2022). As shown in Table 8, OPT performed worse than Instruct GP T, but still achieved comparable performance with DPR; OpenAI Codex achieved the best performance on both TriviaQA and WebQ.

我们注意到，虽然在OpenAI API上复现实验是公开可用的，但会产生费用。为此，我们进一步评估了两个开源大语言模型OPT (Zhang et al., 2022) 和Codex (OpenAI, 2022)。如表8所示，OPT的表现逊于Instruct GPT，但仍与DPR相当；OpenAI Codex在TriviaQA和WebQ上均取得了最佳性能。

A.3 REPRODUCIBILITY VIA OPEN SOURCE LARGE LANGUAGE MODELS

Documents obtained by ←	TriviaQA WebQ
DPR (Karpukhin et al., 2020)	66.3 50.8
OPT (Zhang et al., 2022)	62.1 51.8
InstructGPT (Ouyang et al., 2022)	71.3 54.5
Codex x (OpenAI,2022)	72.6 55.4

Table 8: Exact match (EM) score with using DPR and different open-source large language models such as OPT and Codex to generate contextual documents.

A.3 通过开源大语言模型实现可复现性

由以下方法获取的文档	TriviaQA WebQ
DPR (Karpukhin et al., 2020)	66.3 50.8
OPT (Zhang et al., 2022)	62.1 51.8
InstructGPT (Ouyang et al., 2022)	71.3 54.5
Codex (OpenAI, 2022)	72.6 55.4

表 8: 使用DPR和不同开源大语言模型(如OPT和Codex)生成上下文文档时的精确匹配(EM)分数。

A.4 SCALING WITH NUMBER OF LARGE LANGUAGE MODEL PARAMETERS

A.4 大语言模型参数规模的扩展

Figure 4 shows the scaling of performance with InstructGPT generator parameters, including Ada-150M, Babbage-1.3B, Curie-6.7B and Davinci-175B. We note that for both FiD and our GENREAD , we use the FiD-xl with 10 input documents either retrieved from Wikipedia or generated by Instruct GP T. The performance of both TriviaQA and WebQ continues to improve as the generator model parameters increase, as does the slope. Only with the largest size Instruct GP T, GENREAD can outperform the DPR-FiD. This indicates using large language model to generate contextual documents is an “emergent ability” of scaling, which is not present in smaller models but is only present in larger language models (Wei et al., 2022a).

图 4 展示了 InstructGPT 生成器参数对性能的影响，包括 Ada-150M、Babbage-1.3B、Curie-6.7B 和 Davinci-175B。我们注意到，对于 FiD 和我们的 GENREAD，我们使用了 FiD-xl 并输入 10 篇文档，这些文档要么是从维基百科检索的，要么是由 InstructGPT 生成的。随着生成器模型参数的增加，TriviaQA 和 WebQ 的性能持续提升，斜率也随之增大。只有在使用最大规模的 InstructGPT 时，GENREAD 才能超越 DPR-FiD。这表明利用大语言模型生成上下文文档是一种随着规模增长而"涌现的能力" (Wei et al., 2022a)，这种能力在较小模型中不存在，仅存在于较大的语言模型中。

Figure 4: Model performance with different size of Instruct GP T as context generators.

图 4: 不同规模Instruct GPT作为上下文生成器的模型性能。

A.5 ADDITIONAL NUMBERS FOR TABLES IN THE MAIN PAPER

A.5 主论文中表格的补充数据

– Table 9 contains additional evaluation results for Table 1. It demonstrates zero-shot open-domain QA performance, compared to recent large language model.

表 9 包含表 1 的额外评估结果，展示了与近期大语言模型相比的零样本开放域问答性能。

– Figure 5 contains additional retrieval performance evaluation for Figure 3 of experiments on combining DPR retrieved documents and large language model generated document.

图 5: 包含针对图 3 实验的额外检索性能评估，该实验结合了 DPR 检索文档与大语言模型生成文档。

– Table 10 contains additional retrieval performance evaluated by Recall $\ @\mathrm{K}$ of baselines and different GENREAD variants. Some numbers in the table overlaps with those in Figure 2.

表 10: 包含基线方法和不同GENREAD变体通过召回率 $\ @\mathrm{K}$ 评估的额外检索性能。表中部分数据与图 2 存在重叠。

Table 9: Additional numbers for Table 1. Zero-shot open-domain QA performance, compared to recent large language models. All models in the table do not leverage any external corpus for document retrieval. Compared to Instruct GP T, our proposed GENREAD can improve the EM score by $+6.9$ on average. GENREAD can achieve state-of-the-art performance on open test sets.

Models	#total parameters	NQ open test	TriviaQA opentest		WebQ opentest
				wiki split
GPT-3 (Brown et al.,2020)	175B	14.6 10.1	49.2	64.3	14.4
Gopher (Rae et al., 2021) FLAN (Wei et al.,2021)	280B 137B	20.7	43.5 56.7	52.8 68.1
GLaM (Du et al., 2022)	64B	21.5		68.0	19.0
Chinchilla (Hoffmann et al., 2022)	70B	16.6	55.4	67.0
PaLM (Chowdhery et al., 2022)	540B	21.2		76.9	10.9
InstructGPT (Ouyang et al., 2022)	175B	19.5	57.4	68.5	19.9
GENREAD (InstructGPT)	175B	28.2	59.3	70.3	24.8

表 9: 表 1 的补充数据。零样本开放域问答性能对比，与近期大语言模型的比较。表中所有模型均未使用外部语料库进行文档检索。相比 InstructGPT，我们提出的 GENREAD 平均可将 EM 分数提升 $+6.9$。GENREAD 在开放测试集上达到最先进性能。

Models	#total parameters	NQ open test	TriviaQA opentest	wiki split	WebQ opentest
GPT-3 (Brown et al.,2020)	175B	14.6 10.1	49.2	64.3	14.4
Gopher (Rae et al., 2021) FLAN (Wei et al.,2021)	280B 137B	20.7	43.5 56.7	52.8 68.1
GLaM (Du et al., 2022)	64B	21.5		68.0	19.0
Chinchilla (Hoffmann et al., 2022)	70B	16.6	55.4	67.0
PaLM (Chowdhery et al., 2022)	540B	21.2		76.9	10.9
InstructGPT (Ouyang et al., 2022)	175B	19.5	57.4	68.5	19.9
GENREAD (InstructGPT)	175B	28.2	59.3	70.3	24.8

Figure 5: Additional retrieval performance evaluation for Figure 3 of experiments on combining DPR retrieved documents and large language model generated documents. Merging documents from two sources achieved significantly better performance than using DPR retrieved documents only.

图 5: 对图 3 实验的补充检索性能评估，结合 DPR 检索文档与大语言模型生成文档的效果。合并两种来源的文档比仅使用 DPR 检索文档获得了显著更好的性能。

Models	R@1	TriviaQA		WebQ				NQ
Models	R@1	R@10	R@20	R@1	R@10	R@20	R@1	R@10	R@20
BM25 (Robertson et al., 2009)	46.2	71.7	76.4	19.1	51.8	62.6	22.8	55.6	63.9
Contriever (Izacard et al.,2022a)	34.0	67.9	74.3	18.2	55.7	65.7	18.8	54.8	65.1
DPR (Karpukhin et al., 2020)	53.2	75.3	79.0	45.4	70.5	75.2	44.6	74.5	79.5
Google Search engine API	50.0	78.8		40.0	65.6		35.5	67.5
GENREAD (nucleus, p=.95)	65.1	81.6	83.8	49.5	71.4	74.4	40.1	66.2	70.6
GENREAD (10 human prompts)	65.5	81.8		50.8	72.7		40.5	66.9
GENREAD (clustering prompts)	69.6	82.9	85.1	54.5	73.3	75.4	48.0	70.9	74.5

Table 10: Retrieval performance evaluated by Recall $\ @\mathrm{K}$ of baselines and different GENREAD variants. Some numbers in the table overlaps with those in Figure 2. The table aims to show the performance of more methods, and to provide accurate recall numbers for future research comparisons.

模型	R@1	TriviaQA		WebQ			NQ
		R@10	R@20	R@1	R@10	R@20	R@1	R@10
BM25 (Robertson et al., 2009)	46.2	71.7	76.4	19.1	51.8	62.6	22.8	55.6
Contriever (Izacard et al.,2022a)	34.0	67.9	74.3	18.2	55.7	65.7	18.8	54.8
DPR (Karpukhin et al., 2020)	53.2	75.3	79.0	45.4	70.5	75.2	44.6	74.5
Google Search engine API	50.0	78.8		40.0	65.6		35.5	67.5
GENREAD (nucleus, p=.95)	65.1	81.6	83.8	49.5	71.4	74.4	40.1	66.2
GENREAD (10 human prompts)	65.5	81.8		50.8	72.7		40.5	66.9
GENREAD (clustering prompts)	69.6	82.9	85.1	54.5	73.3	75.4	48.0	70.9

表 10: 基准方法和不同GENREAD变体通过召回率 $\ @\mathrm{K}$ 评估的检索性能。表中部分数据与图2存在重叠。本表旨在展示更多方法的性能表现，并为后续研究比较提供精确的召回率数值。

	TriviaQA		WebQ		NQ
	R@10	EM	R@10	EM	R@10	EM
	Sample5documentsfrom entiredata	81.5	70.9	72.5	53.3	69.2	44.2
Sample 5 documents from each cluster	82.7	71.8	73.3	54.4	70.6	45.3

Table 11: Ablation study on the strategy of sampling documents as in-context demonstrations.

	TriviaQA		WebQ		NQ
	R@10	EM	R@10	EM	R@10	EM
从全部数据中采样5份文档	81.5	70.9	72.5	53.3	69.2	44.2
从每个聚类中采样5份文档	82.7	71.8	73.3	54.4	70.6	45.3

表11: 上下文示例文档采样策略的消融研究。

A.6 DISCUSSION ON INFERENCE COST OF DPR AND INSTRUCT GP T

A.6 关于 DPR 和 INSTRUCT GP T 推理成本的讨论

We now compare the costs of using DPR and Instruct GP T to retrieve or generate contextual documents. We consider DPR using the BERT-base (Devlin et al., 2019) version with 110M parameters and Instruct GP T using its largest version with 175B parameters. For simplicity, we use the FLOPs-pertoken estimates for Transformer-based language models, which is introduced by Kaplan et al. (2020). It should be noted that FLOPs are not a direct measure of real-world computing costs, as latency, power consumption, and other costs can vary widely based on other factors (Liu et al., 2022a).

我们现在比较使用DPR和Instruct GPT检索或生成上下文文档的成本。我们考虑使用具有1.1亿参数的BERT-base版本(Devlin et al., 2019)的DPR，以及使用具有1750亿参数的最大版本Instruct GPT。为简化起见，我们采用Kaplan等人(2020)提出的基于Transformer的语言模型的每token FLOPs估算值。需要注意的是，FLOPs并非现实世界计算成本的直接衡量标准，因为延迟、功耗和其他成本可能因其他因素而有很大差异(Liu et al., 2022a)。

For the DPR model, all Wikipedia documents (around 21M) only need to be encoded once. Therefore, as the number of input questions increases, the marginal computational cost gradually decreases. For fair comparison, we first use DPR to encode all 21M Wikipedia documents once. Encoding all Wikipedia documents requires 110e6 (BERT-base parameters) $\times21e6$ (total number of documents) $\times100$ (tokens per document) $=2.3e17$ FLOPs. When the embedding of all candidate documents are produced, retrieving documents for a given question requires 110e6 (BERT-base parameters) $\times20$ (tokens per question) $+21e6$ (total number of documents) $\times(768+768-1)=3.2e10$ FLOPs.

对于DPR模型，所有维基百科文档(约2100万篇)只需编码一次。因此，随着输入问题数量的增加，边际计算成本会逐渐降低。为公平比较，我们首先使用DPR对所有2100万篇维基百科文档进行一次编码。编码全部维基百科文档需要110e6(BERT-base参数量) $\times21e6$(文档总数) $\times100$(每文档token数) $=2.3e17$ FLOPs。当所有候选文档的嵌入向量生成后，为给定问题检索文档需要110e6(BERT-base参数量) $\times20$(每问题token数) $+21e6$(文档总数) $\times(768+768-1)=3.2e10$ FLOPs。

For Instruct GP T, it requires 175e9 (Instruct GP T parameters) $\times10$ (number of documents) $\times~55$ (generated tokens per document) $=9.6e13$ FLOPs to generate 10 documents for a given question.

对于 Instruct GPT，生成 10 份针对给定问题的文档需要消耗 175e9 (Instruct GPT 参数量) $\times10$ (文档数量) $\times~55$ (每文档生成 token 数) $=9.6e13$ FLOPs 的计算量。

Therefore, the equation for the total cost $Y_{\mathrm{DPR-cost}}$ to retrieve 10 documents using DPR versus the number of input questions $X$ is: $Y_{\mathrm{DPR-cost}}=3.2e10X+2.3e17$ . Besides, the equation for the total cost $Y_{\mathrm{GPT3-cost}}$ to generate 10 documents using Instruct GP T versus the number of input questions $X$ is: $Y_{\mathrm{GPT3-cost}}=9.6e13X$ . When $Y_{\mathrm{DPR-cost}}=Y_{\mathrm{GPT3-cost}}$ , $X\approx2473$ . In conclusion, if the number of input questions is less than 2473, the total cost of Instruct GP T is lower than the DPR; if the number of input questions is greater than 2473, the total cost of Instruct GP T exceeds the DPR.

因此，使用DPR检索10份文档的总成本$Y_{\mathrm{DPR-cost}}$与输入问题数量$X$的关系式为：$Y_{\mathrm{DPR-cost}}=3.2e10X+2.3e17$。此外，使用Instruct GPT生成10份文档的总成本$Y_{\mathrm{GPT3-cost}}$与输入问题数量$X$的关系式为：$Y_{\mathrm{GPT3-cost}}=9.6e13X$。当$Y_{\mathrm{DPR-cost}}=Y_{\mathrm{GPT3-cost}}$时，$X\approx2473$。综上所述，若输入问题数量少于2473，Instruct GPT的总成本低于DPR；若输入问题数量超过2473，Instruct GPT的总成本将高于DPR。

A.7 ERROR ANALYSIS AND CASE STUDIES ON THE NQ DATASET

A.7 NQ数据集上的错误分析与案例研究

As stated in Zhang & Choi (2021), NQ contains a significant proportion, roughly $16.5%$ , of questions that have time-dependent answers. Similarly, Izacard et al. (2022b) observed using the latest version of Wikipedia (12 / 2021) could lead to 4.4 drops of the EM score, compared to the Wikipedia version (12 / 2018) that the NQ questions are created from. We provide case studies in Table 13 in Appendix.

如Zhang & Choi (2021)所述，NQ数据集中有约16.5%的问题具有时效敏感性答案。Izacard et al. (2022b)同样发现，相较于NQ问题构建时采用的维基百科版本(12/2018)，使用最新维基百科版本(12/2021)会导致EM分数下降4.4个百分点。我们在附录表13中提供了具体案例分析。

We did case studies of 100 examples from the NQ dataset. The results are shown in Table 12. Among these 100 examples, we found that 29 examples have data collection and annotation mistakes, mainly including the temporal question issue (13 / 29) and the incomplete answer issue (16 / 29). A typical temporal-dependent question is that no specific time condition is provided. For example, “Who won the MVP for the National League?” could have different answers in different years. In 2017, the MVP is Giancarlo Stanton, and in 2018, the MVP is Christian Yelich. Besides, some answer labels provided in the NQ dataset are not complete. For example, person names in the NQ dataset usually consist of first, middle, and last names, but most names in the generated documents are first and last names. For the question “who played lionel in as time goes by?”, the labeled answer is “Geoffrey Dyson Palmer”. DPR-FiD produces “Geoffrey Dyson Palmer” but GENREAD produces “Geoffrey Palmer”, both of which should be considered correct. More examples are provided in Table 14.

我们对NQ数据集中的100个示例进行了案例研究，结果如表12所示。在这100个示例中，我们发现29个示例存在数据收集和标注错误，主要包括时间相关问题（13/29）和不完整答案问题（16/29）。典型的时间依赖性问题是没有提供具体的时间条件。例如，"谁赢得了国家联盟的MVP？"在不同年份可能有不同的答案。2017年的MVP是Giancarlo Stanton，而2018年的MVP是Christian Yelich。此外，NQ数据集中提供的一些答案标签并不完整。例如，NQ数据集中的人名通常由名、中间名和姓组成，但生成文档中的大多数名字只有名和姓。对于问题"在《As Time Goes By》中扮演Lionel的是谁？"，标注的答案是"Geoffrey Dyson Palmer"。DPR-FiD生成的是"Geoffrey Dyson Palmer"，而GENREAD生成的是"Geoffrey Palmer"，两者都应被视为正确。更多示例见表14。

Besides, GENREAD produced correct answers for 49 questions. Among the 22 incorrect predictions, 12 of them could be classified as retrieval errors (i.e., step-I error) and 12 as reading errors (i.e., step-II error). In all cases of retrieval errors, none of the generated documents contain the correct answer. In all cases of reading errors, at least one generated document contains the correct answer but the reader model failed to infer the correct answer from the documents..

此外，GENREAD为49个问题提供了正确答案。在22个错误预测中，12个可归类为检索错误（即步骤I错误），12个为阅读错误（即步骤II错误）。在所有检索错误案例中，生成的文档均未包含正确答案。而在所有阅读错误案例中，至少有一份生成文档包含正确答案，但阅读模型未能从文档中推断出正确结果。

Table 12: Case study on 100 GENREAD predictions in the NQ dataset. Among 100 examples, there are 49 correct predictions, i.e., $\mathrm{EM}=49%$ . We further categorized 51 incorrect predictions of our GENREAD, including errors caused by data collection and annotation, and errors caused by model prediction. In addition, we provide more case studies in Tables 13-15 (Table 13 for the temporal question issue; Table 14 for the incomplete answer issue; Table 15 for the hallucination issue).

Good Q&A (71%)	Correct prediction (49%)		- Query: Who got the first Nobel Prize in Physics? - Document: The first Nobel Prize in Physics was awarded in 1901 to Wilhelm Conrad Rontgen for his discovery of the remarkable rays subsequently named after him. - Predicted answer: Wilhelm Conrad Rontgen - Correct answer: Wilhelm Conrad Rontgen
	Wrong retrieval (12%)	Hallucinations (8%)	- Query: Who died in the first episode of Stranger Things? - Document: In the first episode of Stranger Things, the character Will Byers dies. He is killed by Demogorgon, a monster from the Upside Down. - Predicted answer: Will Byers - Correct answer: Benny Hammond
	Wrong retrieval (12%)	No hit answers (4%)	- Query: When was coffee first made into a drink? - Document: The history of coffee goes back to the 10th century, with coffee trees native to Ethiopia. The earliest substantiated evidence of either coffee drinking or know- ledge of coffee tree is from sixth century AD in Ethiopia. - Predicted answer: the 10th century - Correct answer: the 15th century
	Wrong reading (10%)		- Query: When is the fourth movie of the Divergent series coming out? - Document: The fourth movie in the Divergent series was originally scheduled to be released in June 2017, but was delayed indefinitely. - Predicted answer: June 2017 - Correct answer: never made - Query: Who won the MVP for the National League?
Bad Q&A (29%)	Temporal questions (13%)		- Document: In 2017, the National League MVP was won by Giancarlo Stanton of the Miami Marlins. In 2018, the National League MVP was won by Christian Yelich of the Milwaukee Brewers. - Predicted answer: Christian Yelich - Correct answer: Giancarlo Stanton - Query: Where do the greasers live in the Outsiders?
Bad Q&A (29%)	Incomplete answers (16%)		- Document: The Outsiders is a novel by S.E. Hinton. It is about a gang of greasers in Oklahoma in the 1960s. The National League MVP was won by Christian Yelich of the Milwaukee Brewers. - Predicted answer: Oklahoma - Correct answer: Tulsa, Oklahoma

表 12: NQ数据集中100个GENREAD预测的案例分析。在100个示例中，有49个预测正确，即$\mathrm{EM}=49%$。我们进一步对GENREAD的51个错误预测进行了分类，包括数据收集和标注导致的错误，以及模型预测导致的错误。此外，我们在表13-15中提供了更多案例分析（表13针对时间性问题，表14针对不完整答案问题，表15针对幻觉问题）。

优质问答 (71%)	正确预测 (49%)	-	- 查询: 谁获得了第一个诺贝尔物理学奖？- 文档: 第一个诺贝尔物理学奖于1901年授予Wilhelm Conrad Rontgen，以表彰他发现了后来以他名字命名的非凡射线。- 预测答案: Wilhelm Conrad Rontgen- 正确答案: Wilhelm Conrad Rontgen
	错误检索 (12%)	幻觉 (8%)	- 查询: 谁在《怪奇物语》第一集中死亡？- 文档: 在《怪奇物语》第一集中，角色Will Byers死亡。他被来自颠倒世界的怪物Demogorgon杀死。- 预测答案: Will Byers- 正确答案: Benny Hammond
		无命中答案 (4%)	- 查询: 咖啡首次被制成饮料是什么时候？- 文档: 咖啡的历史可以追溯到10世纪，咖啡树原产于埃塞俄比亚。关于饮用咖啡或了解咖啡树的最早确凿证据来自公元6世纪的埃塞俄比亚。- 预测答案: the 10th century- 正确答案: the 15th century
	错误阅读 (10%)	-	- 查询: 《分歧者》系列第四部电影何时上映？- 文档: 《分歧者》系列第四部电影原定于2017年6月上映，但被无限期推迟。- 预测答案: June 2017- 正确答案: never made- 查询: 谁赢得了国家联盟MVP？
劣质问答 (29%)	时间性问题 (13%)	-	- 文档: 2017年，国家联盟MVP由迈阿密马林鱼队的Giancarlo Stanton获得。2018年，国家联盟MVP由密尔沃基酿酒人队的Christian Yelich获得。- 预测答案: Christian Yelich- 正确答案: Giancarlo Stanton- 查询: 《局外人》中greasers住在哪里？
	不完整答案 (16%)	-	- 文档: 《局外人》是S.E. Hinton的小说。讲述了1960年代俄克拉荷马州一群greaser帮派的故事。国家联盟MVP由密尔沃基酿酒人队的Christian Yelich获得。- 预测答案: Oklahoma- 正确答案: Tulsa, Oklahoma

Table 13: Case studies of temporal it y issues of the NQ dataset. All these questions are drawn from Zhang & Choi (2021), which contains a subset of NQ data examples with time-dependent questions.

Original question	NQlabels	Correct labels
Q: When is the last time the philadelphia won the superbowl?\|Super Bowl LIl; 2017\|2018; February 4, 2018 DPR:2017x;Go0gle search:2018V;GENREAD:February 4,2018
Q: Who has the most big ten championships in football? DPR: Michigan X; Google search: Ohio State V; GENREAD : Ohio State V	Michigan	Ohio State
Q: Who has the most super bowls in nf history?	Pittsburgh Steelers	Pittsburgh Steelers; New England Patriots
DPR: Pittsburgh Steelers V; Google search: New England Patriots V; GENREAD : New England Patriots V
Q: How many casinos are in atlantic city new jersey? DPR:elevenX;Google search:nineV;GENREAD:nineV	11; eleven	9; nine
Q: When did the us not go to the olympics? DPR:1980V;Go0gle search:1980 and1984V;GENREAD:1984√	1980	1980; 1984
Q: Largest cities in the world by population? DPR: Beijing X; Google search: Tokyo V; GENREAD : Tokyo V	Beijing	Tokyo
Q: Who has most followers on instagram in world?	Selena Gomez	CristianoRonaldo
DPR:Instagramx;Google search:Cristiano Ronaldo V; GENREAD :Cristiano Ronaldo V Q: Who is the no. 1 ranked tennis player in the world?	Rafael Nadal	NovakDjokovic

表 13: NQ数据集时序性问题案例研究。所有问题均引自Zhang & Choi (2021)，该研究包含NQ数据集中时间敏感性问题的子集示例。

原始问题	NQ标注	正确答案
Q: 费城队最近一次赢得超级碗是什么时候？\| 超级碗LIl；2017年 \| 2018年；2018年2月4日 DPR:2017x；Google搜索:2018V；GENREAD:2018年2月4日
Q: 谁赢得过最多的十大联盟橄榄球锦标赛？ DPR: 密歇根大学 X；Google搜索: 俄亥俄州立大学 V；GENREAD: 俄亥俄州立大学 V	密歇根大学	俄亥俄州立大学
Q: NFL历史上谁赢得过最多的超级碗？	匹兹堡钢人队	匹兹堡钢人队；新英格兰爱国者队
DPR: 匹兹堡钢人队 V；Google搜索: 新英格兰爱国者队 V；GENREAD: 新英格兰爱国者队 V
Q: 新泽西州大西洋城有多少家赌场？ DPR:11家X；Google搜索:9家V；GENREAD:9家V	11；十一家	9；九家
Q: 美国何时未参加奥运会？ DPR:1980年V；Google搜索:1980年和1984年V；GENREAD:1984年√	1980年	1980年；1984年
Q: 世界上人口最多的城市是？ DPR: 北京 X；Google搜索: 东京 V；GENREAD: 东京 V	北京	东京
Q: Instagram全球粉丝最多的人是谁？	赛琳娜·戈麦斯	克里斯蒂亚诺·罗纳尔多
DPR:Instagramx；Google搜索:克里斯蒂亚诺·罗纳尔多 V；GENREAD:克里斯蒂亚诺·罗纳尔多 V Q: 当前世界排名第一的网球选手是谁？	拉斐尔·纳达尔	诺瓦克·德约科维奇

Original question	DPR-FiDpredictions	GENREADpredictions
Q: Who played lionel in as time goes by? DPR retrieved documents: Geoffrey Dyson Palmer, (born 4 June 1927) is an English actor known for his roles in British television sitcoms playing Jimmy Anderson in “The Fall and Rise of Reginald Perrin", Ben Parkinson in “Butterflies” and Lionel Hardcastle in “As Time Goes By". His film appearances include "A Fish Called Wanda", "The Madness of King George", “Mrs. Brown", and “Tomorrow Never Dies". GPT generated documents: As Time Goes By is a British sitcom that aired on BBC One from 1992 to 2005. The show starred Geoffrey Palmer and Judi Dench as Lionel and Jean Pargetter, a middle-aged couple	GeoffreyDysonPalmer	Geoffrey Palmer Explanation: The labeled answer is “Geoffrey Dyson Palmer", however, “Geoffrey Palmer” is also correct.
who reunite after many years apart. Lionel was played by Palmer, who was also a writer on the show. Q:How many cracker barrels in the united states?	645 Explanation: The labled answer is “639" or “over 600", so “over 630" is also a reasonable answer.	over630
Q: Where do the greasers live in the outsiders? at the box office, grossing 33 million on a 10 million budget. In Tulsa, Oklahoma, greasers are a gang of tough, low-income working-class teens. They include Ponyboy Curtis and his two older brothers, Sodapop and Darrel, as well as Johnny.	Tulsa, Oklahoma Explanation: The labled answer is “Tulsa, Oklahoma", but “Oklahoma" is also a correct answer.	Oklahoma DPR retrieved documents: The movie received mostly positive reviews from critics, and performed well
GPT generated documents: The Outsiders is a novel by S.E. Hinton. It is about a gang of greasers in Oklahoma in the 1960s. The greasers are from the poor side of town and constantly in trouble with the law. Q: Where are unipolar neurons found in spinal cord?	the granule region

Table 14: Case studies of the incomplete answers issue of the NQ dataset. Since the labels in NQ are spans identified from Wikipedia passages, it is easier for DPR-FiD to predict correct answers.

原始问题	DPR-FiD预测	GENREAD预测
Q: 谁在《As Time Goes By》中扮演Lionel？DPR检索到的文档：Geoffrey Dyson Palmer（1927年6月4日出生）是一位英国演员，因在英国电视情景喜剧中扮演的角色而闻名，包括《The Fall and Rise of Reginald Perrin》中的Jimmy Anderson、《Butterflies》中的Ben Parkinson以及《As Time Goes By》中的Lionel Hardcastle。他的电影作品包括《A Fish Called Wanda》、《The Madness of King George》、《Mrs. Brown》和《Tomorrow Never Dies》。GPT生成的文档：《As Time Goes By》是一部英国情景喜剧，于1992年至2005年在BBC One播出。该剧由Geoffrey Palmer和Judi Dench主演，分别饰演Lionel和Jean Pargetter，一对中年夫妇	Geoffrey Dyson Palmer	Geoffrey Palmer说明：标注答案为“Geoffrey Dyson Palmer”，但“Geoffrey Palmer”也是正确的。
他们在多年分离后重逢。Lionel由Palmer扮演，他也是该剧的编剧之一。Q: 美国有多少家Cracker Barrel餐厅？	645说明：标注答案为“639”或“超过600”，因此“超过630”也是一个合理的答案。	超过630
Q: 在《The Outsiders》中，greasers住在哪里？在票房上表现良好，以1000万美元的预算获得了3300万美元的收入。在俄克拉荷马州的塔尔萨，greasers是一群坚强、低收入的工作阶层青少年。他们包括Ponyboy Curtis和他的两个哥哥Sodapop和Darrel，以及Johnny。	俄克拉荷马州塔尔萨说明：标注答案为“Tulsa, Oklahoma”，但“Oklahoma”也是一个正确答案。	俄克拉荷马州DPR检索到的文档：这部电影获得了评论家的大部分正面评价，并在票房上表现良好
GPT生成的文档：《The Outsiders》是S.E. Hinton的一部小说。它讲述了20世纪60年代俄克拉荷马州一群greasers的故事。这些greasers来自城镇的贫困地区，经常与法律发生冲突。Q: 单极神经元在脊髓的哪个区域被发现？	颗粒区

Question: Who wrote the first declaration of human rights? Answer: Cyrus Cylinder which was written by George Mason in 1776. Retrieved document: John Peters Humphrey, OC (April 30, 1905 - 2013 March 14, 1995) was a Canadian legal scholar, jurist, and human rights advocate. He is most famous as the author of the first draft of the Universal Declaration of Human Rights. ... .. (Wikipedia page: John Peters Humphrey) GoldenWikipedia evidence:

government, a reading prominently advanced by Shah Mohammed Reza Pahlavi, in a 1967 book, The White Revolution of Iran. (Wikipedia page: Cyrus Cylinder) Hallucination: In GPT generated documents, first declaration of human rights was considered as Virginia Declaration of Rights, which is factually incorrect. This error might occur because Wikipedia page of George Mason contains similar descriptions, causing the model to confuse their meanings. Model predictions: DPR-FiD John Peters Humphrey; GENREAD George Mason Question: Who died in the first episode of stranger things Answer: Benny or Barb Generated document: In the first episode of Stranger Things, the character Will Byers dies. He is killed by the Demogorgon, a monster from the Upside Down.

character since the focus was on finding Will. .. (Wikipedia page: Stranger Things) GoldenWikipediaevidence: One area of focus from these fans was the character of Barb, the nerdy friend and classmate of Nancy who is taken and killed by the monster early in the season. According to actress Shannon Purser, Barb wasn't supposed to be a big deal; and the Duffer Brothers had not gone into great detail about the character since the focus was on finding Will. .... (Wikipedia page: Stranger Things) The owner, Benny, learns from a tattoo of "O11" on her arm that her name is Eleven. Brenner, monitoring the phone lines, sends agents to the diner after Benny calls social services. The agents kill Benny, but Eleven manages to escape using telekinetic abilities. Joyce's phone short circuits after receiving a mysterious phone call that she believes is from Will. While searching for Will in the woods, Mike, Dustin and Lucas come across Eleven. (Wikipedia page: Stranger Things) Hallucination: In GPT generated documents, Will Byers is said to be dead, which is actually incorrect.

Will Byers is actually a fake deaths in season 1 and he is still alive at the end of the season. Model predictions: DPR-FiDBarb; GENREADWill Question: Who played the judge in good will hunting? Answer: James P. Flynn Generated document: The character of the Judge in Good Will Hunting was played by actor Robert Sean Leonard. Leonard is best known for his role as Dr. James Wilson in the television series House. Retrieveddocument:Janitorcouldshowsuchfacilitywithmatricesappliedthisway.GoodWill

| 问题: 第一份人权宣言是谁写的？答案: 居鲁士圆柱，由乔治·梅森于1776年撰写。检索文档: John Peters Humphrey (1905年4月30日 - 1995年3月14日) 是加拿大法学家、人权倡导者，以起草《世界人权宣言》初稿闻名... (维基百科页面: John Peters Humphrey) 黄金维基证据: 伊朗政府1967年著作《伊朗白色革命》中突出宣传了居鲁士圆柱的解读。(维基百科页面: 居鲁士圆柱) 幻觉: GPT生成文档误将《弗吉尼亚权利宣言》当作首份人权宣言，该错误可能源于乔治·梅森维基页面的相似描述导致模型混淆。模型预测: DPR-FiD John Peters Humphrey; GENREAD 乔治·梅森 |
| 问题: 《怪奇物语》第一集谁死了？答案: Benny或Barb 生成文档: 《怪奇物语》首集中Will Byers被逆世界怪物Demogorgon杀死。检索文档: 粉丝重点关注角色Barb——Nancy的书呆子好友，在本季早期被怪物杀害。演员Shannon Purser表示Barb本非重要角色...(维基百科页面: 怪奇物语) 黄金维基证据: 餐馆老板Benny通过"O11"纹身得知她叫Eleven。Brenner监听电话线，在Benny呼叫社会服务后派特工杀死Benny，但Eleven用心灵传动能力逃脱...(维基百科页面: 怪奇物语) 幻觉: GPT生成文档错误宣称Will Byers死亡，实际该角色第一季为假死，季末仍存活。模型预测: DPR-FiD Barb; GENREAD Will |
| 问题: 《心灵捕手》中法官扮演者是谁？答案: James P. Flynn 生成文档: 该角色由Robert Sean Leonard饰演，他因《豪斯医生》中Dr. James Wilson角色知名。检索文档: 看门人展现矩阵应用能力。《心灵捕手》 |

Query: What type of music did Mozart composed? Answers: opera, classical music, art song, chamber music, ballet

查询：莫扎特创作了什么类型的音乐？答案：歌剧、古典音乐、艺术歌曲、室内乐、芭蕾舞剧

DPR #1: The following is a list of surviving works that Mozart composed for performance at gatherings of Masons. The story and music of his opera “The Magic Flute” is also considered to have strong Masonic influences. The following is a partial list of family members, patrons, and colleagues who were Masons. Mozart’s grandfather Johann Georg, a bookbinder, was raised among the extended Mozart family in Augsburg, in the house of Johann’s own grandfather David Mozart. David and his children were distinguished architects and master ... (Wikipedia id: 10988168 – “Mozart and Freemasonry”)

DPR #1: 以下是莫扎特为共济会聚会创作并流传至今的作品列表。其歌剧《魔笛》的剧情与音乐也被认为深受共济会影响。以下为莫扎特家族成员、赞助人及共济会同僚的部分名单：莫扎特的祖父约翰·格奥尔格（装订匠）在奥格斯堡的莫扎特家族大宅中长大，该宅邸由其曾祖父大卫·莫扎特建造。大卫及其子女是杰出的建筑师与工匠...(维基百科ID: 10988168–"莫扎特与共济会")

DPR #2: Progressively, and in large part at the hands of Mozart himself, the contrapuntal complexities of the late Baroque emerged once more, moderated and disciplined by new forms, and adapted to a new aesthetic and social milieu. Mozart was a versatile composer, and wrote in every major genre, including symphony, opera, the solo concerto, chamber music including string quartet and string quintet, and the piano sonata. These forms were not new, but Mozart advanced their technical sophistication and emotional reach. He almost single-handedly developed and popularized the Classical piano concerto. He wrote a great deal ... (Wikipedia id: 451968 – “Wolfgang Amadeus Mozart”)

DPR #2: 在莫扎特本人的主导下，巴洛克晚期复杂的对位技法逐渐复兴，通过新形式的调和与规训，适应了新的美学和社会环境。莫扎特作为多产作曲家，涉猎所有主要音乐体裁：交响曲、歌剧、独奏协奏曲、包含弦乐四重奏与弦乐五重奏的室内乐，以及钢琴奏鸣曲。这些形式并非首创，但莫扎特提升了它们的技术深度与情感表现力。他几乎以一己之力发展并普及了古典钢琴协奏曲。他还创作了大量...（维基百科ID：451968——《沃尔夫冈·阿马德乌斯·莫扎特》）

Among top-10 retrieved documents: opera: 2 times chamber music: 1 time

检索排名前10的文档中：歌剧(opera)出现2次，室内乐(chamber music)出现1次

GPT nucleus #1: Mozart was a prolific composer and wrote in many different genres. He composed symphonies, operas, concertos, chamber music, and more. Many of his works are considered some of the greatest ever written in those genres.

GPT核心#1：莫扎特是一位多产作曲家，创作涵盖多种不同体裁。他谱写了交响曲、歌剧、协奏曲、室内乐等作品，其中许多被公认为相关体裁中最伟大的杰作。

GPT nucleus #2: Mozart composed a variety of music, including operas, symphonies, chamber music, concertos, and sonatas. His early works were influenced by the music of Haydn and Bach, and his later works show the influence of the French and Italian composers of his time. Mozart’s music is characterized by its beauty, elegance, and precision, and he is considered one of the greatest composers of all time.

GPT核心#2：莫扎特创作了多种音乐形式，包括歌剧、交响乐、室内乐、协奏曲和奏鸣曲。其早期作品受到海顿与巴赫的影响，后期作品则展现出同时代法国和意大利作曲家的风格印记。莫扎特的音乐以优美、典雅和精准著称，被公认为史上最伟大的作曲家之一。

GPT nucleus #3: Mozart composed a wide variety of music, including operas, symphonies, concertos, solo pieces, and chamber music. He also wrote a number of religious works, including masses and motets.

GPT核心3：莫扎特创作了种类繁多的音乐作品，包括歌剧、交响曲、协奏曲、独奏曲和室内乐。他还谱写了许多宗教音乐作品，包括弥撒曲和经文歌。

Among top-10 generated documents: chamber music: 9 times

在前10名生成文档中：室内乐 (chamber music)：9次

GPT cluster #1: Mozart found work in Vienna as a piano teacher and composer for opera and other works. There he met a fellow Mason, Joseph Haydn, who was to become an important mentor. Mozart married Constanze Weber in 1782; their marriage is thought to have been happy, in contrast to his earlier unhappy arranged marriage to his cousin, Maria Anna ¨Nannerl ¨Mozart. Mozart composed more than 600 works, including works widely.

GPT集群#1: 莫扎特在维也纳以钢琴教师和歌剧及其他作品作曲家的身份谋生。在那里他遇到了同为共济会成员的约瑟夫·海顿，后者成为他重要的导师。1782年莫扎特与康斯坦泽·韦伯结婚；这段婚姻被认为很幸福，与他早年和表妹玛丽亚·安娜·"南内尔"·莫扎特不愉快的包办婚姻形成鲜明对比。莫扎特创作了600多部作品，包括广受欢迎的歌剧《费加罗的婚礼》《唐璜》和《魔笛》，以及《安魂曲》等杰作。他的作品涵盖交响乐、协奏曲、室内乐和钢琴独奏曲等多种体裁。

GPT cluster #2: Mozart was a prolific composer, and wrote in a wide variety of genres. He composed over 600 works, including operas, symphonies, concertos, solo works, and chamber music. His operas include The Marriage of Figaro, Don Giovanni, and Cosi fan tutte. His symphonies include the well-known Symphony No. 40 in G minor, and his concertos include the Piano Concerto No. 21 in C major, known as the Elvira Madigan concerto.

GPT集群#2：莫扎特是一位多产的作曲家，创作了多种类型的作品。他谱写了600多部作品，包括歌剧、交响曲、协奏曲、独奏作品和室内乐。其歌剧代表作有《费加罗的婚礼》《唐璜》《女人皆如此》；交响曲中以G小调第40号交响曲最为著名；协奏曲中则以C大调第21号钢琴协奏曲（因电影《艾尔维拉·麦迪根》配乐而被称为"艾尔维拉协奏曲"）最具代表性。

GPT cluster #3: Mozart was a prolific composer who wrote music in a wide variety of genres. He is best known for his work in the classical genre, which includes such works as his Symphony No. 40 in G minor and his piano concertos. He also wrote operas, including The Marriage of Figaro and The Magic Flute. In addition to his work in classical music, Mozart also composed works in the popular genre, such as his Eine kleine Nachtmusik.

GPT集群#3：莫扎特是一位多产作曲家，创作了涵盖多种流派的音乐作品。他最著名的成就在于古典乐领域，代表作包括《G小调第40号交响曲》和众多钢琴协奏曲。他还创作了《费加罗的婚礼》《魔笛》等歌剧作品。除古典音乐外，莫扎特也涉足流行音乐创作，例如著名的小夜曲《Eine kleine Nachtmusik》。

Among top-10 generated documents: opera: 2 times chamber music: 7 times classical music: 1 time

生成文档排名前十中：歌剧：2次室内乐：7次古典音乐：1次

Table 16: Case study of retrieved documents and generated documents from language models (1/4).

表 16: 语言模型检索文档与生成文档的案例研究 (1/4)。

Query: What does Monsanto own? Answers: agricultural chemicals, seed, agriculture, chemical industry

查询：孟山都拥有哪些业务？
回答：农用化学品、种子、农业、化工行业

DPR #1: In September 2016, Monsanto agreed to be acquired by Bayer for $\operatorname{USS66}$ billion. In an effort to receive regulatory clearance for the deal, Bayer announced the sale of significant portions of its current agriculture businesses, including its seed and herbicide businesses, to BASF. The deal was approved by the European Union on March 21, 2018, and approved in the United States on May 29, 2018. The sale closed on June 7, 2018; Bayer announced its intent to discontinue the Monsanto name, with the combined company operating solely under the Bayer brand. ... (Wikipedia id: 935126 – “Monsanto”)

DPR #1：2016年9月，孟山都(Monsanto)同意被拜耳(Bayer)以660亿美元收购。为获得交易监管许可，拜耳宣布将旗下包括种子和除草剂业务在内的主要农业部门出售给巴斯夫(BASF)。该交易于2018年3月21日获欧盟批准，2018年5月29日获美国批准。2018年6月7日交易完成；拜耳宣布将停用孟山都品牌，合并后的公司仅保留拜耳品牌...(维基百科ID:935126–"Monsanto")

DPR #2: Behind it all is of course an evil corporation: Monsanto. GMO conspiracy theories GMO conspiracy theories are conspiracy theories related to the production and sale of genetically modified crops and genetically modified food (also referred to as genetically modified organisms or “GMOs”). These conspiracy theories include claims that agribusinesses, especially Monsanto, have suppressed data showing that GMOs cause harm, deliberately cause food shortages to promote the use of GM food, or have co-opted government agencies ... (Wikipedia id: 19304897 – “GMO conspiracy theories”)

DPR #2: 幕后黑手当然是一家邪恶公司: 孟山都。
转基因阴谋论
转基因阴谋论是与转基因作物及转基因食品(也称转基因生物或"GMO")生产销售相关的阴谋论。这些阴谋论声称农业企业(特别是孟山都)压制了显示转基因有害的数据、故意制造粮食短缺以推广转基因食品、或收买了政府机构...(维基百科ID: 19304897 – "转基因阴谋论")

Among top-10 retrieved documents: agriculture: 1 time seed: 2 times

在前10篇检索到的文档中：农业(agriculture)出现1次，种子(seed)出现2次

GPT nucleus #1: Monsanto is an American agricultural biotechnology corporation founded in 1901. It is headquartered in Creve Coeur, Missouri. Monsanto is a leading producer of genetically engineered seed and Roundup, a glyphosate-based herbicide.

GPT核心观点1: Monsanto是一家成立于1901年的美国农业生物技术公司，总部位于密苏里州克雷弗科尔。该公司是转基因种子和草甘膦除草剂农达(Roundup)的领先生产商。

GPT nucleus #2: Monsanto is a multinational a gro chemical and agricultural biotechnology corporation headquartered in Creve Coeur, Missouri. It is one of the world’s leading producers of Roundup, a glyphosatebased herbicide. The company also manufactures genetically engineered seed and Roundup Ready crops, including corn, soybeans, and cotton. As of 2018, Monsanto is the world’s largest producer of glyphosate. In 2018, Bayer acquired Monsanto for $\$63$ billion.

GPT核心#2：孟山都(Monsanto)是一家跨国农业化工与农业生物技术公司，总部位于密苏里州克雷沃克尔。该公司是全球草甘膦除草剂"农达"(Roundup)的主要生产商之一，同时生产转基因种子及抗农达作物(包括玉米、大豆和棉花)。截至2018年，孟山都已成为全球最大草甘膦生产商。2018年，拜耳(Bayer)以630亿美元完成对孟山都的收购。

Among top-10 generated documents: seed: 6 times

在前10名生成文档中：种子：6次

GPT cluster #1: Monsanto Company is an American multinational a gro chemical and agricultural biotechnology corporation headquartered in Creve Coeur, St. Louis, Missouri. It is a leading producer of genetically engineered (GE) seed and of the herbicide glyphosate, which it markets under the Roundup brand. Founded in 1901 by John Francis Queeny, Monsanto spent most of the 20th century as a chemical company, becoming one of the world’s leading manufacturers of plastics, resins, and other chemicals.

GPT集群#1：孟山都公司(Monsanto Company)是一家总部位于密苏里州圣路易斯市克雷夫科尔镇的美国跨国农用化学品与农业生物技术企业。作为转基因(GE)种子和草甘膦除草剂(商品名Roundup)的领先生产商，该公司由John Francis Queeny于1901年创立，在20世纪主要作为化工企业运营，曾位列全球塑料、树脂及其他化学制品顶级制造商行列。

GPT cluster #2: Monsanto is a biotechnology and agriculture company that specializes in genetically modified organisms (GMOs). The company owns a number of patents on GMOs, and it produces a range of products that contain GMOs, including herbicides, pesticides, and seeds.

GPT集群#2：孟山都(Monsanto)是一家专注于转基因生物(GMO)的生物技术与农业公司。该公司拥有多项转基因生物专利，生产包括除草剂、杀虫剂和种子在内的一系列含转基因成分产品。

GPT cluster #3: Monsanto is a multinational agricultural biotechnology corporation. It is one of the world’s leading producers of genetically engineered seed and Roundup, a glyphosate-based herbicide. The company also manufactures other agricultural chemicals, such as insecticides and fungicides.

GPT集群#3：孟山都(Monsanto)是一家跨国农业生物技术公司，作为全球转基因种子和草甘膦除草剂"农达"(Roundup)的主要生产商，该公司还生产杀虫剂、杀菌剂等其他农用化学品。

Among top-10 generated documents: agriculture: 2 times seed: 5 times agricultural chemicals: 1 time

生成文档前十名中：农业：2次种子：5次农药：1次

Query: What city was Zeus the patron god of? Answers: Olympia, Elis

查询：宙斯是哪座城市的守护神？
答案：奥林匹亚、伊利斯

DPR #1: Many were specific only to a particular deity or city-state. For example, the festival of Lykaia was celebrated in Arcadia in Greece, which was dedicated to the pastoral god Pan. Like the other Pan hellenic Games, the ancient Olympic Games were a religious festival, held at the sanctuary of Zeus at Olympia. Other festivals centred on Greek theatre, of which the Dionysia in Athens was the most important. More typical festivals featured a procession, large sacrifices and a feast to eat the offerings, and many included entertainments and customs such as visiting friends, wearing fancy dress and unusual behaviour in the ... (Wikipedia id: 1945872 – “Ancient Greek religion”)

DPR #1: 许多节日仅针对特定的神祇或城邦。例如，吕卡亚节(Lykaia)在希腊的阿卡迪亚地区庆祝，专门献给牧神潘(Pan)。与其他泛希腊运动会一样，古代奥林匹克运动会也是宗教节日，在奥林匹亚的宙斯圣地举行。其他节日以希腊戏剧为中心，其中雅典的酒神节(Dionysia)最为重要。更典型的节日包括游行、大型祭祀和享用供品的宴会，许多还包含娱乐活动和习俗，如拜访朋友、穿着奇装异服以及在...中表现异常行为。(维基百科ID: 1945872 - "古希腊宗教")

DPR #2: Theatre of Dionysus The Theatre of Dionysus E leu there us is a major theatre in Athens, considered to be the world’s first theatre, built at the foot of the Athenian Acropolis. Dedicated to Dionysus, the god of plays and wine (among other things), the theatre could seat as many as 17,000 people with excellent acoustics, making it an ideal location for ancient Athens’ biggest theatrical celebration, the Dionysia. It was the first theatre ever built, cut into the southern cliff face of the Acropolis, and supposedly the birthplace of Greek tragedy. The remains of a restored and redesigned Roman version can still ... (Wikipedia id: 3732515 – “Theatre of Dionysus”)

DPR #2: 狄俄尼索斯剧场

狄俄尼索斯剧场 (Theatre of Dionysus Eleuthereus) 是雅典的一座重要剧场，被认为是世界上第一座剧场，建于雅典卫城山脚下。该剧场献给戏剧与酒神狄俄尼索斯 (Dionysus)，可容纳多达17,000人，并拥有卓越的声学效果，使其成为古希腊最盛大的戏剧庆典——酒神节 (Dionysia) 的理想举办地。作为历史上第一座剧场，它开凿于卫城南侧的崖壁，据传是希腊悲剧的发源地。如今仍可见到经过修复和重建的罗马时期版本遗迹…… (维基百科ID: 3732515 – "Theatre of Dionysus")

Among top-10 retrieved documents: Olympia: 1 time

在前10个检索到的文档中：Olympia：1次

Among top-10 generated documents: Olympia: 6 times

在前10名生成文档中：Olympia出现了6次

GPT cluster #1: Zeus was the patron god of the city of Olympia, which was located in the northwestern Peloponnese region of Greece. Olympia was the site of the Olympic Games, which were held every four years in honor of Zeus.

GPT集群#1：宙斯是奥林匹亚城的守护神，该城位于希腊伯罗奔尼撒半岛西北部。奥林匹亚是奥林匹克运动会的举办地，这项赛事每四年举行一次以纪念宙斯。

GPT cluster #2: Zeus was the patron god of the city of Olympia. Olympia was a sanctuary of ancient Greece in Elis on the Pe lopon neso s peninsula, in south-western Greece. The sanctuary was known as the Pan hellenic Sanctuary of Zeus, and the altar of Zeus was located in the Temple of Zeus. The Olympic Games were held in Olympia every four years from 776 BC to 393 AD.

GPT集群#2：宙斯是奥林匹亚城的守护神。奥林匹亚是古希腊位于伯罗奔尼撒半岛伊利斯地区的圣地，地处希腊西南部。该圣地被称为泛希腊宙斯圣所，宙斯祭坛坐落于宙斯神庙内。从公元前776年到公元393年，奥林匹克运动会每四年在奥林匹亚举办一次。

GPT cluster #3: Zeus was the patron god of the city of Olympia. Among top-10 generated documents: Olympia: 9 times Elis: 1 time

GPT集群#3: Zeus是奥林匹亚城的守护神。在生成文档的前10名中: 奥林匹亚: 9次伊利斯: 1次

Query: What is the collection of the districts to the east of the Jordan River? Answers: Jordan, Golan Heights

查询：约旦河以东地区包括哪些区域？
答案：约旦、戈兰高地

DPR #1: Water politics in the Jordan River basin Water politics in the Jordan River basin refers to political issues of water within the Jordan River drainage basin, including competing claims and water usage, and issues of riparian rights of surface water along transnational rivers, as well as the availability and usage of ground water. Water resources in the region are scarce, and these issues directly affect the five political subdivisions (Israel, the West Bank, Lebanon, Syria and Jordan) located within and bordering the basin, which were created since the collapse, , during World War I, of the former single controlling entity, the ... (Wikipedia id: 12822413 – “Water politics in the Jordan River basin”)

DPR #1：约旦河流域的水政治
约旦河流域的水政治涉及该流域内水资源相关的政治问题，包括竞争性主张与用水权、跨国河流地表水的沿岸权利，以及地下水的可用性与使用情况。该地区水资源匮乏，这些问题直接影响流域内及周边的五个政治分区（以色列、约旦河西岸、黎巴嫩、叙利亚和约旦）。这些分区自第一次世界大战期间原单一控制实体......崩溃后形成。（维基百科ID: 12822413 - "Water politics in the Jordan River basin"）

DPR #2: Jordan’s western border follows the bottom of the rift. Although an earthquake-prone region, no severe shocks had been recorded for several centuries. By far the greatest part of the East Bank is desert, displaying the land forms and other features associated with great aridity. Most of this land is part of the Syrian Desert and northern Arabian Desert. There are broad expanses of sand and dunes, particularly in the south and southeast, together with salt flats. Occasional jumbles of sandstone hills or low mountains support only meager and stunted vegetation that ... (Wikipedia id: 210052 – “Geography of Jordan”)

DPR #2: 约旦西部边界沿裂谷底部延伸。虽然地处地震多发带，但几个世纪以来未有严重地震记录。东岸绝大部分地区为沙漠，呈现出与极端干旱相关的地貌特征。该区域大部分属于叙利亚沙漠和北阿拉伯沙漠，广袤的沙地与沙丘遍布南部及东南部，间杂盐碱滩。零星分布的砂岩丘陵或低矮山脉仅能支撑稀疏矮小的植被...(维基百科ID:210052–"约旦地理")

Among top-10 retrieved documents: Jordan: 8 times

在前10篇检索文档中：Jordan：8次

GPT nucleus #1: The collection of the districts to the east of the Jordan River is known as the West Bank. The West Bank is a landlocked territory situated in Western Asia. It is bordered by Jordan to the east, Egypt to the southwest, and Israel to the west and north.

GPT核心区#1：约旦河以东的地区统称为约旦河西岸。西岸是位于西亚的内陆领土，东接约旦，西南邻埃及，西面和北面与以色列接壤。

GPT nucleus #2: The collection of the districts to the east of the Jordan River is the West Bank. The West Bank is a landlocked territory near the Mediterranean coast of Western Asia, bordered by Lebanon to the north, Syria to the northeast, Jordan to the east, and Egypt to the southwest. It is the smaller of two political entities that comprise the Palestinian territories, the other being the Gaza Strip.

GPT核心区#2：约旦河以东地区的集合被称为西岸。西岸是西亚地中海沿岸的一个内陆地区，北接黎巴嫩，东北邻叙利亚，东靠约旦，西南与埃及接壤。它是构成巴勒斯坦领土的两个政治实体中较小的一个，另一个是加沙地带。

Among top-10 generated documents: Jordan: 10 times

在前10名生成文档中：Jordan：10次

GPT cluster #1: The collection of the districts to the east of the Jordan River is known as the West Bank. The West Bank is a landlocked territory situated in Western Asia. It is bordered by Jordan to the east, Egypt to the southwest, and Israel to the west and north.

GPT集群1：约旦河以东的地区统称为约旦河西岸。西岸是位于西亚的内陆领土，东接约旦，西南邻埃及，西面和北面与以色列接壤。

GPT cluster #2: The Jordan River is a major river in the Middle East, the source of which is in the northern part of Israel. The river flows southward through Israel and then Jordan, emptying into the Dead Sea. East of the river is the collection of districts known as the East Bank.

GPT集群#2：约旦河是中东地区的一条主要河流，发源于以色列北部。该河南流经以色列和约旦，最终注入死海。河流以东的区域统称为东岸地区。

GPT cluster #3: There is no single answer to this question as the east bank of the Jordan River is home to a number of different districts and regions, each with its own unique history, culture, and customs. However, some of the more well-known districts on the east bank include the West Bank, the Gaza Strip, and the Golan Heights.

GPT集群#3：约旦河东岸包含多个不同的地区和区域，每个地区都有其独特的历史、文化和习俗，因此这个问题没有唯一答案。不过东岸较知名的地区包括约旦河西岸、加沙地带和戈兰高地。

Among top-10 generated documents: Jordan: 10 times Golan Heights: 2 times

在前10名生成文档中：Jordan: 10次 Golan Heights: 2次

B PROMPTS CHOICES

B 提示词选择

B.1 ZERO-SHOT LEARNING PROMPTS (FOR TABLE 1)

B.1 零样本学习提示 (用于表 1)

B.1.1 PROMPTS FOR ”INSTRUCT GP T (NO DOCS.)”

B.1.1 "INSTRUCT GPT (无文档)"提示词

We observed the prompts (i.e., “Q: ${{\bf q u e r y}}\backslash{\bf n}\backslash{\bf nA}:^{\prime\prime})$ used in GPT-3 paper (Brown et al., 2020) perform poorly on its text-davinci-002 version. Therefore, we experimented with multiple prompts and found the following two prompts work best on open-domain QA datasets.

我们观察到 GPT-3 论文 (Brown et al., 2020) 中使用的提示 (即 "Q: ${{\bf q u e r y}}\backslash{\bf n}\backslash{\bf nA}:^{\prime\prime}$) 在其 text-davinci-002 版本上表现不佳。因此，我们尝试了多种提示，发现以下两种提示在开放域 QA 数据集上效果最佳。

– (1) “{query}\n\nThe answer is” (no space between {query} and $\backslash\mathtt{n},$ ) $-\left(2\right)^{\mathrm{{\scriptsize~{\prime}}}}{\mathrm{query}}\backslash\mathfrak{n}\backslash\mathfrak{n}$ The answer is” (performance reported in Table 1) For fact checking and dialogue system, we used the following prompts. – Fact Checking $\mathbf{claim}\backslash\mathbf{n}\backslash\mathbf{n}$ Is the claim true or false?” – Open-domain Dialogue System “{query} \n\n”

(1) "{query}\n\nThe answer is" ({query}和$\backslash\mathtt{n}$之间无空格)
(2) "{query}\n\n The answer is" (性能数据见表1)

针对事实核查和对话系统，我们使用了以下提示模板：

事实核查
$\mathbf{claim}\backslash\mathbf{n}\backslash\mathbf{n}$
该声明是真是假？
开放域对话系统
"{query}\n\n"

B.1.2 PROMPTS FOR BACKGROUND GENERATION (STEP-1)

B.1.2 背景生成提示词 (步骤1)

– Open-domain Question Answering “Generate a background document from Wikipedia to answer the given question. "\n\n{query}\n\n"

开放域问答 (Open-domain Question Answering) "根据给定问题从维基百科生成背景文档。"\n\n{query}\n\n"

– Fact checking “Generate a background document from Wikipedia to support or refute the statement. $\backslash\mathfrak{n}\backslash\mathfrak{n}$ Statement: ${\mathrm{claim}}\backslash\mathfrak{n}\backslash\mathfrak{n}^{,}$

– 事实核查 "从维基百科生成背景文件以支持或反驳该声明。$\backslash\mathfrak{n}\backslash\mathfrak{n}$ 声明: ${\mathrm{claim}}\backslash\mathfrak{n}\backslash\mathfrak{n}^{,}$"

– Open-domain Dialogue System “Generate a background document from Wikipedia to answer the given question. $\backslash\mathfrak{n}\backslash\mathfrak{n}$ {utterance} $\backslash\mathfrak{n}\backslash\mathfrak{n}^{,,}$

– 开放域对话系统 "根据维基百科生成背景文档来回答给定问题。$\backslash\mathfrak{n}\backslash\mathfrak{n}$ {话语} $\backslash\mathfrak{n}\backslash\mathfrak{n}^{,,}$"

B.1.3 PROMPTS FOR READING COMPREHENSION (STEP-2)

B.1.3 阅读理解提示词 (STEP-2)

We collected the prompt from P3 (Bach et al., 2022), which includes over 2,000 open-source prompts for roughly 170 datasets. For zero-shot QA, we experimented with three different reading comprehension prompts. We reported the performance for each prompt in Table 20.

我们从P3 (Bach et al., 2022) 收集了提示词，其中包含约170个数据集的2000多个开源提示词。对于零样本问答，我们尝试了三种不同的阅读理解提示词。每种提示词的表现如表20所示。

B.2 HUMAN PROMPT ANNOTATIONS (FOR SECTION 3.2.1)

B.2 人工提示标注 (对应第3.2.1节)

In order to get a better prompt for large language models to generate better contextual documents, we asked 30 students in the computer science department to write different prompts. We first constructed a small validation set with 200 examples by combining 50 random question-answer pairs from NQ, 100 random pairs from TriviaQA and 50 random pairs from WebQ. When an annotator wrote down a prompt, our system can immediately evaluate the prompt by using the validation set and return the performance to the annotator. Then, the annotator can modify the previous prompt until the recall performance reaches a threshold, which is set as 50 in our experiments. Finally, we got 29 prompts from human annotators due to two of them are the same. We used the top-10 prompts (shown in Table 21 and Table 22) in the human prompt setting, as described in §3.2.1.

为了获得更优质的大语言模型提示词(prompt)以生成更具上下文相关性的文档，我们邀请了计算机科学系的30名学生撰写不同提示词。首先，我们通过组合NQ数据集中的50个随机问答对、TriviaQA中的100个随机对以及WebQ中的50个随机对，构建了一个包含200个样例的小型验证集。当标注者提交提示词时，系统能立即使用验证集评估该提示词，并将性能反馈给标注者。随后，标注者可修改原有提示词，直至召回率达到实验设定的阈值（本研究中设为50）。最终我们获得29条人工标注的提示词（因其中两条重复）。如§3.2.1所述，在人工提示词设置中我们采用了性能最佳的10条提示词（具体见表21和表22）。

Models	NQ				TriviaQA			WebQ		avg.
Models	(1)	(2)（3)	avg.		(1)	(2)(3)	avg.		(1)(2)(3)	avg.
*withretrieverANDdirectlytrainedonthesedatasets
DPR + InstructGPT*
*withretriever,BUTNOTtrainedonthesedatasets
BM25 + InstructGPT	20.1 18.4 20.5 19.7				54.249.053.452.2				14.916.6 16.015.8
Contriever +InstructGPT	18.3 16.5 19.1 18.0			53.1		48.5 52.4 51.3			14.9 18.2 16.8 16.6
Google + InstructGPT		29.1 29.3 27.8 28.8				60.3 57.5 58.7 58.8		19.5	21.819.9 20.4

*withoutretriever,andnotusingexternaldocuments

Table 20: Zero-shot QA performance under different prompts. The prompts are listed in §B.1.

模型	NQ				TriviaQA			WebQ
	(1)	(2) (3)	平均		(1)	(2) (3)	平均	(1) (2) (3)
*使用检索器并直接在这些数据集上训练
DPR + InstructGPT*
*使用检索器，但未在这些数据集上训练
BM25 + InstructGPT	20.1 18.4 20.5 19.7				54.2 49.0 53.4 52.2			14.9 16.6 16.0 15.8
Contriever + InstructGPT	18.3 16.5 19.1 18.0			53.1		48.5 52.4 51.3			14.9 18.2 16.8 16.6
Google + InstructGPT		29.1 29.3 27.8 28.8				60.3 57.5 58.7 58.8		19.5	21.8 19.9 20.4
*不使用检索器，且未使用外部文档

表 20: 不同提示下的零样本问答性能。提示列表详见 §B.1。

No.	Prompts	Validation
#1	Generate a background document fromWikipedia to answer the given question.	66.0
#2	Provide a background document from Wikipedia to answer the given question.	65.0
#3	Generate a background document from web to answer the given question.	64.0
#4	Generate aWikipedia document to support the given question.	63.5
#5	Provide a background document for the given question.	63.0
#6	Prepare a background document to support the given question.	63.0
#7	To support the given question, prepare a background document.	62.5
#8	Create a background document that supports the given question.	61.5
#9	Retrieve a document from Wikipedia to answer the given question.	60.5
#10	Retrieve aWikipedia articletoaddress theposed question.	59.5

Table 21: Top-10 human prompts, evaluated on merged validation set of NQ, TriviaQA and WebQ.

序号	提示语	验证值
#1	根据维基百科生成背景文档来回答给定问题。	66.0
#2	提供维基百科的背景文档来回答给定问题。	65.0
#3	根据网络生成背景文档来回答给定问题。	64.0
#4	生成维基百科文档来支持给定问题。	63.5
#5	为给定问题提供背景文档。	63.0
#6	准备背景文档来支持给定问题。	63.0
#7	为支持给定问题，准备背景文档。	62.5
#8	创建支持给定问题的背景文档。	61.5
#9	从维基百科检索文档来回答给定问题。	60.5
#10	检索维基百科文章来解决提出的问题。	59.5

表 21: 在NQ、TriviaQA和WebQ合并验证集上评估的人类最佳10条提示语。

Prompt No.	Validation	NQ	WebQ	TriviaQA	Avg.
#1 (Generate ..)	66.0	45.9	51.9	68.7	55.5
#2 (Provide ...)	65.0	43.9	51.0	68.3	54.4
#3 (Generate ...)	64.0	44.0	50.6	67.7	54.2
#4 (Generate ...)	63.5	43.2	51.2	67.5	54.0
#5 (Provide ..)	63.0	43.6	50.3	67.9	54.0
#6 (Prepare ..)	63.0	43.5	50.5	67.7	54.0
#7 (To support ...)	62.5	43.5	50.3	67.5	53.8
#8 (Create ...)	61.5	42.7	50.2	66.8	53.3
#9 (Retrieve ...)	60.5	41.6	49.0	68.2	53.0
#10 (Retrieve ...)	59.5	40.7	49.5	67.7	52.7

Table 22: Performance on NQ, TriviaQA and WebQ test sets of top-10 human prompts.

提示编号	验证集	NQ	WebQ	TriviaQA	平均
#1 (生成...)	66.0	45.9	51.9	68.7	55.5
#2 (提供...)	65.0	43.9	51.0	68.3	54.4
#3 (生成...)	64.0	44.0	50.6	67.7	54.2
#4 (生成...)	63.5	43.2	51.2	67.5	54.0
#5 (提供...)	63.0	43.6	50.3	67.9	54.0
#6 (准备...)	63.0	43.5	50.5	67.7	54.0
#7 (为支持...)	62.5	43.5	50.3	67.5	53.8
#8 (创建...)	61.5	42.7	50.2	66.8	53.3
#9 (检索...)	60.5	41.6	49.0	68.2	53.0
#10 (检索...)	59.5	40.7	49.5	67.7	52.7

表 22: 人工编写的前10个提示在NQ、TriviaQA和WebQ测试集上的性能表现。

[论文翻译]生成而非检索：大语言模型是强大的上下文生成器

原文地址：https://arxiv.org/pdf/2209.10063