FactS potter: Evaluating the Factual Faithfulness of Graph-to-Text Generation
FactS potter: 评估图到文本生成的事实忠实性
Abstract
摘要
Graph-to-text (G2T) generation takes a graph as input and aims to generate a fluent and faithful textual representation of the information in the graph. The task has many applications, such as dialogue generation and question answering. In this work, we investigate to what extent the G2T generation problem is solved for previously studied datasets, and how proposed metrics perform when comparing generated texts. To help address their limitations, we propose a new metric that correctly identifies factual faithfulness, i.e., given a triple (sub- ject, predicate, object), it decides if the triple is present in a generated text. We show that our metric FactS potter achieves the highest correlation with human annotations on data correctness, data coverage, and relevance. In addition, FactS potter can be used as a plug-in feature to improve the factual faithfulness of existing models. Finally, we investigate if existing G2T datasets are still challenging for state-of-the-art models. Our code is available online: https: //github.com/guihuzhang/FactS potter.
图到文本生成 (Graph-to-text, G2T) 以图结构作为输入,旨在生成流畅且忠实反映图中信息的文本表示。该任务具有多种应用场景,例如对话生成和问答系统。本研究探讨了现有数据集在多大程度上解决了 G2T 生成问题,并分析了现有指标在文本对比中的表现。针对现有指标的局限性,我们提出了一种新指标 FactSpotter,它能准确识别事实忠实性——即给定三元组 (主体, 谓词, 客体) 时,判断该三元组是否出现在生成文本中。实验表明,FactSpotter 在数据正确性、数据覆盖率和相关性方面与人工标注结果具有最高相关性。此外,FactSpotter 可作为插件功能提升现有模型的事实忠实性。最后,我们验证了现有 G2T 数据集对前沿模型是否仍具挑战性。代码已开源: https://github.com/guihuzhang/FactSpotter。
ated text, but also with its faithfulness to the input graph. While recent models such as T5 and GPT models are very fluent, they have been criticized for their factual accuracy, a problem commonly referred to as hallucination (Ji et al., 2023; Liu et al., 2022). Hallucinations are a serious drawback in G2T, where the generated text should only contain facts mentioned in the input graph.
生成的文本不仅要流畅,还需忠实于输入图。尽管T5和GPT等近期模型在流畅性上表现出色,但其事实准确性常受诟病,这一问题通常被称为幻觉 (Ji et al., 2023; Liu et al., 2022) 。在G2T任务中,幻觉是严重缺陷,因为生成文本应仅包含输入图中提及的事实。
In this work, we focus on measuring and improving the factual accuracy of G2T generative models. More precisely, our contributions are as follows: i) We introduce a novel metric FactS potter for detecting if G2T generations are faithful to the facts present in the input graph; ii) We show how FactS potter can be used in the inference step of any G2T model to improve its generations; iii) We analyze the difficulty of existing G2T datasets and determine which are (resp., are no longer) challenging for state-of-the-art models. FactS potter can be extended to other data-to-text tasks via methods for transforming a relational dataset into RDF, such as the R2RML language .
在本研究中,我们专注于衡量并提升G2T(Graph-to-Text)生成模型的事实准确性。具体贡献包括:i) 提出新型指标FactSpotter,用于检测G2T生成内容是否忠实于输入图数据中的事实;ii) 展示如何将FactSpotter应用于任意G2T模型的推理阶段以优化生成结果;iii) 分析现有G2T数据集的难度,确定哪些对前沿模型仍具挑战性(或已不再构成挑战)。FactSpotter可通过R2RML等关系型数据转RDF的方法扩展至其他数据到文本任务。
1 Introduction
1 引言
Graph-to-text (G2T) generation is an important task in natural language generation, as it renders graphs, and in particular knowledge graphs, accessible to non-technical users in downstream applications such as question answering (Gu et al., 2021), (Romero and Razniewski, 2020), knowledge-grounded dialogue generation (Zhou et al., 2018), and document sum mari z ation (Fan et al., 2019). In recent years, there have been several datasets (Gardent et al., 2017; Nan et al., 2021) and methods proposed for G2T generation (Ke et al., 2021; Ribeiro et al., 2021), in addition to G2T competitions (Shimorina et al., 2018; Castro Ferreira et al., 2020). Evaluating text generation is a challenging task in itself (Cel i kyi l maz et al., 2020); moreover, in the context of G2T generation, we are concerned not only with the fluency of the gener
图到文本 (G2T) 生成是自然语言生成中的一项重要任务,它使非技术用户能够在下游应用中访问图数据,特别是知识图谱,例如问答 (Gu et al., 2021)、(Romero and Razniewski, 2020)、基于知识的对话生成 (Zhou et al., 2018) 以及文档摘要 (Fan et al., 2019)。近年来,除了 G2T 竞赛 (Shimorina et al., 2018; Castro Ferreira et al., 2020) 之外,还出现了多个数据集 (Gardent et al., 2017; Nan et al., 2021) 和 G2T 生成方法 (Ke et al., 2021; Ribeiro et al., 2021)。文本生成评估本身是一项具有挑战性的任务 (Cel i kyi l maz et al., 2020);此外,在 G2T 生成背景下,我们不仅关注生成内容的流畅性,还...
2 Related Work
2 相关工作
2.1 Graph-to-text (G2T) generation
2.1 图到文本 (G2T) 生成
In (Ribeiro et al., 2021), the authors investigate the potential of pretrained language models (PLM) on the G2T task. They consider two Transformerbased models (Vaswani et al., 2017) with an encoder-decoder structure: T5 (Raffel et al., 2020), and Bart (Lewis et al., 2020). The models receive in input a linearized (serialized) version of the input graph, in which they add the tags $\langle H\rangle,\langle R\rangle$ , and $\langle T\rangle$ before the head entity, the relation, and tail entity of a triple, respectively.
在 (Ribeiro et al., 2021) 中,作者研究了预训练语言模型 (PLM) 在 G2T 任务中的潜力。他们采用了两种基于 Transformer (Vaswani et al., 2017) 的编码器-解码器结构模型:T5 (Raffel et al., 2020) 和 Bart (Lewis et al., 2020)。这些模型的输入是线性化(序列化)的图数据,其中在三元组的头实体、关系和尾实体前分别添加了标签 $\langle H\rangle$、$\langle R\rangle$ 和 $\langle T\rangle$。
The potential of large language models is further investigated in (Keymanesh et al., 2022) on the DART dataset (Nan et al., 2021), a data-totext dataset constructed from tables, available in a triple-to-sentence format. The dataset is constructed from a combination of manual and automatic techniques. The authors empirically evaluate the GPT2 model (Radford et al., 2019) and the T5 model on the dataset by varying the amount of supervision a model receives. They also investigate the potential of adding predicate descriptors in the prompt and re-ranking generations. In a small-scale human evaluation, they find that their best model, T5 large, outperforms the reference text regarding hallucinations and fluency, underlining that existing datasets suffer from poor human annotations, which we also observe and discuss in Section 7.
(Keymanesh等人,2022)在DART数据集(Nan等人,2021)上进一步研究了大语言模型(Large Language Model)的潜力。该数据集是从表格构建的数据到文本(data-to-text)数据集,以三元组到句子的格式提供。数据集结合了人工和自动技术构建而成。作者通过改变模型接收的监督量,对GPT2模型(Radford等人,2019)和T5模型进行了实证评估。他们还研究了在提示中添加谓词描述符和重新排序生成结果的潜力。在小规模人工评估中,他们发现最佳模型T5 large在幻觉和流畅度方面优于参考文本,这表明现有数据集存在人工标注质量差的问题,我们在第7节也观察并讨论了这一点。
The authors of (Ke et al., 2021) propose modifying the Transformer model by adding extra attention and pooling layers to improve G2T generation. In addition, the model’s pre training has three steps given the input graph and the expected output text: 1) reconstructing the text sequence given the complete subgraph, 2) predicting the masked entities and relations in the corrupted subgraph, given the complete text, and, 3) aligning the embedding vectors of the knowledge graph and the text.
(Ke et al., 2021) 的作者提出通过增加额外的注意力层和池化层来改进 Transformer 模型,以提升图到文本 (G2T) 生成效果。此外,该模型的预训练分为三步,给定输入图和预期输出文本:1) 在完整子图条件下重建文本序列,2) 在完整文本条件下预测损坏子图中被遮蔽的实体和关系,3) 对齐知识图谱与文本的嵌入向量。
Most state-of-the-art G2T models are based on Transformers, and they can generally generate fluent texts related to given graphs. Although various baselines designed neural networks to encode both global and local information (Ribeiro et al., 2020), they cannot guarantee that generated texts are factual faithful to the given graphs. It’s also not clear whether current G2T datasets are still challenging.
大多数最先进的G2T模型基于Transformer,通常能生成与给定图相关的流畅文本。尽管各类基线方法设计了神经网络来编码全局和局部信息 [20],但无法保证生成文本与给定图的事实一致性。目前尚不清楚现有G2T数据集是否仍具挑战性。
2.2 Evaluation metrics for text generation
2.2 文本生成的评估指标
Alongside the significant improvements that models for G2T generation underwent and, in general, the improvement of language models, new metrics to assess the generations’ quality have been proposed (Cel i kyi l maz et al., 2020; Sai et al., 2022). G2T generation belongs to the broader field of natural language generation, including tasks such as machine translation, automatic sum mari z ation, question answering, and more. Each task has specific requirements, which might entail using some metrics over others. For example, in machine translation, the translation should match the ground-truth text as closely as possible, while in chatting or summarization, adding or removing some information is acceptable or even desirable.
随着G2T生成模型的显著进步以及语言模型的整体提升,新的生成质量评估指标被相继提出 [20][22]。G2T生成属于自然语言生成这一更广泛的领域,涵盖机器翻译、自动摘要、问答系统等任务。每个任务都有特定需求,可能需要选用不同的评估指标。例如在机器翻译中,译文应尽可能贴近参考文本;而在对话或摘要场景中,适当增减信息是可接受甚至有益的。
Evaluation metrics are split into three categories: human-centric metrics, untrained automatic metrics, and machine-learned metrics. Human evaluation is the most important of these metrics. It consists of asking users to evaluate the quality of a text in a specific category, such as fluency or correctness. Unfortunately, human evaluation sometimes suffers from a low inter-annotator agreement (Cel i kyi l maz et al., 2020), (Belz and Reiter, 2006) as different people might have different notions of what makes a text fluent or correct, or the instructions they receive for annotation might lack sufficient clarity. Also, human annotation is time and money-consuming; it can represent a bottleneck in the iterative process of improving a model. Hence the need for automatic metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), METEOR (Banerjee and Lavie, 2005), BERTScore (Zhang et al., 2019), BLEURT (Sellam et al., 2020) BARTScore (Yuan et al., 2021), In- foLM (Colombo et al., 2022), among others. Overlap measures based on $n$ -grams, such as BLEU, ROUGE, and METEOR, have been widely used in the literature, while recently proposed metrics based on word embeddings, such as BERTScore, BLEURT, BARTScore are gaining traction. The embedding-based measures have been shown to correlate better with human evaluation than the ngrams metrics. In addition, some metrics, such as BLEURT, have been trained on human annotations. As these automatic metrics are based on distances or similarities to ground-truth texts, they rely on the quality of annotated sentences.
评估指标分为三类:以人为本的指标、未经训练的自动指标和机器学习指标。其中人工评估是最重要的指标,它通过让用户评估文本在特定类别(如流畅性或正确性)中的质量来实现。遗憾的是,人工评估有时会面临标注者间一致性较低的问题 (Cel i kyi l maz et al., 2020) (Belz and Reiter, 2006),因为不同人对文本流畅性或正确性的理解可能存在差异,或者他们收到的标注说明可能不够清晰。此外,人工标注耗时耗力,可能成为模型迭代改进过程中的瓶颈。因此需要采用自动指标,如BLEU (Papineni et al., 2002)、ROUGE (Lin, 2004)、METEOR (Banerjee and Lavie, 2005)、BERTScore (Zhang et al., 2019)、BLEURT (Sellam et al., 2020)、BARTScore (Yuan et al., 2021)、InfoLM (Colombo et al., 2022)等。基于n元语法 ($n$-grams) 的重叠度量指标(如BLEU、ROUGE和METEOR)在文献中已被广泛使用,而最近提出的基于词嵌入的指标(如BERTScore、BLEURT、BARTScore)正逐渐受到关注。研究表明,基于嵌入的度量指标比n元语法指标更能与人工评估结果保持一致。此外,部分指标(如BLEURT)已在人工标注数据上进行训练。由于这些自动指标基于与真实文本的距离或相似度,其效果依赖于标注句子的质量。
Apart from the works mentioned above, a few prior studies assessed factual faithfulness in graphto-text generation. In (Faille et al., 2021) the authors introduce a metric for verifying whether entities in input graphs are represented in the generated texts. However, this work does not evaluate the quality of the predicates in the generated texts, which is a much more difficult task. In (Rebuffel et al., 2021), a question generation (QG) and question answering (QA) framework is employed to evaluate the quality of the generated text by determining if each question posed by the QG module can be addressed by the QA module. We believe our contribution can further advance the state-ofthe-art as: i) FactS potter requires significantly less computational resources; $i i$ ) FactS potter is self supervised, thus it does not requires additional data to the G2T model; item FactS potter can be pluged into a G2T to improve its generation.
除了上述研究外,已有少量工作评估了图到文本生成的事实忠实性。(Faille et al., 2021) 提出了一种度量方法,用于验证输入图中的实体是否在生成文本中得到体现。但该工作未评估生成文本中谓词的质量,后者是更具挑战性的任务。(Rebuffel et al., 2021) 采用问题生成 (QG) 和问答 (QA) 框架,通过判断QG模块提出的每个问题是否能被QA模块解答来评估生成文本质量。我们认为本研究的贡献能推动技术发展体现在:i) FactSpotter所需计算资源显著减少;ii) FactSpotter采用自监督机制,无需为G2T模型提供额外数据;iii) FactSpotter可即插即用地提升G2T模型的生成质量。
3 Problem Statement
3 问题陈述
A knowledge graph (KG) consists of facts (or data triples) of the form ⟨subject, relation, object/literal⟩ and/or an ontology describing the properties that hold between classes and/or relations.
知识图谱 (KG) 由形式为⟨主语, 关系, 宾语/字面量⟩的事实(或数据三元组) 和/或描述类与/或关系之间属性的本体组成。
Graph-to-text. A G2T tool takes in input a graph and outputs a textual representation of the graph. G2T inputs are often subgraphs of larger real-world graphs, such as knowledge graphs. The textual represent ation should be fluent and should contain all the facts present in the input graph. For example, given a subgraph consisting of the single DBpedia (Auer et al., 2007) fact ⟨The Myth of Sisyphus, author, Albert Camus⟩, we would like to generate the sentence The Myth of Sisyphus was written by Albert Camus. This work primarily focuses on creating textual representations of KG subgraphs.
图到文本 (Graph-to-text)。G2T工具接收一个图作为输入,并输出该图的文本表示。G2T的输入通常是大型现实世界图(如知识图谱)的子图。文本表示应当流畅且包含输入图中的所有事实。例如,给定由单个DBpedia (Auer et al., 2007) 事实⟨西西弗斯神话, 作者, 阿尔贝·加缪⟩组成的子图,我们希望生成句子《西西弗斯神话》由阿尔贝·加缪所著。这项工作主要聚焦于创建知识图谱子图的文本表示。
Factual faithfulness. The following human criteria have been proposed to evaluate the factual quality of a generated text (Castro Ferreira et al., 2020): data correctness, data coverage, and relevance. Given a graph $G$ and a machine-generated text $T$ , the generated text is characterized by:
事实准确性。以下人类标准被提出来评估生成文本的事实质量 (Castro Ferreira et al., 2020):数据正确性、数据覆盖率和相关性。给定一个图 $G$ 和机器生成的文本 $T$,生成文本的特征包括:
- Data coverage or recall: are all the descriptions presented in the graph $G$ included in the text $T?i$ ) Covered predicates: does $T$ contain all the predicates from $G?i i$ ) Covered entities: does $T$ contain all the entities from $G$ ?
- 数据覆盖率或召回率:图中 $G$ 呈现的所有描述是否都包含在文本 $T$ 中?
i) 覆盖谓词:$T$ 是否包含 $G$ 中的所有谓词?
ii) 覆盖实体:$T$ 是否包含 $G$ 中的所有实体? - Relevance or precision: i) Relevant predicates: does $T$ only contain relevant predicates? $i i$ ) Relevant entities: does $T$ only contain relevant entities?
- 相关性或精确度: i) 相关谓词: $T$ 是否仅包含相关谓词? ii) 相关实体: $T$ 是否仅包含相关实体?
- Correctness: are predicates correctly mentioned and adequately introduced in the data?
- 正确性:谓词在数据中是否正确提及并充分引入?
Research questions. We focus on the following three research questions:
研究问题。我们聚焦于以下三个研究问题:
RQ1 What metric would better correlate with the factual faithfulness of generated text?
RQ1 哪种指标能更好地与生成文本的事实准确性相关联?
RQ2 Can we improve factual faithfulness of G2T?
RQ2 能否提升G2T的事实准确性?
RQ3 Is the G2T task solved on existing datasets?
RQ3 现有数据集上的G2T任务是否已解决?
4 FactS potter: An explain able metric for factual faithfulness
4 FactSpotter: 一种可解释的事实忠实性度量
In this section, we introduce a new metric for factual faithfulness. A good metric should be interpretable, that is: given in input a fact $f$ of the form ⟨subject, predicate, object⟩ and a text $T$ , it assigns a score between 0 and 1, where a score close to 0 signifies that the text that does not correctly represent the fact, and a score close to 1 rewards a factual faithful textual representation of $f$ in $T$ . Such a metric can be used to compare different generation systems and, in addition, assess a single system on a given dataset, to determine how close the system is to representing the input graphs correctly.
在本节中,我们引入了一种新的事实忠实度指标。一个好的指标应当具备可解释性,即:给定一个形如⟨主体,谓词,客体⟩的事实$f$和文本$T$,该指标会输出0到1之间的评分,其中接近0的分数表示文本未能正确反映事实,而接近1的分数则代表文本$T$对事实$f$进行了忠实呈现。此类指标既可用于比较不同生成系统,也能评估单一系统在给定数据集上的表现,从而判断该系统在正确呈现输入图结构方面的接近程度。
The intuition of our score is the following. We train a model to perform a task simpler than G2T generation: only detecting whether facts exist in a sentence and whether they are well expressed. This simpler model can then be used as a plug-in feature by any existing G2T model, to aid it perform the more complex task of language generation.
我们评分的直觉如下。我们训练一个模型执行比G2T生成更简单的任务:仅检测句子中是否存在事实以及它们是否表达良好。然后,这个更简单的模型可以作为任何现有G2T模型的插件功能,帮助其执行更复杂的语言生成任务。
Our metric, FactS potter, is trained as a binary classifier. Given in input a fact ⟨subject, predicate, object⟩ and a sentence, it should predict 1 if the fact is well expressed in the sentence, or 0 otherwise. Thus, FactS potter is inherently interpret able. We leverage as classification models recent large language models, capable of detecting semantic closeness, even if different words, e.g., synonyms, are used. This approach is similar to the one taken to compute metrics such as BertScore and BartScore. Given an input G2T dataset $D$ , with a training (train) set, a development (dev) set, and a test set, we create the training set as follows:
我们的指标 FactS potter 被训练为一个二元分类器。给定输入事实⟨主语, 谓语, 宾语⟩和一个句子,如果该事实在句子中得到充分表达则预测1,否则预测0。因此 FactS potter 本质上是可解释的。我们采用能检测语义相似性的大语言模型作为分类模型,即使使用不同词汇(如同义词)也能识别。这种方法类似于计算 BertScore 和 BartScore 等指标时采用的方法。给定输入 G2T 数据集 $D$ (包含训练集(train)、开发集(dev)和测试集),我们按以下方式创建训练集:
Positive samples. Given an instance of the training set of the form (graph $G$ , ground truth text $T_{,}$ , for each fact (triple) in $G$ , we create a positive sample of the form (fact $f$ , ground truth text $T$ ).
正样本。给定训练集中形式为 (图 $G$,真实文本 $T_{,}$) 的实例,对于 $G$ 中的每个事实(三元组),我们创建一个形式为 (事实 $f$,真实文本 $T$) 的正样本。
Negative samples. Given an instance of the training set of the form (graph $G$ , ground truth text $T)$ , for each fact $f\in G$ , we create negative samples as follows: $i)$ ) Type I: we perturb the fact $f$ : we change its predicate, or an entity (subject or object), or both, while the ground truth text $T$ remains unchanged. ii) Type II: we perturb the ground truth text $T$ : we drop one or both entities related to $f$ from the text, or drop the $n$ -grams most similar to the predicate of $f$ , or we apply simultaneously several modifications, keeping the fact unchanged.
负样本。给定一个形式为 (图 $G$ , 真实文本 $T$ ) 的训练集实例,对于每个事实 $f\in G$ ,我们按以下方式创建负样本: $i)$ ) 类型 I:扰动事实 $f$ :更改其谓词、实体 (主语或宾语) 或两者,同时保持真实文本 $T$ 不变;ii) 类型 II:扰动真实文本 $T$ :从文本中删除与 $f$ 相关的一个或两个实体,或删除与 $f$ 谓词最相似的 $n$ -gram,或同时进行多项修改,同时保持事实不变。
For example, given the fact ⟨The Myth of Sisyphus, author, Albert Camus and its associated text “The Myth of Sisyphus was written by Albert Camus”, a Type I negative sample alters the fact (⟨The Myth of Sisyphus, author, Simone de Beauvoir⟩, “The Myth of Sisyphus was written by Albert Camus”), while for Type II yields the sample ( ⟨The Myth of Sisyphus, author, Albert Camus⟩, “The Myth of Sisyphus was written”). We associate probabilities to each perturbation, and control the generation such that for each positive sample, we only generate one negative sample. To allow our classifier to learn from different negative samples and avoid over fitting (Chen et al., 2020), for each training epoch, we use a newly generated set of negative samples. The development set is built in the same way. Through evaluation on a fixed test set (Appendix A.4.1) we find that the model which best detects factual faithfulness is the one with the highest probability of perturbing the fact $(90%)$ and $10%$ the probability of perturbing the ground truth.
例如,给定事实⟨《西西弗神话》(The Myth of Sisyphus),作者,阿尔贝·加缪(Albert Camus)⟩及其关联文本"《西西弗神话》由阿尔贝·加缪所著",I类负样本会篡改事实(⟨《西西弗神话》,作者,西蒙娜·德·波伏娃(Simone de Beauvoir)⟩,"《西西弗神话》由阿尔贝·加缪所著"),而II类样本则生成(⟨《西西弗神话》,作者,阿尔贝·加缪⟩,"《西西弗神话》由...所著")。我们为每种扰动分配概率,并控制生成过程使得每个正样本仅生成一个负样本。为了让分类器能从不同负样本中学习并避免过拟合 [20],每个训练周期都使用新生成的负样本集,开发集的构建方式相同。通过在固定测试集(附录A.4.1)上的评估发现,检测事实忠实性最佳的模型是具有最高事实篡改概率$(90%)$和$10%$真实文本扰动概率的模型。
Above, we have described FactS potter as a (trained) classifier. To use it as a score (metric), we take the output of the model after the softmax layer. The final score of a generated text $T$ given an input graph $G$ is the average over the scores for each pair (fact $f\in G$ , generated text $T$ ).
上文我们将FactSpotter描述为一个(经过训练的)分类器。要将其用作评分(指标),我们取模型经过softmax层后的输出。给定输入图$G$时生成文本$T$的最终得分,是每对(图$G$中的事实$f$与生成文本$T$)得分的平均值。
Parameters and performance of FactS potter. As we aim to add our metric, FactS potter, in the inference step of graph-to-text generation, we prefer small language models. Hence, we select the small Electra model (Clark et al., 2020). We have experimented with other small models, such as DistilBERT and Distil RoBERTa, but we did not observe an improvement. We train our classifier for 16 epochs, with a learning rate of $5\cdot10^{-5}$ , and the AdamW optimizer. We describe in the Appendix A.4.1 how we chose the percentages of negative samples for FactS potter.
FactS potter的参数与性能。由于我们的目标是在图到文本生成的推理步骤中加入FactS potter这一指标,因此更倾向于使用小型语言模型。为此,我们选择了小型Electra模型 (Clark et al., 2020)。我们还尝试了其他小型模型,如DistilBERT和Distil RoBERTa,但未观察到性能提升。我们使用学习率$5\cdot10^{-5}$和AdamW优化器对分类器进行了16个epoch的训练。附录A.4.1中描述了如何为FactS potter选择负样本比例。
The performance of FactS potter on the test splits across multiple datasets is detailed in Table 1, with accuracy and F1 score. We also report numbers of true positives/negatives (TP/TN), and false positives/negatives (FP/FN) in the table.
FactS potter在多个数据集测试集上的性能详见表1,包含准确率和F1分数。表中还统计了真阳性/阴性(TP/TN)和假阳性/阴性(FP/FN)的数量。
Table 1: Performance of FactS potter on test splits.
数据集 | 准确率 | F1值 | 真阳性/真阴性 | 假阳性/假阴性 |
---|---|---|---|---|
GrailQA | 96.85 | 96.85 | 2272/2643 | 64/96 |
SimpleQ. | 95.41 | 95.41 | 10260/10438 | 419/575 |
DART | 97.62 | 97.62 | 26613/26352 | 607/684 |
WebNLG17 | 99.10 | 99.09 | 8594/8575 | 93/63 |
WebNLG20 | 95.21 | 95.21 | 9637/9853 | 317/663 |
表 1: FactSpotter在测试集上的性能表现。
5 Evaluating graph-to-text generation
5 评估图到文本生成
We investigate our first research question (RQ1): which metrics would better correlate with the factual faithfulness of the generated text? For this question, we compare FactS potter with: BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005), BERTScore (Zhang et al., 2019), BLEURT (Sellam et al., 2020), BARTScore (Yuan et al., 2021). The only metric that is not normalized is BARTScore. This metric has a variant specifically adapted for factual faithfulness, BARTScorefaithful. Further details of these metrics are provided in Appendix A.1.
我们研究的第一个问题(RQ1): 哪些指标能更好地与生成文本的事实准确性相关联?针对该问题,我们将FactS potter与以下指标进行比较: BLEU (Papineni et al., 2002)、METEOR (Banerjee and Lavie, 2005)、BERTScore (Zhang et al., 2019)、BLEURT (Sellam et al., 2020)、BARTScore (Yuan et al., 2021)。其中唯一未标准化的指标是BARTScore。该指标有一个专门针对事实准确性的变体BARTScorefaithful。这些指标的详细说明见附录A.1。
We calculate the system-level correlation between automatic metrics and human annotations. Given $S$ systems under evaluation, for a certain dimension, e.g., fluency, we compute the correlation between the system-level automatic metric scores $[M(S_{1}),\dots M(S_{S})]$ and the corresponding system-level human scores $[H(S_{1}),\dots H(S_{S})]$ , where $M(S_{i})$ is the score of the automatic metric on the texts from system $S_{i}$ , and $H(S_{i})$ is the score of the human annotation on the same result. Similarly to (Colombo et al., 2022), we compute three correlation metrics: Pearson correlation $(r)$ , Spearman correlation $(\rho)$ , and Kendall’s Tau $(\tau)$ . To test if a metric $M_{1}$ has a higher correlation with human annotations than $M_{2}$ , we use the bootstrapping technique proposed in (Wilcox, 2016), which we describe in the Appendix A.2. We also report the sentence-level correlation in Appendix A.5. In addition to automatic measures, we report the correlation between one annotator and the average score of the remaining annotators, which should be a upper bound on the correlation we can obtain using automatic measures.
我们计算自动指标与人工标注之间的系统级相关性。给定评估中的 $S$ 个系统,在某个维度(如流畅度)上,我们计算系统级自动指标得分 $[M(S_{1}),\dots M(S_{S})]$ 与对应系统级人工得分 $[H(S_{1}),\dots H(S_{S})]$ 之间的相关性。其中 $M(S_{i})$ 表示自动指标在系统 $S_{i}$ 生成文本上的得分,$H(S_{i})$ 表示相同结果的人工标注得分。参照 (Colombo et al., 2022) 的方法,我们计算三种相关性指标:皮尔逊相关系数 $(r)$、斯皮尔曼相关系数 $(\rho)$ 和肯德尔 Tau 系数 $(\tau)$。为检验指标 $M_{1}$ 是否比 $M_{2}$ 与人工标注具有更高相关性,我们采用 (Wilcox, 2016) 提出的自助法技术(详见附录 A.2)。句子级相关性结果见附录 A.5。除自动指标外,我们还报告了单个标注者与其他标注者平均分之间的相关性,这应作为自动指标所能达到相关性的理论上限。
WebNLG 2017. In the WebNLG 2017 challenge (Shimorina et al., 2018), the organizers annotated 9 submissions on semantic adequacy (the text correctly represents the meaning in the data), text structure (as above, referred in the original paper as grammar) and fluency (as above). This annotation has carried over 223 samples.
WebNLG 2017。在WebNLG 2017挑战赛 (Shimorina et al., 2018) 中,组织者对9份提交结果进行了语义准确性(文本正确表达数据含义)、文本结构(同前述,原论文称为语法)和流畅性(同前述)标注。该标注共涵盖223个样本。
WebNLG 2020. After the WebNLG 2020 Chal- lenge (Castro Ferreira et al., 2020), the organizers annotated the 16 participating systems on data correctness (the predicates found in the data are correctly mentioned together with their subject and object), data coverage (the text includes descriptions of all predicates presented in the data), and relevance (the text describes only those predicates with related subjects and objects which are in the data), in addition to text structure (the text is grammatical, well structured, and written in good English) and fluency (the text progresses naturally, forms a coherent whole and is easy to understand). 178 generations of each system were annotated.
WebNLG 2020。在WebNLG 2020挑战赛 (Castro Ferreira et al., 2020) 之后,组织者对16个参赛系统进行了标注,评估内容包括数据正确性(数据中提到的谓词与其主语和宾语正确关联)、数据覆盖率(文本包含对数据中所有谓词的描述)和相关性(文本仅描述数据中具有相关主语和宾语的谓词),此外还包括文本结构(文本语法正确、结构良好且英语表达流畅)和流畅性(文本自然推进,形成连贯整体且易于理解)。每个系统的178个生成结果被纳入标注。
Table 2: Correlation at the system level with human judgment on correctness, data coverage, relevance, fluency and text structure for the 2020 WebNLG task. For tables here and below, BERTF1 stands for BERTScore-F1, BARTS for BARTScore, BARTS-F for BARTScore-faithful, FactS for FactS potter. We highlight the best result and we mark it with an asterisk when it is statistically significantly larger than any other metric (excluding Human to Human correlation). FactS potter performs the best on correctness, data coverage, and relevance. We put a value for correlation if the pvalue $p<0.05$ .
表 2: 2020年WebNLG任务中系统级别与人工评判在正确性、数据覆盖度、相关性、流畅性和文本结构上的相关性。对于此处及以下的表格,BERTF1代表BERTScore-F1,BARTS代表BARTScore,BARTS-F代表BARTScore-faithful,FactS代表FactS potter。我们标出了最佳结果,并在其统计显著优于其他指标(不包括人工与人工相关性)时用星号标记。FactS potter在正确性、数据覆盖度和相关性上表现最佳。当p值$p<0.05$时,我们给出相关性数值。
指标 | 正确性 | 数据覆盖度 | 相关性 | 流畅性 | 文本结构 | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
r | p | T | r | p | T | r | p | T | r | p | T | r | p | T | |
正确性 | 1.0 | 1.0 | 1.0 | 0.96 | 0.81 | 0.66 | 0.97 | 0.81 | 0.66 | 0.80 | 0.77 | 0.60 | 0.79 | 0.76 | 0.59 |
数据覆盖度 | 1.0 | 1.0 | 1.0 | 0.93 | 0.80 | 0.64 | 0.71 | 0.56 | 0.43 | 0.69 | 0.55 | 0.42 | |||
相关性 | 1.0 | 1.0 | 1.0 | 0.76 | 0.63 | 0.48 | 0.76 | 0.62 | 0.47 | ||||||
流畅性 | 1.0 | 1.0 | 1.0 | 0.98 | 0.97 | 0.91 | |||||||||
文本结构 | 0.80 | 0.65 | 0.93 | 1.0 | 1.0 | 1.0 | |||||||||
Human BLEU | 0.96 | 0.59 | 0.64 | 0.48 | 0.53 | 0.83 | 0.53 | 0.68 | 0.40 | 0.96 | 0.56 | 0.74 | 0.59 | 0.95 | 0.93 |
METEOR | 0.72 | 0.75 | 0.60 | 0.65 | 0.58 | 0.44 | 0.70 | 0.60 | 0.45 | 0.50 | 0.87 | 0.88 | 0.84 | 0.68 | 0.86 |
BERTF1 | 0.77 | 0.60 | 0.64 | 0.89 | 0.74 | 0.86 | 0.88 | 0.72 | |||||||
BLEURT | 0.83 | 0.93 | 0.82 | 0.67 | 0.74 | 0.86 | 0.58 | 0.65 | 0.43 | 0.50 | 0.81 | 0.91 | 0.65 | 0.50 | 0.55 |
BARTS | 0.90 | 0.83 | 0.67 | 0.86 | 0.71 | 0.53 | 0.88 | 0.69 | 0.71 | 0.56 | 0.77 | 0.92 | 0.81 | 0.78 | 0.63 |
BARTS-F | 0.67 | 0.54 | 0.41 | 0.68 | 0.61 | 0.46 | 0.68 | 0.59 | 0.45 | 0.51 | 0.52 | ||||
FactS | 0.94 | 0.80 | 0.64 | 0.91 | 0.87 | 0.71 | 0.96* | 0.79 | 0.64 | 0.74 | 0.59 | -0.45 | 0.72 | 0.59 | 0.45 |
Table 3: Correlation at the system level with human judgment on semantic adequacy, grammar, and fluency, for the 2017 WebNLG dataset.
表 3: 2017 WebNLG 数据集中系统级指标与人工评判在语义充分性、语法结构和流畅性上的相关性
Metric | Sem.Adeq. | T.Struct. | Fluency | ||||||
---|---|---|---|---|---|---|---|---|---|
r | p | T | r | P | T | r | P | T | |
Sem.Adeq. T.Struct. | 1.0 | 0.73 1.0 | 0.71 0.98 | ||||||
1.0 | 1.0 | 0.95 | 0.88 | ||||||
Fluency | 1.0 | 1.0 | 1.0 | ||||||
Human | 0.99 | 0.98 | 0.71 | 0.94 | 0.98 | 0.92 | 0.84 | 0.98 | 0.89 |
BLEU | 0.76 | 0.56 | 0.86 | 0.71 | 0.57 | 0.83 | 0.72 | 0.57 | |
METEOR | 0.86 0.83 | 0.67 | 0.85 | 0.70 | 0.57 | 0.80 | 0.70 | 0.56 | |
BERTF1 | 0.70 | 0.78 | 0.63 | 0.71 | 0.70 | 0.57 | 0.69 | 0.70 | 0.57 |
BLEURT | 0.90 0.88 | 0.72 | 0.84 | 0.69 | 0.56 | 0.79 | 0.68 | 0.56 | |
BARTS | 0.90 | 0.87 | 0.78 | 0.71 | 0.68 | ||||
FactS | 0.97* | 0.93 | 0.85 | 0.67 |
Results. On WebNLG 2020, Table 2 shows that FactS potter has the best performance on factual faithfulness, significantly improving relevance. BLEU, METEOR, BERTScore and BLEURT reach similar fluency and text stucture scores. For the results on WebNLG 2017 in Table 3, FactS potter has the highest performance on semantic adequacy, which is the only dimension related to factual faithfulness. For text structure and fluency, BLEURT obtains the best results, although the results are not statistically significant. Overall, previous metrics are better on text structure and fluency, which are generally considered as no longer a bottleneck for large language models. FactS potter is the best suited metric on factual faithfulness.
结果。在WebNLG 2020上,表2显示FactSpotter在事实忠实度方面表现最佳,显著提升了相关性。BLEU、METEOR、BERTScore和BLEURT在流畅性和文本结构得分上表现相近。对于WebNLG 2017的结果(表3),FactSpotter在语义充分性(唯一与事实忠实度相关的维度)上表现最优。在文本结构和流畅性方面,BLEURT取得了最佳结果,尽管差异不具备统计学显著性。总体而言,现有指标在文本结构和流畅性上表现更优(这通常被认为已不再是大语言模型的瓶颈),而FactSpotter是衡量事实忠实度的最佳指标。
6 Improving the factual faithfulness of graph-to-text generation
6 提升图到文本生成的事实准确性
In this section, we investigate the answer to our third research question: can we improve graph-totext generation on factual faithfulness (RQ2)? For this, we first explain how to improve the inference step of any G2T model using FactS potter, and then we present the results of this technique on the stateof-the-art models for G2T generation.
在本节中,我们探讨第三个研究问题的答案:能否提升图到文本生成的事实忠实性(RQ2)?为此,我们首先说明如何利用FactSpotter改进任意G2T模型的推理步骤,随后展示该技术在当前最优G2T生成模型上的实验结果。
6.1 Improving models’ factual faithfulness
6.1 提升模型的事实准确性
Let $\mathcal{M}$ be a neural network G2T (seq2seq) model that, given an input sequence $x=x_{1},x_{2},...x_{M}$ , produces an output sequence $y=y_{1},y_{2},\dots,y_{N}$ where $x_{i}\in V_{x}$ , $y_{i}\in V_{y}$ , and $V_{x}$ , $V_{y}$ are the vocabularies from where we select items in the input, respectively output sequence. In the inference step, the model generates the sequence $y$ that maximizes:
设 $\mathcal{M}$ 为一个神经网络 G2T (seq2seq) 模型,给定输入序列 $x=x_{1},x_{2},...x_{M}$,生成输出序列 $y=y_{1},y_{2},\dots,y_{N}$,其中 $x_{i}\in V_{x}$,$y_{i}\in V_{y}$,且 $V_{x}$ 和 $V_{y}$ 分别是输入和输出序列的词汇表。在推理阶段,该模型生成使以下值最大化的序列 $y$:
$$
P(y|x)=\prod_{i=1}^{N}P(y_{i}|y_{<i},x)
$$
$$
P(y|x)=\prod_{i=1}^{N}P(y_{i}|y_{<i},x)
$$
In practice, for computational efficiency, the log of the probabilities are typically utilized in beam search. We use the following method to improve factual faithfulness in G2T inference without retraining the model.
在实践中,出于计算效率考虑,通常会在束搜索中使用概率的对数值。我们采用以下方法在不重新训练模型的情况下提升G2T推理的事实准确性。
Given:i) a graph-to-text generation model $\mathcal{M},\quad i i)$ our factual faithfulness classifier, i.e., FactS potter, $i i i$ ) a subgraph $G$ composed of $F$ facts, we encourage factual generations by modifying the prediction step as follows:
给定:i) 一个图到文本生成模型 $\mathcal{M}$,ii) 我们的事实忠实度分类器 FactSpotter,iii) 一个由 $F$ 个事实组成的子图 $G$,我们通过如下方式修改预测步骤来促进事实生成:
$$
\begin{array}{l}{{\displaystyle\log(P^{f}(y_{i}|y_{<i},x))=}}\ {{\displaystyle~\lambda\sum_{j=1}^{F}(1-P_{f a c t_{j}}(y_{<i-1}))\log P_{f a c t_{j}}(y_{<i})}}\ {{\displaystyle~+\log(P(y_{i}|y_{<i},x))}}\end{array}
$$
$$
\begin{array}{l}{{\displaystyle\log(P^{f}(y_{i}|y_{<i},x))=}}\ {{\displaystyle~\lambda\sum_{j=1}^{F}(1-P_{f a c t_{j}}(y_{<i-1}))\log P_{f a c t_{j}}(y_{<i})}}\ {{\displaystyle~+\log(P(y_{i}|y_{<i},x))}}\end{array}
$$
where: i) $P^{f}(y_{i}|y_{<i},x)$ is the probability of generating token $y_{i}$ given the factual classifier; $i i$ ) $P_{f a c t_{j}}(y_{<i-1})$ is the probability of correctly representing the fact $j$ in the previously generated tokens $y_{0},...,y_{i-1}$ , computed by FactS potter. iii) $P(y_{i}|y_{<i},x)$ is the probability for generating the next token based on previous $i-1$ tokens.
其中:
i) $P^{f}(y_{i}|y_{<i},x)$ 是基于事实分类器生成 token $y_{i}$ 的概率;
ii) $P_{f a c t_{j}}(y_{<i-1})$ 是由 FactSpotter 计算的在先前生成的 token $y_{0},...,y_{i-1}$ 中正确表示事实 $j$ 的概率;
iii) $P(y_{i}|y_{<i},x)$ 是基于前 $i-1$ 个 token 生成下一个 token 的概率。
In our equation, a fact $j$ is encouraged only if we have not observed in the text generated until step $i$ , according to $P_{f a c t_{j}}(y_{<i-1})$ . Then adding token $y_{i}$ would increase the probability of the text including the fact $j$ , $P_{f a c t_{j}}(y_{<i})$ . When $P_{f a c t_{j}}(y_{<i-1})$ is small, the equation encourages the selection of words belonging to the fact $j$ . As $P$ is large and $1-P_{f a c t_{j}}(y_{<i-1})$ tends to 0, then we can consider that the fact $j$ already appears in the text, and words that satisfy fact $j$ will no longer be encouraged. The weight $\lambda$ controls the influence of the FactS potter on the prediction of the following words. A high $\lambda$ might yield a factual text, but not necessarily a fluent one. We generate tokens till we have generated text $S=y_{0},...,y_{k}$ for which $\forall j\in F,P_{f a c t_{j}}(S)>0.5$ , i.e., the probability of each fact $j$ in $G$ is verbalized in the text $S$ is over 0.5, the standard positive threshold.
在我们的方程中,只有当根据 $P_{fact_{j}}(y_{<i-1})$ 在截至步骤 $i$ 生成的文本中未观察到事实 $j$ 时,才会鼓励该事实。此时添加 token $y_{i}$ 将提高文本包含事实 $j$ 的概率 $P_{fact_{j}}(y_{<i})$。当 $P_{fact_{j}}(y_{<i-1})$ 较小时,方程会鼓励选择属于事实 $j$ 的词汇。随着 $P$ 增大且 $1-P_{fact_{j}}(y_{<i-1})$ 趋近于0,则可以认为事实 $j$ 已出现在文本中,满足事实 $j$ 的词汇将不再被鼓励。权重 $\lambda$ 控制FactSpotter对后续词汇预测的影响程度。较高的 $\lambda$ 可能生成事实性文本,但不保证流畅性。我们持续生成 token 直至生成文本 $S=y_{0},...,y_{k}$ 满足 $\forall j\in F,P_{fact_{j}}(S)>0.5$,即图 $G$ 中每个事实 $j$ 在文本 $S$ 中被表述的概率超过0.5(标准正向阈值)。
6.2 Models
6.2 模型
We consider for our evaluation state-of-the-art models for G2T genetation, PLM (Ribeiro et al., 2021) and JointGT (Ke et al., 2021). The former investigates how a standard seq2seq model can perform on G2T, given a carefully constructed representation of a graph. The latter proposes a more com- plex neural network, with built-in attention layers specialized on graph structured inputs. Both are initialized with the pretrained weights of a language model. Similar to the authors, we consider in this work T5 (Raffel et al., 2020) for both G2T models. For simplicity we refer to the first model as T5, and to the second model as JointGT. We refer to the models modified as explained in Section 6.1 as FactT5 and Fact Joint GT. For T5 and FactT5, we initialize the weights with T5 small, and T5 base. For JointGT and Fact Joint GT we initialize the weights with a pretrained T5 base model offered by the authors of JointGT. For the fine-tuning step (each model is fine-tuned on the training split of the datasets), we train the small models with a learning rate of $10^{-4}$ and a batch size of 32, while the base models are trained with a learning rate of $5\cdot10^{-5}$ and a batch of 16. We use a beam size of 5 and the AdamW optimizer for both sizes of models. For Equation 1, we fix the weight $\lambda=0.15$ (parameter tuning in Appendix A.4.2).
在我们的评估中,考虑了G2T生成的最先进模型PLM (Ribeiro et al., 2021)和JointGT (Ke et al., 2021)。前者研究了标准序列到序列(seq2seq)模型在给定精心构建的图表示时,如何在G2T任务上表现。后者提出了一种更复杂的神经网络,内置专门针对图结构输入的注意力层。两者均使用语言模型的预训练权重进行初始化。与原作者一致,我们在本工作中为两种G2T模型选用T5 (Raffel et al., 2020)。为简化表述,我们将第一个模型称为T5,第二个模型称为JointGT。将第6.1节所述修改后的模型分别称为FactT5和Fact Joint GT。对于T5和FactT5,我们使用T5 small和T5 base初始化权重。对于JointGT和Fact Joint GT,则采用JointGT作者提供的预训练T5 base模型初始化权重。
在微调阶段(每个模型均在数据集的训练集上进行微调),我们以$10^{-4}$的学习率和32的批量大小训练small模型,而base模型采用$5\cdot10^{-5}$的学习率和16的批量大小。两种规模的模型均使用束搜索大小为5和AdamW优化器。对于公式1,我们固定权重参数$\lambda=0.15$(参数调优见附录A.4.2)。
6.3 Datasets
6.3 数据集
To evaluate G2T performance, we need (graph, text) pairs datasets. The graphs can be directly extracted from an underlying knowledge graph or adapted to have richer semantics, such as query graphs (Yih et al., 2015). The text associated to the graph should be a sentence or a paragraph verbalizing all the information contained in the subgraph.
为了评估G2T性能,我们需要(图结构,文本)配对数据集。这些图结构可以直接从底层知识图谱中提取,或经过调整以包含更丰富的语义信息,例如查询图(Yih et al., 2015)。与图结构关联的文本应当是通过句子或段落来完整表述子图中包含的所有信息。
Several datasets are proposed in the literature for the G2T task, such as WebNLG (Gardent et al., 2017). Many question-answering (QA, in short) datasets are also in the form (graph, text). Since question answering datasets can be very large and cover many KG predicates (Gu et al., 2021), we also leverage such datasets. To ensure that FactS potter has never encountered the test data, it is trained exclusively on the training set and evaluated it on the validation split.
文献中提出了多个用于G2T任务的数据集,例如WebNLG (Gardent et al., 2017)。许多问答(QA)数据集也采用(图, 文本)的形式。由于问答数据集规模庞大且涵盖大量知识图谱谓词(Gu et al., 2021),我们也利用了此类数据集。为确保FactSpotter从未接触过测试数据,我们仅使用训练集进行训练,并在验证集上评估模型性能。
Simple Questions (Bordes et al., 2015) is a QA dataset built on Freebase (Bollacker et al., 2008). The dataset contains 108K (triple, question) pairs, where the question corresponds to the subject and predicate from the triple, and the answer is the object of the triple. For example, given the triple (Gulliver’s Travels, book/written work/author, Dean Swift), the question is Who wrote Gulliver’s Travels?, with the answer Dean Swift. The dataset covers 76 domains, 741 classes, 89K entities and 2K relations. A Freebase domain is a general area of knowledge such as business, politics, economics, etc. We created our own split for this dataset, where the test set is zero shot: we have not seen the predicates during training. FactS potter can be trained to correctly classify if a question refers to a triple, even if the object or subject is missing from the question, as we replace the entity with its type.
Simple Questions (Bordes et al., 2015) 是一个基于Freebase (Bollacker et al., 2008)构建的问答数据集。该数据集包含108K个(三元组, 问题)对,其中问题对应三元组中的主语和谓语,而答案则是三元组的宾语。例如,给定三元组(Gulliver’s Travels, book/written work/author, Dean Swift),对应问题为Who wrote Gulliver’s Travels?,答案为Dean Swift。数据集涵盖76个领域、741个类别、89K个实体和2K种关系。Freebase领域指代知识的广义范畴,如商业、政治、经济等。我们为该数据集创建了自定义划分,其中测试集采用零样本设定:训练阶段未见过这些谓语。FactSpotter经过训练后,即使问题中缺失宾语或主语(我们将实体替换为其类型),仍能正确判断问题是否指向某个三元组。
GrailQA (Gu et al., 2021) is also a QA dataset that uses Freebase, created using human annotation. The original dataset contains $64K$ (triple, question) pairs, however, the test set is not released as the authors have created a QA challenge 2, hence we use the development set as a test set. The remaining data (training and development) consists of 51K pairs. The authors propose three levels of generalization and split the development and test as follows. $50%$ of the pairs from held-out domains (Freebase assigns to each entity and predicate a domain, such as music, sports, etc.) are not covered in training: this is the zero-shot split. $25%$ of the pairs correspond to graphs where the combination of ontology items (classes and predicates) were not covered in training: this is the compositional split. Finally, the remaining $25%$ are randomly sampled from the same distribution as the training dataset: the i.i.d. split. The i.i.d. and compositional subsets contain only ontology items (classes and predicates) covered in training. For the zero-shot subset, five domains are held out for validation.
GrailQA (Gu et al., 2021) 同样是一个基于Freebase的问答数据集,通过人工标注构建。原始数据集包含 $64K$ 个 (三元组, 问题) 对,但由于作者创建了问答挑战赛2,测试集未公开,因此我们使用开发集作为测试集。剩余数据(训练集和开发集)包含51K对。作者提出了三个泛化层级,并按以下方式划分开发集和测试集:训练中未涵盖的领域(Freebase为每个实体和谓词分配了领域,如音乐、体育等)中 $50%$ 的配对属于零样本划分;训练中未涵盖的本体项(类和谓词)组合对应的 $25%$ 配对属于组合划分;剩余的 $25%$ 从与训练集相同分布中随机采样,即独立同分布划分。独立同分布和组合子集仅包含训练中覆盖的本体项(类和谓词)。零样本子集中,留出五个领域用于验证。
WebNLG (Gardent et al., 2017; Castro Ferreira et al., 2020) is a text generation dataset on DBPedia, created via human annotation. The dataset consists of (graph, paragraph) pairs, where the graph is a set of DBPedia facts, and the paragraph consists of one or several sentences that describe the graph. We use the 2017 (2.1) and 2020 (3.0) versions of the dataset 3. The 2017 version of the dataset contains 42K graph-text pairs, and it has two splits, the standard version and the constrained version. In the constrained version, the test set does not contain that a triple occurring in train/dev. In this work we considered only the constrained split, as it is more challenging. The WebNLG 2020 dataset has 40K pairs, which comprises 10 categories that were previously seen and utilized in WebNLG 2017, as well as 5 categories that were unseen in WebNLG 2017 are now incorporated into the seen data of the WebNLG 2020 dataset. It also introduces a new category of data: company.
WebNLG (Gardent et al., 2017; Castro Ferreira et al., 2020) 是一个基于DBPedia的文本生成数据集,通过人工标注创建。该数据集由(图,段落)对组成,其中图是一组DBPedia事实,段落则由描述该图的一个或多个句子构成。我们使用了该数据集的2017年(2.1)和2020年(3.0)版本。2017版数据集包含42K个图-文本对,分为标准版和约束版两个子集。在约束版中,测试集不包含训练集/开发集中出现过的三元组。本研究仅采用约束版子集,因其更具挑战性。WebNLG 2020数据集包含40K个样本,涵盖WebNLG 2017中出现过的10个类别,以及5个新增类别(这些类别在2017版中未出现,现被纳入2020版的已知数据)。该版本还引入了新的数据类别:公司。
DART (Nan et al., 2021) is a data-to-text dataset based on Wikipedia tables. Since the tables are represented as (subject, predicate, object) triples, it also suits our evaluation. Besides creating tableto-text annotations, the authors also use existing datasets: the QA dataset WikiSQL (Zhong et al., 2017), the cleaned E2E (Dušek et al., 2019) (entity- to-entity relations in the restaurant domain), and the original release of the WebNLG dataset for the 2017 challenge (Shimorina et al., 2018). The authors align the predicates such that predicates with the same meaning have the same representation. The dataset has 82K instances. We excluded the WikiSQL split as it has been generated automatically and after we performed a manual verification, we observed many low quality ground truth texts.
DART (Nan et al., 2021) 是一个基于维基百科表格的数据到文本数据集。由于表格以 (主语, 谓语, 宾语) 三元组形式表示,它也适合我们的评估。除了创建表格到文本标注外,作者还使用了现有数据集:问答数据集 WikiSQL (Zhong et al., 2017)、清洗后的 E2E (Dušek et al., 2019) (餐饮领域的实体到实体关系) 以及 2017 年挑战赛原始发布的 WebNLG 数据集 (Shimorina et al., 2018)。作者对齐了谓语,使含义相同的谓语具有相同的表示形式。该数据集包含 82K 个实例。我们排除了 WikiSQL 分割部分,因为它是自动生成的,经过人工验证后我们发现许多真实文本质量较低。
6.4 Evaluation
6.4 评估
In this section, we consider (RQ2): Can we improve G2T generation on factual faithfulness?
在本节中,我们探讨 (RQ2):能否提升 G2T 生成的事实准确性?
Table 4 shows the results on the SimpleQuestions dataset. We generally have a high FactS potter score, indicating that models are already good at relaying factual information. We can improve the factual faithfulness with F-T5 and FGT without significant compromise on other metrics, implying maintained fluency of texts.
表 4: SimpleQuestions数据集上的结果。我们总体上获得了较高的FactS potter分数,表明模型在传递事实信息方面已经表现良好。通过F-T5和FGT可以在不明显影响其他指标的情况下提升事实忠实度,这意味着文本的流畅性得到了保持。
模型 | BLEU | METEOR | BERTF1 | BLEURT | BARTS | FactS |
---|---|---|---|---|---|---|
T5S | 37.97 | 36.50 | 93.43 | 67.85 | -2.45 | 96.80 |
F-T5S T5B | 37.85 | 36.06 | 93.42 | 67.86 | -2.45 | 98.17 |
38.73 | 36.43 | 93.61 | 68.48 | -2.43 | 95.09 | |
F-T5B | 38.73 | 36.42 | 93.56 | 68.42 | -2.43 | 97.14 |
JGT-T5 | 39.35 | 36.82 | 93.65 | 68.50 | -2.42 | 95.40 |
FGT-T5 | 39.24 | 36.78 | 93.64 | 68.48 | -2.42 | 97.25 |
Table 4: Results on G2T on Simple Questions. Here and below, T5S stands for T5 small, T5B for T5 base, F-T5S for FactT5 small, F-T5B for FactT5 base, JGT-T5 for JointGT-T5, and FGT-T5 for Fact Joint GT-T5.
表 4: G2T在简单问题上的结果。此处及下文,T5S代表T5 small,T5B代表T5 base,F-T5S代表FactT5 small,F-T5B代表FactT5 base,JGT-T5代表JointGT-T5,FGT-T5代表Fact Joint GT-T5。
Table 5: Results on G2T on WebNLG 2017 Const.
表 5: WebNLG 2017 Const 上的 G2T 结果
Model | BLEU | METEOR | BERTF1 | BLEURT | BARTS | FactS |
---|---|---|---|---|---|---|
T5S | 66.24 | 47.80 | 96.72 | 73.16 | -1.41 | 98.67 |
F-T5S | 66.27 | 47.89 | 96.73 | 73.21 | -1.41 | 99.25 |
T5B | 67.04 | 48.35 | 96.81 | 73.22 | -1.40 | 99.44 |
F-T5B | 67.04 | 48.22 | 96.80 | 73.26 | -1.40 | 99.71 |
JGT-T5 | 67.08 | 48.34 | 96.76 | 73.44 | -1.39 | 99.09 |
FGT-T5 | 66.89 | 48.19 | 96.84 | 73.42 | -1.39 | 99.67 |
Table 6: Results on G2T on the WebNLG 2020 dataset.
表 6: WebNLG 2020 数据集上 G2T 任务的结果。
模型 | BLEU | METEOR | BERTF1 | BLEURT | BARTS | FactS |
---|---|---|---|---|---|---|
T5S | 52.30 | 40.82 | 93.43 | -1.75 | 65.80 | 90.75 |
F-T5S | 52.44 | 41.02 | 93.45 | -1.74 | 65.92 | 93.45 |
T5B F-T5B | 54.29 | 41.66 | 93.65 | -1.69 | 66.43 | 93.60 |
54.72 | 41.70 | 93.61 | -1.69 | 66.46 | 95.14 | |
JGT | 54.23 | 41.49 | 93.47 | -1.72 | 66.23 | 91.26 |
FGT-T5 | 54.45 | 41.52 | 93.49 | -1.72 | 66.31 | 93.16 |
Table 7: Results on G2T on the DART dataset.
表 7: DART数据集上G2T任务的结果。
Model | BLEU | METEOR | BERTF1 | BLEURT | BARTS | FactS |
---|---|---|---|---|---|---|
T5S | 46.22 | 39.96 | 94.69 | 66.62 | -2.03 | 95.47 |
F-T5S | 46.31 | 40.07 | 94.74 | 66.66 | -2.02 | 97.29 |
T5B | 48.47 | 40.74 | 95.04 | 67.49 | -1.97 | 96.65 |
F-T5B | 48.37 | 40.72 | 95.05 | 67.43 | -1.97 | 97.60 |
JGT-T5 | 47.51 | 40.43 | 94.92 | 67.33 | -2.01 | 95.86 |
FGT-T5 | 47.39 | 40.32 | 94.92 | 67.26 | -2.00 | 97.25 |
Table 8: Results on G2T on the GrailQA dataset.
模型/分割 | BLEU | METEOR | BERTF1 | BLEURT | BARTS | FactS |
---|---|---|---|---|---|---|
IID | ||||||
T5S | 44.51 | 41.80 | 93.23 | 69.53 | -2.37 | 97.98 |
F-T5S | 44.64 | 41.88 | 93.25 | 69.63 | -2.36 | 98.47 |
T5B | 45.95 | 42.71 | 93.50 | 70.66 | -2.29 | 99.43 |
F-T5B | 46.10 | 42.67 | 93.52 | 70.73 | -2.29 | 99.50 |
JGT-T5 | 43.68 | 41.65 | 93.21 | 69.41 | -2.37 | 98.62 |
FGT-T5 | 43.61 | 41.65 | 93.19 | 69.41 | -2.37 | 99.12 |
Zero | ||||||
T5S | 30.30 | 36.91 | 91.74 | 62.87 | -2.74 | 93.27 |
F-T5S | 30.30 | 36.90 | 91.75 | 63.01 | -2.73 | 94.60 |
T5B | 32.20 | 37.35 | 91.92 | 63.84 | -2.72 | 94.77 |
F-T5B | 32.39 | 37.39 | 91.92 | 63.94 | -2.71 | 95.61 |
JGT-T5 | 32.94 | 37.69 | 92.02 | 64.18 | -2.68 | 94.15 |
FGT-T5 | 32.46 | 37.55 | 91.93 | 64.00 | -2.68 | 94.95 |
Comp. | ||||||
T5S | 30.38 | 35.32 | 92.09 | 63.99 | -2.74 | 94.94 |
F-T5S | 30.14 | 35.21 | 92.08 | 63.72 | -2.75 | 96.58 |
T5B | 31.75 | 35.64 | 92.24 | 63.99 | -2.72 | 94.84 |
F-T5B | 31.66 | 35.72 | 92.24 | 64.10 | -2.71 | 96.53 |
JGT-T5 | 31.46 | 36.08 | 92.39 | 64.92 | -2.67 | 95.26 |
FGT-T5 | 31.25 | 36.21 | 92.43 | 65.12 | -2.65 | 97.10 |
表 8: GrailQA数据集上G2T任务的结果。
Table 5 has the highest FactS potter score from all datasets, which means that we observe the most factual generations on WebNLG 2017, with F-T5 and FGT having slightly higher scores.
表 5: 在所有数据集中 WebNLG 2017 的 FactScore 得分最高,这表明我们观察到该数据集生成的文本事实性最强,其中 F-T5 和 FGT 的得分略高。
In Table 6, the FactS potter scores are lower for the WebNLG 2020 test split, although we achieve scores comparable to WebNLG 2017 on its validation split. This discrepancy may be attributed to the difference in distribution between the test and training splits of WebNLG 2020. F-T5B can achieve higher FactS potter score than T5B without compromising fluency. We observe the same trends for the DART dataset in Table 7.
在表6中,WebNLG 2020测试集的FactSpotter分数较低,尽管我们在其验证集上取得了与WebNLG 2017相当的分数。这种差异可能归因于WebNLG 2020测试集与训练集之间的分布差异。在不影响流畅性的情况下,F-T5B能够获得比T5B更高的FactSpotter分数。我们在表7的DART数据集中也观察到了相同的趋势。
In Table 8, all the metrics are higher for the IID split of GrailQA, and in particular FactS potter can reach $99.5%$ , hence the models learn to reproduce triples seen in training. For Zero-shot and Compositional splits, larger models are better, and our factual inference improves the score of FactS potter. We illustrate some improved samples of Zero-shot and Compositional splits in Appendix A.3, and in Appendix A.6, we investigate the impact of varying numbers of triples in the input subgraphs on the quality of the generated text.
在表8中,GrailQA的IID拆分所有指标均更高,特别是FactSpotter可达$99.5%$,这表明模型学会了复现训练中见过的三元组。对于零样本和组合拆分,模型越大表现越好,而我们的 factual inference 提升了FactSpotter的得分。附录A.3展示了零样本和组合拆分的部分改进样本,附录A.6则探讨了输入子图中不同数量三元组对生成文本质量的影响。
To validate that indeed generations improve using FactS potter in inference, we select the best FGT-T5 model and we analyse the top 20 phrases where the FactS potter score improved the most compared to the JGT-T5 generations.
为验证FactS potter在推理中确实能提升生成质量,我们选取了最佳FGT-T5模型,并分析其相比JGT-T5生成结果中FactS potter分数提升最显著的20个短语。
For the Simple Question datasets, we have 15 generations that are more factual, and 5 generations less factual. For the Zero-shot split of GrailQA, 14 generations are more factual. For its Compositional split, we have 13 improved generations, but also 5 that are less factual. For the IID split, 4 generations are improved, others are all rephrased texts. Only 7 samples of IID improve over 0.01 for FactS potter, so this split is not challenging. For the DART dataset, 6 texts are more factual. DART dataset has samples that ground-truth sentences do not match with graphs, so FactS potter trained on DART has false positives. For WebNLG 2017 dataset, 11 generations are more factual, others are rephrased texts. WebNLG 2017 is only has 16 generations improve over 0.1 for FactS potter, whose baseline is very high. For WebNLG 2020, 12 generations are more factual, and 3 are rephrased texts. 5 generations in WebNLG 2020 have higher FactS potter than baseline generations, but they’re still not factual enough.
在Simple Question数据集中,我们有15个生成结果更具事实性,5个生成结果事实性较弱。对于GrailQA的零样本划分,14个生成结果更具事实性。在其组合划分中,13个生成结果有所改进,但也有5个事实性较弱。对于IID划分,4个生成结果得到改进,其余均为改写文本。IID中仅有7个样本在FactS potter上提升超过0.01,因此该划分挑战性较低。在DART数据集中,6个文本更具事实性。DART数据集存在真实语句与图表不匹配的样本,因此基于DART训练的FactS potter会出现误判。对于WebNLG 2017数据集,11个生成结果更具事实性,其余为改写文本。WebNLG 2017仅有16个生成结果在FactS potter上提升超过0.1(其基线值已很高)。在WebNLG 2020中,12个生成结果更具事实性,3个为改写文本。WebNLG 2020中有5个生成结果的FactS potter得分高于基线生成结果,但仍未达到足够的事实性标准。
For the cases where a generation becomes less factual, this is a consequence of the accuracy of FactS potter, which we present in Section 4. Given that our metric does not correlate strongly with fluency, we perform a second analysis on generations to observe if there is a decrease in fluency. To answer this question, we study the top 20 sentences for which the BLEURT decreased the most in comparison with the original generated question. We do not observe an obvious decrease in fluency on any dataset, the decrease in BLEURT score is due to several other factors: BLEURT has difficulties identifying rephrased sentences, in a few cases the factual faithfulness decreased, and in the remaining cases the generations are more faithful to the input graph than the ground truth sentences, however BLEURT cannot identify it. Hence, we can conclude that adding FactS potter as a plugin in generation can improve G2T generations on factual faithfulness and does not affect the fluency.
在生成结果事实性降低的情况下,这是由于FactS potter的准确性所致(详见第4节)。鉴于我们的指标与流畅度关联性不强,我们对生成结果进行了二次分析以观察流畅度是否存在下降。通过研究BLEURT分数相较原始生成问题下降幅度最大的前20个句子,我们发现:在所有数据集中均未出现明显的流畅度下降。BLEURT分数降低源于其他因素:该指标难以识别改写后的句子;少数情况下事实忠实度确实降低;其余情况中生成内容比参考答案更忠实于输入图谱,但BLEURT无法识别这一点。因此可以得出结论:将FactS potter作为生成插件能提升G2T生成的事实忠实度,且不会影响流畅性。
7 Remaining challenges in G2T task
7 G2T任务中的剩余挑战
Finally, we consider (RQ3): is G2T task solved on existing datasets? We have observed high FactS potter scores in Section 6 on the performances of models. We use FactS potter to do a more detailed analysis: we investigate what is the percentage of generations in each dataset which had at least a fact considered missing by FactS potter. A fact is considered as missing if the score of the pair (fact, generated sentence) is lower than 0.5. We obtain the following statistics: $1.94%$ of texts miss a fact in Simple Questions; $7.27%$ of texts miss at least a fact in DART; $5.79%$ of WebNLG 2017 texts miss at least one fact, and $12.64%$ for WebNLG 2020; For GrailQA we have $5.8%$ for the zero shot split, $4.36%$ for the compositional split and 1.13 for the IID split. According to the observations, WebNLG 2020 is the most challenging dataset, followed by DART, the zero shot split of GrailQA, and the WebNLG 2017 dataset. In Appendix A.7, taking GrailQA and WebNLG 2017 as examples, we analysed the difficulty of G2T on datasets from different knowledge graphs, by looking into how often predicates and entity names are rephrased or expressed exactly as in the input graph.
最后,我们考虑(RQ3):G2T任务在现有数据集上是否已解决?在第6节中我们已观察到模型的高FactS potter分数。我们使用FactS potter进行更细致的分析:统计每个数据集中至少缺失一项FactS potter认定事实的生成文本比例(事实缺失判定标准为(事实,生成句)配对得分低于0.5)。统计结果如下:Simple Questions中$1.94%$的文本缺失事实;DART中$7.27%$的文本至少缺失一项事实;WebNLG 2017中$5.79%$的文本至少缺失一项事实,WebNLG 2020中该比例为$12.64%$;GrailQA数据集中,零样本划分(zero shot split)的缺失率为$5.8%$,组合划分(compositional split)为$4.36%$,IID划分则为1.13。观察表明,WebNLG 2020是最具挑战性的数据集,其次是DART、GrailQA的零样本划分和WebNLG 2017数据集。在附录A.7中,我们以GrailQA和WebNLG 2017为例,通过分析谓词和实体名称在输入图中的重述或精确表达频率,研究了不同知识图谱数据集上G2T任务的难度差异。
We perform a second evaluation, this time by manually analyzing the output of the models. We consider the worst 20 sentences according to BLEURT and FactS potter, hence 40 examples per dataset or split. On Simple Questions, the generations are fluent, however 22/40 have an incorrect predicate. For GrailQA, in the IID split the predicates are correctly generated, but the models still have difficulties in generating some entity names (16/40). For the zero-shot split, generations suffer from wrong entities and predicates (22/40). The compositional split has several samples with wrong ground truth (6/40 of the worst generations) and 19 out of 40 incorrect generations. For DART, 24 generations are not correct. On WebNLG 2017, from the worst 40 generations, only two might benefit from improved fluency, while in many examples, the generated sentence was more fluent than the ground truth (14/40). Regarding correctness, only 2 out of 40 generations had a missing triple, while two generations incorrectly used a predicate. On WebNLG2020, only one instance exhibits room for improvement in fluency, but 24 instances either omit factual information or contain incorrect facts. Among the 20 outputs with the lowest BLEURT scores, 9 are rephrased texts with correctly explained facts. In contrast, among the 20 outputs with the lowest FactS potter, only 4 instances fall into this category.
我们进行了第二轮评估,这次通过人工分析模型输出。根据BLEURT和FactS potter指标筛选出表现最差的20个句子,因此每个数据集或数据划分共分析40个样本。在Simple Questions数据集上,生成结果流畅但22/40存在谓词错误。GrailQA的IID划分中谓词生成正确,但模型在生成部分实体名称时仍有困难(16/40)。零样本划分的生成结果存在实体和谓词错误(22/40)。组合划分中有6/40的最差生成样本存在标注错误,另有19/40生成结果不正确。DART数据集有24条错误生成。WebNLG 2017数据集中,40条最差生成里仅2条可能通过提升流畅度改进,而14/40的生成句实际比标注文本更流畅。在正确性方面,40条生成中仅2条缺失三元组,2条误用谓词。WebNLG2020数据集中仅1例流畅度有待提升,但有24例遗漏事实信息或包含错误事实。在BLEURT得分最低的20条输出中,9条是事实表述正确的改写文本;而FactS potter得分最低的20条输出中,仅4条属于此类。
Based on our manual annotations, we observed that models are able to produce correct generations. However, when the generation is rephrased in respect to the ground truth sentence, metrics like BLEURT, which measure if two sentences are equivalent, struggle to assign high scores. We recall that BLEURT, a normalized metric, gives a score of 1 to equivalent sentences. On the dataset WebNLG 2017, our metric assigns a very high score to the models, while the highest average BLEURT score is $73.44%$ . The BLEURT scores of the generations vary from 0.46 to 0.99; more than $50%$ of the test set generations score less than 0.80. We sampled 40 generations with BLEURT score $<0.8$ and note that 35 generations are correct, which are rephrasing the ground truth, while 2 out of 35 that are better than the ground truth. Hence, we observe that BLEURT score cannot be used to determine if we have achieved a good performance on a dataset, it can only be used to compare different models. This issue has also been pointed out by the authors4.
根据我们的手动标注,观察到模型能够生成正确的结果。然而,当生成内容相对于真实语句进行改写时,BLEURT等衡量句子等价性的指标难以给出高分。需要说明的是,作为标准化指标,BLEURT会给等价句子打1分。在WebNLG 2017数据集上,我们的指标给模型打了极高分数,而BLEURT最高平均分仅为$73.44%$。生成结果的BLEURT分数分布在0.46到0.99之间,超过$50%$的测试集生成结果得分低于0.80。我们抽样了40个BLEURT分数$<0.8$的生成结果,发现其中35个是正确的(属于对真实语句的改写),且有2个结果优于原始真实语句。由此可见,BLEURT分数不能用于判断模型在数据集上的表现优劣,仅适用于模型间比较。该问题也被原作者[4]指出过。
FactS potter answers whether a fact is present in text; it does not have to address the much harder task of deciding if two sentences are equivalent. Besides being more reliable because it is solving a simpler task, it is also more interpret able as we can investigate the exact triples that are classified as negative, instead of doing a complete comparison between a subgraph and a sentences or between two sentences. This is especially useful for long input graphs and long generations.
FactS potter 用于判断文本中是否存在某个事实,而无需解决判断两个句子是否等价这一更困难的任务。由于它处理的任务更简单,因此结果更可靠;同时解释性也更强,我们可以直接检查被判定为负例的具体三元组,而不必对子图与句子或两个句子进行完整比对。这一特性对于处理长输入图和长文本生成尤为实用。
8 Conclusion
8 结论
In this work, we have presented a new metric for measuring factual faithfulness in G2T, FactS potter. This metric can be trained in a self supervised fashion, using the same annotations as a G2T model. We have shown it achieves the highest correlation with humans on factual faithfulness and it can be used as a plug-in feature for G2T models. Finally, we have used our metric to analyze the difficulty of existing datasets. We have observed that models perform very well on these datasets, hence new datasets should be proposed. Such datasets could cover more difficult input graphs, for example triple from tables. In addition, through the initiative of governments to release tabular data related to public interest5, tools trained to express in natural language the content of tables could be used as user friendly interfaces for citizens.
在本工作中,我们提出了一种用于衡量G2T(图到文本)事实忠实度的新指标FactSpotter。该指标可采用与G2T模型相同的标注进行自监督训练。实验表明,该指标在事实忠实度方面与人类判断具有最高相关性,并可作为G2T模型的插件功能。最后,我们利用该指标分析了现有数据集的难度,发现模型在这些数据集上表现优异,因此需要提出新数据集。这类数据集可涵盖更复杂的输入图,例如来自表格的三元组。此外,随着政府主动发布涉及公共利益的表格数据[5],经过训练能用自然语言表达表格内容的工具,可作为面向公众的友好交互界面。
Acknowledgment. This work was performed using HPC resources from GENCI-IDRIS (Grant 2023-AD 011014244). The authors were partially funded by the ANR-20-CHIA-0015 project.
致谢。本研究使用了GENCI-IDRIS提供的HPC资源(资助编号2023-AD011014244)。作者部分经费来自ANR-20-CHIA-0015项目。
9 Limitations
9 局限性
Our work has the following limitations:
我们的工作存在以下局限性:
• FactS potter cannot be used to determine the precise nature of the error in the generated sentence. It was trained to predict whether a fact is presented in the text or not, not if we the sentence has a wrong predicate or a wrong subject or object. This problem can be solved by a second classification step for whether predicates or entities are incorrectly verbalized in the text, to make FactS potter more inter pre table.
- FactS potter无法用于确定生成句子中错误的具体性质。它被训练用于预测文本中是否呈现某个事实,而非判断句子是否存在错误的谓词、主语或宾语。这一问题可以通过增加第二个分类步骤来解决,即判断文本中的谓词或实体是否被错误表述,从而使FactS potter更具可解释性。
• The input of FactS potter is the concatenation of a fact $f$ represented in triple and a natural language text $T$ , i.e., it has limited input format. With such input, the advantage is that it is easier to construct both positive and negative samples for its self-supervised training. However, it is also difficult to use it to check the factual faithfulness on other text generation tasks, since high quality structured knowledge graphs are hard to generate. However, we will investigate in the future the use of open information extraction models (Upadhyay et al., 2023) for extracting facts from sentences.
• FactSpotter的输入是由三元组表示的事实$f$和自然语言文本$T$拼接而成,即其输入格式有限。这种输入的优势在于更容易构建自监督训练的正负样本。然而,由于高质量结构化知识图谱难以生成,该方法难以用于检测其他文本生成任务的事实忠实性。未来我们将研究利用开放信息抽取模型(Upadhyay et al., 2023)从句子中提取事实。
• Although the accuracy and F1 score of our classification model on the test splits of various datasets in Table 1 are high, there still exist some false positive and false negative samples. Hence, our FactS potter generally reflects factual faithfulness, but it might still be biased on some hard samples, especially when predicates in knowledge graphs are distant to their natural language representations in the vector space of language models.
• 尽管我们的分类模型在表 1 中各个数据集测试集上的准确率和 F1 分数较高,但仍存在一些假阳性 (false positive) 和假阴性 (false negative) 样本。因此,FactSpotter 总体上能反映事实忠实度,但在某些困难样本上仍可能存在偏差,特别是当知识图谱中的谓词 (predicate) 与语言模型向量空间中自然语言表征相距较远时。
Anja Belz and Ehud Reiter. 2006. Comparing automatic and human evaluation of NLG systems. In 11th Conference of the European Chapter of the Association for Computational Linguistics, pages 313– 320, Trento, Italy. Association for Computational Linguistics.
Anja Belz和Ehud Reiter。2006。比较NLG系统的自动与人工评估。载于《第11届欧洲计算语言学会会议论文集》,第313-320页,意大利特伦托。计算语言学协会。
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: A collabor at iv ely created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD $\ '_{08}$ , page 1247–1250, New York, NY, USA. Association for Computing Machinery.
Kurt Bollacker、Colin Evans、Praveen Paritosh、Tim Sturge 和 Jamie Taylor。2008. Freebase: 一个协作构建的图数据库用于结构化人类知识。见《2008年ACM SIGMOD国际数据管理会议论文集》,SIGMOD '08,第1247–1250页,美国纽约州纽约市。美国计算机协会。
Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075.
Antoine Bordes、Nicolas Usunier、Sumit Chopra 和 Jason Weston。2015。基于记忆网络的大规模简单问答系统。arXiv 预印本 arXiv:1506.02075。
Thiago Castro Ferreira, Claire Gardent, Nikolai Ilinykh, Chris van der Lee, Simon Mille, Diego Moussallem, and Anastasia Shimorina. 2020. The 2020 bilingual, bi-directional WebNLG $^+$ shared task: Overview and evaluation results $(\mathrm{WebNLG}+2020)$ ). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web $(W e b N L G+,$ ), pages 55–76, Dublin, Ireland (Virtual). Association for Computational Linguistics.
Thiago Castro Ferreira、Claire Gardent、Nikolai Ilinykh、Chris van der Lee、Simon Mille、Diego Moussallem和Anastasia Shimorina。2020。2020年双语双向WebNLG$^+$共享任务:概述与评估结果$(\mathrm{WebNLG}+2020)$)。载于第三届语义网自然语言生成国际研讨会$(WebNLG+)$论文集,第55-76页,爱尔兰都柏林(线上)。计算语言学协会。
Asli Cel i kyi l maz, Elizabeth Clark, and Jianfeng Gao. 2020. Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799.
Asli Cel i kyi l maz, Elizabeth Clark, and Jianfeng Gao. 2020. 文本生成评估综述. arXiv preprint arXiv:2006.14799.
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Inter national conference on machine learning, pages 1597–1607. PMLR.
Ting Chen、Simon Kornblith、Mohammad Norouzi 和 Geoffrey Hinton。2020. 视觉表征对比学习的简单框架。国际机器学习会议,第1597-1607页。PMLR。
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. 2020. Electra: Pre-training text encoders as disc rim in at or s rather than generators. arXiv preprint arXiv:2003.10555.
Kevin Clark、Minh-Thang Luong、Quoc V Le 和 Christopher D Manning。2020。Electra:将文本编码器预训练为判别器而非生成器。arXiv预印本 arXiv:2003.10555。
Pierre Jean A Colombo, Chloé Clavel, and Pablo Piantanida. 2022. InfoLM: A new metric to evaluate sum mari z ation & data2text generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10554–10562.
Pierre Jean A Colombo, Chloé Clavel, and Pablo Piantanida. 2022. InfoLM: 一种评估摘要与数据到文本生成的新指标。在《AAAI人工智能会议论文集》第36卷中,页码10554–10562。
References
参考文献
Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: A nucleus for a web of open data. In The Semantic Web: 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007+ ASWC 2007, Busan, Korea, November 11-15, 2007. Proceedings, pages 722–735. Springer.
Sören Auer、Christian Bizer、Georgi Kobilarov、Jens Lehmann、Richard Cyganiak和Zachary Ives。2007. DBpedia:开放数据网络的核心。载于《语义网:第六届国际语义网会议,第二届亚洲语义网会议,ISWC 2007+ASWC 2007,韩国釜山,2007年11月11-15日。会议录》,第722-735页。Springer出版社。
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or sum mari z ation, pages 65–72.
Satanjeev Banerjee和Alon Lavie。2005。METEOR:一种自动评估机器翻译的指标,与人工判断具有更高的相关性。在《机器翻译和/或摘要内在与外在评估指标acl研讨会论文集》中,第65-72页。
Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, MingWei Chang, Dipanjan Das, and William Cohen. 2019. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4884–4895, Florence, Italy. Association for Computational Linguistics.
Bhuwan Dhingra、Manaal Faruqui、Ankur Parikh、MingWei Chang、Dipanjan Das 和 William Cohen。2019. 评估表格到文本生成时处理分歧参考文本的方法。见《第57届计算语言学协会年会论文集》,第4884–4895页,意大利佛罗伦萨。计算语言学协会。
Ondrej Dusek, David M. Howcroft, and Verena Rieser. 2019. Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 421–426, Tokyo, Japan. Association for Computational Linguistics.
Ondrej Dusek、David M. Howcroft 和 Verena Rieser。2019. 语义噪声对神经自然语言生成的影响。载于《第12届国际自然语言生成会议论文集》,第421-426页,日本东京。计算语言学协会。
James D Evans. 1996. Straightforward statistics for the behavioral sciences. Thomson Brooks/Cole Publishing Co.
James D Evans. 1996. 行为科学简明统计学. Thomson Brooks/Cole Publishing Co.
Juliette Faille, Albert Gatt, and Claire Gardent. 2021. Entity-based semantic adequacy for data-to-text generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1530–1540, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Juliette Faille、Albert Gatt和Claire Gardent。2021。基于实体的数据到文本生成语义充分性。载于《计算语言学协会发现:EMNLP 2021》,第1530–1540页,多米尼加共和国蓬塔卡纳。计算语言学协会。
Angela Fan, Claire Gardent, Chloé Braud, and Antoine Bordes. 2019. Using local knowledge graph construction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4186–4196, Hong Kong, China. Association for Computational Linguistics.
Angela Fan、Claire Gardent、Chloé Braud和Antoine Bordes。2019。利用局部知识图谱构建将Seq2Seq模型扩展到多文档输入。载于《2019年自然语言处理经验方法会议暨第九届自然语言处理国际联合会议(EMNLP-IJCNLP)论文集》,第4186-4196页,中国香港。计算语言学协会。
Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltr a chin i. 2017. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 179–188, Vancouver, Canada. Association for Computational Linguistics.
Claire Gardent、Anastasia Shimorina、Shashi Narayan和Laura Perez-Beltrachini。2017。构建面向自然语言生成微规划器的训练语料库。载于《第55届计算语言学协会年会论文集(第一卷:长论文)》,第179-188页,加拿大温哥华。计算语言学协会。
Yu Gu, Sue Kase, Michelle Vanni, Brian Sadler, Percy Liang, Xifeng Yan, and Yu Su. 2021. Beyond i.i.d.: Three levels of generalization for question answering on knowledge bases. In Proceedings of the Web Conference 2021, WWW ’21, page 3477–3488, New York, NY, USA. Association for Computing Machinery.
Yu Gu、Sue Kase、Michelle Vanni、Brian Sadler、Percy Liang、Xifeng Yan和Yu Su。2021。超越独立同分布:知识库问答的三个泛化层级。载于《2021年万维网大会论文集》(WWW '21),第3477–3488页,美国纽约州纽约市。计算机协会。
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
Ziwei Ji、Nayeon Lee、Rita Frieske、Tiezheng Yu、Dan Su、Yan Xu、Etsuko Ishii、Ye Jin Bang、Andrea Madotto 和 Pascale Fung。2023。自然语言生成中的幻觉研