[论文翻译]注意力引导的生成对抗网络在无监督图像到图像转换中的应用


原文地址:https://arxiv.org/pdf/1903.12296v3


Attention-Guided Generative Adversarial Networks for Unsupervised Image-to-Image Translation

注意力引导的生成对抗网络在无监督图像到图像转换中的应用

Abstract—The state-of-the-art approaches in Generative Adversarial Networks (GANs) are able to learn a mapping function from one image domain to another with unpaired image data. However, these methods often produce artifacts and can only be able to convert low-level information, but fail to transfer high-level semantic part of images. The reason is mainly that generators do not have the ability to detect the most discriminative semantic part of images, which thus makes the generated images with low-quality. To handle the limitation, in this paper we propose a novel Attention-Guided Generative Adversarial Network (AGGAN), which can detect the most disc rim i native semantic object and minimize changes of unwanted part for semantic manipulation problems without using extra data and models. The attention-guided generators in AGGAN are able to produce attention masks via a built-in attention mechanism, and then fuse the input image with the attention mask to obtain a target image with high-quality. Moreover, we propose a novel attention-guided disc rim in at or which only considers attended regions. The proposed AGGAN is trained by an end-to-end fashion with an adversarial loss, cycle-consistency loss, pixel loss and attention loss. Both qualitative and quantitative results demonstrate that our approach is effective to generate sharper and more accurate images than existing models. The code is available at https://github.com/Ha0Tang/Attention GAN.

摘要—生成对抗网络(GAN)的先进方法能够利用不成对的图像数据学习从一个图像域到另一个图像域的映射函数。然而,这些方法常会产生伪影,且仅能转换低层次信息,无法迁移图像的高层语义部分。主要原因在于生成器缺乏检测图像最具判别性语义区域的能力,导致生成图像质量低下。为解决这一局限,本文提出新型注意力引导生成对抗网络(AGGAN),可在不使用额外数据和模型的情况下,检测最具判别性的语义对象并最小化非目标区域的改变。AGGAN中的注意力引导生成器通过内置注意力机制生成注意力掩码,随后将输入图像与注意力掩码融合以获取高质量目标图像。此外,我们提出仅关注注意力区域的注意力引导判别器。所提出的AGGAN采用端到端训练方式,结合对抗损失、循环一致性损失、像素损失和注意力损失。定性与定量结果表明,相较于现有模型,本方法能生成更清晰准确的图像。代码详见https://github.com/Ha0Tang/Attention_GAN

Index Terms—GANs, Image-to-Image Translation, Attention

索引术语—GANs, 图像到图像转换, 注意力机制

I. INTRODUCTION

I. 引言

Recently, Generative Adversarial Networks (GANs) [8] have received considerable attention across many communities, e.g., computer vision, natural language processing, audio and video processing. GANs are generative models, which are particularly designed for image generation task. Recent works in computer vision, image processing and computer graphics have produced powerful translation systems in supervised settings such as Pix2pix [11], where the image pairs are required. However, the paired training data are usually difficult and expensive to obtain. Especially, the input-output pairs for images tasks such as artistic styli z ation can be even more difficult to acquire since the desired output is quite complex, typically requiring artistic authoring. To tackle this problem, CycleGAN [47], DualGAN [43] and DiscoGAN [13] provide an insight, in which the models can learn the mapping from one image domain to another one with unpaired image data.

近来,生成对抗网络 (Generative Adversarial Networks, GANs) [8] 在计算机视觉、自然语言处理、音视频处理等多个领域获得了广泛关注。GANs 是专为图像生成任务设计的生成模型。近期在计算机视觉、图像处理与计算机图形学领域的研究已开发出强大的监督式转换系统(如 Pix2pix [11]),但这类系统需要成对的图像数据。然而配对训练数据通常难以获取且成本高昂,尤其是艺术风格化等图像任务所需的输入-输出配对数据更难获取——这类任务的理想输出通常复杂度高,需经过艺术化创作。针对该问题,CycleGAN [47]、DualGAN [43] 和 DiscoGAN [13] 提出了一种创新思路:模型可通过非配对图像数据学习不同图像域之间的映射关系。

Despite these efforts, image-to-image translation, e.g., converting a neutral expression to a happy expression, remains a challenging problem due to the fact that the facial expression changes are non-linear, unaligned and vary conditioned on the appearance of the face. Moreover, most previous models change unwanted objects during the translation stage and can also be easily affected by background changes. In order to address these limitations, Liang et al. propose the ContrastGAN [18], which uses object mask annotations from each dataset. In Contrast GAN, it first crops a part in the image according to the masks, and then makes translations and finally pastes it back. Promising results have been obtained from it, however it is hard to collect training data with object masks. More importantly, we have to make an assumption that the object shape should not change after applying semantic modification. Another option is to train an extra model to detect the object masks and fit them into the generated image patches [6], [12]. In this case, we need to increase the number of parameters of our network, which consequently increases the training complexity both in time and space.

尽管已有这些努力,图像到图像的转换(例如将中性表情转换为快乐表情)仍然是一个具有挑战性的问题,因为面部表情的变化是非线性、未对齐的,并且会因面部外观的不同而变化。此外,大多数先前模型在转换阶段会改变不需要的对象,并且容易受到背景变化的影响。为了解决这些限制,Liang等人提出了ContrastGAN [18],该方法利用了每个数据集中的对象掩码标注。在ContrastGAN中,首先根据掩码裁剪图像的一部分,然后进行转换,最后将其粘贴回原处。虽然取得了不错的结果,但收集带有对象掩码的训练数据非常困难。更重要的是,我们必须假设在应用语义修改后对象的形状不应改变。另一种选择是训练一个额外的模型来检测对象掩码并将其适配到生成的图像块中 [6][12]。这种情况下,我们需要增加网络的参数数量,从而在时间和空间上增加了训练的复杂性。

To overcome the aforementioned issues, in this paper we propose a novel Attention-Guided Generative Adversarial Network (AGGAN) for the image translation problem without using extra data and models. The proposed AGGAN comprises of two generators and two disc rim in at or s, which is similar with CycleGAN [47]. Fig. 1 illustrates the differences between previous representative works and the proposed AGGAN. Two attention-guided generators in the proposed AGGAN have built-in attention modules, which can disentangle the discriminative semantic object and the unwanted part by producing a attention mask and a content mask. Then we fuse the input image with new patches produced through the attention mask to obtain high-quality results. We also constrain generators with pixel-wise and cycle-consistency loss function, which forces the generators to reduce changes. Moreover, we propose two novel attention-guided disc rim in at or s which aims to consider only the attended regions. The proposed AGGAN is trained by an end-to-end fashion, and can produce attention mask, content mask and targeted images at the same time. Experimental results on four public available datasets demonstrate that the proposed AGGAN is able to produce higher-quality images compared with the state-of-the-art methods.

为克服上述问题,本文提出一种新颖的注意力引导生成对抗网络(AGGAN)来解决图像翻译问题,无需额外数据与模型。该AGGAN由两个生成器和两个判别器组成,结构类似CycleGAN [47]。图1展示了先前代表性工作与所提AGGAN的差异。AGGAN中两个注意力引导生成器内置注意力模块,可通过生成注意力掩膜和内容掩膜来分离判别性语义对象与非目标区域。随后将输入图像与通过注意力掩膜生成的新图像块融合以获得高质量结果。我们采用像素级和循环一致性损失函数约束生成器,迫使生成器减少变化。此外,我们提出两种新型注意力引导判别器,仅聚焦于注意力区域。所提AGGAN采用端到端训练方式,可同步生成注意力掩膜、内容掩膜和目标图像。在四个公开数据集上的实验表明,相比现有最优方法,AGGAN能生成更高质量的图像。

The contributions of this paper are summarized as follows: • We propose a novel Attention-Guided Generative Adversarial Network (AGGAN) for unsupervised image-to-image translation. • We propose a novel generator architecture with built-in at

本文的贡献总结如下:

  • 我们提出了一种新颖的注意力引导生成对抗网络(AGGAN),用于无监督图像到图像转换。
  • 我们提出了一种新型生成器架构,内置...


Fig. 1: Comparison of previous frameworks, e.g., CycleGAN [47], DualGAN [43] and DiscoGAN [13] (Left), and the proposed AGGAN (Right). The contribution of AGGAN is that the proposed generators can produce the attention mask $M_{x}$ and $M_{y}$ ) via the built-in attention module and then the produced attention mask and content mask mixed with the input image to obtain the targeted image. Moreover, we also propose two attention-guided disc rim in at or s $D_{X A}$ , $D_{Y A}$ , which aim to consider only the attended regions. Finally, for better optimizing the proposed AGGAN, we employ pixel loss, cycle-consistency loss and attention loss.

图 1: 现有框架(如CycleGAN [47]、DualGAN [43]和DiscoGAN [13])(左)与提出的AGGAN(右)对比。AGGAN的创新在于:所提出的生成器可通过内置注意力模块生成注意力掩码$M_{x}$和$M_{y}$,随后将生成的注意力掩码与内容掩码混合输入图像以获得目标图像。此外,我们还提出了两个注意力引导判别器$D_{X A}$、$D_{Y A}$,其仅关注注意力区域。最后,为更好地优化AGGAN,我们采用了像素损失、循环一致性损失和注意力损失。

tention mechanism, which can detect the most disc rim i native semantic part of images in different domains. • We propose a novel attention-guided disc rim in at or which only consider the attended regions. Moreover, the proposed attention-guided generator and disc rim in at or can be easily used to other GAN models. • Extensive results demonstrate that the proposed AGGAN can generate sharper faces with clearer details and more realistic expressions compared with baseline models.

• 提出一种新颖的注意力机制,能够检测不同领域中图像最具判别力的语义部分。
• 提出一种新型注意力引导判别器,仅关注被注意区域。此外,所提出的注意力引导生成器和判别器可轻松应用于其他GAN模型。
• 大量实验结果表明,与基线模型相比,提出的AGGAN能生成轮廓更清晰、细节更分明、表情更逼真的人脸图像。

II. RELATED WORK

II. 相关工作

Generative Adversarial Networks (GANs) [8] are powerful generative models, which have achieved impressive results on different computer vision tasks, e.g., image generation [5], [27], [9], image editing [34], [35] and image inpainting [17], [10]. In order to generate meaningful images that meet user requirement, Conditional GAN (CGAN) [26] is proposed where the conditioned information is employed to guide the image generation process. The conditioned information can be discrete labels [28], text [22], [31], object keypoints [32], human skeleton [39] and reference images [11]. CGANs using a reference images as conditional information have tackled a lot of problems, e.g., text-to-image translation [22], image-toimage translation [11] and video-to-video translation [42].

生成对抗网络 (GANs) [8] 是强大的生成模型,在不同计算机视觉任务中取得了令人瞩目的成果,例如图像生成 [5]、[27]、[9],图像编辑 [34]、[35] 以及图像修复 [17]、[10]。为了生成符合用户需求的有意义图像,研究者提出了条件生成对抗网络 (CGAN) [26],通过引入条件信息来指导图像生成过程。这些条件信息可以是离散标签 [28]、文本 [22]、[31]、物体关键点 [32]、人体骨骼 [39] 或参考图像 [11]。以参考图像作为条件信息的 CGAN 已成功解决诸多问题,例如文本到图像转换 [22]、图像到图像转换 [11] 以及视频到视频转换 [42]。

Image-to-Image Translation models learns a translation function using CNNs. Pix2pix [11] is a conditional framework using a CGAN to learn a mapping function from input to output images. Similar ideas have also been applied to many other tasks, such as generating photographs from sketches [33] or vice versa [38]. However, most of the tasks in the real world suffer from the constraint of having few or none of the paired input-output samples available. To overcome this limitation, unpaired image-to-image translation task has been proposed. Different from the prior works, unpaired image translation task try to learn the mapping function without the requirement of paired training data. Specifically, CycleGAN [47] learns the mappings between two image domains (i.e., a source domain $X$ to a target domain $Y$ ) instead of the paired images. Apart from CycleGAN, many other GAN variants are proposed to tackle the cross-domain problem. For example, to learn a common representation across domains, CoupledGAN [19] uses a weight-sharing strategy. The work of [37] utilizes some certain shared content features between input and output even though they may differ in style. Kim et al. [13] propose a method based on GANs that learns to discover relations between different domains. A model which can learn object transfiguration from two unpaired sets of images is presented in [46]. Tang et al. [41] propose $\mathbf{G}^{2}\mathbf{GAN}$ , which is a robust and scalable approach allowing to perform unpaired imageto-image translation for multiple domains. However, those models can be easily affected by unwanted content and cannot focus on the most disc rim i native semantic part of images during translation stage.

图像到图像翻译模型利用卷积神经网络(CNN)学习转换函数。Pix2pix[11]是采用条件生成对抗网络(CGAN)的框架,用于学习从输入图像到输出图像的映射函数。类似思想也被应用于许多其他任务,例如从草图生成照片[33]或逆向操作[38]。然而现实世界中大多数任务都面临配对输入-输出样本稀缺或缺失的约束。为突破这一限制,研究者提出了非配对图像翻译任务。

与先前工作不同,非配对图像翻译任务旨在无需配对训练数据的情况下学习映射函数。具体而言,CycleGAN[47]学习的是两个图像域(即源域$X$到目标域$Y$)之间的映射,而非配对图像间的映射。除CycleGAN外,还有许多其他GAN变体被提出以解决跨域问题。例如CoupledGAN[19]采用权重共享策略来学习跨域通用表示;文献[37]利用输入输出间特定的共享内容特征(尽管风格可能不同);Kim等人[13]提出基于GAN的跨域关系发现方法;文献[46]展示了从两组非配对图像学习物体变形的模型;Tang等人[41]提出的$\mathbf{G}^{2}\mathbf{GAN}$是支持多域非配对图像翻译的鲁棒可扩展方案。但这些模型易受无关内容干扰,在翻译阶段难以聚焦最具判别性的图像语义部分。

Attention-Guided Image-to-Image Translation. In order to fix the aforementioned limitations, Liang et al. propose Contrast GAN [18], which uses the object mask annotations from each dataset as extra input data. In this method, we have to make an assumption that after applying semantic changes an object shape does not change. Another method is to train another segmentation or attention model and fit it to the system. For instance, Mejjati et al. [25] propose an attention mechanisms that are jointly trained with the generators and disc rim in at or s. Chen et al. propose Attention GAN [6], which uses an extra attention network to generate attention maps, so that major attention can be paid to objects of interests. K as tani otis et al. [12] present ATAGAN, which use a teacher network to produce attention maps. Zhang et al. [45] propose the Self-Attention Generative Adversarial Networks (SAGAN) for image generation task. Qian et al. [30] employ a recurrent network to generate visual attention first and then transform a raindrop degraded image into a clean one. Tang et al. [40] propose a novel Multi-Channel Attention Selection GAN for the challenging cross-view image translation task. Sun et al. [36] generate a facial mask by using FCN [21] for face attribute manipulation.

注意力引导的图像到图像翻译。为克服上述局限性,Liang等人提出对比生成对抗网络(Contrast GAN)[18],该方法将各数据集中的物体掩码标注作为额外输入数据。此方法需假设语义变化后物体形状保持不变。另一种方案是训练独立的分割或注意力模型并集成至系统,例如Mejjati等人[25]提出与生成器和判别器联合训练的注意力机制。Chen团队开发的注意力生成对抗网络(Attention GAN)[6]通过额外注意力网络生成注意力图,使系统能聚焦关键物体。Kastaniotis等人[12]提出的ATAGAN采用教师网络生成注意力图。Zhang等人[45]为图像生成任务设计了自注意力生成对抗网络(SAGAN)。Qian等人[30]先通过循环网络生成视觉注意力,再将雨滴退化图像复原。Tang等人[40]针对跨视角图像翻译这一挑战性任务,提出多通道注意力选择生成对抗网络。Sun等人[36]则利用全卷积网络(FCN)[21]生成面部掩码以实现人脸属性编辑。

All these aforementioned methods employ extra networks or data to obtain attention masks, which increases the number of parameters, training time and storage space of the whole system. In this work, we propose the Attention-Guided Generative Adversarial Network (AGGAN), which can produce attention masks by the generators. For this purpose, we embed an attention method to the vanilla generator which means that we do not need any extra models to obtain the attention masks of objects of interests.

上述所有方法都采用额外网络或数据来获取注意力掩码,这增加了整个系统的参数量、训练时间和存储空间。本文提出注意力引导生成对抗网络(AGGAN),其生成器可直接生成注意力掩码。为此,我们在基础生成器中嵌入注意力机制,这意味着无需任何额外模型即可获取目标物体的注意力掩码。


Fig. 2: The framework of the proposed AGGAN. Because of the space limitation, we only show one mapping in this figure, i.e., $x{\rightarrow}[M_{y},R_{y},G_{y}]{\rightarrow}C_{x}{\approx}x$ . We also have the other mapping, i.e., $y\rightarrow[M_{x},R_{x},G_{x}]\rightarrow C_{y}\approx y$ . The attention-guided generators have built-in attention mechanism, which can detect the most disc rim i native part of images. After that we mix the input image, content mask and the attention mask to synthesize the targeted image. Moreover, to distinguish only the most disc rim i native content, we also propose a attention-guided disc rim in at or $D_{Y A}$ . Note that our systems does not require supervision, i.e., no pairs of images of the same person with different expressions.

图 2: 提出的 AGGAN 框架。由于篇幅限制,本图仅展示一个映射关系,即 $x{\rightarrow}[M_{y},R_{y},G_{y}]{\rightarrow}C_{x}{\approx}x$ 。另一个映射关系为 $y\rightarrow[M_{x},R_{x},G_{x}]\rightarrow C_{y}\approx y$ 。注意力引导生成器内置注意力机制,可检测图像中最具判别力的区域。随后我们将输入图像、内容掩码和注意力掩码混合以合成目标图像。此外,为仅区分最具判别力的内容,我们还提出了注意力引导判别器 $D_{Y A}$ 。需注意本系统无需监督信号,即不需要同一人物不同表情的成对图像。

III. METHOD

III. 方法

We first start with the attention-guided generator and discriminator of the proposed Attention-Guided Generative Adversarial Network (AGGAN), and then introduce the loss function for better optimization of the model. Finally we present the implementation details of the whole model including network architecture and training procedure.

我们首先从所提出的注意力引导生成对抗网络(AGGAN)的注意力引导生成器和判别器开始,然后介绍用于更好优化模型的损失函数。最后展示整个模型的实现细节,包括网络架构和训练流程。

A. Attention-Guided Generator

A. 注意力引导生成器

GANs [8] are composed of two competing modules, i.e., the generator $G_{X\rightarrow Y}$ and the disc rim in at or $D_{Y}$ (where $X$ and $Y$ denote two different image domains), which are iterative ly trained competing against with each other in the manner of two-player minimax. More formally, let $x_{i}\in X$ and $y_{j}\in Y$ denote the training images in source and target image domain, respectively (for simplicity, we usually omit the subscript $i$ and $j$ ). For most current image translation models, e.g., CycleGAN [47] and DualGAN [43], they include two mappings $G_{X\rightarrow Y}{:}x{\rightarrow}G_{y}$ and $G_{Y\rightarrow X}{:}y\rightarrow G_{x}$ , and two corresponding adversarial disc rim in at or s $D_{X}$ and $D_{Y}$ . The generator $G_{X\rightarrow Y}$ maps $x$ from the source domain to the generated image $G_{y}$ in the target domain $Y$ and tries to fool the disc rim in at or $D_{Y}$ , whilst the $D_{Y}$ focuses on improving itself in order to be able to tell whether a sample is a generated sample or a real data sample. Similar to $G_{Y\rightarrow X}$ and $D_{X}$ .

GANs [8] 由两个相互竞争的模块组成,即生成器 $G_{X\rightarrow Y}$ 和判别器 $D_{Y}$ (其中 $X$ 和 $Y$ 表示两个不同的图像域),它们以双人极小极大的方式迭代训练并相互对抗。更正式地说,令 $x_{i}\in X$ 和 $y_{j}\in Y$ 分别表示源图像域和目标图像域中的训练图像 (为简化起见,通常省略下标 $i$ 和 $j$)。对于当前大多数图像转换模型 (如 CycleGAN [47] 和 DualGAN [43]),它们包含两个映射 $G_{X\rightarrow Y}{:}x{\rightarrow}G_{y}$ 和 $G_{Y\rightarrow X}{:}y\rightarrow G_{x}$,以及两个对应的对抗判别器 $D_{X}$ 和 $D_{Y}$。生成器 $G_{X\rightarrow Y}$ 将源域的 $x$ 映射到目标域 $Y$ 的生成图像 $G_{y}$,并试图欺骗判别器 $D_{Y}$,而 $D_{Y}$ 则专注于提升自身能力以判断样本是生成样本还是真实数据样本。$G_{Y\rightarrow X}$ 和 $D_{X}$ 的情况与之类似。

While for the proposed AGGAN, we intend to learn two mappings between domains $X$ and $Y$ via two generators with built-in attention mechanism, i.e., $G_{X\rightarrow Y};x{\rightarrow}[M_{y},R_{y},G_{y}]$ and $G_{Y\rightarrow X};y\rightarrow[M_{x},R_{x},G_{x}]$ , where $M_{x}$ and $M_{y}$ are the attention masks of images $x$ and $y$ , respectively; $R_{x}$ and $R_{y}$ are the content masks of images $x$ and $y$ , respectively; $G_{x}$ and $G_{y}$ are the generated images. The attention masks $M_{x}$ and $M_{y}$ define a per pixel intensity specifying to which extend each pixel of the content masks $R_{x}$ and $R_{y}$ will contribute in the final rendered image. In this way, the generator does not need to render static elements, and can focus exclusively on the pixels defining the facial movements, leading to sharper and more realistic synthetic images. After that, we fuse input image $x$ and the generated attention mask $M_{y}$ , and the content mask $R_{y}$ to obtain the targeted image $G_{y}$ . Through this way, we can disentangle the most disc rim i native semantic object and unwanted part of images. In Fig. 2, the attention-guided generators focus only on those regions of the image that are responsible of generating the novel expression such as eyes and mouth, and keep the rest parts of the image such as hair, glasses, clothes untouched. The higher intensity in the attention mask means the larger contribution for changing the expression.

而对于提出的AGGAN,我们旨在通过两个内置注意力机制的生成器学习域$X$和$Y$之间的映射关系,即$G_{X\rightarrow Y};x{\rightarrow}[M_{y},R_{y},G_{y}]$和$G_{Y\rightarrow X};y\rightarrow[M_{x},R_{x},G_{x}]$。其中$M_{x}$和$M_{y}$分别是图像$x$和$y$的注意力掩码;$R_{x}$和$R_{y}$是图像$x$和$y$的内容掩码;$G_{x}$和$G_{y}$是生成的图像。注意力掩码$M_{x}$和$M_{y}$定义了每个像素的强度,指定内容掩码$R_{x}$和$R_{y}$的每个像素对最终渲染图像的贡献程度。通过这种方式,生成器无需渲染静态元素,可以专注于定义面部运动的像素,从而生成更清晰、更逼真的合成图像。之后,我们融合输入图像$x$、生成的注意力掩码$M_{y}$和内容掩码$R_{y}$,得到目标图像$G_{y}$。通过这种方式,我们可以分离出最具判别性的语义对象和图像中不需要的部分。在图2中,注意力引导的生成器仅聚焦于负责生成新表情的图像区域(如眼睛和嘴巴),而保持图像的其余部分(如头发、眼镜、衣服)不变。注意力掩码中强度越高,表示对改变表情的贡献越大。

To focus on the disc rim i native semantic parts in two different domains, we specifically designed two generators with built-in attention mechanism. By using this mecha