[论文翻译]Inception-v4, Inception-ResNet 和残差连接对学习的影响


原文地址:http://arxiv.org/abs/1602.07261


Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning

> 论文中英文对照合集 : https://aiqianji.com/blog/articles

作者

Christian Szegedy
Google Inc.
1600 Amphitheatre Pkwy, Mountain View, CA
Sergey Ioffe
Vincent Vanhoucke
Alex Alemi

Abstract

Very deep convolutional networks have been central to the largest advances in image recognition performance in recent years. One example is the Inception architecture that has been shown to achieve very good performance at relatively low computational cost. Recently, the introduction of residual connections in conjunction with a more traditional architecture has yielded state-of-the-art performance in the 2015 ILSVRC challenge; its performance was similar to the latest generation Inception-v3 network. This raises the question of whether there are any benefit in combining the Inception architecture with residual connections.
Here we give clear empirical evidence that training with residual connections accelerates the training of Inception networks significantly. There is also some evidence of residual Inception networks outperforming similarly expensive Inception networks without residual connections by a thin margin.
We also present several new streamlined architectures for both residual and non-residual Inception networks. These variations improve the single-frame recognition performance on the ILSVRC 2012 classification task significantly.
We further demonstrate how proper activation scaling stabilizes the training of very wide residual Inception networks. With an ensemble of three residual and one Inception-v4, we achieve 3.08% top-5 error on the test set of the ImageNet classification (CLS) challenge.

摘要

近年来,超深度卷积网络对于图像识别性能的最大进步至关重要。一个示例是Inception体系结构,该体系已显示出以相对较低的计算成本实现了非常好的性能。最近,在2015年ILSVRC挑战赛中,引入残差连接以及更传统的体系结构带来了最先进的性能Start-of-art;它的性能类似于最新一代的Inception-v3网络。这就提出了一个问题,那就是将Inception体系结构与残差连接结合起来是否有任何好处。在这里,我们提供了充足的实验证据,即使用残差连接进行训练会显着加速Inception网络的训练。还有一些证据表明,残差Inception网络的性能优于类似没有残差连接的更复杂的Inception网络。我们还为残留和非残留Inception网络提供了几种新的简化架构。这些变化大大提高了ILSVRC 2012分类任务的单帧识别性能。我们进一步证明,在保证宽Residual Inception网络的稳定性训练前提下,如何合理的增大每一层的激活值。结合三个残差和一个Inception-v4,我们在ImageNet分类(CLS)质询的测试集上实现了3.08%的top-5错误。

1、Introduction

2012年ImageNet的冠军获得者-Krizhevsky,他们提出的网络 "AlexNet" 被成功的应用于各个视觉领域,比如目标检测,分割,人体姿态估计,视频分类,目标跟踪以及超分辨率。这些示例只是深度卷积网络成功广泛应用中的几个。

本文中,我们研究如何结合最新的两种卷积网络思想:残差连接(residual connections)和Inception-v3。ResNet的作者认为残差连接是训练超深度卷积网络必不可少的条件。由于Inception网络非常深,很自然的,我们让Inception与residual结合。这将使Inception在保留其计算效率的同时,也能够获得residual思想的好处。

除了直接将Residual融合,我们也研究将Inception本身做得更深,更广泛,是否可以使其效率更高。出于这个目的,我们设计新版本的结构-Inception-v4,它具有更加统一、简化的结构,以及更多的Inception模块(相比于Inception-v3)。Inception-v3继承了早期网络设计的诸多优点,主要的限制是如何进行分布式训练。但是,当TensorFlow出现后,那些不足不再存在。这使我们可以显着简化架构。该简化体系结构的详细信息在本节中介绍。(Section3)。

Since the 2012 ImageNet competition [11] winning entry by Krizhevsky et al [8], their network “AlexNet” has been successfully applied to a larger variety of computer vision tasks, for example to object-detection [4], segmentation [10], human pose estimation [17], video classification [7], object tracking [18], and superresolution [3]. These examples are but a few of all the applications to which deep convolutional networks have been very successfully applied ever since.

In this work we study the combination of the two most recent ideas: Residual connections introduced by He et al. in [5] and the latest revised version of the Inception architecture [15]. In [5], it is argued that residual connections are of inherent importance for training very deep architectures. Since Inception networks tend to be very deep, it is natural to replace the filter concatenation stage of the Inception architecture with residual connections. This would allow Inception to reap all the benefits of the residual approach while retaining its computational efficiency.

Besides a straightforward integration, we have also studied whether Inception itself can be made more efficient by making it deeper and wider. For that purpose, we designed a new version named Inception-v4 which has a more uniform simplified architecture and more inception modules than Inception-v3. Historically, Inception-v3 had inherited a lot of the baggage of the earlier incarnations. The technical constraints chiefly came from the need for partitioning the model for distributed training using DistBelief [2]. Now, after migrating our training setup to TensorFlow [1] these constraints have been lifted, which allowed us to simplify the architecture significantly. The details of that simplified architecture are described in Section 3.

本文中,我们比较了两种Inception网络,Inception-v3和Inception-v4,它们的计算复杂度与Inception-ResNet类似。这些模型的设计的基本假设:相比于non-residual的模型,其参数和复杂度不能增加。事实上,我们测试过更大、更宽的Inception-ResNet变体,在ImageNet上的表现非常相似。

此处报告的最后一个实验是对此处显示的所有最佳性能模型的综合评估。实验表明,Inception-v4和Inception-ResNet-v2表现差不多。都超过了ImageNet验证数据集的单图效果的最新性能,我们想看看它们的结合如何推动了最新技术的发展。在这个经过充分研究的数据集上。令人惊讶的是,我们发现单图效果的提升并没有转化为整体性能的类似大幅提升。尽管如此,据我们所知,它仍然在验证集上有3.1%的top-5错误,其中包含四个模型,这些模型设定了新的技术水平。

在最后一部分中,我们研究了一些分类失败的问题,并得出结论,该集合仍未达到该数据集上注释的标签噪声,并且预测仍有改进的空间。

In this report, we will compare the two pure Inception variants, Inception-v3 and v4, with similarly expensive hybrid Inception-ResNet versions. Admittedly, those models were picked in a somewhat ad hoc manner with the main constraint being that the parameters and computational complexity of the models should be somewhat similar to the cost of the non-residual models. In fact we have tested bigger and wider Inception-ResNet variants and they performed very similarly on the ImageNet classification challenge [11] dataset.

The last experiment reported here is an evaluation of an ensemble of all the best performing models presented here. As it was apparent that both Inception-v4 and Inception-ResNet-v2 performed similarly well, exceeding state-of-the art single frame performance on the ImageNet validation dataset, we wanted to see how a combination of those pushes the state of the art on this well studied dataset. Surprisingly, we found that gains on the single-frame performance do not translate into similarly large gains on ensembled performance. Nonetheless, it still allows us to report 3.1% top-5 error on the validation set with four models ensembled setting a new state of the art, to our best knowledge.

In the last section, we study some of the classification failures and conclude that the ensemble still has not reached the label noise of the annotations on this dataset and there is still room for improvement for the predictions.

自从AlexNet提出之后,卷积网络在图像识别领域越来越流行。之后一些重要的里程碑模型,比如VGG,Network-in-network,GoogLeNet。

He et al 提出了残差连接思想,并给出了强有力的理论依据以及实验,证明了在图像识别,特别是目标检测中,残差连接具有更强的信息融合能力。作者认为,残差连接是训练超深神经网络的必要条件。但是,我们的研究并不是特别支持这个观点,至少在图像识别领域。然而,这可能需要更多的深度网络训练实验,才能更好地理解残差连接真正的优势。实验表明,没有残差连接的神经网络,训练并不是那么难。但是,残差连接可以很大程度上提升训练速度,这一点是毋庸置疑的。

Convolutional networks have become popular in large scale image recognition tasks after Krizhevsky et al. [8]. Some of the next important milestones were Network-in-network [9] by Lin et al., VGGNet [12] by Simonyan et al. and GoogLeNet (Inception-v1) [14] by Szegedy et al.

Residual connection were introduced by He et al. in [5] in which they give convincing theoretical and practical evidence for the advantages of utilizing additive merging of signals both for image recognition, and especially for object detection. The authors argue that residual connections are inherently necessary for training very deep convolutional models. Our findings do not seem to support this view, at least for image recognition. However it might require more measurement points with deeper architectures to understand the true extent of beneficial aspects offered by residual connections. In the experimental section we demonstrate that it is not very difficult to train competitive very deep networks without utilizing residual connections. However the use of residual connections seems to improve the training speed greatly, which is alone a great argument for their use.

Residual connections as introduced in He et al.

Figure 1: Residual connections as introduced in He et al. [[5]

Figure 2: Optimized version of ResNet connections by [5] to shield computation.

The Inception deep convolutional architecture was introduced in [14] and was called GoogLeNet or Inception-v1 in our exposition. Later the Inception architecture was refined in various ways, first by the introduction of batch normalization [6] (Inception-v2) by Ioffe et al. Later the architecture was improved by additional factorization ideas in the third iteration [15] which will be referred to as Inception-v3 in this report.

3、Architectural Choices

3.1、Pure Inception blocks

出于模型占用内存的考虑,早期的Inception模型采用分段训练的模式。但是,Inception结构是可以微调的,意味着可以更改很多层的卷积核的个数,并且不会影响整体的性能。为了优化训练速度,以及平衡不同子网络之间的计算效率,我们仔细调整层的大小。作为对比,当引入TensorFlow后,我们最新的模型不需要分割来进行训练。同时,也优化了内存的使用:比如,反向传播中,考虑到哪些需要计算,哪些不需要等等。

历史上讲,在更改网络结构上,我们相对的保守,并且,在实验中,我们独立的限制网络的模块(保证其它的网络部分稳定)。这也导致网络很复杂,难以修改。在我们最新的实验中(Inception-v4),我们简化了模块的设计,去除不必要的集成,对每一个Inception模块进行统一的设计。图9给出了Inception-v4的结构,图3,4,5,6,7,8是每个模块的详细结构。所有卷积中(没有标记为V)意味着填充方式为"SAME Padding",输入和输出维度一致。标记为V的卷积中,使用"VALID Padding",输出维度视具体情况而定。

Our older Inception models used to be trained in a partitioned manner, where each replica was partitioned into a multiple sub-networks in order to be able to fit the whole model in memory. However, the Inception architecture is highly tunable, meaning that there are a lot of possible changes to the number of filters in the various layers that do not affect the quality of the fully trained network. In order to optimize the training speed, we used to tune the layer sizes carefully in order to balance the computation between the various model sub-networks. In contrast, with the introduction of TensorFlow our most recent models can be trained without partitioning the replicas. This is enabled in part by recent optimizations of memory used by backpropagation, achieved by carefully considering what tensors are needed for gradient computation and structuring the computation to reduce the number of such tensors. Historically, we have been relatively conservative about changing the architectural choices and restricted our experiments to varying isolated network components while keeping the rest of the network stable. Not simplifying earlier choices resulted in networks that looked more complicated that they needed to be. In our newer experiments, for Inception-v4 we decided to shed this unnecessary baggage and made uniform choices for the Inception blocks for each grid size. Plase refer to Figure 9 for the large scale structure of the Inception-v4 network and Figures 3, 4, 5, 6, 7 and 8 for the detailed structure of its components. All the convolutions not marked with “V” in the figures are same-padded meaning that their output grid matches the size of their input. Convolutions marked with “V” are valid padded, meaning that input patch of each unit is fully contained in the previous layer and the grid size of the output activation map is reduced accordingly.

3.2、Residual Inception Blocks

对于残差版本的Inception网络,我们使用更加低廉的Inception blocks。每一个Inception block都会添加卷积核扩展层(1x1卷积,没有激活层),在与输入执行相加之前,增大卷积核个数(宽度)。相当于是对Inception block降低维度的弥补。

我们尝试过不同版本的ResNet-Inception,只详细介绍其中的两种。第一种:"Inception-ResNet-V1",计算效率与Inception-v3类似;第二种:"Inception-ResNet-V2",计算效率与Inception-v4类似。图15给出了上述网络的大致结构。

For the residual versions of the Inception networks, we use cheaper Inception blocks than the original Inception. Each Inception block is followed by filter-expansion layer (1×1 convolution without activation) which is used for scaling up the dimensionality of the filter bank before the addition to match the depth of the input. This is needed to compensate for the dimensionality reduction induced by the Inception block.

We tried several versions of the residual version of Inception. Only two of them are detailed here. The first one “Inception-ResNet-v1” roughly the computational cost of Inception-v3, while “Inception-ResNet-v2” matches the raw cost of the newly introduced Inception-v4 network. See Figure 15 for the large scale structure of both varianets. (However, the step time of Inception-v4 proved to be significantly slower in practice, probably due to the larger number of layers.)

Residual and non-residual Inception的另一个微小的差别是:在Inception-ResNet网络中,我们只在传统层(traditional layers)使用BN,不在求和层(summations layers)使用BN。通常认为,所有层使用BN是有必要的,但是我们希望模型能够在单个GPU上训练。实验证明,增大层的宽度对于GPU内存的消耗是不成比例的。通过在网络顶层去掉BN层,我们大幅度提高Inception模块的数量。

Another small technical difference between our residual and non-residual Inception variants is that in the case of Inception-ResNet, we used batch-normalization only on top of the traditional layers, but not on top of the summations. It is reasonable to expect that a thorough use of batch-normalization should be advantageous, but we wanted to keep each model replica trainable on a single GPU. It turned out that the memory footprint of layers with large activation size was consuming disproportionate amount of GPU-memory. By omitting the batch-normalization on top of those layers, we were able to increase the overall number of Inception blocks substantially. We hope that with better utilization of computing resources, making this trade-off will become unecessary.

The schema for stem of the pure Inception-v4 and
Inception-ResNet-v2 networks. This is the input part of those
networks. Cf. Figures
Figure 3: The schema for stem of the pure Inception-v4 and Inception-ResNet-v2 networks. This is the input part of those networks. Cf. Figures [9] and [15]

The schema for

Figure 4: The schema for 35×35 grid modules of the pure Inception-v4 network. This is the Inception-A block of Figure [9]

The schema for

Figure 5: The schema for 17×17 grid modules of the pure Inception-v4 network. This is the Inception-B block of Figure [9]

The schema for

Figure 6: The schema for 8×8 grid modules of the pure Inception-v4 network. This is the Inception-C block of Figure [9]

The schema for

Figure 7: The schema for 35×35 to 17×17 reduction module. Different variants of this blocks (with various number of filters) are used in Figure [9], and [15] in each of the new Inception(-v4, -ResNet-v1, -ResNet-v2) variants presented in this paper. The k, l, m, n numbers represent filter bank sizes which can be looked up in Table [1]

The schema for

Figure 8: The schema for 17×17 to 8×8 grid-reduction module. This is the reduction module used by the pure Inception-v4 network in Figure [9]

The overall schema of the Inception-v4 network. For the
detailed modules, please refer to Figures

Figure 9: The overall schema of the Inception-v4 network. For the detailed modules, please refer to Figures 3, 4, 5, 6, 7 and 8 for the detailed structure of the various components.

The schema for

Figure 10: The schema for 35×35 grid (Inception-ResNet-A) module of Inception-ResNet-v1 network.

The schema for

Figure 11: The schema for 17×17 grid (Inception-ResNet-B) module of Inception-ResNet-v1 network.

“Reduction-B”

Figure 12: “Reduction-B” 17×17 to 8×8 grid-reduction module. This module used by the smaller Inception-ResNet-v1 network in Figure [15]
(https://arxiv.org/abs/1602.07261/#S3.F15 "Figure 15 ‣ 3.2 Residual Inception Blocks ‣ 3 Architectural Choices ‣ Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning").

The schema for

Figure 13: The schema for 8×8 grid (Inception-ResNet-C) module of Inception-ResNet-v1 network.The stem of the Inception-ResNet-v1 network.

Figure 14: The stem of the Inception-ResNet-v1 network.

Schema for Inception-ResNet-v1 and Inception-ResNet-v2 networks.
This schema applies to both networks but the underlying components differ.
Inception-ResNet-v1 uses the blocks as described in Figures
Figure 15: Schema for Inception-ResNet-v1 and Inception-ResNet-v2 networks. This schema applies to both networks but the underlying components differ. Inception-ResNet-v1 uses the blocks as described in Figures 14, 10, 7, 11, 12 and 13. Inception-ResNet-v2 uses the blocks as described in Figures 3, 16, 7,17, 18 and 19. The output sizes in the diagram refer to the activation vector tensor shapes of Inception-ResNet-v1.

The schema for

Figure 16: The schema for 35×35 grid (Inception-ResNet-A) module of the Inception-ResNet-v2 network.
The schema for

Figure 17: The schema for 17×17 grid (Inception-ResNet-B) module of the Inception-ResNet-v2 network.The schema for Figure 18: The schema for 17×17 to 8×8 grid-reduction module. Reduction-B module used by the wider Inception-ResNet-v1 network in Figure 15.The schema for Figure 19: The schema for 8×8 grid (Inception-ResNet-C) module of the Inception-ResNet-v2 network.

Network k l m n
Inception-v4 192 224 256 384
Inception-ResNet-v1 192 192 256 384
Inception-ResNet-v2 256 256 384 384

Table 1: The number of filters of the Reduction-A module for the three Inception variants presented in this paper. The four numbers in the colums of the paper parametrize the four convolutions of Figure 7

3.3、Scaling of the Residuals

Also we found that if the number of filters exceeded 1000, the residual variants started to exhibit instabilities and the network has just “died” early in the training, meaning that the last layer before the average pooling started to produce only zeros after a few tens of thousands of iterations. This could not be prevented, neither by lowering the learning rate, nor by adding an extra batch-normalization to this layer.

We found that scaling down the residuals before adding them to the previous layer activation seemed to stabilize the training. In general we picked some scaling factors between 0.1 and 0.3 to scale the residuals before their being added to the accumulated layer activations (cf. Figure 20).

A similar instability was observed by He et al. in [5] in the case of very deep residual networks and they suggested a two-phase training where the first “warm-up” phase is done with very low learning rate, followed by a second phase with high learning rata. We found that if the number of filters is very high, then even a very low (0.00001) learning rate is not sufficient to cope with the instabilities and the training with high learning rate had a chance to destroy its effects. We found it much more reliable to just scale the residuals.

Even where the scaling was not strictly necessary, it never seemed to harm the final accuracy, but it helped to stabilize the training.

我们发现,如果卷积核数量突破1000,残差的各种变体开始变得不稳定,网络在更早的死亡,意味着,在训练几万步之后,平均池化前面的几层网络出现零值。即使使用低的学习率或者BN,也无法避免这种情况的出现。

我们发现,在与前一层激活层求和时,缩小残差有助于稳定训练过程。一般我们会选择固定的缩放因子,见图20:

He 同样发现了超深度网络训练训练不稳定性,为此他们使用了两阶段训练方式:使用小的学习率,进行预热训练(warm-up),当误差降低到一定程度后,增大学习率。但是,我们发现如果卷积核个数太多,那么即使很小的学习率(比如0.00001)也无法解决。最好的方式是缩小残差。尽管缩放并不是严格的必要,但是没有损害最终的精度,并且稳定了训练过程。

The general schema for scaling combined Inception-resnet moduels.
We expect that the same idea is useful in the general resnet case, where
instead of the Inception block an arbitrary subnetwork is used. The scaling
block just scales the last linear activations by a suitable constant, typically
around 0.1.

Figure 20: The general schema for scaling combined Inception-resnet moduels. We expect that the same idea is useful in the general resnet case, where instead of the Inception block an arbitrary subnetwork is used. The scaling block just scales the last linear activations by a suitable constant, typically around 0.1.

4、Training Methodology

We have trained our networks with stochastic gradient utilizing the TensorFlow [1] distributed machine learning system using 20 replicas running each on a NVidia Kepler GPU. Our earlier experiments used momentum [13] with a decay of 0.9, while our best models were achieved using RMSProp [16] with decay of 0.9 and ϵ=1.0. We used a learning rate of 0.045, decayed every two epochs using an exponential rate of 0.94. Model evaluations are performed using a running average of the parameters computed over time.

我们利用 TensorFlow [1]分布式机器学习系统,在NVidia Kepler GPU上运行20个副本,利用随机梯度训练我们的网络。我们早期的实验使用momentum 衰减率decay为0.9,而我们最好的模型是使用 RMSProp [16] ,衰减率为0.9,ε = 1.0。我们采用0.045的学习率,每两轮指数衰减一次,指数衰减率为0.94。模型评估是使用随时间计算的参数的运行平均值执行的。

Top-1 error evolution during training of pure Inception-v3 vs a
residual network of similar computational cost. The evaluation is measured on
a single crop on the non-blacklist images of the ILSVRC-2012 validation set.
The residual model was training much faster, but reached
slightly worse final accuracy than the traditional Inception-v3.

Figure 21: Top-1 error evolution during training of pure Inception-v3 vs a residual network of similar computational cost. The evaluation is measured on a single crop on the non-blacklist images of the ILSVRC-2012 validation set. The residual model was training much faster, but reached slightly worse final accuracy than the traditional Inception-v3.

5、Experiments Results

First we observe the top-1 and top-5 validation-error evolution of the four variants during training. After the experiment was conducted, we have found that our continuous evaluation was conducted on a subset of the validation set which omitted about 1700 blacklisted entities due to poor bounding boxes. It turned out that the omission should have been only performed for the CLSLOC benchmark, but yields somewhat incomparable (more optimistic) numbers when compared to other reports including some earlier reports by our team. The difference is about 0.3% for top-1 error and about 0.15% for the top-5 error. However, since the differences are consistent, we think the comparison between the curves is a fair one.

On the other hand, we have rerun our multi-crop and ensemble results on the complete validation set consisting of 50000 images. Also the final ensemble result was also performed on the test set and sent to the ILSVRC test server for validation to verify that our tuning did not result in an over-fitting. We would like to stress that this final validation was done only once and we have submitted our results only twice in the last year: once for the BN-Inception paper and later during the ILSVR-2015 CLSLOC competition, so we believe that the test set numbers constitute a true estimate of the generalization capabilities of our model.

首先,我们观察top-1和top-5 validation-error训练演变的四个变种。在实验完成后,我们发现我们的连续评估是针对验证集的一个子集进行的,这个子集由于包围盒不好而省略了大约1700个黑名单实体。结果表明,这个省略应该只在 CLSLOC 基准测试中执行,但是与其他报告(包括我们团队早期的一些报告)相比,产生了一些无法比较的(更乐观的)数字。top-1误差的差异约为0.3% ,top-5误差的差异约为0.15% 。然而,由于差异是一致的,我们认为曲线之间的比较是公平的。

另一方面,我们在由50000张图像组成的完整验证集上重新运行了我们的多种模型。最终的模型也在测试集上执行,并发送到 ILSVRC 测试服务器进行验证,以验证我们的调整没有导致过度装配。我们想要强调的是,这个最终的验证只做了一次,我们只在去年提交了我们的结果两次: 一次是 BN-Inception 论文,后来是 ILSVR-2015 CLSLOC 竞赛,所以我们相信测试集数字构成了对我们模型泛化能力的真实估计。

Top-5 error evolution during training of pure Inception-v3 vs a
residual Inception of similar computational cost. The evaluation is measured on
a single crop on the non-blacklist images of the ILSVRC-2012 validation set.
The residual version has trained much faster and reached slightly better final recall
on the validation set.

Figure 22: Top-5 error evolution during training of pure Inception-v3 vs a residual Inception of similar computational cost. The evaluation is measured on a single crop on the non-blacklist images of the ILSVRC-2012 validation set. The residual version has trained much faster and reached slightly better final recall on the validation set.

Top-1 error evolution during training of pure Inception-v3 vs a
residual Inception of similar computational cost. The evaluation is measured on
a single crop on the non-blacklist images of the ILSVRC-2012 validation set.
The residual version was training much faster and reached
slightly better final accuracy than the traditional Inception-v4.

Figure 23: Top-1 error evolution during training of pure Inception-v3 vs a residual Inception of similar computational cost. The evaluation is measured on a single crop on the non-blacklist images of the ILSVRC-2012 validation set. The residual version was training much faster and reached slightly better final accuracy than the traditional Inception-v4.

Top-5 error evolution during training of pure Inception-v4 vs a
residual Inception of similar computational cost. The evaluation is measured on
a single crop on the non-blacklist images of the ILSVRC-2012 validation set.
The residual version trained faster and reached slightly better final recall
on the validation set.

Figure 24: Top-5 error evolution during training of pure Inception-v4 vs a residual Inception of similar computational cost. The evaluation is measured on a single crop on the non-blacklist images of the ILSVRC-2012 validation set. The residual version trained faster and reached slightly better final recall on the validation set.

Top-5 error evolution of all four models (single model, single crop).
Showing the improvement due to larger model size. Although the residual
version converges faster, the final accuracy seems to mainly depend on the
model size.

Figure 25: Top-5 error evolution of all four models (single model, single crop). Showing the improvement due to larger model size. Although the residual version converges faster, the final accuracy seems to mainly depend on the model size.Top-1 error evolution of all four models (single model, single crop).
This paints a similar picture as the top-5 evaluation.

Figure 26: Top-1 error evolution of all four models (single model, single crop). This paints a similar picture as the top-5 evaluation.

Finally, we present some comparisons, between various versions of Inception and Inception-ResNet. The models Inception-v3 and Inception-v4 are deep convolutional networks not utilizing residual connections while Inception-ResNet-v1 and Inception-ResNet-v2 are Inception style networks that utilize residual connections instead of filter concatenation.

最后,我们在各种版本的Inception和 Inception-ResNet中,进行了一些比较。模型 Inception-v3和 Inception-v4是深度卷积网络,没有利用残差连接,而 Inception-ResNet-v1和 Inception-ResNet-v2是 Inception 风格的网络,利用残差连接而不是卷积层连接。

Network Top-1 Error Top-5 Error
BN-Inception [6] 25.2% 7.8%
Inception-v3 [15] 21.2% 5.6%
Inception-ResNet-v1 21.3% 5.5%
Inception-v4 20.0% 5.0%
Inception-ResNet-v2 19.9% 4.9%

Table 2: Single crop - single model experimental results. Reported on the non-blacklisted subset of the validation set of ILSVRC 2012.

Network Crops Top-1 Error Top-5 Error
ResNet-151 [5] 10 21.4% 5.7%
Inception-v3 [15] 12 19.8% 4.6%
Inception-ResNet-v1 12 19.8% 4.6%
Inception-v4 12 18.7% 4.2%
Inception-ResNet-v2 12 18.7% 4.1%

Table 3: 10/12 crops evaluations - single model experimental results. Reported on the all 50000 images of the validation set of ILSVRC 2012.

Network Crops Top-1 Error Top-5 Error
ResNet-151 [5] dense 19.4% 4.5%
Inception-v3 [15] 144 18.9% 4.3%
Inception-ResNet-v1 144 18.8% 4.3%
Inception-v4 144 17.7% 3.8%
Inception-ResNet-v2 144 17.8% 3.7%

Table 4: 144 crops evaluations - single model experimental results. Reported on the all 50000 images of the validation set of ILSVRC 2012.

Network Models Top-1 Error Top-5 Error
ResNet-151 [5] 6 3.6%
Inception-v3 [15] 4 17.3% 3.6%
\stackanchorInception-v4 + 3× Inception-ResNet-v2 4 16.5% 3.1%

Table 5: Ensemble results with 144 crops/dense evaluation. Reported on the all 50000 images of the validation set of ILSVRC 2012. For Inception-v4(+Residual), the ensemble consists of one pure Inception-v4 and three Inception-ResNet-v2 models and were evaluated both on the validation and on the test-set. The test-set performance was 3.08% top-5 error verifying that we don’t over-fit on the validation set.

6、CONCLUSIONS

We have presented three new network architectures in detail:

  • Inception-ResNet-v1: a hybrid Inception version that has a similar computational cost to Inception-v3 from [15].
  • Inception-ResNet-v2: a costlier hybrid Inception version with significantly improved recognition performance.
  • Inception-v4: a pure Inception variant without residual connections with roughly the same recognition performance as Inception-ResNet-v2.

We studied how the introduction of residual connections leads to dramatically improved training speed for the Inception architecture. Also our latest models (with and without residual connections) outperform all our previous networks, just by virtue of the increased model size.

我们详细介绍了三种新的网络体系结构:
Inception-ResNet-v1: 混合 Inception 版本,其计算成本与[15]中的 Inception-v3版本相似。
Inception-resnet-v2: 一个更昂贵的混合Inception版本,具有显著改善的识别性能。
Inception-v4: 一个没有剩余连接的纯 Inception 变体,其识别性能与 Inception-ResNet-v2大致相同。
我们研究了如何引入残差连接来显著提高模型结构的训练速度。另外,我们最新的模型(有和没有残差连接)只是凭借增加的模型大小,就可以优于我们所有以前的网络。

REFERENCES

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
  • [2] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le, et al. Large scale distributed deep networks. In Advances in Neural Information Processing Systems, pages 1223–1231, 2012.
  • [3] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In Computer Vision–ECCV 2014, pages 184–199. Springer, 2014.
  • [4] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [5] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015.
  • [6] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
  • [7] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei. Large-scale video classification with convolutional neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1725–1732. IEEE, 2014.
  • [8] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [9] M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  • [10] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3431–3440, 2015.
  • [11] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. 2014.
  • [12] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [13] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), volume 28, pages 1139–1147. JMLR Workshop and Conference Proceedings, May 2013.
  • [14] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
  • [15] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
  • [16] T. Tieleman and G. Hinton. Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012. Accessed: 2015-11-05.
  • [17] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on, pages 1653–1660. IEEE, 2014.
  • [18] N. Wang and D.-Y. Yeung. Learning a deep compact image representation for visual tracking. In Advances in Neural Information Processing Systems, pages 809–817, 2013.
    参考:
    机翻,
    https://blog.csdn.net/kxh123456/article/details/102828148