EfficientViT Memory Efficient Vision Transformer with Cascaded Group Attention

CV 写作

论文阅读

发布日期: 2023-11-06

更新日期: 2024-04-15

文章字数: 2.3k

阅读时长: 12 分

阅读次数:

Title

原文：

EfficientViT: Memory Efficient Vision Transformer with Cascaded Group Attention

解析：

方法名称：EfficientViT
主体：Vision Transformer
作用（修饰主体的主要功能）：Memory Efficient
技术：Cascaded Group Attention
介词：with

有几个常用的介词需要注意区别：towards, by, for, with, (based) on, in

towards：面向，朝向（某个角度）。在一定程度上方法实现了这个功能，但不一定完全实现。
by：通过某种手段、方法，实现某个目的。
for：表示目的。在一定程度上方法实现了这个目的，但不一定完全实现。
with：表示伴随、具备某种性质或能力。
（based）on：表示基于某种技术。
in：在某个地点、域、情景中。

Absract

原文：

Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8×/3.7× faster on the GPU/CPU, and 7.4× faster when converted to ONNX format. Code and models are available at here.

解析：

Vision transformers have shown great success due to their high model capabilities.

介绍了大背景。时态用的现在完成时态

However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications.

给出矛盾。使用定语从句给出影响的进一步解释

In this paper, we propose a family of high-speed vision transformers named EfficientViT.

直接开始介绍这篇文章。在这篇文章中，我们提出了…

We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA.

我们发现…存在问题，提别是…（这句话可以在方法前面介绍）------大背景中存在的问题

Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication.

因此，对于这个问题，我们设计了…，使用…方法，具有…的作用

Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy.

此外，我们发现了一个…现象（问题），会导致…结果------实验过程中的问题

To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity.

为了解决这个问题，我们提出了一种新的方法，不仅…而且…

Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy.

大量的实验验证了实验的有效性

For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively.

展示实验结果

Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8×/3.7× faster on the GPU/CPU, and 7.4× faster when converted to ONNX format.

对比实验结果

Code and models are available at here.

作者真的，我哭死，居然贴代码。

Introduction

Vision Transformers (ViTs) have taken computer vision domain by storm due to their high model capabilities and superior performance [18, 44, 69]. However, the constantly improved accuracy comes at the cost of increasing model sizes and computation overhead. For example, SwinV2 [43] uses 3.0B parameters, while V-MoE [62] taking 14.7B parameters, to achieve state-of-the-art performance on ImageNet [17]. Such large model sizes and the accompanying heavy computational costs make these models unsuitable for applications with real-time requirements [40, 78, 86].

解析：扩充表达摘要的第一句、第二句话。介绍大背景。

There are several recent works designing light and efficient vision transformer models [9,19,29,49,50,56,79,81]. Unfortunately, most of these methods aim to reduce model parameters or Flops, which are indirect metrics for speed and do not reflect the actual inference throughput of models. For example, MobileViT-XS [50] using 700M Flops runs much slower than DeiT-T [69] with 1,220M Flops on an Nvidia V100 GPU. Although these methods have achieved good performance with fewer Flops or parameters, many of them do not show significant wall-clock speedup against standard isomorphic or hierarchical transformers, e.g., DeiT[69] and Swin [44], and have not gained wide adoption.

解析：列举相关工作，展现出现在这个领域的缺点。

To address this issue, in this paper, we explore how to go faster with vision transformers, seeking to find principles for designing efficient transformer architectures. Based on the prevailing vision transformers DeiT [69] and Swin[44], we systematically analyze three main factors that affect model inference speed, including memory access, computation redundancy, and parameter usage. In particular, we find that the speed of transformer models is commonly memory-bound. In other words, memory accessing delay prohibits the full utilization of the computing power in GPU/CPUs [21, 32, 72], leading to a critically negative impact on the runtime speed of transformers [15, 31]. The most memory-inefficient operations are the frequent tensor reshaping and element-wise functions in multi-head self-attention (MHSA). We observe that through an appropriate adjustment of the ratio between MHSA and FFN (feedforward network) layers, the memory access time can be reduced significantly without compromising the performance. Moreover, we find that some attention heads tend to learn similar linear projections, resulting in redundancy in attention maps. The analysis shows that explicitly decomposing the computation of each head by feeding them with diverse features can mitigate this issue while improving computation efficiency. In addition, the parameter allocation in different modules is often overlooked by existing lightweight models, as they mainly follow the configurations in standard transformer models [44,69]. To improve parameter efficiency, we use structured pruning [45] to identify the most important network components, and summarize empirical guidance of parameter reallocation for model acceleration.

解析：解决大背景问题。本文的主要观点。引用参考的工作。介绍实验中发现的问题。

Based upon the analysis and findings, we propose a new family of memory efficient transformer models named EfficientViT. Specifically, we design a new block with a sandwich layout to build up the model. The sandwich layout block applies a single memory-bound MHSA layer between FFN layers. It reduces the time cost caused by memorybound operations in MHSA, and applies more FFN layers to allow communication between different channels, which is more memory efficient. Then, we propose a new cascaded group attention (CGA) module to improve computation efficiency. The core idea is to enhance the diversity of the features fed into the attention heads. In contrast to prior self-attention using the same feature for all heads, CGA feeds each head with different input splits and cascades the output features across heads. This module not only reduces the computation redundancy in multi-head attention, but also elevates model capacity by increasing network depth. Last but not least, we redistribute parameters through expanding the channel width of critical network components such as value projections, while shrinking the ones with lower importance like hidden dimensions in FFNs. This reallocation finally promotes model parameter efficiency.

解析：给出实验中的问题的解决方案。

Experiments demonstrate that our models achieve clear improvements over existing efficient CNN and ViT models in terms of both speed and accuracy, as shown in Fig. 1. For instance, our EfficientViT-M5 gets 77.1% top-1 accuracy on ImageNet with throughput of 10,621 images/s on an Nvidia V100 GPU and 56.8 images/s on an Intel Xeon E5-2690 v4 CPU @ 2.60GHz, outperforming MobileNetV3-Large [26] by 1.9% in accuracy, 40.4% in GPU inference speed, and 45.2% in CPU speed. Moreover, EfficientViT-M2 gets 70.8% accuracy, surpassing MobileViT-XXS [50] by 1.8%, while running 5.8×/3.7× faster on the GPU/CPU, and 7.4× faster when converted to ONNX [3] format. When deployed on the mobile chipset, i.e., Apple A13 Bionic chip in iPhone 11, EfficientViT-M2 model runs 2.3× faster than MobileViT-XXS [50] using the CoreML [1].

解析：给出优秀的实验结果。

In summary, the contributions of this work are two-fold:

We present a systematic analysis on the factors that affect the inference speed of vision transformers, deriving a set of guidelines for efficient model design.
We design a new family of vision transformer models, which strike a good trade-off between efficiency and accuracy. The models also demonstrate good transfer ability on a variety of downstream tasks.

解析：给出本文的贡献。

Method

略。

Experiments

Implementation Details

主被动结合。如果被动不好描述，建议多用we。

Results on ImageNet

We compare EfficientViT with prevailing efficient CNN and ViT models on ImageNet [17], and report the results in Tab. 2 and Fig. 1. The results show that, in most cases, our EfficientViT achieves the best accuracy and speed trade-off across different evaluation settings.

Comparisons with efficient CNNs. We first compare EfficientViT with vanilla CNN models, such as MobileNets [26, 63] and EfficientNet [67]. Specifically, compared to MobileNetV2 1.0× [63], EfficientViT-M3 obtains 1.4% better top-1 accuracy, while running at 2.5× and 3.0× faster speed on V100 GPU and Intel CPU, respectively. Compared to the state-of-the-art MobileNetV3-Large [26], EfficientViT-M5 achieves 1.9% higher accuracy yet runs much faster, e.g., 40.5% faster on the V100 GPU and 45.2% faster on the Intel CPU but is 11.5% slower as ONNX models. This may because reshaping is slower in ONNX implementation, which is inevitable in computing self-attention. Moreover, EfficientViT-M5 achieves comparable accuracy with the searched model EfficientNet-B0 [67], while runs 2.3×/1.9× faster on the V100 GPU/Intel CPU, and 2.1× faster as ONNX models. Although our model uses more parameters, it reduces memory-inefficient operations that affect the inference speed and achieves higher throughput.

解析：最小化问题，最大化结果。

Transfer Learning Results

略。

Ablation Study

略。

Conclusion

In this paper, we have presented a systematic analysis on the factors that affect the inference speed of vision transformers, and proposed a new family of fast vision transformers with memory-efficient operations and cascaded group attention, named EfficientViT. Extensive experiments have demonstrated the efficacy and high speed of EfficientViT, and also show its superiority on various downstream benchmarks.

解析：总结本文观点、结论、结果。

Limitations. One limitation of EfficientViT is that, despite its high inference speed, the model size is slightly larger compared to state-of-the-art efficient CNN [26] due to the extra FFNs in the introduced sandwich layout. Besides, our models are designed manually based on the derived guidelines on building efficient vision transformers. In future work, we are interested in reducing the model size and incorporating automatic search techniques to further enhance the model capacity and efficiency.

解析：提出本文的限制，最小化责任和问题。将在未来工作里面解决。

JiJunhao

http://jijunhao.github.io/2023/11/06/article20231106/