现有主流视觉机制有三类:CNN、Attention和MLP-mixer。后两种都是谷歌提出的基于Transformer和token的架构。token就是图片分块后进行线性映射的向量。怎么自定义token-mixing操作是各种基于mlp架构的关键。此外MLP-mixer其实就是attention的简化版本,通过mlp来实现所要用的attention。
【论文阅读】End-to-End Object Detection with Transformers
【论文阅读】Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention
【论文阅读】Efficient Transformers: A Survey
【论文阅读】An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Vison Transformer
【论文阅读】A Survey on Visual Transformer
【论文阅读】DeepViT: Towards Deeper Vision Transformer
【论文阅读】ViViT: A Video Vision Transformer
【论文阅读】MLP-Mixer: An all-MLP Architecture for Vision
【论文阅读】Pay Attention to MLPs
【论文阅读】A Survey of Transformers
【论文阅读】AS-MLP: An Axial Shifted MLP Architecture for Vision
【论文阅读】CycleMLP: A MLP-Like Architecture for Dense Prediction
【论文阅读】ConvMLP: Hierarchical Convolutional MLPs for Vision
【论文阅读】A Survey of Visual Transformers
【论文阅读】Attention Mechanisms in Computer Vision: A Survey