|
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
View a PDF of the paper titled An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, by Alexey Dosovitskiy and 10 other authors
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Abstract While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
This paper introduces a Transformer-based image recognition model that is fully built on the Transformer layers (multi-head self-attention + point-wise MLP) without any standard convolution layers
- AN I W 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE
ats state of the art on multiple image recognition benchmarks In particular, the best model reaches the accuracy of 88:55% on ImageNet, 90:72% on ImageNet-ReaL,
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Therefore, in large-scale image recognition, classic ResNet-like architectures are still state of the art (Mahajan et al , 2018; Xie et al , 2020; Kolesnikov et al , 2020) Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Inspired by the Transformer scaling successes in NLP, we experiment with applying a standard Transformer directly to images, with the fewest possible modifications
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A new transformer based hybrid network is proposed by taking advantage of transformers to capture long-range dependencies, and of CNNs to extract local information, obtaining much better trade-off for accuracy and efficiency than previous CNN-based and transformer-based models
- arXiv. org e-Print archive
Explores the application of Transformer models to image recognition, achieving competitive results compared to convolutional networks on various benchmarks
|
|
|