PyTorch Classification

Vision Transformer

What is the Vision Transformer?

The Vision Transformer leverages powerful natural language processing embeddings (BERT) and applies them to images. When providing images to the model, each image is split into patches that are linearly embedded after which position embeddings are added and this is sequentially fed to the transformer encoder. Finally, to classify the image, a [CLS] token is inserted at the beginning of the image sequence.

Vision Transformer Architecture

Vision Transformer Performance

Applying transformers to image classification tasks achieves state-of-the-art performance on a variety of datasets, rivaling traditional convolutional neural networks.
ViT Performance
Images in Courtesy of Google Research