Blog
Blog
YOLOS - You Only Look At One Sequence
Unlike previous CNN based YOLO models, the YOLOS backbone is a Transformer block, much like the first vision transformer for classification.
YOLOS looks at patches of an image to to form "patch tokens", which are used in place of the traditional wordpiece tokens in NLP. There are 100 detection tokens on the right are learnable embeddings and feed into potential detections.
Compared to other CNN-based YOLO models, YOLOS benefits from the rising tides of transformers in computer vision, as well as inferring without the need for NMS, a tedious post-processing step that makes the deployment of other YOLO models difficult and slow.
Model | Pre-train Epochs | ViT (DeiT) Weight / Log | Fine-tune Epochs | Eval Size | YOLOS Checkpoint / Log | AP @ COCO val |
---|---|---|---|---|---|---|
YOLOS-Ti |
300 | FB | 300 | 512 | Baidu Drive, Google Drive / Log | 28.7 |
YOLOS-S |
200 | Baidu Drive, Google Drive / Log | 150 | 800 | Baidu Drive, Google Drive / Log | 36.1 |
YOLOS-S |
300 | FB | 150 | 800 | Baidu Drive, Google Drive / Log | 36.1 |
YOLOS-S (dWr) |
300 | Baidu Drive, Google Drive / Log | 150 | 800 | Baidu Drive, Google Drive / Log | 37.6 |
YOLOS-B |
1000 | FB | 150 | 800 | Baidu Drive, Google Drive / Log | 42.0 |