Table of Contents
Fetching ...

YotoR-You Only Transform One Representation

José Ignacio Díaz Villa, Patricio Loncomilla, Javier Ruiz-del-Solar

TL;DR

YotoR proposes a hybrid detector that fuses Swin Transformer backbones with YoloR heads to achieve real-time object detection with improved accuracy. Four variants (TP4, TP5, BP4, BB4) are evaluated on COCO, showing that TP5 and BP4 outperform YoloR P6 and Swin-based detectors in both speed and AP on val2017 and testdev, though some tradeoffs remain. The work demonstrates the viability of transformer-based backbones paired with YOLO-style heads and highlights potential for scaling and broader image-task applications. It also outlines future directions, including larger model variants and deeper exploration of implicit knowledge and multi-modal capabilities.

Abstract

This paper introduces YotoR (You Only Transform One Representation), a novel deep learning model for object detection that combines Swin Transformers and YoloR architectures. Transformers, a revolutionary technology in natural language processing, have also significantly impacted computer vision, offering the potential to enhance accuracy and computational efficiency. YotoR combines the robust Swin Transformer backbone with the YoloR neck and head. In our experiments, YotoR models TP5 and BP4 consistently outperform YoloR P6 and Swin Transformers in various evaluations, delivering improved object detection performance and faster inference speeds than Swin Transformer models. These results highlight the potential for further model combinations and improvements in real-time object detection with Transformers. The paper concludes by emphasizing the broader implications of YotoR, including its potential to enhance transformer-based models for image-related tasks.

YotoR-You Only Transform One Representation

TL;DR

YotoR proposes a hybrid detector that fuses Swin Transformer backbones with YoloR heads to achieve real-time object detection with improved accuracy. Four variants (TP4, TP5, BP4, BB4) are evaluated on COCO, showing that TP5 and BP4 outperform YoloR P6 and Swin-based detectors in both speed and AP on val2017 and testdev, though some tradeoffs remain. The work demonstrates the viability of transformer-based backbones paired with YOLO-style heads and highlights potential for scaling and broader image-task applications. It also outlines future directions, including larger model variants and deeper exploration of implicit knowledge and multi-modal capabilities.

Abstract

This paper introduces YotoR (You Only Transform One Representation), a novel deep learning model for object detection that combines Swin Transformers and YoloR architectures. Transformers, a revolutionary technology in natural language processing, have also significantly impacted computer vision, offering the potential to enhance accuracy and computational efficiency. YotoR combines the robust Swin Transformer backbone with the YoloR neck and head. In our experiments, YotoR models TP5 and BP4 consistently outperform YoloR P6 and Swin Transformers in various evaluations, delivering improved object detection performance and faster inference speeds than Swin Transformer models. These results highlight the potential for further model combinations and improvements in real-time object detection with Transformers. The paper concludes by emphasizing the broader implications of YotoR, including its potential to enhance transformer-based models for image-related tasks.
Paper Structure (17 sections, 5 figures, 7 tables)

This paper contains 17 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Diagrams of the different approaches to a multi-task network as proposed by YoloR.
  • Figure 2: Diagram of the Swin Transformer T backbone.
  • Figure 3: Architecture of YotoR BP4.
  • Figure 4: Comparison between the time and mAP of each model in COCO val2017.
  • Figure 5: Left: Images from val2017 and testdev. Right: Predictions from YotoR BP4.