YotoR-You Only Transform One Representation
José Ignacio Díaz Villa, Patricio Loncomilla, Javier Ruiz-del-Solar
TL;DR
YotoR proposes a hybrid detector that fuses Swin Transformer backbones with YoloR heads to achieve real-time object detection with improved accuracy. Four variants (TP4, TP5, BP4, BB4) are evaluated on COCO, showing that TP5 and BP4 outperform YoloR P6 and Swin-based detectors in both speed and AP on val2017 and testdev, though some tradeoffs remain. The work demonstrates the viability of transformer-based backbones paired with YOLO-style heads and highlights potential for scaling and broader image-task applications. It also outlines future directions, including larger model variants and deeper exploration of implicit knowledge and multi-modal capabilities.
Abstract
This paper introduces YotoR (You Only Transform One Representation), a novel deep learning model for object detection that combines Swin Transformers and YoloR architectures. Transformers, a revolutionary technology in natural language processing, have also significantly impacted computer vision, offering the potential to enhance accuracy and computational efficiency. YotoR combines the robust Swin Transformer backbone with the YoloR neck and head. In our experiments, YotoR models TP5 and BP4 consistently outperform YoloR P6 and Swin Transformers in various evaluations, delivering improved object detection performance and faster inference speeds than Swin Transformer models. These results highlight the potential for further model combinations and improvements in real-time object detection with Transformers. The paper concludes by emphasizing the broader implications of YotoR, including its potential to enhance transformer-based models for image-related tasks.
