Designing High-Performing Networks for Multi-Scale Computer Vision
Cédric Picron
TL;DR
The work investigates high-performance network designs for multi-scale computer vision, introducing a dedicated neck (TPN) and task-specific heads (FQDet, FQDetV2, EffSeg) to address scale variation more efficiently than backbone-centric approaches. By balancing communication-based and self-processing in the neck, reintroducing anchors and static top-k matching in query-based detectors, and applying Structure-Preserving Sparsity for fine-grained segmentation, the paper demonstrates improved accuracy and faster convergence on COCO benchmarks while maintaining competitive computation. Key findings show that allocating more compute to necks yields tangible gains and that anchor-informed, top-k–based matching accelerates training and improves localization; EffSeg delivers strong segmentation performance with substantial FLOP and memory savings. Collectively, these designs advance multi-scale CV by shifting emphasis toward neck and task-head innovations, with broad implications for object detection and segmentation in real-world, resource-constrained settings.
Abstract
Since the emergence of deep learning, the computer vision field has flourished with models improving at a rapid pace on more and more complex tasks. We distinguish three main ways to improve a computer vision model: (1) improving the data aspect by for example training on a large, more diverse dataset, (2) improving the training aspect by for example designing a better optimizer, and (3) improving the network architecture (or network for short). In this thesis, we chose to improve the latter, i.e. improving the network designs of computer vision models. More specifically, we investigate new network designs for multi-scale computer vision tasks, which are tasks requiring to make predictions about concepts at different scales. The goal of these new network designs is to outperform existing baseline designs from the literature. Specific care is taken to make sure the comparisons are fair, by guaranteeing that the different network designs were trained and evaluated with the same settings. Code is publicly available at https://github.com/CedricPicron/DetSeg.
