Table of Contents
Fetching ...

CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection

Woojin Shin, Donghwa Kang, Byeongyun Park, Brent Byunghoon Kang, Jinkyu Lee, Hyeongboo Baek

TL;DR

CF-DETR tackles the real-time multi-task DETR challenge in autonomous driving by introducing a coarse-to-fine Transformer architecture and a dedicated NPFP** scheduler. It leverages four mechanisms—coarse-to-fine inference, selective region refinement, multi-level batching, and batch-enabled scheduling—to dynamically adjust patch granularity and attention scope while preserving safety-critical deadlines. Key contributions include the architectural CF-DETR, the NPFP** scheduling framework, and a robust evaluation showing improved critical and overall mAP with competitive throughput, including a practical emergency braking case. The approach delivers a practical, transformer-aware solution that meets real-time constraints for safety-critical AV perception without reliance on additional sensors.

Abstract

Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.

CF-DETR: Coarse-to-Fine Transformer for Real-Time Object Detection

TL;DR

CF-DETR tackles the real-time multi-task DETR challenge in autonomous driving by introducing a coarse-to-fine Transformer architecture and a dedicated NPFP** scheduler. It leverages four mechanisms—coarse-to-fine inference, selective region refinement, multi-level batching, and batch-enabled scheduling—to dynamically adjust patch granularity and attention scope while preserving safety-critical deadlines. Key contributions include the architectural CF-DETR, the NPFP** scheduling framework, and a robust evaluation showing improved critical and overall mAP with competitive throughput, including a practical emergency braking case. The approach delivers a practical, transformer-aware solution that meets real-time constraints for safety-critical AV perception without reliance on additional sensors.

Abstract

Detection Transformers (DETR) are increasingly adopted in autonomous vehicle (AV) perception systems due to their superior accuracy over convolutional networks. However, concurrently executing multiple DETR tasks presents significant challenges in meeting firm real-time deadlines (R1) and high accuracy requirements (R2), particularly for safety-critical objects, while navigating the inherent latency-accuracy trade-off under resource constraints. Existing real-time DNN scheduling approaches often treat models generically, failing to leverage Transformer-specific properties for efficient resource allocation. To address these challenges, we propose CF-DETR, an integrated system featuring a novel coarse-to-fine Transformer architecture and a dedicated real-time scheduling framework NPFP**. CF-DETR employs three key strategies (A1: coarse-to-fine inference, A2: selective fine inference, A3: multi-level batch inference) that exploit Transformer properties to dynamically adjust patch granularity and attention scope based on object criticality, aiming to satisfy R2. The NPFP** scheduling framework (A4) orchestrates these adaptive mechanisms A1-A3. It partitions each DETR task into a safety-critical coarse subtask for guaranteed critical object detection within its deadline (ensuring R1), and an optional fine subtask for enhanced overall accuracy (R2), while managing individual and batched execution. Our extensive evaluations on server, GPU-enabled embedded platforms, and actual AV platforms demonstrate that CF-DETR, under an NPFP** policy, successfully meets strict timing guarantees for critical operations and achieves significantly higher overall and critical object detection accuracy compared to existing baselines across diverse AV workloads.

Paper Structure

This paper contains 18 sections, 3 theorems, 4 equations, 9 figures, 3 tables, 3 algorithms.

Key Result

Lemma 1

A task set $\tau$ scheduled by NPFP$^\textsf{C}$ is schedulable, if every $\tau_i \in \tau$ satisfies Eq. eq:rta1. where the worst-case response time $R_i$ is the smallest positive value found by iterating Eq. eq:rta2 until convergence, i.e., $R_i(x+1)=R_i(x)$, starting with $R_i(0) = C_i^S + B_i$: Here, $B_i = \max ( \{0\} \cup \{ C_j^S \mid \tau_j \in \texttt{LP}(\tau_i) \} )$ denotes the maxi

Figures (9)

  • Figure 1: Overview of the DETR pipeline for object detection. The input image is divided into patches (a), which are encoded and processed via self-attention in the Transformer encoder (b). Learned object queries in the Transformer decoder interact with encoded patches through cross-attention (c), producing bounding boxes, class labels, and confidence scores (d).
  • Figure 2: Accuracy and latency analysis of DETR (DINO model) with varying patch granularity on the KITTI dataset (evaluated on Jetson Orin). (a) Trade-off between accuracy (averaged across all images) and latency with different numbers of patches. (b) Accuracy (averaged for each object-size group) comparison by object size, showing coarse patches are sufficient for large objects but significantly reduce accuracy for smaller objects.
  • Figure 3: System overview of CF-DETR: Periodic frames from multiple tasks arrive and queue. The scheduler then forms initial processing batches (❶). Inputs undergo coarse batch inference using a low-res grid—yielding initial detected boxes through concurrent DETR inferences in a single batch (❷). Frames are classified by confidence as easy (high-confidence) or hard (ambiguous) (❸); this determines if fine subtasks are needed. Easy frames use coarse results directly. Hard frames generate fine subtasks, which are queued for refinement. The scheduler determines their execution timing (❹). For each fine subtask, hard regions are identified and subdivided into finer patches, defining its workload (❺). The scheduler selects an dynamic fine batch configuration from active subtasks, maximizing throughput within deadlines (❻). Fine subtasks use the patch-level batch inference to refine detections in uncertain regions (❼). Finally, refined outputs merge with high-confidence coarse detections for the final results (❽).
  • Figure 4: Validation of CF-DETR's adaptive inference strategies A1--A3 using the DINO model on Jetson Orin with the KITTI dataset: (a) sample coarse inference output; (b) hardness identification by query confidence analysis; (c) selective fine inference efficiency, showing patch count versus object coverage; and (d) multi-level batch inference latency advantages for fine-stage processing.
  • Figure 5: Four types of coarse subtask batching scenarios ($\mathcal{B}^S$) illustrating potential impacts on task executions under the NPFP$^{**}$ framework, critical for schedulability analysis.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Lemma 1: NPFP schedulability test BTW95YBB10ChBr17a
  • Lemma 2
  • Lemma 3