Table of Contents
Fetching ...

CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

Jinyeong Park, Donghwa Kang, Brent ByungHoon Kang, Hyeongboo Baek, Jibum Kim

TL;DR

This work proposes curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation, and demonstrates that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings.

Abstract

Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher's multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.

CR-QAT: Curriculum Relational Quantization-Aware Training for Open-Vocabulary Object Detection

TL;DR

This work proposes curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation, and demonstrates that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings.

Abstract

Open-vocabulary object detection (OVOD) enables novel category detection via vision-language alignment, but massive model sizes hinder deployment on resource-constrained devices. While quantization offers practical compression, we reveal that naive extreme low-bit (e.g., 4-bit) quantization severely degrades fine-grained vision-language alignment and distorts inter-region relational structures. To address this, we propose curriculum relational quantization-aware training (CR-QAT), an integrated framework combining stage-by-stage optimization with relational knowledge distillation. Within CR-QAT, curriculum QAT (CQAT) mitigates error accumulation by partitioning the model for progressive quantization, ensuring stable optimization via error isolation. Concurrently, text-centric relational KD (TRKD) is applied to task-relevant modules. By constructing text-anchored pairwise similarity matrices, TRKD comprehensively transfers the teacher's multi-dimensional relational knowledge. Experiments on LVIS and COCO zero-shot benchmarks demonstrate that CR-QAT consistently outperforms existing QAT baselines under aggressive low-bit settings, achieving relative AP improvements of up to 38.9% and 40.9%, respectively.
Paper Structure (29 sections, 10 equations, 5 figures, 6 tables)

This paper contains 29 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Impact of 4-bit quantization on YOLO-World YOLOWorld with the Objects365v2 Objects365 dataset. (a) Confidence scores derived from the similarity between each region embedding and the text embedding of category "Lamp" ($\bar{P}$: mean over positive regions). (b) Pairwise cosine similarity matrix among positive region embeddings within the same category ($r$: Pearson correlation with FP32). (c) Quantitative comparison of embedding distortion relative to FP32. The horizontal and vertical axes measure the MAE of inter-region relational structure and region-text alignment, respectively. Proximity to the origin indicates less distortion. Our method simultaneously minimizes both, most closely approaching FP32.
  • Figure 2: Overview of the proposed CR-QAT framework. (Red) blocks denote quantized and learnable modules, and (blue) blocks denote full-precision and frozen modules. (a) Stage 1: the backbone ($M_1$) is quantized with $\mathcal{L}_{\text{feat}}$ supervision from the full-precision teacher, while the neck-head remains frozen for error isolation. (b) Stage 2: the neck-head ($M_2$) is additionally quantized, supervised by both $\mathcal{L}_{\text{feat}}$ and $\mathcal{L}_{\text{TRKD}}$. (c) Feature distillation aligns the student's multi-scale backbone features $f_1^S(x)$ to those of the teacher $f_1^T(x)$. (d) TRKD groups region embeddings by text query $\mathbf{t}_c$ and constructs a unified pairwise similarity matrix $\mathbf{S}_c$ to transfer region-text and region-region relationships.
  • Figure 3: Correlation between embedding-level and confidence-level inter-region relation distortion relative to FP32. Each point represents an (image, category) group with $\geq$10 anchors. (Large markers) denote the mean over all groups. ($\rho$) denotes the Spearman correlation.
  • Figure 4: Qualitative comparison on YOLO-World-L (4-4-8, Ch-T-H). (Top) Detection results. (Bottom) Inter-region similarity heatmap of average pairwise cosine similarity among same-class anchors. QAT distorts both detection and similarity patterns of FP32, whereas CR-QAT restores them.
  • Figure 5: Effect of curriculum stages. 3-stage uses 1/3 data per stage. LVIS miniVal AP is reported.