Table of Contents
Fetching ...

Knowledge Distillation via Query Selection for Detection Transformer

Yi Liu, Luting Wang, Zongheng Tang, Yue Liao, Yifan Sun, Lijun Zhang, Si Liu

TL;DR

A novel Group Query Selection strategy is introduced, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union with ground truth objects, thereby uncovering valuable hard-negative queries for distillation.

Abstract

Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model's intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.

Knowledge Distillation via Query Selection for Detection Transformer

TL;DR

A novel Group Query Selection strategy is introduced, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union with ground truth objects, thereby uncovering valuable hard-negative queries for distillation.

Abstract

Transformers have revolutionized the object detection landscape by introducing DETRs, acclaimed for their simplicity and efficacy. Despite their advantages, the substantial size of these models poses significant challenges for practical deployment, particularly in resource-constrained environments. This paper addresses the challenge of compressing DETR by leveraging knowledge distillation, a technique that holds promise for maintaining model performance while reducing size. A critical aspect of DETRs' performance is their reliance on queries to interpret object representations accurately. Traditional distillation methods often focus exclusively on positive queries, identified through bipartite matching, neglecting the rich information present in hard-negative queries. Our visual analysis indicates that hard-negative queries, focusing on foreground elements, are crucial for enhancing distillation outcomes. To this end, we introduce a novel Group Query Selection strategy, which diverges from traditional query selection in DETR distillation by segmenting queries based on their Generalized Intersection over Union (GIoU) with ground truth objects, thereby uncovering valuable hard-negative queries for distillation. Furthermore, we present the Knowledge Distillation via Query Selection for DETR (QSKD) framework, which incorporates Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). These components optimize the distillation process by focusing on the most informative aspects of the teacher model's intermediate features and output. Our comprehensive experimental evaluation of the MS-COCO dataset demonstrates the effectiveness of our approach, significantly improving average precision (AP) across various DETR architectures without incurring substantial computational costs. Specifically, the AP of Conditional DETR ResNet-18 increased from 35.8 to 39.9.
Paper Structure (26 sections, 8 equations, 4 figures, 8 tables)

This paper contains 26 sections, 8 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Visualization of the attentive regions (the first row) and the predicted boxes (the second row) of N different queries. Each of these queries is assigned to the corresponding ground truth object (the bed), selected because their GIoU with the bed is the highest among all objects. The queries are then sorted according to their GIoU metric relative to the bed. The first column features the positive prediction from bipartite matching, while the second column shows the top $10$ negative predictions with the highest GIoU metrics. The last two columns present the $10$ prediction with the lowest GIoU metrics. Based on the hypothesis that queries attending to the foreground regions contain valuable information for distillation, we propose to select the queries with higher GIoU metrics while discarding the ones with lower GIoU metrics.
  • Figure 2: The framework of our Query Selection Knowledge Distillation (QSKD): (a) Group Query Selector: Initially, positive queries are identified through bipartite matching with ground truth boxes. Subsequently, we rank all negative queries based on their GIoU with ground truth boxes and select the hard-negative queries whose GIoU is bigger than the threshold. These are then combined with the positive queries to finalize our selection. (b) Our distillation architecture utilizing the Group Query Selector (GQS) includes Attention-Guided Feature Distillation (AGFD) and Local Alignment Prediction Distillation (LAPD). Both the teacher and student models are Detection Transformers. To simplify our explanation, details of the original training phase supervision are omitted.
  • Figure 3: We visualize image features from detectors with no encoder, $1$ encoder layer, and $6$ encoder layers. It highlights a transition in activation patterns when adding a single encoder layer, while the difference between $1$ and $6$ layers is less pronounced.
  • Figure 4: Visualization of the attentive regions of different queries. The first column features the positive prediction from bipartite matching, while the second and third columns show the randomly selected prediction from the top 20 negative predictions with the highest GIoU metrics. The last two columns present the randomly selected prediction from 15 predictions with the lowest GIoU metrics.