Table of Contents
Fetching ...

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao

TL;DR

The paper tackles information loss in deep networks by introducing Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN), forming the YOLOv9 detector. PGI uses an auxiliary reversible branch and multi-level auxiliary information to deliver reliable gradients without extra inference cost, enabling effective training for lightweight models. GELAN generalizes ELAN with flexible computational blocks, achieving strong parameter efficiency and speed. On MS COCO with train-from-scratch training, YOLOv9 demonstrates Pareto-optimal performance across models, outperforming state-of-the-art detectors while using fewer parameters and FLOPs. These innovations offer practical improvements for real-time object detection across devices while maintaining high accuracy.

Abstract

Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.

YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information

TL;DR

The paper tackles information loss in deep networks by introducing Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN), forming the YOLOv9 detector. PGI uses an auxiliary reversible branch and multi-level auxiliary information to deliver reliable gradients without extra inference cost, enabling effective training for lightweight models. GELAN generalizes ELAN with flexible computational blocks, achieving strong parameter efficiency and speed. On MS COCO with train-from-scratch training, YOLOv9 demonstrates Pareto-optimal performance across models, outperforming state-of-the-art detectors while using fewer parameters and FLOPs. These innovations offer practical improvements for real-time object detection across devices while maintaining high accuracy.

Abstract

Today's deep learning methods focus on how to design the most appropriate objective functions so that the prediction results of the model can be closest to the ground truth. Meanwhile, an appropriate architecture that can facilitate acquisition of enough information for prediction has to be designed. Existing methods ignore a fact that when input data undergoes layer-by-layer feature extraction and spatial transformation, large amount of information will be lost. This paper will delve into the important issues of data loss when data is transmitted through deep networks, namely information bottleneck and reversible functions. We proposed the concept of programmable gradient information (PGI) to cope with the various changes required by deep networks to achieve multiple objectives. PGI can provide complete input information for the target task to calculate objective function, so that reliable gradient information can be obtained to update network weights. In addition, a new lightweight network architecture -- Generalized Efficient Layer Aggregation Network (GELAN), based on gradient path planning is designed. GELAN's architecture confirms that PGI has gained superior results on lightweight models. We verified the proposed GELAN and PGI on MS COCO dataset based object detection. The results show that GELAN only uses conventional convolution operators to achieve better parameter utilization than the state-of-the-art methods developed based on depth-wise convolution. PGI can be used for variety of models from lightweight to large. It can be used to obtain complete information, so that train-from-scratch models can achieve better results than state-of-the-art models pre-trained using large datasets, the comparison results are shown in Figure 1. The source codes are at: https://github.com/WongKinYiu/yolov9.
Paper Structure (25 sections, 6 equations, 7 figures, 11 tables)

This paper contains 25 sections, 6 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparisons of the real-time object detecors on MS COCO dataset. The GELAN and PGI-based object detection method surpassed all previous train-from-scratch methods in terms of object detection performance. In terms of accuracy, the new method outperforms RT DETR lv2023detrs pre-trained with a large dataset, and it also outperforms depth-wise convolution-based design YOLO MS chen2023yolo in terms of parameters utilization.
  • Figure 2: Visualization results of random initial weight output feature maps for different network architectures: (a) input image, (b) PlainNet, (c) ResNet, (d) CSPNet, and (e) proposed GELAN. From the figure, we can see that in different architectures, the information provided to the objective function to calculate the loss is lost to varying degrees, and our architecture can retain the most complete information and provide the most reliable gradient information for calculating the objective function.
  • Figure 3: PGI and related network architectures and methods. (a) Path Aggregation Network (PAN)) liu2018path, (b) Reversible Columns (RevCol) cai2022reversible, (c) conventional deep supervision, and (d) our proposed Programmable Gradient Information (PGI). PGI is mainly composed of three components: (1) main branch: architecture used for inference, (2) auxiliary reversible branch: generate reliable gradients to supply main branch for backward transmission, and (3) multi-level auxiliary information: control main branch learning plannable multi-level of semantic information.
  • Figure 4: The architecture of GELAN: (a) CSPNet wang2020cspnet, (b) ELAN wang2023designing, and (c) proposed GELAN. We imitate CSPNet and extend ELAN into GELAN that can support any computational blocks.
  • Figure 5: Comparison of state-of-the-art real-time object detectors. The methods participating in the comparison all use ImageNet as pre-trained weights, including RT DETR lv2023detrs, RTMDet lyu2022rtmdet, and PP-YOLOE xu2022pp, etc. The YOLOv9 that uses train-from-scratch method clearly surpasses the performance of other methods.
  • ...and 2 more figures