Table of Contents
Fetching ...

Beyond Skip Connections: Top-Down Modulation for Object Detection

Abhinav Shrivastava, Rahul Sukthankar, Jitendra Malik, Abhinav Gupta

TL;DR

This paper addresses the challenge of detecting objects that require fine-grained details alongside strong contextual cues. It introduces Top-Down Modulation (TDM), an end-to-end network that adds a top-down contextual pathway with lateral connections to a base bottom-up ConvNet, integrated into Faster R-CNN and evaluated on COCO. Across VGG16, ResNet101, and InceptionResNetv2 backbones, TDM yields significant gains in overall AP, with pronounced improvements for small objects and localization accuracy. The approach is architecture-agnostic, does not rely on multi-scale or iterative refinement, and offers a principled mechanism to combine high-level context with low-level detail for robust object detection.

Abstract

In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).

Beyond Skip Connections: Top-Down Modulation for Object Detection

TL;DR

This paper addresses the challenge of detecting objects that require fine-grained details alongside strong contextual cues. It introduces Top-Down Modulation (TDM), an end-to-end network that adds a top-down contextual pathway with lateral connections to a base bottom-up ConvNet, integrated into Faster R-CNN and evaluated on COCO. Across VGG16, ResNet101, and InceptionResNetv2 backbones, TDM yields significant gains in overall AP, with pronounced improvements for small objects and localization accuracy. The approach is architecture-agnostic, does not rely on multi-scale or iterative refinement, and offers a principled mechanism to combine high-level context with low-level detail for robust object detection.

Abstract

In recent years, we have seen tremendous progress in the field of object detection. Most of the recent improvements have been achieved by targeting deeper feedforward networks. However, many hard object categories such as bottle, remote, etc. require representation of fine details and not just coarse, semantic representations. But most of these fine details are lost in the early convolutional layers. What we need is a way to incorporate finer details from lower layers into the detection architecture. Skip connections have been proposed to combine high-level and low-level features, but we argue that selecting the right features from low-level requires top-down contextual information. Inspired by the human visual pathway, in this paper we propose top-down modulations as a way to incorporate fine details into the detection framework. Our approach supplements the standard bottom-up, feedforward ConvNet with a top-down modulation (TDM) network, connected using lateral connections. These connections are responsible for the modulation of lower layer filters, and the top-down network handles the selection and integration of contextual information and low-level features. The proposed TDM architecture provides a significant boost on the COCO testdev benchmark, achieving 28.6 AP for VGG16, 35.2 AP for ResNet101, and 37.3 for InceptionResNetv2 network, without any bells and whistles (e.g., multi-scale, iterative box refinement, etc.).

Paper Structure

This paper contains 25 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Detecting objects such as the bottle or remote shown above requires low-level finer details as well as high-level contextual information. In this paper, we propose a top-down modulation (TDM) network, which can be used with any bottom-up, feedforward ConvNet. We show that the features learnt by our approach lead to significantly improved object detection.
  • Figure 2: The illustration shows an example of Top-Down Modulation (TDM) Network, which is integrated with the bottom-up network with lateral connections. $\mathbf{C}_i$ are bottom-up, feedforward feature blocks, $\mathbf{L}_i$ are the lateral modules which transform low level features for the top-down contextual pathway. Finally, $\mathbf{T}_{j,i}$, which represent flow of top-down information from index $j$ to $i$. Individual components are explained in Figure \ref{['fig:overview_short']} and \ref{['fig:overview_details']}.
  • Figure 3: The basic building blocks of Top-Down Modulation Network (detailed Section \ref{['sec:arch_details']}).
  • Figure 4: An example with details of top-down modules and lateral connections. Please see Section \ref{['sec:arch_details']} for details of the architecture.
  • Figure 5: Improvement in AP over Faster R-CNN baseline. Base Networks: (left) VGG16, (middle) ResNet101, and (right) InceptionResNetv2. Improved performance for almost all categories emphasize the effectiveness of Top-Down Modulation for object detection. (best viewed digitally)
  • ...and 1 more figures