Table of Contents
Fetching ...

Image Captioning via Dynamic Path Customization

Yiwei Ma, Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Xiaopeng Hong, Yongjian Wu, Rongrong Ji

TL;DR

This paper introduces DTNet, a dynamic transformer for image captioning that customizes computation per input by routing data through a set of spatial and channel modeling cells. A dedicated Spatial-Channel Joint Router (SCJR) jointly models spatial and channel information to produce adaptive path weights, enabling inputs to follow different, task-tailored paths. The approach achieves state-of-the-art results on MS-COCO (Karpathy split and online server) and generalizes to Flickr8K, Flickr30K, and VQA without heavy feature ensembles. The work demonstrates that input-dependent routing can significantly improve caption discriminability and quality while maintaining efficient parameter usage.

Abstract

This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only heavily rely on expert knowledge, but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address these issues, we propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions. Specifically, to build a rich routing space and improve routing efficiency, we introduce five types of basic cells and group them into two separate routing spaces according to their operating domains, i.e., spatial and channel. Then, we design a Spatial-Channel Joint Router (SCJR), which endows the model with the capability of path customization based on both spatial and channel information of the input sample. To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new state-of-the-art performance on both the Karpathy split and the online test server.

Image Captioning via Dynamic Path Customization

TL;DR

This paper introduces DTNet, a dynamic transformer for image captioning that customizes computation per input by routing data through a set of spatial and channel modeling cells. A dedicated Spatial-Channel Joint Router (SCJR) jointly models spatial and channel information to produce adaptive path weights, enabling inputs to follow different, task-tailored paths. The approach achieves state-of-the-art results on MS-COCO (Karpathy split and online server) and generalizes to Flickr8K, Flickr30K, and VQA without heavy feature ensembles. The work demonstrates that input-dependent routing can significantly improve caption discriminability and quality while maintaining efficient parameter usage.

Abstract

This paper explores a novel dynamic network for vision and language tasks, where the inferring structure is customized on the fly for different inputs. Most previous state-of-the-art approaches are static and hand-crafted networks, which not only heavily rely on expert knowledge, but also ignore the semantic diversity of input samples, therefore resulting in suboptimal performance. To address these issues, we propose a novel Dynamic Transformer Network (DTNet) for image captioning, which dynamically assigns customized paths to different samples, leading to discriminative yet accurate captions. Specifically, to build a rich routing space and improve routing efficiency, we introduce five types of basic cells and group them into two separate routing spaces according to their operating domains, i.e., spatial and channel. Then, we design a Spatial-Channel Joint Router (SCJR), which endows the model with the capability of path customization based on both spatial and channel information of the input sample. To validate the effectiveness of our proposed DTNet, we conduct extensive experiments on the MS-COCO dataset and achieve new state-of-the-art performance on both the Karpathy split and the online test server.
Paper Structure (39 sections, 15 equations, 9 figures, 15 tables)

This paper contains 39 sections, 15 equations, 9 figures, 15 tables.

Figures (9)

  • Figure 1: Illustration of Vanilla Transformer (static) and our DTNet (dynamic). Circles of different colors represent different cells, and arrows of different colors represent data flows of different input samples. Note that orange and green circles are for spatial and channel operations, respectively. In this example, the static model (a) tends to generate the same sentence for similar images, while the dynamic network (b) can generate informative captions through dynamic routing. More examples are shown in Fig. \ref{['fig:fig5']}.
  • Figure 2: The framework of the proposed Dynamic Transformer Network (DTNet) for image captioning. The visual features are extracted according to jiang2020defense. Next, stacked dynamic encoder layers are leveraged to encode the visual features with various input-dependent architectures, which are determined by our proposed Spatial-Channel Joint Router (SCJR). Finally, the features from the encoder will be fed into the decoder to generate captions word by word. Residual connections in the encoder are omitted for simplicity. Best viewed in color.
  • Figure 3: The detailed architectures of different cells in the spatial and channel routing space. BatchNorm is omitted for simplicity.
  • Figure 4: Receptive field illustration of different cells. (a) Global Modeling Cell, (b) Local Modeling Cell, (c) Axial Modeling Cell. The dark blue grid is the query grid, the light blue area is the receptive field, and the rest white area is the imperceptible area.
  • Figure 5: Examples of captions generated by Transformer vaswani2017attention, $M^2$Transformer cornia2020meshed, RSTNet zhang2021rstnet and DTNet. "GT" is short for "Ground Truth".
  • ...and 4 more figures