Table of Contents
Fetching ...

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

Jingchao Wang, Hong Wang, Wenlong Zhang, Kunhua Ji, Dingjiang Huang, Yefeng Zheng

TL;DR

A Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features.

Abstract

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Progressive Language-guided Visual Learning for Multi-Task Visual Grounding

TL;DR

A Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features.

Abstract

Multi-task visual grounding (MTVG) includes two sub-tasks, i.e., Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES). The existing representative approaches generally follow the research pipeline which mainly consists of three core procedures, including independent feature extraction for visual and linguistic modalities, respectively, cross-modal interaction module, and independent prediction heads for different sub-tasks. Albeit achieving remarkable performance, this research line has two limitations: 1) The linguistic content has not been fully injected into the entire visual backbone for boosting more effective visual feature extraction and it needs an extra cross-modal interaction module; 2) The relationship between REC and RES tasks is not effectively exploited to help the collaborative prediction for more accurate output. To deal with these problems, in this paper, we propose a Progressive Language-guided Visual Learning framework for multi-task visual grounding, called PLVL, which not only finely mine the inherent feature expression of the visual modality itself but also progressively inject the language information to help learn linguistic-related visual features. In this manner, our PLVL does not need additional cross-modal fusion module while fully introducing the language guidance. Furthermore, we analyze that the localization center for REC would help identify the to-be-segmented object region for RES to some extent. Inspired by this investigation, we design a multi-task head to accomplish collaborative predictions for these two sub-tasks. Extensive experiments conducted on several benchmark datasets comprehensively substantiate that our PLVL obviously outperforms the representative methods in both REC and RES tasks. https://github.com/jcwang0602/PLVL

Paper Structure

This paper contains 15 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of different pipelines for multi-task visual grounding: (a) Visual and language features are extracted separately and then cross-modal fusion is performed; (b) Additional modules (marked by orange) are inserted after the original network layers to inject the language features into the visual backbone; (c) Our progressive language-guided visual learning framework with a collaborative multi-task head, which directly adjusts the original network layer for progressively introducing the language guidance.
  • Figure 2: Flowchart of the proposed Progressive Language-guided Visual Learning (PLVL) framework which consists of three parts, i.e., linguistic backbone, language-guided visual backbone, and a collaborative multi-task head. The detailed structures of local block, glocal block, and multi-task head are described in Fig. \ref{['fig:fusion']}(a), Fig. \ref{['fig:fusion']}(b), and Fig. \ref{['fig:head']}, respectively.
  • Figure 3: The structures of local block and global block.
  • Figure 4: The structure of collaborative multi-task Head.
  • Figure 5: Qualitative results on the RefCOCO. From left to right: the input image, the ground truth of REC and RES, the predicted results of PLVL, the score map of REC sub-task.
  • ...and 2 more figures