Table of Contents
Fetching ...

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Yuzhong Zhao, Feng Liu, Yue Liu, Mingxiang Liao, Chen Gong, Qixiang Ye, Fang Wan

TL;DR

DynRefer tackles the challenge of region-level multimodal tasks requiring precise region-language descriptions by introducing dynamic resolution through nested views around the referred region. It trains a stochastic vision-language alignment that maps a region representation $x_v$ to language via decoders $D_{tag}$, $D_{rtc}$, and $D_{llm}$, while enforcing alignment with multi-resolution inputs. At inference, it performs selective multimodal referring using task priors or image priors, guided by a greedy view-selection strategy based on information gain measured by a perceptual hash, ultimately achieving state-of-the-art results on OVAD, COCO region recognition, Visual Genome dense captioning, and RefCOCOg with a compact 4.2B-parameter model. These results demonstrate the practicality and effectiveness of dynamic resolution for unifying diverse region-level multimodal tasks, reducing unnecessary computation while improving alignment with human preferences.

Abstract

One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the nested views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is available at https://github.com/callsys/DynRefer.

DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

TL;DR

DynRefer tackles the challenge of region-level multimodal tasks requiring precise region-language descriptions by introducing dynamic resolution through nested views around the referred region. It trains a stochastic vision-language alignment that maps a region representation to language via decoders , , and , while enforcing alignment with multi-resolution inputs. At inference, it performs selective multimodal referring using task priors or image priors, guided by a greedy view-selection strategy based on information gain measured by a perceptual hash, ultimately achieving state-of-the-art results on OVAD, COCO region recognition, Visual Genome dense captioning, and RefCOCOg with a compact 4.2B-parameter model. These results demonstrate the practicality and effectiveness of dynamic resolution for unifying diverse region-level multimodal tasks, reducing unnecessary computation while improving alignment with human preferences.

Abstract

One fundamental task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the nested views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is available at https://github.com/callsys/DynRefer.
Paper Structure (19 sections, 1 equation, 15 figures, 11 tables)

This paper contains 19 sections, 1 equation, 15 figures, 11 tables.

Figures (15)

  • Figure 1: Left: Illustration of our DynRefer approach, which dynamically determines proper region views for each task through stochastic vision-language alignment and selectively multimodal referring. Right: Performance comparison on region-level multimodal tasks.
  • Figure 2: Diagram of the proposed DynRefer. The "dynamic" capability is achieved through a stochastic vision-language alignment procedure during training (upper) and a selectively multimodal referring procedure during reference (lower). During training, the input image is cropped and resized to multiple views surrounding the referred region. The views are then randomly sampled to simulate an image with stochastic resolution. The sampled views are used to train a Refer Module (upper). During inference, the views are sampled based on task and image priors to meet the task requirements and human preference (lower).
  • Figure 3: Architecture of the proposed refer module. It comprises a stochastic multi-view embedding module and multimodal decoders ($D_{*}$). $n$ nested views are encoded as a region representation $x_v$ by the stochastic multi-view embedding module (left). The region representation $x_v$ is decoded by multimodal decoders, and then aligned to language descriptions of multimodal tasks (right).
  • Figure 4: Performance of a double-view ($n=2$) DynRefer model on region-level multimodal tasks ($e.g.$, open-vocabulary attribute detection on OVAD ovad, region recognition on COCO coco, dense captioning on VG-COCO Shao2022Region, and region-level captioning on VG krishna2017visual) under interpolation coefficients $\textbf{t}$, $\textbf{t} = [t_1, t_2]\in {\mathbb{R}}^2[0, 1]$. The first view is a fixed one ($t_1=0$) and the second is randomly selected or fixed.
  • Figure 5: Visualization of selected views using image prior.
  • ...and 10 more figures