Table of Contents
Fetching ...

Superpixel Semantics Representation and Pre-training for Vision-Language Task

Siyu Zhang, Yeming Chen, Yaoru Sun, Fang Wang, Jun Yang, Lizhi Bai, Shangce Gao

TL;DR

This work tackles the semantic gap in vision-language pre-training by introducing superpixels as robust visual primitives and modeling their relations with a Multiscale Difference Graph Convolutional Network (MDGCN). It presents a differentiable superpixel representation, a multiscale graph construction, and a center-difference graph convolution to capture inter-node gradients, followed by a bottom-up, multi-level fusion of pixel- and superpixel-level features. The approach yields strong, cross-task improvements across VQA, visual reasoning, visual entailment, and image-text retrieval benchmarks, demonstrating enhanced cross-modal alignment with reduced noise. Overall, integrating superpixel-level topology with pixel-level detail provides scalable, robust VL representations for pre-training pipelines.

Abstract

The key to integrating visual language tasks is to establish a good alignment strategy. Recently, visual semantic representation has achieved fine-grained visual understanding by dividing grids or image patches. However, the coarse-grained semantic interactions in image space should not be ignored, which hinders the extraction of complex contextual semantic relations at the scene boundaries. This paper proposes superpixels as comprehensive and robust visual primitives, which mine coarse-grained semantic interactions by clustering perceptually similar pixels, speeding up the subsequent processing of primitives. To capture superpixel-level semantic features, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It allows parsing the entire image as a fine-to-coarse visual hierarchy. To reason actual semantic relations, we reduce potential noise interference by aggregating difference information between adjacent graph nodes. Finally, we propose a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by mining complementary spatial information at different levels. Experiments show that the proposed method can effectively promote the learning of multiple downstream tasks. Encouragingly, our method outperforms previous methods on all metrics. Our code will be released upon publication.

Superpixel Semantics Representation and Pre-training for Vision-Language Task

TL;DR

This work tackles the semantic gap in vision-language pre-training by introducing superpixels as robust visual primitives and modeling their relations with a Multiscale Difference Graph Convolutional Network (MDGCN). It presents a differentiable superpixel representation, a multiscale graph construction, and a center-difference graph convolution to capture inter-node gradients, followed by a bottom-up, multi-level fusion of pixel- and superpixel-level features. The approach yields strong, cross-task improvements across VQA, visual reasoning, visual entailment, and image-text retrieval benchmarks, demonstrating enhanced cross-modal alignment with reduced noise. Overall, integrating superpixel-level topology with pixel-level detail provides scalable, robust VL representations for pre-training pipelines.

Abstract

The key to integrating visual language tasks is to establish a good alignment strategy. Recently, visual semantic representation has achieved fine-grained visual understanding by dividing grids or image patches. However, the coarse-grained semantic interactions in image space should not be ignored, which hinders the extraction of complex contextual semantic relations at the scene boundaries. This paper proposes superpixels as comprehensive and robust visual primitives, which mine coarse-grained semantic interactions by clustering perceptually similar pixels, speeding up the subsequent processing of primitives. To capture superpixel-level semantic features, we propose a Multiscale Difference Graph Convolutional Network (MDGCN). It allows parsing the entire image as a fine-to-coarse visual hierarchy. To reason actual semantic relations, we reduce potential noise interference by aggregating difference information between adjacent graph nodes. Finally, we propose a multi-level fusion rule in a bottom-up manner to avoid understanding deviation by mining complementary spatial information at different levels. Experiments show that the proposed method can effectively promote the learning of multiple downstream tasks. Encouragingly, our method outperforms previous methods on all metrics. Our code will be released upon publication.
Paper Structure (15 sections, 20 equations, 7 figures, 10 tables)

This paper contains 15 sections, 20 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The framework of our scheme to integrate pixel- and superpixel-based complementary information for the VL alignment task. Note that BN denotes batch normalized input features, and “modal concat” means multi-modal concatenation.
  • Figure 2: Illustration of multi-scale superpixel graph construction. (a) and (b) are large-scale and small-scale superpixel maps, respectively. Note that superpixel region $1$ in (a) is constructed by merging adjacent superpixels (region 2 to 5) in (b). (c) and (d) are corresponding generated graph nodes of (a) and (b), respectively.
  • Figure 3: Comparison of superpixel-level vanilla and difference graph convolutional layer. (a) Vanilla graph convolutional layer. (b) Difference graph convolutional layer. Where the minus signs denote the difference operation between the central node $v_i$ (red node) and adjacent node $v_j$ (blue node) in the blue sampling area at the superpixel-level. The yellow arrows denote the corresponding gradient features and their directions. The blue dashed and solid lines represent weak (or zero) and strong edge weights, respectively.
  • Figure 4: Description of visual information at different spatial scales. (a) Built the scale space by convex surfaces $\Psi$, which can be quantized as a multi-level tree. (b) The process of the three-level parsing architecture, where the red dashed lines execute Tree-LSTM.
  • Figure 5: Performance evaluation of the model in three states under different VL tasks. Here, “W” and “O” represent the model with and without GCN, respectively. “W+D” means that the model performs a difference convolution operation.
  • ...and 2 more figures