Table of Contents
Fetching ...

Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers

Chaitanya Devaguptapu, Sumukh Aithal, Shrinivas Ramasubramanian, Moyuru Yamada, Manohar Kaul

TL;DR

A novel Semantic Graph Consistency (SGC) module is introduced to regularize ViT-based SSL methods and leverage patch tokens effectively, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation.

Abstract

Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL architectures do not fully exploit the ViT backbone, particularly the patch tokens of the ViT. In this paper, we introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively. We reconceptualize images as graphs, with image patches as nodes and infuse relational inductive biases by explicit message passing using Graph Neural Networks into the SSL framework. Our SGC loss acts as a regularizer, leveraging the underexploited patch tokens of ViTs to construct a graph and enforcing consistency between graph features across multiple views of an image. Extensive experiments on various datasets including ImageNet, RESISC and Food-101 show that our approach significantly improves the quality of learned representations, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation. These experiments coupled with a comprehensive set of ablations demonstrate the promise of our approach in various settings.

Semantic Graph Consistency: Going Beyond Patches for Regularizing Self-Supervised Vision Transformers

TL;DR

A novel Semantic Graph Consistency (SGC) module is introduced to regularize ViT-based SSL methods and leverage patch tokens effectively, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation.

Abstract

Self-supervised learning (SSL) with vision transformers (ViTs) has proven effective for representation learning as demonstrated by the impressive performance on various downstream tasks. Despite these successes, existing ViT-based SSL architectures do not fully exploit the ViT backbone, particularly the patch tokens of the ViT. In this paper, we introduce a novel Semantic Graph Consistency (SGC) module to regularize ViT-based SSL methods and leverage patch tokens effectively. We reconceptualize images as graphs, with image patches as nodes and infuse relational inductive biases by explicit message passing using Graph Neural Networks into the SSL framework. Our SGC loss acts as a regularizer, leveraging the underexploited patch tokens of ViTs to construct a graph and enforcing consistency between graph features across multiple views of an image. Extensive experiments on various datasets including ImageNet, RESISC and Food-101 show that our approach significantly improves the quality of learned representations, resulting in a 5-10\% increase in performance when limited labeled data is used for linear evaluation. These experiments coupled with a comprehensive set of ablations demonstrate the promise of our approach in various settings.
Paper Structure (33 sections, 7 equations, 2 figures, 11 tables)

This paper contains 33 sections, 7 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Conceptual representation of the proposed Semantic Graph Consistency (SGC). Unlike traditional methods that emphasize on class token representations and discard patch tokens for contrastive learning, SGC leverages patch tokens to accentuate relational and semantic information. SGC constructs a graph utilizing the patch tokens and imposes graph-level consistency, thus significantly enhancing the representational quality within the contrastive learning framework.
  • Figure 2: Overview of the proposed Semantic Graph Consistency for Contrastive learning using Vision Transformer backbone. (EMA denotes exponential moving average.)