Table of Contents
Fetching ...

Learning to Rank Patches for Unbiased Image Redundancy Reduction

Yang Luo, Zhineng Chen, Peng Zhou, Zuxuan Wu, Xieping Gao, Yu-Gang Jiang

TL;DR

LTRP tackles unbiased image redundancy reduction by formulating patch informativeness as a self-supervised, patch-level ranking problem. It leverages a pre-trained MAE at a high masking ratio to generate pseudo semantic-density scores for visible patches, then trains a listwise ranking model to order patches by informativeness without supervisory labels. The approach achieves competitive single-label accuracy, superior performance on unseen categories in multi-label/segmentation tasks, and substantial efficiency gains for Vision Transformers. Practically, LTRP reduces redundancy without introducing categorical inductive bias, enabling more robust downstream performance and faster inference/training for large ViT models.

Abstract

Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue, we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch, and learning to rank the patches with the pseudo score. The entire process is self-supervised, thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.

Learning to Rank Patches for Unbiased Image Redundancy Reduction

TL;DR

LTRP tackles unbiased image redundancy reduction by formulating patch informativeness as a self-supervised, patch-level ranking problem. It leverages a pre-trained MAE at a high masking ratio to generate pseudo semantic-density scores for visible patches, then trains a listwise ranking model to order patches by informativeness without supervisory labels. The approach achieves competitive single-label accuracy, superior performance on unseen categories in multi-label/segmentation tasks, and substantial efficiency gains for Vision Transformers. Practically, LTRP reduces redundancy without introducing categorical inductive bias, enabling more robust downstream performance and faster inference/training for large ViT models.

Abstract

Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However, current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue, we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g., 90\%). Building upon it, we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch, and learning to rank the patches with the pseudo score. The entire process is self-supervised, thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content.
Paper Structure (16 sections, 3 equations, 8 figures, 2 tables)

This paper contains 16 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Redundancy reduction (upper): given an image, LTRP selects informative patches regardless of whether they are from the categories already learned. While for supervised methods (from left to right: GFNet wang2020glance, Grad-CAM selvaraju2017grad using ViT, EViT liang2022not) the preserved patches are mostly located on the learned category. Patch removing (bottom): an image and its reconstructions using MAE he2022masked with different sets of visible patches, columns 3-5 denote that the red patch is removed from the visible set. They generate reconstructions with different levels of semantic shift, e.g., without the dog tail, eyes, etc.
  • Figure 2: LTRP training (left): given an image, LTRP randomly selects a set of visible patches using a high masking ratio (e.g. 90%). A pre-trained MAE (parameter frozen) is applied to get its reconstruction, i.e., the anchor image. Then, the visible patches (red) are removed one by one with replacement, each time generating a new reconstruction and its semantic density score w.r.t the anchor image. The scores are treated as pseudo labels to train the ranking model using learning to rank. Once trained, the ranking model is preserved and the MAE is discarded. LTRP inference (right): Top-k patches are selected using the trained ranking model and fed into downstream tasks.
  • Figure 3: Normalized score maps generated by our ranking model.
  • Figure 4: Classification results on single-label dataset (upper, ImageNet-1K) and multi-label dataset (bottom, MS-COCO) at different krs. All methods differ only in the patch selection step. Supervised and self-supervised methods are depicted using different colors. LTRP is annotated spots.
  • Figure 5: Experimental results on object detection and semantic segmentation datasets. For each dataset, we exclude the categories overlapped with ImageNet-1K (learned) and only compute the metrics on the remaining (unseen) categories, which all the methods have not been explicitly told to learn. The results illustrate the merit of LTRP in unbiased image redundancy reduction.
  • ...and 3 more figures