Table of Contents
Fetching ...

Improving Multi-label Recognition using Class Co-Occurrence Probabilities

Samyak Rawlekar, Shubhang Bhatnagar, Vishnuvardhan Pogunulu Srinivasulu, Narendra Ahuja

TL;DR

This work tackles multi-label recognition under limited labeled data by exploiting object co-occurrence statistics. It introduces a two-stage approach: first, VLM-driven, prompt-based logits provide initial evidence; second, a Graph Convolutional Network refines these logits using a conditional probability prior $A$ derived from training co-occurrences, where $a_{mn} = c_{mn}/c_{mm}$. Training employs Reweighted Asymmetric Loss (RASL) to address long-tailed class distributions. Empirical results on four benchmarks in the low-data regime show consistent, substantial improvements over state-of-the-art methods, particularly for difficult-to-recognize classes, validating the value of inter-class dependencies for MLR.

Abstract

Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.

Improving Multi-label Recognition using Class Co-Occurrence Probabilities

TL;DR

This work tackles multi-label recognition under limited labeled data by exploiting object co-occurrence statistics. It introduces a two-stage approach: first, VLM-driven, prompt-based logits provide initial evidence; second, a Graph Convolutional Network refines these logits using a conditional probability prior derived from training co-occurrences, where . Training employs Reweighted Asymmetric Loss (RASL) to address long-tailed class distributions. Empirical results on four benchmarks in the low-data regime show consistent, substantial improvements over state-of-the-art methods, particularly for difficult-to-recognize classes, validating the value of inter-class dependencies for MLR.

Abstract

Multi-label Recognition (MLR) involves the identification of multiple objects within an image. To address the additional complexity of this problem, recent works have leveraged information from vision-language models (VLMs) trained on large text-images datasets for the task. These methods learn an independent classifier for each object (class), overlooking correlations in their occurrences. Such co-occurrences can be captured from the training data as conditional probabilities between a pair of classes. We propose a framework to extend the independent classifiers by incorporating the co-occurrence information for object pairs to improve the performance of independent classifiers. We use a Graph Convolutional Network (GCN) to enforce the conditional probabilities between classes, by refining the initial estimates derived from image and text sources obtained using VLMs. We validate our method on four MLR datasets, where our approach outperforms all state-of-the-art methods.
Paper Structure (22 sections, 6 equations, 2 figures, 2 tables)

This paper contains 22 sections, 6 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Method Overview: Given an image with multiple objects, we extract image features and text features from the subimages using a vision-language model (CLIP). An image-text feature aggregation module (Sec. \ref{['sec: Initial Logits Estimation']}) combines these features to identify all classes present in the image as a union of the classes present in the subimages, giving an initial set of image level class logits. These logits are passed to a GCN, that uses conditional probabilities between classes to refine these initial predictions (Sec. \ref{['sec:refinement using conditional probability prior']}). We train this framework while reweighting the loss generated by classes to address any class imbalance in the training data using a Reweighted Asymmetric Loss (RASL), a weighted version of ASLasl.
  • Figure 2: Improvement in average precision ($\Delta$AP) of a class obtained by refining VLM-based initial logits to incorporate the information provided by conditional probabilities, shown as a function of the mean conditional probability of most co-occurring three classes.