Table of Contents
Fetching ...

Democratizing Fine-grained Visual Recognition with Large Language Models

Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci

TL;DR

This work proposes Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names.

Abstract

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Democratizing Fine-grained Visual Recognition with Large Language Models

TL;DR

This work proposes Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names.

Abstract

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.
Paper Structure (40 sections, 9 equations, 24 figures, 9 tables)

This paper contains 40 sections, 9 equations, 24 figures, 9 tables.

Figures (24)

  • Figure 1: An overview of our proposed fine-grained visual recognition (FGVR) pipeline. Left: Given few unlabelled images we exploit visual question answering (VQA) and large language models (LLM) to reason about subordinate-level category names without requiring expert knowledge. Right: At inference, we utilize the reasoned concepts to carry out FGVR via zero-shot semantic classification with a vision-language model (VLM).
  • Figure 2: Comparing our proposed FineR with the state-of-the-art visual question answering models: BLIP-2 li2023blip, LLaVA liu2023visual, LENS berrios2023towards, and MiniGPT-4 zhu2023minigpt.
  • Figure 3: The pipeline of the proposed Fine-grained Semantic Category Reasoning (FineR) system.
  • Figure 4: Comparison with the learning-based methods. cACC is averaged on five datasets.
  • Figure 5: Human study results. Averages computed across 30 participants are reported.
  • ...and 19 more figures