Democratizing Fine-grained Visual Recognition with Large Language Models

Mingxuan Liu; Subhankar Roy; Wenjing Li; Zhun Zhong; Nicu Sebe; Elisa Ricci

Democratizing Fine-grained Visual Recognition with Large Language Models

Mingxuan Liu, Subhankar Roy, Wenjing Li, Zhun Zhong, Nicu Sebe, Elisa Ricci

TL;DR

This work proposes Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names.

Abstract

Identifying subordinate-level categories from images is a longstanding task in computer vision and is referred to as fine-grained visual recognition (FGVR). It has tremendous significance in real-world applications since an average layperson does not excel at differentiating species of birds or mushrooms due to subtle differences among the species. A major bottleneck in developing FGVR systems is caused by the need of high-quality paired expert annotations. To circumvent the need of expert knowledge we propose Fine-grained Semantic Category Reasoning (FineR) that internally leverages the world knowledge of large language models (LLMs) as a proxy in order to reason about fine-grained category names. In detail, to bridge the modality gap between images and LLM, we extract part-level visual attributes from images as text and feed that information to a LLM. Based on the visual attributes and its internal world knowledge the LLM reasons about the subordinate-level category names. Our training-free FineR outperforms several state-of-the-art FGVR and language and vision assistant models and shows promise in working in the wild and in new domains where gathering expert annotation is arduous.

Democratizing Fine-grained Visual Recognition with Large Language Models

TL;DR

Abstract

Paper Structure (40 sections, 9 equations, 24 figures, 9 tables)

This paper contains 40 sections, 9 equations, 24 figures, 9 tables.

Introduction
Methodology
Preliminaries
FineR: Fine-grained Semantic Category Reasoning system
Translating Useful Visual Information from Visual to Textual Modality
Fine-grained Semantic Category Reasoning
Multi-modal Classifier Construction
Inference
Experiments
Benchmarking on Fine-grained Datasets
Benchmarking on the Novel Pokemon Dataset
Ablation Study
Related Work
Conclusion
Acknowledgments
...and 25 more sections

Figures (24)

Figure 1: An overview of our proposed fine-grained visual recognition (FGVR) pipeline. Left: Given few unlabelled images we exploit visual question answering (VQA) and large language models (LLM) to reason about subordinate-level category names without requiring expert knowledge. Right: At inference, we utilize the reasoned concepts to carry out FGVR via zero-shot semantic classification with a vision-language model (VLM).
Figure 2: Comparing our proposed FineR with the state-of-the-art visual question answering models: BLIP-2 li2023blip, LLaVA liu2023visual, LENS berrios2023towards, and MiniGPT-4 zhu2023minigpt.
Figure 3: The pipeline of the proposed Fine-grained Semantic Category Reasoning (FineR) system.
Figure 4: Comparison with the learning-based methods. cACC is averaged on five datasets.
Figure 5: Human study results. Averages computed across 30 participants are reported.
...and 19 more figures

Democratizing Fine-grained Visual Recognition with Large Language Models

TL;DR

Abstract

Democratizing Fine-grained Visual Recognition with Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)