Table of Contents
Fetching ...

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

Qihao Zhao, Yalun Dai, Hao Li, Wei Hu, Fan Zhang, Jun Liu

TL;DR

The paper tackles long-tail recognition by generating diverse tail-class data through a generative-content pipeline that leverages LMMs and LLMs, then fine-tunes a CLIP-based model using BalanceMix to integrate generated and original data. A key innovation is the iterative evaluation module, which uses CLIP feedback to refine tail descriptions and regenerate higher-quality images, guided by class-specific feature templates. The approach achieves state-of-the-art performance on ImageNet-LT, Places-LT, and iNaturalist 2018, with ablations confirming the value of iterative refinement and BalanceMix. This work demonstrates the practical potential of combining large multimodal models with principled data mixing for robust long-tail recognition in vision tasks.

Abstract

Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper, we propose a novel generative and fine-tuning framework, LTGC, to handle long-tail recognition via leveraging generated content. Firstly, inspired by the rich implicit knowledge in large-scale models (e.g., large language models, LLMs), LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC, which produces accurate and diverse tail data. Additionally, the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks.

LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content

TL;DR

The paper tackles long-tail recognition by generating diverse tail-class data through a generative-content pipeline that leverages LMMs and LLMs, then fine-tunes a CLIP-based model using BalanceMix to integrate generated and original data. A key innovation is the iterative evaluation module, which uses CLIP feedback to refine tail descriptions and regenerate higher-quality images, guided by class-specific feature templates. The approach achieves state-of-the-art performance on ImageNet-LT, Places-LT, and iNaturalist 2018, with ablations confirming the value of iterative refinement and BalanceMix. This work demonstrates the practical potential of combining large multimodal models with principled data mixing for robust long-tail recognition in vision tasks.

Abstract

Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper, we propose a novel generative and fine-tuning framework, LTGC, to handle long-tail recognition via leveraging generated content. Firstly, inspired by the rich implicit knowledge in large-scale models (e.g., large language models, LLMs), LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC, which produces accurate and diverse tail data. Additionally, the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks.
Paper Structure (15 sections, 4 equations, 7 figures, 5 tables)

This paper contains 15 sections, 4 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Top: Here, our LMMs use ChatGPT. For the Trogon Rufus category, when we asked ChatGPT, "What species is in the picture?" we did not get the expected answer due to the complexity of the question. Middle and Down: In contrast, when we asked some easy questions for the ChatGPT, "Please describe the image." or "Please describe the distinctive features of Trogon Rufus." It could accurately answer these questions.
  • Figure 2: Overall framework of LTGC. LTGC first employs LMMs to analyze the existing tail data to obtain the existing tail-class descriptions list. Then it inputs the list into LLMs to analyze the absent features of the tail classes and employs the T2I model to generate diverse images. Moreover, our designed self-reflection and iterative evaluation modules ensure the diversity and quality of the tail data. Finally, LTGC employs the BalanceMix module to fine-tune the CLIP's visual encoder with the extended and original data.
  • Figure 3: Example of the instruction for LMMs. When both images from tail classes and textual templates are input into LMMs, textual descriptions corresponding to the images can be obtained. By repeatedly performing this operation on the training data, we convert abstract image descriptions into concrete textual descriptions. Finally, we acquire the current textual descriptions list corresponding to each class.
  • Figure 4: Example of the instruction for LLMs. LTGC inputs the existing textual descriptions list to LLMs, which continually extends it with new distinctive features and scene information. During multiple iterations, LTGC generates a new extended textual descriptions list for each class.
  • Figure 5: Illustration of the proposed iterative evaluation module framework. This module detects lower-quality images through the similarity score $\mathcal{S}$ computed by images and their corresponding class feature template. Then the textual descriptions corresponding to lower-quality images are re-input into LLMs for refinement. Finally, the refined textural descriptions are fed into the T2I model for regeneration.
  • ...and 2 more figures