Table of Contents
Fetching ...

Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition

Zichen Liang, Jingjing Fei, Jie Wang, Zheming Yang, Changqing Li, Pei Wu, Minghui Qiu, Fei Yang, Xialei Liu

TL;DR

This work formulates open-world logo recognition as a comparison task and introduces Logo-VGR, a two-stage, domain-adaptive multimodal reasoning framework. Stage 1 reinforces logo perception through a domain-specific detection task, while Stage 2 guides robust reasoning via Logo-Guided Visual Grounded Reasoning with coordinate clues and IoU-based rewards, supplemented by a Cognitive Trajectory Reward judged by an LLM. Empirical results show Logo-VGR significantly improves generalization to unseen brands, particularly in OOD settings, outperforming strong baselines by up to ~14 points in OOD accuracy. The approach demonstrates the value of domain-aware reasoning and grounded evidence in open-world commercial scenarios, with practical impact on intelligent product moderation.

Abstract

Recent advances in multimodal large language models (MLLMs) have been primarily evaluated on general-purpose benchmarks, while their applications in domain-specific scenarios, such as intelligent product moderation, remain underexplored. To address this gap, we introduce an open-world logo recognition benchmark, a core challenge in product moderation. Unlike traditional logo recognition methods that rely on memorizing representations of tens of thousands of brands-an impractical approach in real-world settings-our proposed method, Logo-VGR, enables generalization to large-scale brand recognition with supervision from only a small subset of brands. Specifically, we reformulate logo recognition as a comparison-based task, requiring the model to match product images with candidate logos rather than directly generating brand labels. We further observe that existing models tend to overfit by memorizing brand distributions instead of learning robust multimodal reasoning, which results in poor performance on unseen brands. To overcome this limitation, Logo-VGR introduces a new paradigm of domain-specific multimodal reasoning: Logo Perception Grounding injects domain knowledge, and Logo-Guided Visual Grounded Reasoning enhances the model's reasoning capability. Experimental results show that Logo-VGR outperforms strong baselines by nearly 10 points in OOD settings, demonstrating superior generalization.

Logo-VGR: Visual Grounded Reasoning for Open-world Logo Recognition

TL;DR

This work formulates open-world logo recognition as a comparison task and introduces Logo-VGR, a two-stage, domain-adaptive multimodal reasoning framework. Stage 1 reinforces logo perception through a domain-specific detection task, while Stage 2 guides robust reasoning via Logo-Guided Visual Grounded Reasoning with coordinate clues and IoU-based rewards, supplemented by a Cognitive Trajectory Reward judged by an LLM. Empirical results show Logo-VGR significantly improves generalization to unseen brands, particularly in OOD settings, outperforming strong baselines by up to ~14 points in OOD accuracy. The approach demonstrates the value of domain-aware reasoning and grounded evidence in open-world commercial scenarios, with practical impact on intelligent product moderation.

Abstract

Recent advances in multimodal large language models (MLLMs) have been primarily evaluated on general-purpose benchmarks, while their applications in domain-specific scenarios, such as intelligent product moderation, remain underexplored. To address this gap, we introduce an open-world logo recognition benchmark, a core challenge in product moderation. Unlike traditional logo recognition methods that rely on memorizing representations of tens of thousands of brands-an impractical approach in real-world settings-our proposed method, Logo-VGR, enables generalization to large-scale brand recognition with supervision from only a small subset of brands. Specifically, we reformulate logo recognition as a comparison-based task, requiring the model to match product images with candidate logos rather than directly generating brand labels. We further observe that existing models tend to overfit by memorizing brand distributions instead of learning robust multimodal reasoning, which results in poor performance on unseen brands. To overcome this limitation, Logo-VGR introduces a new paradigm of domain-specific multimodal reasoning: Logo Perception Grounding injects domain knowledge, and Logo-Guided Visual Grounded Reasoning enhances the model's reasoning capability. Experimental results show that Logo-VGR outperforms strong baselines by nearly 10 points in OOD settings, demonstrating superior generalization.

Paper Structure

This paper contains 25 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: The accuracy results of different methods on ID and OOD benchmarks. Here, zero-shot refers to the Qwen2.5-VL-3B baseline. Through SFT training, the model improves its performance on ID data but simultaneously suffers from reduced generalization ability. In contrast, Logo-VGR leverages process supervision to encourage correct reasoning, thereby achieving stronger generalization.
  • Figure 2: Overview of the Logo Recognition Benchmark. To prevent the model from memorizing brand information, we reformulate the original memory-based task into a comparison-based one, where the model is required to compare the features of product images with those of candidate logos to produce the answer.
  • Figure 3: Statistics of brand distribution. A small number of top brands account for the majority of occurrences. We categorize the top brands as ID brands and the remaining brands as OOD brands.
  • Figure 4: Illustration of the Logo-VGR framework. Logo-VGR is a two-stage, domain-adaptive multimodal reasoning pipeline. In the first stage, domain knowledge is enhanced via logo detection to improve low-level logo perception. In the second stage, Logo-Guided Visual Grounded Reasoning is introduced to prevent shortcut reasoning based on logo-style memorization and to guide the model toward a more principled and generalizable multimodal reasoning paradigm.
  • Figure 5: Visualization result of Logo-VGR. The red boxes indicate the predicted coordinates generated by the model. Inference is performed by comparing the logo features extracted from the product image (shown in green) with those of the candidate logos (shown in blue).
  • ...and 4 more figures