LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension
Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen, Yang Liu, Sebastien Ourselin
TL;DR
This work tackles WGREC, where expressions may refer to zero, one, or multiple objects using only image–text supervision. The authors introduce LIHE, a two-stage framework: Referential Decoupling uses a VLM-driven prompt-based decomposition to produce target-specific sub-expressions, followed by Referent Grounding that localizes each sub-expression with a novel hybrid Euclidean–hyperbolic similarity (HEMix). HEMix leverages Euclidean precision and hyperbolic hierarchy to prevent semantic collapse while preserving fine-grained distinctions, yielding strong results on gRefCOCO and Ref-ZOM and improving standard REC benchmarks. The approach demonstrates the value of integrating structured geometric priors into vision–language grounding and provides a credible weakly supervised baseline for generalized referring expression tasks, with code available publicly. Limitations include reliance on VLMs with relatively slow inference, suggesting LIHE as a teacher model for pseudo-label generation in smaller, faster students.
Abstract
Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5\%. The code is available at https://anonymous.4open.science/r/LIHE.
