Table of Contents
Fetching ...

LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

Xianglong Shi, Silin Cheng, Sirui Zhao, Yunhan Jiang, Enhong Chen, Yang Liu, Sebastien Ourselin

TL;DR

This work tackles WGREC, where expressions may refer to zero, one, or multiple objects using only image–text supervision. The authors introduce LIHE, a two-stage framework: Referential Decoupling uses a VLM-driven prompt-based decomposition to produce target-specific sub-expressions, followed by Referent Grounding that localizes each sub-expression with a novel hybrid Euclidean–hyperbolic similarity (HEMix). HEMix leverages Euclidean precision and hyperbolic hierarchy to prevent semantic collapse while preserving fine-grained distinctions, yielding strong results on gRefCOCO and Ref-ZOM and improving standard REC benchmarks. The approach demonstrates the value of integrating structured geometric priors into vision–language grounding and provides a credible weakly supervised baseline for generalized referring expression tasks, with code available publicly. Limitations include reliance on VLMs with relatively slow inference, suggesting LIHE as a teacher model for pseudo-label generation in smaller, faster students.

Abstract

Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5\%. The code is available at https://anonymous.4open.science/r/LIHE.

LIHE: Linguistic Instance-Split Hyperbolic-Euclidean Framework for Generalized Weakly-Supervised Referring Expression Comprehension

TL;DR

This work tackles WGREC, where expressions may refer to zero, one, or multiple objects using only image–text supervision. The authors introduce LIHE, a two-stage framework: Referential Decoupling uses a VLM-driven prompt-based decomposition to produce target-specific sub-expressions, followed by Referent Grounding that localizes each sub-expression with a novel hybrid Euclidean–hyperbolic similarity (HEMix). HEMix leverages Euclidean precision and hyperbolic hierarchy to prevent semantic collapse while preserving fine-grained distinctions, yielding strong results on gRefCOCO and Ref-ZOM and improving standard REC benchmarks. The approach demonstrates the value of integrating structured geometric priors into vision–language grounding and provides a credible weakly supervised baseline for generalized referring expression tasks, with code available publicly. Limitations include reliance on VLMs with relatively slow inference, suggesting LIHE as a teacher model for pseudo-label generation in smaller, faster students.

Abstract

Existing Weakly-Supervised Referring Expression Comprehension (WREC) methods, while effective, are fundamentally limited by a one-to-one mapping assumption, hindering their ability to handle expressions corresponding to zero or multiple targets in realistic scenarios. To bridge this gap, we introduce the Weakly-Supervised Generalized Referring Expression Comprehension task (WGREC), a more practical paradigm that handles expressions with variable numbers of referents. However, extending WREC to WGREC presents two fundamental challenges: supervisory signal ambiguity, where weak image-level supervision is insufficient for training a model to infer the correct number and identity of referents, and semantic representation collapse, where standard Euclidean similarity forces hierarchically-related concepts into non-discriminative clusters, blurring categorical boundaries. To tackle these challenges, we propose a novel WGREC framework named Linguistic Instance-Split Hyperbolic-Euclidean (LIHE), which operates in two stages. The first stage, Referential Decoupling, predicts the number of target objects and decomposes the complex expression into simpler sub-expressions. The second stage, Referent Grounding, then localizes these sub-expressions using HEMix, our innovative hybrid similarity module that synergistically combines the precise alignment capabilities of Euclidean proximity with the hierarchical modeling strengths of hyperbolic geometry. This hybrid approach effectively prevents semantic collapse while preserving fine-grained distinctions between related concepts. Extensive experiments demonstrate LIHE establishes the first effective weakly supervised WGREC baseline on gRefCOCO and Ref-ZOM, while HEMix achieves consistent improvements on standard REC benchmarks, improving IoU@0.5 by up to 2.5\%. The code is available at https://anonymous.4open.science/r/LIHE.

Paper Structure

This paper contains 43 sections, 4 theorems, 32 equations, 10 figures, 13 tables.

Key Result

Proposition 1

Let $\sigma^{2}_{\mathrm{E}}\!=\!\operatorname{Var}[\varepsilon_{\mathrm{E}}]$, $\sigma^{2}_{\mathrm{H}}\!=\!\operatorname{Var}[\varepsilon_{\mathrm{H}}]$ and $\rho\!=\!\operatorname{Corr}[\varepsilon_{\mathrm{E}},\varepsilon_{\mathrm{H}}]$. If $\rho<1$, the mean‑squared error of the hybrid estimato attains its minimum at $\alpha^{\star}=\frac{(\sigma_{\mathrm{E}}^{2}+\rho\sigma_{\mathrm{E}}\sigma

Figures (10)

  • Figure 1: Limitations of current WREC methods. The ground truth is denoted by red bounding boxes, whereas green bounding boxes denote the predictions. Current WREC methods always select only the best anchor as output, failing to handle No-target and Multi-target cases (e.g., no red bounding box and two red bounding boxes).
  • Figure 2: The overall framework of LIHE. (a). Referential Decoupling: VLM decomposed the referring expression into distinct short phrases for each target. (b). Referent Grounding: Each phrase is processed by a textual encoder, and the image by a visual encoder. Then the model filters anchors of low value and returns the best-matching one for bounding box prediction. The referent grounding stage is weakly supervised by the anchor-based contrastive loss.
  • Figure 3: A simple illustration of (a) Euclidean flatten space and (b) hyperbolic Lorentz manifold in 3-dimensional space li2024hyperbolic. In Euclidean space, all nodes occupy a single, undifferentiated hierarchy, so parent and child entities share the same geometric scale. In contrast, the negative curvature of hyperbolic space naturally organizes nodes into concentric hierarchies: parent nodes reside closer to the manifold’s apex, while their children are pushed farther outward, and different children at the same level are repelled from one another.
  • Figure 4: Successful cases(green background) and failure cases(red background). The ground truth is denoted by red bounding boxes, whereas green bounding boxes denote the predictions.
  • Figure 5: We randomly select 100 samples from the dataset and visualize their similarity scores. From left to right: (1) the similarity scores of top-scoring anchors (dark blue dots), (2) the similarity scores of second-best anchors (light blue squares), and (3) an overlaid view combining both. The dashed vertical segments connect the top and second-best scores for each sample, illustrating that second-best anchors in many cases have higher similarity than top anchors in other samples.
  • ...and 5 more figures

Theorems & Definitions (7)

  • Proposition 1: Variance reduction
  • Proposition 2: Hyperboloid Membership
  • proof
  • Proposition 3: Monotonicity of the Lorentzian Inner Product
  • proof
  • Proposition 4: Bias--Variance Reduction
  • proof