Table of Contents
Fetching ...

Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

Xinyu Zhang, Atsushi Konno, Toshihiko Yamasaki, Ling Xiao

Abstract

Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.

Enhancing Lightweight Vision Language Models through Group Competitive Learning for Socially Compliant Navigation

Abstract

Social robot navigation requires a sophisticated integration of scene semantics and human social norms. Scaling up Vision Language Models (VLMs) generally improves reasoning and decision-making capabilities for socially compliant navigation. However, increased model size incurs substantial computational overhead, limiting suitability for real-time robotic deployment. Conversely, lightweight VLMs enable efficient inference but often exhibit weaker reasoning and decision-making performance in socially complex environments. Achieving both strong reasoning ability and efficiency remains an open challenge. To bridge this gap, we propose Group Competitive Learning (GCL), a strategy designed to amplify the capabilities of lightweight VLMs. Our strategy introduces the Group Competitive Objective (GCO) to harmonize global semantics with distributional regularization, alongside Asymmetric Group Optimization (AGO) to explore the upper limits of model performance. Empirical evaluations on social navigation benchmarks demonstrate that GCL significantly elevates VLM performance. Specifically, GCL enables the Qwen2.5-VL-3B learner model and guide Qwen3-VL-4B to achieve an F1 score of 0.968 and 0.914, representing 40\% and 12\% improvement over vanilla supervised fine-tuning (SFT). Notably, under vanilla SFT, the 3B model initially trails the 8B model (F1: 0.692 vs. 0.755). However, through the GCL, the 3B model outperforms (28\%) the 8B baseline model. These results suggest that GCL provides an effective solution for achieving both high accuracy and computational efficiency in real-world deployment.
Paper Structure (21 sections, 17 equations, 6 figures, 3 tables)

This paper contains 21 sections, 17 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The Overview of GCL. The left of the figure illustrates the GCL strategy's inputs examples: texts and images across social navigation dataset. The middle shows the GCL strategy's components: Group Competitive Objective, to align VLMs at both semantic and distributional token levels, and Asymmetric Group Optimization, to explore the optimization dynamics and performance boundaries of GCL. The right shows the socially compliant instruction of GCL outputs.
  • Figure 2: Ablation Study of Loss Components in GCO. Red and blue bars represent the learner (Qwen2.5-VL-3B) and guide (Qwen3-VL-4B) models, respectively, with bold values indicating peak performance. The left illustrates the ablation of the GSL weight, where both models achieve optimal results at 0.5. The right plot presents the ablation of the DRL weight, demonstrating that a value of 0.4 yields the highest performance for both models.
  • Figure 3: Impact of Capability Gaps on Asymmetric Learning Rate Dynamics. Action-F1 is evaluated against the learning rate ratio ($r = \eta_{learner} / \eta_{guide}$), with $\eta_{guide}$ fixed at $\eta = 5 \times 10^{-6}$. Shaded regions indicate the Optimization Best Spot. Fig. \ref{['fig:ago_ratio_large']} shows that large capability gap group tolerate aggressive learner updates ($1.0 \le r \le 3.0$ ), enabling effective feature space alignment. In contrast, Fig. \ref{['fig:ago_ratio_small']} reveals that minimal initial gaps confine the optimal window to ($1.0 \le r \le 2.0$). These confirms that GCL necessitates performance based asymmetric learning rate scheduling.
  • Figure 4: Optimal Asymmetric Temperature Gradients in VLMs Group. Heatmaps of Action-F1 scores for Qwen2.5-VL-3B learner (left) and Qwen3-VL-4B guide (right) across temperature grids. Symmetric settings $\tau_{learner}=\tau_{guide}$) result in suboptimal consensus or collapse (e.g., $0.720$ at $\tau=3.0$). The global peak (learner: 0.968) occurs at $(\tau_{learner}=3.0, \tau_{guide}=2.0)$, highlighting the necessity of asymmetric thermal gradients to enhance learner plasticity through guide discriminative signaling. (For small gap group, details provided in the Supplementary Material)
  • Figure 5: Visualization Comparison. SFT models (yellow/green) miss critical constraints and produce incorrect actions, while GCL models (red/blue) generate socially compliant instructions aligned with GT.
  • ...and 1 more figures