Table of Contents
Fetching ...

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan, Gerard de Melo, Andres Sevtsuk

TL;DR

This paper defines social group region detection, a semantically rich grounding task in urban scenes, and proposes MINGLE, a three-stage pipeline that combines robust person detection, depth-aware VLM reasoning for pairwise social affiliation, and a lightweight aggregation method to localize socially connected groups. The authors release a 100K-image street-view dataset with both individual and group annotations to enable large-scale study of relational vision in real-world environments. Through fine-tuning a vision-language model on pairwise relations and a greedy clustering step, MINGLE outperforms zero-shot VLM baselines on both pairwise and region-level evaluations, while depth and distance cues help filter unlikely pairs and reduce computation. The work advances semantic grounding of social interactions in urban computing, with implications for urban planning, AR/VR, robotics, and privacy-aware surveillance; it also provides a valuable dataset for future research in relational perception and open-vocabulary grounding in complex scenes.

Abstract

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

TL;DR

This paper defines social group region detection, a semantically rich grounding task in urban scenes, and proposes MINGLE, a three-stage pipeline that combines robust person detection, depth-aware VLM reasoning for pairwise social affiliation, and a lightweight aggregation method to localize socially connected groups. The authors release a 100K-image street-view dataset with both individual and group annotations to enable large-scale study of relational vision in real-world environments. Through fine-tuning a vision-language model on pairwise relations and a greedy clustering step, MINGLE outperforms zero-shot VLM baselines on both pairwise and region-level evaluations, while depth and distance cues help filter unlikely pairs and reduce computation. The work advances semantic grounding of social interactions in urban computing, with implications for urban planning, AR/VR, robotics, and privacy-aware surveillance; it also provides a valuable dataset for future research in relational perception and open-vocabulary grounding in complex scenes.

Abstract

Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.

Paper Structure

This paper contains 26 sections, 7 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of our method with zero-shot object detection and standard VLM-based methods. Our pipeline enables localized detection of socially interacting human groups in complex urban scenes.
  • Figure 2: The result of OVD for social group detection.
  • Figure 3: Illustration of the three-stage pipeline for detecting semantically complex social interaction regions.
  • Figure 4: Descriptive statistics pertaining to our Social Group Region dataset.