Table of Contents
Fetching ...

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi

TL;DR

UrbanAlign shows that aligning frozen vision-language models to human preferences in urban perception can be achieved without any model training by mining interpretable semantic dimensions, distilling features through a multi-agent scoring chain, and applying a locally adaptive geometric calibration. The method leverages a hybrid space combining CLIP features with mid-level semantic scores, calibrated by Locally Weighted Ridge Regression to map to human judgments. End-to-end dimension optimization selects category-specific bottlenecks that maximize calibrated accuracy, yielding substantial gains over supervised baselines while maintaining per-dimension interpretability. The approach reduces reliance on data-intensive training and GPUs, offering a scalable, interpretable path to domain-specific alignment with human preferences. Its validation on Place Pulse 2.0 demonstrates strong performance and practical potential for urban planning and beyond, with broad applicability to other pairwise-preference domains.

Abstract

Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($κ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.

UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment

TL;DR

UrbanAlign shows that aligning frozen vision-language models to human preferences in urban perception can be achieved without any model training by mining interpretable semantic dimensions, distilling features through a multi-agent scoring chain, and applying a locally adaptive geometric calibration. The method leverages a hybrid space combining CLIP features with mid-level semantic scores, calibrated by Locally Weighted Ridge Regression to map to human judgments. End-to-end dimension optimization selects category-specific bottlenecks that maximize calibrated accuracy, yielding substantial gains over supervised baselines while maintaining per-dimension interpretability. The approach reduces reliance on data-intensive training and GPUs, offering a scalable, interpretable path to domain-specific alignment with human preferences. Its validation on Place Pulse 2.0 demonstrates strong performance and practical potential for urban planning and beyond, with broad applicability to other pairwise-preference domains.

Abstract

Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy () on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
Paper Structure (49 sections, 10 equations, 4 figures, 18 tables, 1 algorithm)

This paper contains 49 sections, 10 equations, 4 figures, 18 tables, 1 algorithm.

Figures (4)

  • Figure 1: Conceptual comparison of three approaches to urban perception. (a) End-to-end / zero-shot VLM methods map directly from pixels to abstract judgments with no interpretable intermediates. (b) Low-level objective methods (segmentation-based regression) extract hand-crafted features but lack semantic depth (56.2% avg. accuracy). (c) UrbanAlign routes predictions through mid-level semantic dimensions discovered by the VLM (e.g., Façade Quality, Pavement Integrity), achieving 72.2% average accuracy with actionable, per-dimension interpretability.
  • Figure 2: Overview of the UrbanAlign framework. Stage 1 discovers interpretable semantic dimensions from high/low-consensus exemplars via a VLM. Stage 2 distils VLM knowledge into continuous dimension scores via the Observer--Debater--Judge chain and fuses them with CLIP embeddings into hybrid vectors ($\alpha{=}0.3$). Stage 3 calibrates predictions by locally-weighted ridge regression (LWRR) on the hybrid differential manifold, with $R^2$ as an interpretability audit. Data isolation: $\mathcal{D}_{\mathrm{ref}} \cap \mathcal{D}_{\mathrm{pool}} = \varnothing$.
  • Figure 3: Distribution of image pairs across vote thresholds (${\geq}1$ to ${\geq}5$ votes) for all six perception categories. The $y$-axis is log-scaled. Our consensus filter retains pairs with $N{\geq}3$ votes.
  • Figure 4: Individual image appearance frequency distribution across thresholds (${\geq}1$ to ${\geq}20$ appearances) for all six categories. The heavy-tailed distribution indicates a core set of frequently compared images that anchor TrueSkill rating estimates.