UrbanAlign: Post-hoc Semantic Calibration for VLM-Human Preference Alignment
Yecheng Zhang, Rong Zhao, Zhizhou Sha, Yong Li, Lei Wang, Ce Hou, Wen Ji, Hao Huang, Yunshan Wan, Jian Yu, Junhao Xia, Yuru Zhang, Chunlei Shi
TL;DR
UrbanAlign shows that aligning frozen vision-language models to human preferences in urban perception can be achieved without any model training by mining interpretable semantic dimensions, distilling features through a multi-agent scoring chain, and applying a locally adaptive geometric calibration. The method leverages a hybrid space combining CLIP features with mid-level semantic scores, calibrated by Locally Weighted Ridge Regression to map to human judgments. End-to-end dimension optimization selects category-specific bottlenecks that maximize calibrated accuracy, yielding substantial gains over supervised baselines while maintaining per-dimension interpretability. The approach reduces reliance on data-intensive training and GPUs, offering a scalable, interpretable path to domain-specific alignment with human preferences. Its validation on Place Pulse 2.0 demonstrates strong performance and practical potential for urban planning and beyond, with broad applicability to other pairwise-preference domains.
Abstract
Aligning vision-language model (VLM) outputs with human preferences in domain-specific tasks typically requires fine-tuning or reinforcement learning, both of which demand labelled data and GPU compute. We show that for subjective perception tasks, this alignment can be achieved without any model training: VLMs are already strong concept extractors but poor decision calibrators, and the gap can be closed externally. We propose a training-free post-hoc concept-bottleneck pipeline consisting of three tightly coupled stages: concept mining, multi-agent structured scoring, and geometric calibration, unified by an end-to-end dimension optimization loop. Interpretable evaluation dimensions are mined from a handful of human annotations; an Observer-Debater-Judge chain extracts robust continuous concept scores from a frozen VLM; and locally-weighted ridge regression on a hybrid visual-semantic manifold calibrates these scores against human ratings. Applied to urban perception as UrbanAlign, the framework achieves 72.2% accuracy ($κ=0.45$) on Place Pulse 2.0 across six categories, outperforming the best supervised baseline by +15.1 pp and uncalibrated VLM scoring by +16.3 pp, with full dimension-level interpretability and zero model-weight modification.
