Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

Ruichuan An; Kai Zeng; Ming Lu; Sihan Yang; Renrui Zhang; Huitong Ji; Hao Liang; Wentao Zhang

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, Wentao Zhang

TL;DR

This work tackles personalization in Vision-Language Models under data scarcity by introducing Concept-as-Tree (CaT), a controllable synthetic data framework that represents concepts as tree-structured graphs and generates labeled positives and negatives via diffusion models guided by the tree. A Perturbation-based Concept-Specific (PCS) score filters generated samples to emphasize concept-specific information, enabling high-quality data selection. Across multiple datasets (MC-LLaVA, Yo'LLaVA, MyVLM), CaT with PCS filtering yields consistent improvements in recognition, VQA, and captioning tasks, often approaching or surpassing baselines that use real data, and retaining effectiveness under data-scarce regimes. The approach supports multi-concept personalization via forests and highlights future directions toward broader concept coverage, while discussing limitations such as potential biases and privacy considerations.

Abstract

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

TL;DR

Abstract

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (25)