Table of Contents
Fetching ...

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

Ruichuan An, Kai Zeng, Ming Lu, Sihan Yang, Renrui Zhang, Huitong Ji, Hao Liang, Wentao Zhang

TL;DR

This work tackles personalization in Vision-Language Models under data scarcity by introducing Concept-as-Tree (CaT), a controllable synthetic data framework that represents concepts as tree-structured graphs and generates labeled positives and negatives via diffusion models guided by the tree. A Perturbation-based Concept-Specific (PCS) score filters generated samples to emphasize concept-specific information, enabling high-quality data selection. Across multiple datasets (MC-LLaVA, Yo'LLaVA, MyVLM), CaT with PCS filtering yields consistent improvements in recognition, VQA, and captioning tasks, often approaching or surpassing baselines that use real data, and retaining effectiveness under data-scarce regimes. The approach supports multi-concept personalization via forests and highlights future directions toward broader concept coverage, while discussing limitations such as potential biases and privacy considerations.

Abstract

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.

Concept-as-Tree: A Controllable Synthetic Data Framework Makes Stronger Personalized VLMs

TL;DR

This work tackles personalization in Vision-Language Models under data scarcity by introducing Concept-as-Tree (CaT), a controllable synthetic data framework that represents concepts as tree-structured graphs and generates labeled positives and negatives via diffusion models guided by the tree. A Perturbation-based Concept-Specific (PCS) score filters generated samples to emphasize concept-specific information, enabling high-quality data selection. Across multiple datasets (MC-LLaVA, Yo'LLaVA, MyVLM), CaT with PCS filtering yields consistent improvements in recognition, VQA, and captioning tasks, often approaching or surpassing baselines that use real data, and retaining effectiveness under data-scarce regimes. The approach supports multi-concept personalization via forests and highlights future directions toward broader concept coverage, while discussing limitations such as potential biases and privacy considerations.

Abstract

Vision-Language Models (VLMs) have demonstrated exceptional performance in various multi-modal tasks. Recently, there has been an increasing interest in improving the personalization capabilities of VLMs. To better integrate user-provided concepts into VLMs, many methods use positive and negative samples to fine-tune these models. However, the scarcity of user-provided positive samples and the low quality of retrieved negative samples pose challenges for existing techniques. To reveal the relationship between sample and model performance, we systematically investigate the amount and diversity impact of positive and negative samples (easy and hard) on VLM personalization tasks. Based on the detailed analysis, we introduce Concept-as-Tree (CaT), which represents a concept as a tree structure, thereby enabling the data generation of positive and negative samples with varying difficulty and diversity, and can be easily extended to multi-concept scenarios. With a well-designed data filtering strategy, our CaT framework can ensure the quality of generated data, constituting a powerful pipeline. We perform thorough experiments with various VLM personalization baselines to assess the effectiveness of the pipeline, alleviating the lack of positive samples and the low quality of negative samples. Our results demonstrate that CaT equipped with the proposed data filter significantly enhances the capabilities of VLMs across personalization benchmarks. To the best of our knowledge, this work is the first controllable synthetic data pipeline for VLM personalization. The code will be released.

Paper Structure

This paper contains 19 sections, 4 equations, 25 figures, 10 tables.

Figures (25)

  • Figure 1: The performance of Yo'LLaVA on Yo'LLaVA Dataset. (Left) Various tasks show a noticeable decrease in performance when the number of positive samples is limited. (Right) As the number of retrieved negative samples increases, the performance of various tasks does not improve consistently. This might be due to uncertainty in the image retrieval process, leading to low-quality negative samples. More results on different datasets can be found in Appendix.
  • Figure 2: Overview of unified and controllable data synthesis pipeline and personalized model training. After a systematic analysis of positive and negative samples, we utilize LLM and VLM to construct the concept tree and edit it to achieve controllable generation. We then propose a well-designed data selection module and a new metric named PCS score to ensure the quality of synthetic data. The ultimate high-quality data can be used to enhance test-time fine-tuning methods.
  • Figure 3: Explore the role of positive and negative samples in personalization and their demand for diversity. (Left) We evaluate the effect of halving positive and negative samples. Positive samples generally improve performance, while easy and hard negatives show task-specific benefits. (Middle & Right) With fixed sample counts, we vary the diversity of easy and hard negatives. Hard negatives are more sensitive to diversity, and excessive diversity can degrade performance. Diversity scores are computed by clustering retrieved negatives via K-means and measuring distances to cluster centroids. The first line shows results on the Yo'LLaVA dataset; the second line illustrates results on the MC-LLaVa dataset.
  • Figure 4: Concept tree synthesis and editing operations for sample generation. (a) A concept tree $T_C$ is constructed using the CaT framework and reference images. Positive samples are synthesized with a fine-tuned diffusion model guided by the root node $R(C)$. Easy negatives are generated by changing the class of $R(C)$, while diverse hard negatives are created by applying three editing operations to the original tree, using an unfine-tuned model. (b) Multiple concept trees can also be merged to support multi-concept personalization data generation. (c) The three operations are visualized: adding a "mood" dimension alters emotions; removing the "object" simplifies the scene; modifying a dimension changes behavior.
  • Figure 5: PCS-based filtering and visualization of high-quality image selection. (a) We apply patch shuffling to synthetic images and extract features $F_o$, $F_d$, and $F_r$ from original, disturbed and reference images. Image similarities $S_o$ and $S_d$ are computed to select images with high PCS scores. (b) Images rich in CS information are in green and those rich in CA information are in red. Cosine similarity can not distinguish between two types, but PCS score effectively filters out unqualified ones.
  • ...and 20 more figures