Table of Contents
Fetching ...

T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

Zhehao Huang, Yuhang Liu, Yixin Lou, Zhengbao He, Mingzhen He, Wenxing Zhou, Tao Li, Kehan Li, Zeyi Huang, Xiaolin Huang

TL;DR

T2I-ConBench introduces a unified benchmark for continual post-training of text-to-image diffusion models, targeting item customization and domain enhancement. It couples a formal task definition with a diverse data curation strategy and an automated, multi-axis evaluation pipeline that measures retention of pretrained capability, downstream performance, forgetting, and cross-task compositional generalization. Through three realistic task sequences and ten baselines, the study shows that no single method excels across all criteria and cross-task generalization remains a major open challenge, even with oracle joint training. The benchmark provides datasets, code, and evaluation tools to enable fair comparisons and accelerate research toward more robust continual post-training approaches for T2I diffusion models in real-world applications.

Abstract

Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

T2I-ConBench: Text-to-Image Benchmark for Continual Post-training

TL;DR

T2I-ConBench introduces a unified benchmark for continual post-training of text-to-image diffusion models, targeting item customization and domain enhancement. It couples a formal task definition with a diverse data curation strategy and an automated, multi-axis evaluation pipeline that measures retention of pretrained capability, downstream performance, forgetting, and cross-task compositional generalization. Through three realistic task sequences and ten baselines, the study shows that no single method excels across all criteria and cross-task generalization remains a major open challenge, even with oracle joint training. The benchmark provides datasets, code, and evaluation tools to enable fair comparisons and accelerate research toward more robust continual post-training approaches for T2I diffusion models in real-world applications.

Abstract

Continual post-training adapts a single text-to-image diffusion model to learn new tasks without incurring the cost of separate models, but naive post-training causes forgetting of pretrained knowledge and undermines zero-shot compositionality. We observe that the absence of a standardized evaluation protocol hampers related research for continual post-training. To address this, we introduce T2I-ConBench, a unified benchmark for continual post-training of text-to-image models. T2I-ConBench focuses on two practical scenarios, item customization and domain enhancement, and analyzes four dimensions: (1) retention of generality, (2) target-task performance, (3) catastrophic forgetting, and (4) cross-task generalization. It combines automated metrics, human-preference modeling, and vision-language QA for comprehensive assessment. We benchmark ten representative methods across three realistic task sequences and find that no approach excels on all fronts. Even joint "oracle" training does not succeed for every task, and cross-task generalization remains unsolved. We release all datasets, code, and evaluation tools to accelerate research in continual post-training for text-to-image models.

Paper Structure

This paper contains 25 sections, 6 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Overview of T2I-ConBench. Our benchmark consists of four components: (1) challenging continual post‑training task sequences, (2) the curation of diverse item and domain datasets, (3) an automated evaluation pipeline, and (4) comprehensive metrics to fully assess each continual learning method’s ability to update knowledge, resist forgetting, and generalize across tasks.
  • Figure 2: Body pose distribution.
  • Figure 3: Evaluation pipeline of cross-task generalization.
  • Figure 4: Overview of the continual post‑training baselines evaluated in this work, encompassing rehearsal‑based, regularization‑based, and parameter‑isolation methods (sparse fine‑tuning and low‑rank adaptation). These baselines are described in Sec.\ref{['sec:baselines']} and Appendix \ref{['sec:appendix baselines']}.
  • Figure A1: Evaluation pipeline of the unique personalized item similarity by VQA for Item customization tasks.
  • ...and 5 more figures