Table of Contents
Fetching ...

Text2Data: Low-Resource Data Generation with Textual Control

Shiyu Wang, Yihao Feng, Tian Lan, Ning Yu, Yu Bai, Ran Xu, Huan Wang, Caiming Xiong, Silvio Savarese

TL;DR

Text2Data addresses the challenge of controllable text-to-data generation in low-resource domains by decoupling distribution learning from controllable finetuning. It first learns the data distribution from unlabeled data with an unconditional diffusion model, then finetunes on text-labeled data under a constraint-based objective that mitigates catastrophic forgetting, formalized via a lexicographic optimization and supported by a generalization bound. The approach yields superior controllability across molecules, motions, and time series while maintaining competitive generation quality, demonstrated on QM9, HumanML3D/AMASS, and stock-time-series datasets. This two-stage method reduces dependence on large labeled datasets and can be adapted to other generative models, offering a practical path for text-conditioned generation in scarce-label scenarios.

Abstract

Natural language serves as a common and straightforward signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.

Text2Data: Low-Resource Data Generation with Textual Control

TL;DR

Text2Data addresses the challenge of controllable text-to-data generation in low-resource domains by decoupling distribution learning from controllable finetuning. It first learns the data distribution from unlabeled data with an unconditional diffusion model, then finetunes on text-labeled data under a constraint-based objective that mitigates catastrophic forgetting, formalized via a lexicographic optimization and supported by a generalization bound. The approach yields superior controllability across molecules, motions, and time series while maintaining competitive generation quality, demonstrated on QM9, HumanML3D/AMASS, and stock-time-series datasets. This two-stage method reduces dependence on large labeled datasets and can be adapted to other generative models, offering a practical path for text-conditioned generation in scarce-label scenarios.

Abstract

Natural language serves as a common and straightforward signal for humans to interact seamlessly with machines. Recognizing the importance of this interface, the machine learning community is investing considerable effort in generating data that is semantically coherent with textual instructions. While strides have been made in text-to-data generation spanning image editing, audio synthesis, video creation, and beyond, low-resource areas characterized by expensive annotations or complex data structures, such as molecules, motion dynamics, and time series, often lack textual labels. This deficiency impedes supervised learning, thereby constraining the application of advanced generative models for text-to-data tasks. In response to these challenges in the low-resource scenario, we propose Text2Data, a novel approach that utilizes unlabeled data to understand the underlying data distribution through an unsupervised diffusion model. Subsequently, it undergoes controllable finetuning via a novel constraint optimization-based learning objective that ensures controllability and effectively counteracts catastrophic forgetting. Comprehensive experiments demonstrate that Text2Data is able to achieve enhanced performance regarding controllability across various modalities, including molecules, motions and time series, when compared to existing baselines.
Paper Structure (31 sections, 2 theorems, 39 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 39 equations, 4 figures, 8 tables, 1 algorithm.

Key Result

Theorem 4.2

For every $\theta$ and $t$, assume ${\bm{\epsilon}}_\theta({\mathbf{x}}^{(t)}, t)$ and ${\bm{\epsilon}}_\theta({\mathbf{x}}^{(t)}, {\mathbf{c}}_i, t)$ are sub-Gaussian random variables with mean 0 and variance $\sigma^2$, and $\Theta$ is finite. Let $\Theta^*=\{\theta:{\mathcal{L}}_1'(\theta)\le\xi\ where $\epsilon=\epsilon_N + \epsilon_{N_p}$, $\epsilon_N = \sqrt{C\tilde{\sigma}^2}\cdot\sqrt{\fra

Figures (4)

  • Figure 1: Overview of Text2Data. The model leverages unlabeled data (i.e., blue module) to discern the overall data distribution while the optimal set of model parameters $\Theta$ is obtained. Then the model is finetuned on labeled data (i.e., red module) by constraint optimization that gives the optimal set of parameters as $\Theta\cap\Theta'$, where $\Theta'$ is the optimal set of parameters if finetune the model without constraint.
  • Figure 2: Evaluate controllability on Molecule dataset according to different proportions of paired data. Green solid line corresponds to Text2Data and two dashed lines are baseline comparisons, in which blue line is EDM and orange line is EDM-finetune. Properties of generated molecules are predicted by classifier $\phi_c$. MAE is computed between properties of generated molecules and intended properties. Lower MAE indicates better performance.
  • Figure 3: Visualization of generated molecules when the polarizability increases from "very low" to "very high".
  • Figure 4: t-SNE visualization on time series data generated by Text2Data, DiffTS-finetune model and DiffTS. Red denotes ground truth, and blue denotes generated data.

Theorems & Definitions (5)

  • Definition 4.1: Sub-Gaussian random variable
  • Theorem 4.2
  • Definition C.1: Sub-exponential random variable
  • Lemma C.2
  • proof