Table of Contents
Fetching ...

SYNTHONY: A Stress-Aware, Intent-Conditioned Agent for Deep Tabular Generative Models Selection

Hochan Son, Xiaofeng Lin, Jason Ni, Guang Cheng

Abstract

Deep generative models for tabular data (GANs, diffusion models, and LLM-based generators) exhibit highly non-uniform behavior across datasets; the best-performing synthesizer family depends strongly on distributional stressors such as long-tailed marginals, high-cardinality categorical, Zipfian imbalance, and small-sample regimes. This brittleness makes practical deployment challenging, especially when users must balance competing objectives of fidelity, privacy, and utility. We study {intent-conditioned tabular synthesis selection}: given a dataset and a user intent expressed as a preference over evaluation metrics, the goal is to select a synthesizer that minimizes regret relative to an intent-specific oracle. We propose {stress profiling}, a synthesis-specific meta-feature representation that quantifies dataset difficulty along four interpretable stress dimensions, and integrate it into {SYNTHONY}, a selection framework that matches stress profiles against a calibrated capability registry of synthesizer families. Across a benchmark of 7 datasets, 10 synthesizers, and 3 intents, we demonstrate that stress-based meta-features are highly predictive of synthesizer performance: a $k$NN selector using these features achieves strong Top-1 selection accuracy, substantially outperforming zero-shot LLM selectors and random baselines. We analyze the gap between meta-feature-based and capability-based selection, identifying the hand-crafted capability registry as the primary bottleneck and motivating learned capability representations as a direction for future work.

SYNTHONY: A Stress-Aware, Intent-Conditioned Agent for Deep Tabular Generative Models Selection

Abstract

Deep generative models for tabular data (GANs, diffusion models, and LLM-based generators) exhibit highly non-uniform behavior across datasets; the best-performing synthesizer family depends strongly on distributional stressors such as long-tailed marginals, high-cardinality categorical, Zipfian imbalance, and small-sample regimes. This brittleness makes practical deployment challenging, especially when users must balance competing objectives of fidelity, privacy, and utility. We study {intent-conditioned tabular synthesis selection}: given a dataset and a user intent expressed as a preference over evaluation metrics, the goal is to select a synthesizer that minimizes regret relative to an intent-specific oracle. We propose {stress profiling}, a synthesis-specific meta-feature representation that quantifies dataset difficulty along four interpretable stress dimensions, and integrate it into {SYNTHONY}, a selection framework that matches stress profiles against a calibrated capability registry of synthesizer families. Across a benchmark of 7 datasets, 10 synthesizers, and 3 intents, we demonstrate that stress-based meta-features are highly predictive of synthesizer performance: a NN selector using these features achieves strong Top-1 selection accuracy, substantially outperforming zero-shot LLM selectors and random baselines. We analyze the gap between meta-feature-based and capability-based selection, identifying the hand-crafted capability registry as the primary bottleneck and motivating learned capability representations as a direction for future work.

Paper Structure

This paper contains 77 sections, 6 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: SYNTHONY system architecture. Solid arrows denote the real-time recommendation pipeline; dashed arrows denote the offline calibration loop (Section \ref{['sec:calibration']}). Data flows left through profiling, center through scoring, and returns right through the API.
  • Figure 2: Rule-based scoring pipeline. The match function $m_j$ implements a decay curve: 1.0 (exact match), 0.7 (one level below), 0.4 (two levels below), 0.0 (otherwise). Scale factors $\alpha_{i,j}$ are applied when an intent is specified. The hard-problem path is bypassed when scale factors are provided to allow the optimizer full control.
  • Figure 3: API architecture with SQLite persistence. Double arrows indicate read/write paths. Uploaded files are stored on disk; profiles and recommendations are cached in SQLite for session-based retrieval.
  • Figure 4: MCP server architecture. The server exposes SYNTHONY's capabilities through three MCP primitives: Tools (executable functions), Resources (read-only data), and Prompts (guided workflows). Communication uses JSON-RPC 2.0 over stdio.