Table of Contents
Fetching ...

Task Expansion and Cross Refinement for Open-World Conditional Modeling

Shreyas Bhat Brahmavar, Qiyang Liu, Yang Li, Junier Oliva

Abstract

Open-world conditional modeling (OCM), requires a single model to answer arbitrary conditional queries across heterogeneous datasets, where observed variables and targets vary and arise from a vast open-ended task universe. Because any finite collection of real-world datasets covers only a small fraction of this space, we propose Task Expansion and Cross Refinement (TEXR), a semi-supervised framework that enlarges effective task coverage through structured synthesis and refinement of semantic data contexts. TEXR first generates diverse uninstantiated dataset schemas and weakly instantiates them via structured probabilistic generators guided by large language models. It then performs cross-model refinement by training on disjoint data partitions and revising synthetic values across splits to reduce confirmation bias and improve pseudo-value quality. The refined synthetic datasets are aggregated with real data to train a unified conditional model. Across heterogeneous tabular benchmarks, TEXR consistently improves zero-, few-, and many-shot performance for multiple OCM backbones, demonstrating that structured task expansion and cross refinement enhance open-world conditional modeling.

Task Expansion and Cross Refinement for Open-World Conditional Modeling

Abstract

Open-world conditional modeling (OCM), requires a single model to answer arbitrary conditional queries across heterogeneous datasets, where observed variables and targets vary and arise from a vast open-ended task universe. Because any finite collection of real-world datasets covers only a small fraction of this space, we propose Task Expansion and Cross Refinement (TEXR), a semi-supervised framework that enlarges effective task coverage through structured synthesis and refinement of semantic data contexts. TEXR first generates diverse uninstantiated dataset schemas and weakly instantiates them via structured probabilistic generators guided by large language models. It then performs cross-model refinement by training on disjoint data partitions and revising synthetic values across splits to reduce confirmation bias and improve pseudo-value quality. The refined synthetic datasets are aggregated with real data to train a unified conditional model. Across heterogeneous tabular benchmarks, TEXR consistently improves zero-, few-, and many-shot performance for multiple OCM backbones, demonstrating that structured task expansion and cross refinement enhance open-world conditional modeling.
Paper Structure (39 sections, 1 theorem, 3 equations, 10 figures, 1 table)

This paper contains 39 sections, 1 theorem, 3 equations, 10 figures, 1 table.

Key Result

Proposition 3.1

Open-world conditional model training on arbitrary inputs and targets is equivalent to training based on context-conditioned joint modeling.

Figures (10)

  • Figure 1: The task universe for open-world conditional modeling is vast, and any finite collection of real-world datasets (gray) captures only a small fraction of it. We propose synthetic data generation and refinement techniques to expand task coverage (orange).
  • Figure 2: TEXR provides an initial collection of weakly-instantiated datasets that are partitioned and then refined with cross model labeling (which refine values on synthetic data not seen during training). Finally, revised synthetic and real datasets are consolidated to train a full open-world conditioning model.
  • Figure 3: Our initial collection of synthetic datasets stems from an LLM guided process that first generates hypothetical data contexts followed by a weak-labeling scheme that generates corresponding Bayesian networks and conditional probability tables to produce instance values.
  • Figure 4: Five-shot F1 comparison across OCM backbones and data expansion strategies: synthetic generator (color), refinement (hatch), real-data-only (gray), and TEXR (black).
  • Figure 5: Zero-shot (ZSL) to Five-shot (5SL) F1 comparison across data expansion strategies and OCM backbones. (TP-BERTa and TabSTAR are unable to perform ZSL.)
  • ...and 5 more figures

Theorems & Definitions (1)

  • Proposition 3.1