Table of Contents
Fetching ...

Symbolic Representation for Any-to-Any Generative Tasks

Jiaqi Chen, Xiaoye Zhu, Yue Wang, Tianyang Liu, Xinhui Chen, Ying Chen, Chak Tou Leong, Yifei Ke, Joseph Liu, Yiwen Yuan, Julian McAuley, Li-jia Li

TL;DR

The paper tackles the challenge of enabling any-to-any generative tasks across modalities without task-specific training. It introduces A-Language, a symbolic representation that decomposes tasks into functions, parameters, and topology, and a training-free LM-driven inference engine to map natural language instructions to executable symbolic workflows. Empirical results on 120 real-world tasks and ComfyBench show competitive performance with state-of-the-art unified models, while delivering enhanced editability, interpretability, and efficiency. The work argues for the value of explicit symbolic task representations as a cost-effective, extensible foundation for advancing cross-modal generative AI, with robust topology construction and iterative refinement as key enablers of practical deployment.

Abstract

We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.

Symbolic Representation for Any-to-Any Generative Tasks

TL;DR

The paper tackles the challenge of enabling any-to-any generative tasks across modalities without task-specific training. It introduces A-Language, a symbolic representation that decomposes tasks into functions, parameters, and topology, and a training-free LM-driven inference engine to map natural language instructions to executable symbolic workflows. Empirical results on 120 real-world tasks and ComfyBench show competitive performance with state-of-the-art unified models, while delivering enhanced editability, interpretability, and efficiency. The work argues for the value of explicit symbolic task representations as a cost-effective, extensible foundation for advancing cross-modal generative AI, with robust topology construction and iterative refinement as key enablers of practical deployment.

Abstract

We propose a symbolic generative task description language and a corresponding inference engine capable of representing arbitrary multimodal tasks as structured symbolic flows. Unlike conventional generative models that rely on large-scale training and implicit neural representations to learn cross-modal mappings, often at high computational cost and with limited flexibility, our framework introduces an explicit symbolic representation comprising three core primitives: functions, parameters, and topological logic. Leveraging a pre-trained language model, our inference engine maps natural language instructions directly to symbolic workflows in a training-free manner. Our framework successfully performs over 12 diverse multimodal generative tasks, demonstrating strong performance and flexibility without the need for task-specific tuning. Experiments show that our method not only matches or outperforms existing state-of-the-art unified models in content quality, but also offers greater efficiency, editability, and interruptibility. We believe that symbolic task representations provide a cost-effective and extensible foundation for advancing the capabilities of generative AI.

Paper Structure

This paper contains 32 sections, 9 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: A symbolic representation for Any-to-Any generative tasks. (a) We develop a training-free inference engine that transforms natural language task descriptions into executable symbolic flow comprising functions, parameters, and the topology. (b) The symbolic flow allows executing generative tasks as programs. Example task is mentioned in the first sentence of Sec. \ref{['sec:intro']}. (c) Both functions and parameters can be easily modified to customize the generation process and the output style.
  • Figure 2: The Any-to-Any generative model. Our model demonstrates the capability to handle any-to-any generative tasks across various modalities, including text, images, videos, audio, and 3D content. It supports flexible transformations such as converting image to video, generating 3D models from images, or synthesizing audio from textual prompts. Formally, any-to-any generative tasks refer to generating outputs in any desired modality from inputs in any other modality, all guided by natural language instructions tang2024any.
  • Figure 3: Syntax comparison. We implement our symbolic representation using three different styles of domain-specific languages (DSLs). (a) The declarative syntax registers all components into the workflow. (b) The dataflow syntax emphasizes the direction of data flow. (c) The pseudo-natural syntax mimics human language expression.
  • Figure 4: Inferring symbolic flow with pre-trained language model (LM). Beginning with (a) a natural language task description and key functions and parameters, we leverage LM to infer (b) a comprehensive set of functions and parameters. We then integrate (a) and (b) to deduce the (c) topology. If compilation or execution fails, all information is aggregated for further refinement (Sec. \ref{['sec:topology_construction']}).
  • Figure 5: Demonstration of the inference and execution. The inference framework translates a natural language task description into an executable symbolic representation. This symbolic representation is then compiled and executed through a workflow executor to perform the desired transformation. See appendix for details.
  • ...and 4 more figures