Table of Contents
Fetching ...

Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Mohamed Youssef, Mayar Elfares, Anna-Maria Meer, Matteo Bortoletto, Andreas Bulling

Abstract

Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.

Ontology-Guided Diffusion for Zero-Shot Visual Sim2Real Transfer

Abstract

Bridging the simulation-to-reality (sim2real) gap remains challenging as labelled real-world data is scarce. Existing diffusion-based approaches rely on unstructured prompts or statistical alignment, which do not capture the structured factors that make images look real. We introduce Ontology- Guided Diffusion (OGD), a neuro-symbolic zero-shot sim2real image translation framework that represents realism as structured knowledge. OGD decomposes realism into an ontology of interpretable traits -- such as lighting and material properties -- and encodes their relationships in a knowledge graph. From a synthetic image, OGD infers trait activations and uses a graph neural network to produce a global embedding. In parallel, a symbolic planner uses the ontology traits to compute a consistent sequence of visual edits needed to narrow the realism gap. The graph embedding conditions a pretrained instruction-guided diffusion model via cross-attention, while the planned edits are converted into a structured instruction prompt. Across benchmarks, our graph-based embeddings better distinguish real from synthetic imagery than baselines, and OGD outperforms state-of-the-art diffusion methods in sim2real image translations. Overall, OGD shows that explicitly encoding realism structure enables interpretable, data-efficient, and generalisable zero-shot sim2real transfer.
Paper Structure (40 sections, 13 equations, 6 figures, 4 tables)

This paper contains 40 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Ontology-Guided Diffusion (OGD) explicitly models visual realism to bridge the sim2real gap. Unlike current instruction-guided diffusion that relies on unstructured prompts, OGD decomposes realism into structured traits organised in a knowledge graph that models causal relationships, and uses a symbolic PDDL planner to generate coherent editing actions. Conditioning the diffusion model on this structured guidance produces more realistic and consistent translations from synthetic to real images, enabling interpretable and data-efficient sim2real transfer.
  • Figure 2: Overview of the proposed ontology-guided diffusion framework. A synthetic image is first mapped to realism trait probabilities using supervised MLP heads trained on frozen CLIP features. Traits are propagated through a static realism knowledge graph using a GNN to obtain node-level realism embeddings. Differences between synthetic and target realism states are converted into symbolic transformation plans via PDDL. The symbolic plan and graph embeddings jointly condition a diffusion-based image editing model.
  • Figure 3: Qualitative sim-to-real results.
  • Figure 4: Visualization of part of the realism knowledge graph generated using Neo4j Browser. Nodes correspond to realism traits, while signed edges encode supportive (positive) and opposing (negative) relationships derived from graphics and perception literature.
  • Figure 5: Qualitative sim-to-real results.
  • ...and 1 more figures