Table of Contents
Fetching ...

OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation

Sanghyeon Lee, Minwoo Lee, Euijin Shin, Kangyeol Kim, Seunghwan Choi, Jaegul Choo

Abstract

We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.

OPRO: Orthogonal Panel-Relative Operators for Panel-Aware In-Context Image Generation

Abstract

We introduce a parameter-efficient adaptation method for panel-aware in-context image generation with pre-trained diffusion transformers. The key idea is to compose learnable, panel-specific orthogonal operators onto the backbone's frozen positional encodings. This design provides two desirable properties: (1) isometry, which preserves the geometry of internal features, and (2) same-panel invariance, which maintains the model's pre-trained intra-panel synthesis behavior. Through controlled experiments, we demonstrate that the effectiveness of our adaptation method is not tied to a specific positional encoding design but generalizes across diverse positional encoding regimes. By enabling effective panel-relative conditioning, the proposed method consistently improves in-context image-based instructional editing pipelines, including state-of-the-art approaches.

Paper Structure

This paper contains 35 sections, 3 theorems, 37 equations, 9 figures, 8 tables.

Key Result

Proposition 1

For all tokens $i,j$, the OPRO transformation preserves the norms of the position-aware query and key vectors:

Figures (9)

  • Figure 1: Two positional regimes in tiled ICG and the role of OPRO in panelized attention. (a) Global-canvas encoding: Inpainting-based DiTs treat the tiled layout as a single image on a single global coordinate grid, so different panels become disjoint regions of a unified canvas. (b) Per-panel encoding: T2I-based methods encode each panel in its own local frame and then fuse context features into target generation through attention, reusing the same coordinate range across panels. (c) In panelized attention, diagonal blocks are intra-panel and off-diagonal blocks are inter-panel. OPRO preserves the intra-panel blocks while modulating the inter-panel blocks.
  • Figure 2: Overview of OPRO for tiled-panel in-context image generation. The proposed framework partitions a tiled canvas into $P$ panels and processes them as a single token sequence. Within each attention layer of a backbone, OPRO modulates the position-aware queries ($\tilde{q}_i$) and keys ($\tilde{k}_j$) via panel-specific orthogonal operators ($U_{p(i)}$ and $U_{p(j)}$). This adaptation explicitly guides cross-panel interactions while preserving the original same-panel attention geometry. An example generated image is provided on the right.
  • Figure 3: Two-stage compositional reasoning. Stage 1 (single-panel pretext): classify the sum of two arrow orientations modulo $360^\circ$ (8-way) with distractors. Stage 2 (grid reasoning): on an $n{\times}n$ grid, each row provides context examples and a held-out query; the row-wise rule is either rotation by $k\!\cdot\!45^\circ$ or vertical mirror symmetry.
  • Figure 4: OPRO's impact on parameter efficiency. Validation accuracy (%) plotted against the number of trainable adapter parameters (M) for 3×3 panels
  • Figure 5: Comparison with inpainting-based ICG baselines on MagicBrush zhang2023magicbrush test set. Following ICEdit zhang2025context, a diptych prompt is used: “A diptych with two side-by-side images … but { $\mathcal{P}$ }.” Red dotted boxes highlight regions where the baselines fail to preserve context from the input image, resulting in incorrect or incomplete edits.
  • ...and 4 more figures

Theorems & Definitions (5)

  • Proposition 1: Isometry
  • proof
  • Proposition 2: Same-Panel Invariance
  • proof
  • Proposition 3: Zero initialization identity mapping with non-degenerate gradient