Table of Contents
Fetching ...

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

Hila Manor, Rinon Gal, Haggai Maron, Tomer Michaeli, Gal Chechik

TL;DR

The paper tackles the limited generalization of single-LoRA adapters in visual analogy editing by introducing LoRWeB, a framework that learns a basis of LoRA adapters and a lightweight encoder to dynamically mix them per input triplet. By encoding the analogy triplet and querying a learnable key set, LoRWeB constructs a task-specific, mixed LoRA injected into a diffusion-based editing backbone, enabling flexible transformations unseen during training. Across extensive experiments, LoRWeB achieves state-of-the-art results and better generalization to unseen transformations, outpacing single-LoRA baselines and demonstrating strong preservation of original content while applying complex edits. The work highlights the promise of LoRA-basis decompositions for flexible, parameter-efficient visual manipulation and suggests broad potential for applying task-specific adapter bases beyond visual analogies.

Abstract

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet $\{\mathbf{a}$, $\mathbf{a}'$, $\mathbf{b}\}$, the goal is to generate $\mathbf{b}'$ such that $\mathbf{a} : \mathbf{a}' :: \mathbf{b} : \mathbf{b}'$. Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb

Spanning the Visual Analogy Space with a Weight Basis of LoRAs

TL;DR

The paper tackles the limited generalization of single-LoRA adapters in visual analogy editing by introducing LoRWeB, a framework that learns a basis of LoRA adapters and a lightweight encoder to dynamically mix them per input triplet. By encoding the analogy triplet and querying a learnable key set, LoRWeB constructs a task-specific, mixed LoRA injected into a diffusion-based editing backbone, enabling flexible transformations unseen during training. Across extensive experiments, LoRWeB achieves state-of-the-art results and better generalization to unseen transformations, outpacing single-LoRA baselines and demonstrating strong preservation of original content while applying complex edits. The work highlights the promise of LoRA-basis decompositions for flexible, parameter-efficient visual manipulation and suggests broad potential for applying task-specific adapter bases beyond visual analogies.

Abstract

Visual analogy learning enables image manipulation through demonstration rather than textual description, allowing users to specify complex transformations difficult to articulate in words. Given a triplet , , , the goal is to generate such that . Recent methods adapt text-to-image models to this task using a single Low-Rank Adaptation (LoRA) module, but they face a fundamental limitation: attempting to capture the diverse space of visual transformations within a fixed adaptation module constrains generalization capabilities. Inspired by recent work showing that LoRAs in constrained domains span meaningful, interpolatable semantic spaces, we propose LoRWeB, a novel approach that specializes the model for each analogy task at inference time through dynamic composition of learned transformation primitives, informally, choosing a point in a "space of LoRAs". We introduce two key components: (1) a learnable basis of LoRA modules, to span the space of different visual transformations, and (2) a lightweight encoder that dynamically selects and weighs these basis LoRAs based on the input analogy pair. Comprehensive evaluations demonstrate our approach achieves state-of-the-art performance and significantly improves generalization to unseen visual transformations. Our findings suggest that LoRA basis decompositions are a promising direction for flexible visual manipulation. Code and data are in https://research.nvidia.com/labs/par/lorweb
Paper Structure (34 sections, 4 equations, 9 figures, 3 tables)

This paper contains 34 sections, 4 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: LoRWeB. We present a novel method for analogy-based editing, based on learnable mixing of low-rank adapters. Given a prompt and an image triplet $\{{\mathbf{a}},{\mathbf{a}}',{\mathbf{b}}\}$ that visually describe a desired transformation, LoRWeB dynamically constructs a single LoRA from a learnable basis of LoRA modules, and produces an editing result ${\mathbf{b}}'$ that applies the same analogy for the new image.
  • Figure 2: LoRWeB Overview. We first encode ${\mathbf{a}}$ and ${\mathbf{a}}'$, that describe a visual transformation (e.g. adding a hat to the man), and ${\mathbf{b}}$, which should be edited analogously (e.g. adding a hat to the woman) with CLIP clip, and a small learned projection module. The similarity between the encoded vector and a set of learned keys determines the linear coefficients for combining the learned LoRAs into a single, mixed LoRA. This mixed LoRA is injected into a conditional flow model (e.g. Flux.1-Kontext labs2025kontext). Next, we build a $2\times2$ composite image from $\{{\mathbf{a}},{\mathbf{a}}',{\mathbf{b}}\}$. The conditional flow model gets this composite image as its input, along with a guiding edit prompt, and produces a composite image with the edited results ${\mathbf{b}}'$ in the bottom-right quadrant.
  • Figure 3: LoRWeB visual analogy results. Using a LoRA Basis allows LoRWeB to generalize to a wide variety of new analogy tasks, from adding objects to transferring specific styles or makeup or copying pose changes. Please zoom in for more details.
  • Figure 4: Comparisons with baseline methods on unseen tasks. Our approach generalizes across more diverse tasks, and better maintains the visual details of both the subject and the analogy.
  • Figure 5: Quantitative comparisons. (left) Accuracy of the applied edit and preservation of ${\mathbf{b}}$ in ${\mathbf{b}}'$ using Gemma-3 team2025gemma. Top right is better. (right) CLIP directional similarity and LPIPS between ${\mathbf{b}}'$ and ${\mathbf{b}}$. Bottom-right is better. Our method pushes the Pareto front of edit accuracy-preservation, achieving higher edit accuracy while strongly preserving the input image.
  • ...and 4 more figures