SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Nir Barel; Ron Shapira Weber; Nir Mualem; Shahaf E. Finder; Oren Freifeld

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Nir Barel, Ron Shapira Weber, Nir Mualem, Shahaf E. Finder, Oren Freifeld

TL;DR

SpaceJAM tackles unsupervised joint alignment (JA) by eliminating the need for atlas maintenance and regularization, delivering a lightweight 16K-parameter model that trains and runs an order of magnitude faster than prior methods. It achieves this with a novel inverse-compositional loss built on a Lie-algebra–parameterized sequence of small, invertible warps within a shared latent feature space, plus a compact preprocessing pipeline (PCA+autoencoder) to produce $U_i$ representations. Key contributions include the Lie-group–aware IC-STN architecture, a curriculum that transitions from $\mathrm{SE}(2)$ to full homographies, and an effective flips-handling mechanism, all of which yield competitive PCK@0.10 scores on SPair-71K and CUB while offering substantial speedups and reduced parameter count. The approach enables efficient, robust JA suitable for weakly supervised settings and broad applicability, with code and models made publicly available.

Abstract

The unsupervised task of Joint Alignment (JA) of images is beset by challenges such as high complexity, geometric distortions, and convergence to poor local or even global optima. Although Vision Transformers (ViT) have recently provided valuable features for JA, they fall short of fully addressing these issues. Consequently, researchers frequently depend on expensive models and numerous regularization terms, resulting in long training times and challenging hyperparameter tuning. We introduce the Spatial Joint Alignment Model (SpaceJAM), a novel approach that addresses the JA task with efficiency and simplicity. SpaceJAM leverages a compact architecture with only 16K trainable parameters and uniquely operates without the need for regularization or atlas maintenance. Evaluations on SPair-71K and CUB datasets demonstrate that SpaceJAM matches the alignment capabilities of existing methods while significantly reducing computational demands and achieving at least a 10x speedup. SpaceJAM sets a new standard for rapid and effective image alignment, making the process more accessible and efficient. Our code is available at: https://bgu-cs-vil.github.io/SpaceJAM/.

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

TL;DR

representations. Key contributions include the Lie-group–aware IC-STN architecture, a curriculum that transitions from

to full homographies, and an effective flips-handling mechanism, all of which yield competitive PCK@0.10 scores on SPair-71K and CUB while offering substantial speedups and reduced parameter count. The approach enables efficient, robust JA suitable for weakly supervised settings and broad applicability, with code and models made publicly available.

Abstract

Paper Structure (32 sections, 10 equations, 9 figures, 7 tables)

This paper contains 32 sections, 10 equations, 9 figures, 7 tables.

Introduction
Related Work
Classical JA methods
Deep learning.
Semantic correspondence through self-supervision.
JA using DINO features.
In summary,
Background: Typical Challenges in Joint Alignment
JA Approach #1: Alignment to a Shared Atlas.
JA Approach #2: Congealing.
Method
Preprocessing
Alignment Network
Lie Algebras, Lie groups, and the Matrix Exponential
Loss Function
...and 17 more sections

Figures (9)

Figure 1: SpaceJAM joint alignment Our framework jointly aligns a set of images of an object category in only a few minutes. Top-to-bottom: 1) input images; 2) learned low-dimensional representations; 3) aligned features; 4) aligned images. The last column depicts the average representation (atlas) obtained after training.
Figure 2: Framework overview. Given a set of images,$(I_i)_{i=1}^N$, their DINO-ViT representations and coarse masks , $(V_i, M_i)_{i=1}^N$, SpaceJAM learns an inverse-compositional pairwise alignment between each image pair, and consequently, features, $(U_i)_{i=1}^N$, in a shared semantic space, where they are warped (according to learned warping parameters, $(\btheta_i)_{i=1}^N$) to produce their aligned versions, $(U_i\circ T^{\btheta_i})_{i=1}^N$. Pairwise alignment of $I_j$ to a target image $I_i$, is achieved by warping to the shared space and then wapring the result by the inverse transformation of the target image, yielding $I_j\circ T^{\btheta_j}\circ T^{-\btheta_i}$.
Figure 3: Global alignment vs. dense warping evaluated on several classes of the SPair-71K dataset Min:2019:spair. A source image ($1^{\mathrm{st}}$ row) is mapped to the target image ($2^{\mathrm{nd}}$ row) via either dense warping ($3^{\mathrm{rd}}$ row) achieved by ASIC Gupta:ICCV:2023:ASIC (as presented in their paper) or a parametric alignment by SpaceJAM (ours, $4^{\mathrm{th}}$ row). Dense mapping is prone to produce incoherent results (please zoom in to better see that effect) and heavily relies on regularization. In comparison, our regularization-free method produces geometrically-coherent results and is also much faster.
Figure 4: Pairwise and joint alignment using SpaceJAM. The input images ($1^\mathrm{th}$ row) overlayed by the learned features ($2^\mathrm{nd}$ row) which are used to predict the warping parameters. The $3^\mathrm{rd}$ row shows source-to-target alignment under severe conditions and the $4^\mathrm{th}$ row shows the jointly aligned features and the category atlas (right column).
Figure 5: Lie Algebric curriculum learning. The notation $\mathrm{SE}(2)$ between the epochs $(0,100)$ states that during that interval, the training is restricted to $\mathrm{SE}(2)$. At epoch 100, more transformation parameters are "released" to allow for affine transformations.
...and 4 more figures

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

TL;DR

Abstract

SpaceJAM: a Lightweight and Regularization-free Method for Fast Joint Alignment of Images

Authors

TL;DR

Abstract

Table of Contents

Figures (9)