AQuaMaM: An Autoregressive, Quaternion Manifold Model for Rapidly Estimating Complex SO(3) Distributions
Michael A. Alcorn
TL;DR
AQuaMaM tackles the challenge of learning complex, multimodal distributions on the rotation group $ ext{SO}(3)$ by proposing an autoregressive quaternion-language model implemented on a Transformer. By representing rotations through projected unit quaternions and partitioning each component into mixtures of uniform bins, the model enforces the unit-norm constraint and enables exact likelihoods in a single forward pass, avoiding the slow IPDF inference that requires many MLP evaluations. The approach reframes density learning as a three-token language modeling task, yielding a cubic scaling in the number of bins and enabling parameter-efficient, fast predictions. Empirical results on toy and die datasets show that AQuaMaM achieves higher log-likelihoods, more accurate sampling aligned with the true distributions, and substantially higher throughput (e.g., 52× faster on a single GPU) compared to IPDF, with strong generalization to multi-modal pose uncertainty. Overall, AQuaMaM offers a scalable, precise method for rapid, high-fidelity $ ext{SO}(3)$ distribution estimation with broad applicability to 3D pose reasoning and related domains.
Abstract
Accurately modeling complex, multimodal distributions for rotations in three-dimensions, i.e., the SO(3) group, is challenging due to the curvature of the rotation manifold. The recently described implicit-PDF (IPDF) is a simple, elegant, and effective approach for learning arbitrary distributions on SO(3) up to a given precision. However, inference with IPDF requires $N$ forward passes through the network's final multilayer perceptron (where $N$ places an upper bound on the likelihood that can be calculated by the model), which is prohibitively slow for those without the computational resources necessary to parallelize the queries. In this paper, I introduce AQuaMaM, a neural network capable of both learning complex distributions on the rotation manifold and calculating exact likelihoods for query rotations in a single forward pass. Specifically, AQuaMaM autoregressively models the projected components of unit quaternions as mixtures of uniform distributions that partition their geometrically-restricted domain of values. When trained on an "infinite" toy dataset with ambiguous viewpoints, AQuaMaM rapidly converges to a sampling distribution closely matching the true data distribution. In contrast, the sampling distribution for IPDF dramatically diverges from the true data distribution, despite IPDF approaching its theoretical minimum evaluation loss during training. When trained on a constructed dataset of 500,000 renders of a die in different rotations, AQuaMaM reaches a test log-likelihood 14% higher than IPDF. Further, compared to IPDF, AQuaMaM uses 24% fewer parameters, has a prediction throughput 52$\times$ faster on a single GPU, and converges in a similar amount of time during training.
