Repeat and Concatenate: 2D to 3D Image Translation with 3D to 3D Generative Modeling
Abril Corona-Figueroa, Hubert P. H. Shum, Chris G. Willcocks
TL;DR
This work tackles 2D X-ray to 3D CT-like reconstruction under limited data by reframing the task as a 3D-to-3D generative problem. It preserves 2D information by repeating and concatenating $N$ views into a high-channel volume and uses a Swin UNETR-based mapper together with neural optimal transport, regularized by the de-biased Sinkhorn divergence, to maintain fidelity to the inputs without heavy latent encoding. The approach demonstrates strong cross-view correlation, achieving competitive reconstructions when trained on a single dataset and generalizing to six datasets, including out-of-distribution samples, with fast convergence (~$2{,}000$ iterations, ~28 hours). This method is fast, data-efficient, and robust to view variations, offering practical potential for clinical CT reconstruction while acknowledging remaining blur due to intrinsic uncertainty, which could be mitigated by iterative alignment or diffusion-based refinement in future work.
Abstract
This paper investigates a 2D to 3D image translation method with a straightforward technique, enabling correlated 2D X-ray to 3D CT-like reconstruction. We observe that existing approaches, which integrate information across multiple 2D views in the latent space, lose valuable signal information during latent encoding. Instead, we simply repeat and concatenate the 2D views into higher-channel 3D volumes and approach the 3D reconstruction challenge as a straightforward 3D to 3D generative modeling problem, sidestepping several complex modeling issues. This method enables the reconstructed 3D volume to retain valuable information from the 2D inputs, which are passed between channel states in a Swin UNETR backbone. Our approach applies neural optimal transport, which is fast and stable to train, effectively integrating signal information across multiple views without the requirement for precise alignment; it produces non-collapsed reconstructions that are highly faithful to the 2D views, even after limited training. We demonstrate correlated results, both qualitatively and quantitatively, having trained our model on a single dataset and evaluated its generalization ability across six datasets, including out-of-distribution samples.
