Table of Contents
Fetching ...

OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

Penglei Gao, Kai Yao, Tiandi Ye, Steven Wang, Yuan Yao, Xiaofeng Wang

TL;DR

OPa-Ma tackles 360-degree image out-painting from narrow-field inputs by integrating a time-variant State-Space Model (Mamba) with text guidance. The Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA) provide refined, omni-visual and text-conditioned cues to a latent diffusion backbone, enabling long-range spatial continuity and semantic coherence. Extensive experiments on indoor and outdoor Laval HDR datasets demonstrate state-of-the-art FID, LPIPS, SC, and IS, validating the method under text-only, NFoV-only, and combined conditions. The approach offers a scalable, memory-efficient alternative to transformer-based methods, with practical impact for accessible 360-degree content creation from standard NFoV imagery.

Abstract

In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.

OPa-Ma: Text Guided Mamba for 360-degree Image Out-painting

TL;DR

OPa-Ma tackles 360-degree image out-painting from narrow-field inputs by integrating a time-variant State-Space Model (Mamba) with text guidance. The Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA) provide refined, omni-visual and text-conditioned cues to a latent diffusion backbone, enabling long-range spatial continuity and semantic coherence. Extensive experiments on indoor and outdoor Laval HDR datasets demonstrate state-of-the-art FID, LPIPS, SC, and IS, validating the method under text-only, NFoV-only, and combined conditions. The approach offers a scalable, memory-efficient alternative to transformer-based methods, with practical impact for accessible 360-degree content creation from standard NFoV imagery.

Abstract

In this paper, we tackle the recently popular topic of generating 360-degree images given the conventional narrow field of view (NFoV) images that could be taken from a single camera or cellphone. This task aims to predict the reasonable and consistent surroundings from the NFoV images. Existing methods for feature extraction and fusion, often built with transformer-based architectures, incur substantial memory usage and computational expense. They also have limitations in maintaining visual continuity across the entire 360-degree images, which could cause inconsistent texture and style generation. To solve the aforementioned issues, we propose a novel text-guided out-painting framework equipped with a State-Space Model called Mamba to utilize its long-sequence modelling and spatial continuity. Furthermore, incorporating textual information is an effective strategy for guiding image generation, enriching the process with detailed context and increasing diversity. Efficiently extracting textual features and integrating them with image attributes presents a significant challenge for 360-degree image out-painting. To address this, we develop two modules, Visual-textual Consistency Refiner (VCR) and Global-local Mamba Adapter (GMA). VCR enhances contextual richness by fusing the modified text features with the image features, while GMA provides adaptive state-selective conditions by capturing the information flow from global to local representations. Our proposed method achieves state-of-the-art performance with extensive experiments on two broadly used 360-degree image datasets, including indoor and outdoor settings.
Paper Structure (17 sections, 8 equations, 8 figures, 3 tables)

This paper contains 17 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Overview of the framework. (a) The three different representations of a 360-degree image. (b) The proposed Visual-textual Consistency Refiner (VCR) aims to adjust the image-text condition and provide a better conditional context by comparing the semantic information between the input image and text condition. For example, it will weaken the concept of 'a building' in text condition if there already exists information of building in the input image (red dashed box). (c) The proposed Global-local Mamba Adapter (GMA) contributes to connecting the information flow from the global state to local adaptation. The equipped Mamba block demonstrates proficient learning of spatial continuity through bidirectional scanning of a single image.
  • Figure 2: Mamba block for 1D sequential tasks and 2D visual tasks.
  • Figure 3: Architecture of the proposed OPa-Ma. The left part is the overall generation scheme which will generate panorama images iteratively with OPa-Ma Diffusion. The right part is the detail of OPa-Ma Diffusion.VCR and GMA could provide a better condition for Denoising U-Net from two perspectives. Stacked 1D Mamba blocks are utilized to obtain the modified image-text features extracted by the pre-trained CLIP model. We use a re-weight mechanism to achieve consistency refining. In the GMA module, the NFov input image and the omni visual condition will be processed by shared 2D Mamba individually. The global-local Mamba controls the information flow and extracts the local features of the cube images one by one with the state-selective characteristics of Mamba. There are four scale blocks in GMA and each one outputs a targeted condition for the encoder of Denoising U-Net.
  • Figure 4: Visual results on indoor and outdoor settings with both NFoV image and text guidance.
  • Figure 5: Visual results with only text guidance on indoor and outdoor settings.
  • ...and 3 more figures