Table of Contents
Fetching ...

Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

Md Azim Khan, Aryya Gangopadhyay, Jianwu Wang, Robert F. Erbacher

TL;DR

The paper tackles robustness and efficiency challenges of vision-language processing in real-world, noisy environments by integrating frequency-domain representations with Low-Rank Adaptation (LoRA) in a vision-language framework. A parallel frequency-domain branch applies $\mathcal{F}$ and $\mathcal{F}^{-1}$ transforms and a low-rank adaptation $\Delta W = A B$, yielding a combined output $h' = W x + \mathcal{F}^{-1}(\alpha B A \mathcal{F}(x))$, while preserving pretrained spatial weights. The authors validate the approach on image captioning and VQA using CLIP ViT-L/14 and SigLIP, with a two-stage training pipeline (COCO for captions, Phi-2 for VQA) and assess performance under Gaussian noise; results show that DFT+LoRA improves BLEU-4 by ~6%, CIDEr by ~4%, and VQA accuracy by ~4% on average, with pronounced benefits in high-noise scenarios and real-time UGV imagery. Qualitative analyses demonstrate more detailed, contextually grounded outputs than baselines like Llava 7B, highlighting the method's practical impact for robust, real-time multimodal perception in robotics. The work suggests a viable path toward robust edge-enabled multimodal systems, with future directions including edge deployment and multimodal extensions such as thermal imaging and LiDAR for disaster-response settings.

Abstract

Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).

Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models

TL;DR

The paper tackles robustness and efficiency challenges of vision-language processing in real-world, noisy environments by integrating frequency-domain representations with Low-Rank Adaptation (LoRA) in a vision-language framework. A parallel frequency-domain branch applies and transforms and a low-rank adaptation , yielding a combined output , while preserving pretrained spatial weights. The authors validate the approach on image captioning and VQA using CLIP ViT-L/14 and SigLIP, with a two-stage training pipeline (COCO for captions, Phi-2 for VQA) and assess performance under Gaussian noise; results show that DFT+LoRA improves BLEU-4 by ~6%, CIDEr by ~4%, and VQA accuracy by ~4% on average, with pronounced benefits in high-noise scenarios and real-time UGV imagery. Qualitative analyses demonstrate more detailed, contextually grounded outputs than baselines like Llava 7B, highlighting the method's practical impact for robust, real-time multimodal perception in robotics. The work suggests a viable path toward robust edge-enabled multimodal systems, with future directions including edge deployment and multimodal extensions such as thermal imaging and LiDAR for disaster-response settings.

Abstract

Situational awareness applications rely heavily on real-time processing of visual and textual data to provide actionable insights. Vision language models (VLMs) have become essential tools for interpreting complex environments by connecting visual inputs with natural language descriptions. However, these models often face computational challenges, especially when required to perform efficiently in real environments. This research presents a novel vision language model (VLM) framework that leverages frequency domain transformations and low-rank adaptation (LoRA) to enhance feature extraction, scalability, and efficiency. Unlike traditional VLMs, which rely solely on spatial-domain representations, our approach incorporates Discrete Fourier Transform (DFT) based low-rank features while retaining pretrained spatial weights, enabling robust performance in noisy or low visibility scenarios. We evaluated the proposed model on caption generation and Visual Question Answering (VQA) tasks using benchmark datasets with varying levels of Gaussian noise. Quantitative results demonstrate that our model achieves evaluation metrics comparable to state-of-the-art VLMs, such as CLIP ViT-L/14 and SigLIP. Qualitative analysis further reveals that our model provides more detailed and contextually relevant responses, particularly for real-world images captured by a RealSense camera mounted on an Unmanned Ground Vehicle (UGV).

Paper Structure

This paper contains 9 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Proposed Architecture
  • Figure 2: Framework of the proposed Vision-Language Model (VLM) integrating low-rank DFT features with text embeddings via a projection network. The model employs a two-stage training process: Stage 1 for caption generation and Stage 2 for Visual Question Answering (VQA). The LoRA adaptation is applied to enhance task-specific performance
  • Figure 3: Accuracy of VQA on variation of noise
  • Figure 4: Model performance on variation of rank