Table of Contents
Fetching ...

ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

Yang Liu, Xiaomin Yu, Gongyu Zhang, Zhen Zhu, Christos Bergeles, Prokar Dasgupta, Alejandro Granados, Sebastien Ourselin

TL;DR

Inspired by prior noise injection methods to reduce modality gaps, Adaptive ranged cosine Similarity injected noise (ArcSin) is introduced, introducing an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity.

Abstract

"A data scientist is tasked with developing a low-cost surgical VQA system for a 2-month workshop. Due to data sensitivity, she collects 50 hours of surgical video from a hospital, requiring two months for privacy approvals. Privacy restrictions prevent uploading data to platforms like ChatGPT, so she assembles one annotator and a medical expert to manually create QA pairs. This process takes three weeks and costs over $10,000. The trained model provides accurate responses within the limited data scope but lacks broader generalizability, completing the project in 3 months." To simplify the challenges presented in the scenario above. In this paper, we replace the image input with text for Vision-language training. Inspired by prior noise injection methods to reduce modality gaps, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively broadens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 0.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks.

ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks

TL;DR

Inspired by prior noise injection methods to reduce modality gaps, Adaptive ranged cosine Similarity injected noise (ArcSin) is introduced, introducing an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity.

Abstract

"A data scientist is tasked with developing a low-cost surgical VQA system for a 2-month workshop. Due to data sensitivity, she collects 50 hours of surgical video from a hospital, requiring two months for privacy approvals. Privacy restrictions prevent uploading data to platforms like ChatGPT, so she assembles one annotator and a medical expert to manually create QA pairs. This process takes three weeks and costs over $10,000. The trained model provides accurate responses within the limited data scope but lacks broader generalizability, completing the project in 3 months." To simplify the challenges presented in the scenario above. In this paper, we replace the image input with text for Vision-language training. Inspired by prior noise injection methods to reduce modality gaps, we introduce Adaptive ranged cosine Similarity injected noise (ArcSin). First, we introduce an innovative adaptive noise scale that effectively generates the textual elements with more variability while preserving the original text feature's integrity. Second, a similarity pool strategy is employed, expanding the domain generalization potential by broadening the overall noise scale. This dual strategy effectively broadens the scope of the original domain while safeguarding content integrity. Our empirical results demonstrate that these models closely rival those trained on images in terms of performance. Specifically, our method exhibits substantial improvements over the previous state-of-the-art, achieving gains of 1.9 and 1.1 CIDEr points in S-Cap and M-Cap, respectively. Additionally, we observe increases of 0.5 percentage points (pp), 1.4 pp, and 1.4 pp in accuracy for VQA, VQA-E, and VE, respectively, pushing the boundaries of what is achievable within the constraints of image-trained model benchmarks.
Paper Structure (14 sections, 2 equations, 8 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 2 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: ArcSin surpasses state-of-the-art method CLOSE gu2022can in the Single Image Captioning (S-Cap) task while utilizing various CLIP models Radford2021-CLIP as contrastive backbones for image-text feature alignment and maintaining a consistent language model (T5-base raffel2020exploring). The graph illustrates the relationship between the number of model parameters and corresponding CIDEr vedantam2015cider scores.
  • Figure 2: The ArcSin Architecture. In the training phase, textual descriptions are encoded to feature vectors using a pre-trained text encoder. These vectors are then augmented with dynamically injected noise, aligning them with vision-language tasks through task-specific prompts within the cross-modal models, such as the T5 model raffel2020exploring. During inference, images are processed into feature vectors via an image encoder, which then replaces the text-derived features, allowing the model to perform visual tasks using text-trained embeddings.
  • Figure 3: The visualization of feature alignment in cosine similarity controlled multimodal data processing.(a). Illustration of the relationship between text feature value and corresponding deviations of text-image features. We randomly selected 5 text-image feature pairs encoded by CLIP (ViT-B/32) Radford2021-CLIP and visualized a set of $512 \times 5$ points depicting value deviations. It plots the text feature values along the horizontal axis against the corresponding deviations in the image features on the vertical axis. It indicates variable deviations across different magnitudes of textual features. (b). The visualization of permissible value deviations under a predefined similarity threshold in standard 2D space. Left: A vector (in black) with a vertical component $y_0$. Ensuring that the cosine similarity between the noise-augmented vectors (in blue and yellow) and the original vector remains above a certain threshold equates to confining rotation within an angle $\alpha$. The allowable positive and negative deviations are denoted by $\delta^+(y_0)$ and $\delta^-(y_0)$, respectively. Right: The relationship between $y_0$ and its potential deviations $\delta(y_0)$.
  • Figure 4: Qualitative comparisons with the state-of-the-art method CLOSE gu2022can.
  • Figure 5: Cross-style captioning cases. Our ArcSin model, trained on M-Cap, demonstrates its robust captioning ability on some 'bird' images with distinct visual styles, randomly sourced from the web.
  • ...and 3 more figures