Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Sen Fang; Sizhou Chen; Yalin Feng; Xiaofeng Zhang; Teik Toe Teoh

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Sen Fang, Sizhou Chen, Yalin Feng, Xiaofeng Zhang, Teik Toe Teoh

TL;DR

This study explores the feasibility of the first Langue2Gloss model and proposes a DS-Net, an Result Filter module, and a novel SP-Loss function to strengthen the adaptability of gloss with text/audio and overcome the efficiency and instability issues in multimodal training.

Abstract

This paper presents an innovative approach called BGTAI to simplify multimodal understanding by utilizing gloss-based annotation as an intermediate step in aligning Text and Audio with Images. While the dynamic temporal factors in textual and audio inputs contain various predicate adjectives that influence the meaning of the entire sentence, images, on the other hand, present static scenes. By representing text and audio as gloss notations that omit complex semantic nuances, a better alignment with images can potentially be achieved. This study explores the feasibility of this idea, specifically, we first propose the first Langue2Gloss model and then integrate it into the multimodal model UniBriVL for joint training. To strengthen the adaptability of gloss with text/audio and overcome the efficiency and instability issues in multimodal training, we propose a DS-Net (Data-Pair Selection Network), an Result Filter module, and a novel SP-Loss function. Our approach outperforms previous multimodal models in the main experiments, demonstrating its efficacy in enhancing multimodal representations and improving compatibility among text, audio, visual, and any sequence modalities.

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 3 figures, 5 tables)

This paper contains 17 sections, 6 equations, 3 figures, 5 tables.

Introduction
Proposed Method
Motivation
Our components
Langue2Gloss
DS-Net
Result Filter
SP-Loss.
Experiments
Experimental Setup
Evaluation for Performance
Baseline Comparison
Training Efficiency Study
Ablation Study
Zero-Shot Comparison with Previous Works
...and 2 more sections

Figures (3)

Figure 1: (a) BGTAI Production Pipeline and ((b) Structure Drawing. It reveals a structure that unifies all representation possibilities.
Figure 2: Training efficiency evaluation. We tested the performance of models under different settings on the ESC-50 dataset. We sampled the performance of different periods, determined by epoch.
Figure 3: Sound-to-image generation. Our model introduces a groundbreaking approach for synthesizing natural scene images from sound. It is trained solely on paired audio-visual data, eliminating the need for labels or language supervision. Importantly, our model offers remarkable controllability by manipulating input waveforms (left) and controlling the generated images in the latent space of diffusion models Rombach_2022_CVPR (right), the $I_{j}$ indicates image embedding. This innovative methodology provides enhanced flexibility and control over the model's outputs.

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

TL;DR

Abstract

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)