Table of Contents
Fetching ...

Grammar Induction from Visual, Speech and Text

Yu Zhao, Hao Fei, Shengqiong Wu, Meishan Zhang, Min Zhang, Tat-seng Chua

TL;DR

This work defines unsupervised visual-audio-text grammar induction (VAT-GI) and presents VaTiora, a multimodal inside-outside recursive autoencoder that jointly leverages text, vision, and acoustic cues to induce constituency trees. It introduces a textless VAT-GI setting and the SpokenStory dataset to stress-test generalization and grounding across modalities, achieving state-of-the-art results on VAT-GI benchmarks. The framework combines modal-specific feature extraction, cross-modal attention, and a reconstruction-plus-contrastive training objective to align representations and improve parsing accuracy. A new SCF1 metric is proposed for textless evaluation, and extensive analyses show the contributions of each modality, the importance of object detection granularity, and the challenges in long constituents and noisy signals. The authors provide open resources to facilitate follow-up research in multimodal grammar induction and cross-modal structure learning.

Abstract

Grammar Induction could benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel \emph{unsupervised visual-audio-text grammar induction} task (named \textbf{VAT-GI}), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a \emph{textless} setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which leverages rich modal-specific and complementary features for effective grammar parsing. Besides, a more challenging benchmark data is constructed to assess the generalization ability of VAT-GI system. Experiments on two benchmark datasets demonstrate that our proposed VaTiora system is more effective in incorporating the various multimodal signals, and also presents new state-of-the-art performance of VAT-GI.

Grammar Induction from Visual, Speech and Text

TL;DR

This work defines unsupervised visual-audio-text grammar induction (VAT-GI) and presents VaTiora, a multimodal inside-outside recursive autoencoder that jointly leverages text, vision, and acoustic cues to induce constituency trees. It introduces a textless VAT-GI setting and the SpokenStory dataset to stress-test generalization and grounding across modalities, achieving state-of-the-art results on VAT-GI benchmarks. The framework combines modal-specific feature extraction, cross-modal attention, and a reconstruction-plus-contrastive training objective to align representations and improve parsing accuracy. A new SCF1 metric is proposed for textless evaluation, and extensive analyses show the contributions of each modality, the importance of object detection granularity, and the challenges in long constituents and noisy signals. The authors provide open resources to facilitate follow-up research in multimodal grammar induction and cross-modal structure learning.

Abstract

Grammar Induction could benefit from rich heterogeneous signals, such as text, vision, and acoustics. In the process, features from distinct modalities essentially serve complementary roles to each other. With such intuition, this work introduces a novel \emph{unsupervised visual-audio-text grammar induction} task (named \textbf{VAT-GI}), to induce the constituent grammar trees from parallel images, text, and speech inputs. Inspired by the fact that language grammar natively exists beyond the texts, we argue that the text has not to be the predominant modality in grammar induction. Thus we further introduce a \emph{textless} setting of VAT-GI, wherein the task solely relies on visual and auditory inputs. To approach the task, we propose a visual-audio-text inside-outside recursive autoencoder (\textbf{VaTiora}) framework, which leverages rich modal-specific and complementary features for effective grammar parsing. Besides, a more challenging benchmark data is constructed to assess the generalization ability of VAT-GI system. Experiments on two benchmark datasets demonstrate that our proposed VaTiora system is more effective in incorporating the various multimodal signals, and also presents new state-of-the-art performance of VAT-GI.
Paper Structure (21 sections, 23 equations, 13 figures, 11 tables)

This paper contains 21 sections, 23 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Unsupervised grammar induction with vision , audio and text modality sources , each of which contributes complementarily to the task.
  • Figure 1: Summary of all the features used in VaTiora.
  • Figure 2: Illustration of clip alignment in SCF1.
  • Figure 3: In our VaTiora framework , first the feature extraction module constructs rich modal-specific features from the input image , text and speech. RAPT: robust algorithm for pitch tracking; VAD: voice activity detection. Then the inside-outside recursive autoencoder fuses various features and performs grammar induction.
  • Figure 4: Illustration of pair feature and voice activity feature.
  • ...and 8 more figures