Table of Contents
Fetching ...

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

Sai Koneru, Matthias Huck, Miriam Exel, Jan Niehues

TL;DR

This work proposes a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training, and demonstrates the effectiveness of this method in machine translation scenarios.

Abstract

Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality (We will release the code upon paper acceptance.).

Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking Across Diverse Vocabularies

TL;DR

This work proposes a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training, and demonstrates the effectiveness of this method in machine translation scenarios.

Abstract

Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality (We will release the code upon paper acceptance.).
Paper Structure (21 sections, 2 figures, 5 tables)

This paper contains 21 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The source sentence to be translated is ambiguous because the translation of the word "fell" can be either masculine ("tombé") or feminine ("tombée"), depending on the speaker's gender. Seamless-Large V2 barrault2023seamless utilizes audio cues to correctly determine the gender form but struggles to accurately translate the name "Mrs Ples" using audio alone. In contrast, the text translation model Madlad-400-10b-mt kudugunta2024madlad relies on the gold transcript to correctly translate the name but fails to resolve the gender ambiguity. By combining both models using our approach, the translation correctly captures both the gender form and the named entity.
  • Figure 2: Grid Search on $\alpha$ with Madlad and Seamless Bal on the MuST-C development set with N-best lists from different generators and rankers.