Table of Contents
Fetching ...

SUMART: SUMmARizing Translation from Wordy to Concise Expression

Naoto Nishida, Jun Rekimoto

TL;DR

SUMART tackles the cognitive and temporal burden of real-time translated subtitles by compressing verbose utterances on-device and training a translation model on ASR-original paired with compressed outputs. It introduces a two-path system architecture (real-time transmission and training data collection) and an AR subtitle prototype to validate practicality. The approach quantifies potential gains, estimating per-utterance time savings of about $5.04$ seconds under plausible compression, and aims to support rapid information consumption in speeches, lectures, podcasts, and conferences. The work contributes a practical framework for data collection, on-site compression, and AR-enabled concise translations with clear implications for foreign-language viewing and live-interaction settings.

Abstract

We propose SUMART, a method for summarizing and compressing the volume of verbose subtitle translations. SUMART is designed for understanding translated captions (e.g., interlingual conversations via subtitle translation or when watching movies in foreign language audio and translated captions). SUMART is intended for users who want a big-picture and fast understanding of the conversation, audio, video content, and speech in a foreign language. During the training data collection, when a speaker makes a verbose statement, SUMART employs a large language model on-site to compress the volume of subtitles. This compressed data is then stored in a database for fine-tuning purposes. Later, SUMART uses data pairs from those non-compressed ASR results and compressed translated results for fine-tuning the translation model to generate more concise translations for practical uses. In practical applications, SUMART utilizes this trained model to produce concise translation results. Furthermore, as a practical application, we developed an application that allows conversations using subtitle translation in augmented reality spaces. As a pilot study, we conducted qualitative surveys using a SUMART prototype and a survey on the summarization model for SUMART. We envision the most effective use case of this system is where users need to consume a lot of information quickly (e.g., Speech, lectures, podcasts, Q&A in conferences).

SUMART: SUMmARizing Translation from Wordy to Concise Expression

TL;DR

SUMART tackles the cognitive and temporal burden of real-time translated subtitles by compressing verbose utterances on-device and training a translation model on ASR-original paired with compressed outputs. It introduces a two-path system architecture (real-time transmission and training data collection) and an AR subtitle prototype to validate practicality. The approach quantifies potential gains, estimating per-utterance time savings of about seconds under plausible compression, and aims to support rapid information consumption in speeches, lectures, podcasts, and conferences. The work contributes a practical framework for data collection, on-site compression, and AR-enabled concise translations with clear implications for foreign-language viewing and live-interaction settings.

Abstract

We propose SUMART, a method for summarizing and compressing the volume of verbose subtitle translations. SUMART is designed for understanding translated captions (e.g., interlingual conversations via subtitle translation or when watching movies in foreign language audio and translated captions). SUMART is intended for users who want a big-picture and fast understanding of the conversation, audio, video content, and speech in a foreign language. During the training data collection, when a speaker makes a verbose statement, SUMART employs a large language model on-site to compress the volume of subtitles. This compressed data is then stored in a database for fine-tuning purposes. Later, SUMART uses data pairs from those non-compressed ASR results and compressed translated results for fine-tuning the translation model to generate more concise translations for practical uses. In practical applications, SUMART utilizes this trained model to produce concise translation results. Furthermore, as a practical application, we developed an application that allows conversations using subtitle translation in augmented reality spaces. As a pilot study, we conducted qualitative surveys using a SUMART prototype and a survey on the summarization model for SUMART. We envision the most effective use case of this system is where users need to consume a lot of information quickly (e.g., Speech, lectures, podcasts, Q&A in conferences).

Paper Structure

This paper contains 8 sections, 1 equation, 1 figure, 1 table.

Figures (1)

  • Figure 1: Examples of people using SUMART and user interface of SUMART prototype.