Table of Contents
Fetching ...

Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity

Hugo L. Hammer, Vajira Thambawita, Pål Halvorsen

TL;DR

Calliope tackles the lack of open-source tools for turning standard text e-books into EPUB 3 Media Overlay narrated e-books with exact text-audio synchronization. It uses offline, state-of-the-art open-source TTS models (XTTS-v2 and Chatterbox) to generate audio directly during TTS and inserts timestamps into the text to ensure perfect highlighting alignment, while surgically preserving the original typography and embedded media. The framework operates entirely offline, mitigating privacy and copyright concerns and avoiding API costs, and it compares favorably against forced-alignment approaches which introduce timing drift. The result is an MIT-licensed pipeline that produces fully synchronized, accessible EPUB 3 MO files suitable for readers and accessibility tools, with potential for GUI enhancement and mobile deployment in future work.

Abstract

A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.

Calliope: A TTS-based Narrated E-book Creator Ensuring Exact Synchronization, Privacy, and Layout Fidelity

TL;DR

Calliope tackles the lack of open-source tools for turning standard text e-books into EPUB 3 Media Overlay narrated e-books with exact text-audio synchronization. It uses offline, state-of-the-art open-source TTS models (XTTS-v2 and Chatterbox) to generate audio directly during TTS and inserts timestamps into the text to ensure perfect highlighting alignment, while surgically preserving the original typography and embedded media. The framework operates entirely offline, mitigating privacy and copyright concerns and avoiding API costs, and it compares favorably against forced-alignment approaches which introduce timing drift. The result is an MIT-licensed pipeline that produces fully synchronized, accessible EPUB 3 MO files suitable for readers and accessibility tools, with potential for GUI enhancement and mobile deployment in future work.

Abstract

A narrated e-book combines synchronized audio with digital text, highlighting the currently spoken word or sentence during playback. This format supports early literacy and assists individuals with reading challenges, while also allowing general readers to seamlessly switch between reading and listening. With the emergence of natural-sounding neural Text-to-Speech (TTS) technology, several commercial services have been developed to leverage these technology for converting standard text e-books into high-quality narrated e-books. However, no open-source solutions currently exist to perform this task. In this paper, we present Calliope, an open-source framework designed to fill this gap. Our method leverages state-of-the-art open-source TTS to convert a text e-book into a narrated e-book in the EPUB 3 Media Overlay format. The method offers several innovative steps: audio timestamps are captured directly during TTS, ensuring exact synchronization between narration and text highlighting; the publisher's original typography, styling, and embedded media are strictly preserved; and the entire pipeline operates offline. This offline capability eliminates recurring API costs, mitigates privacy concerns, and avoids copyright compliance issues associated with cloud-based services. The framework currently supports the state-of-the-art open-source TTS systems XTTS-v2 and Chatterbox. A potential alternative approach involves first generating narration via TTS and subsequently synchronizing it with the text using forced alignment. However, while our method ensures exact synchronization, our experiments show that forced alignment introduces drift between the audio and text highlighting significant enough to degrade the reading experience. Source code and usage instructions are available at https://github.com/hugohammer/TTS-Narrated-Ebook-Creator.git.
Paper Structure (27 sections, 6 equations, 3 figures, 2 tables)

This paper contains 27 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Example of an EPUB 3 Media Overlay file created using the proposed method. The figure demonstrates active text highlighting synchronized with audio playback, while the publisher's original layout and styling are preserved.
  • Figure 2: Overview of the three phases of the Calliope methodology.
  • Figure 3: Histograms of the drift distributions for the forced alignment methods. Note that the scale on the $x$ axis is different for the different figures. The green and red areas in the figures, shows where the drift is acceptable or where it may affect the reading experience. Our method is exact, and the drift was zero.