Table of Contents
Fetching ...

VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars

Vineet Kumar Rakesh, Ahana Bhattacharjee, Soumya Mazumdar, Tapas Samanta, Hemendra Kumar Pandey, Amitabha Das, Sarbajit Pal

TL;DR

This work tackles the limitation of GPU-heavy, data-hungry talking-head generation by proposing Symbolic Vedic Computation, a deterministic, CPU-oriented pipeline that converts speech into a time-aligned phoneme stream $\mathcal{P}$, maps phonemes to a compact viseme set $\mathbb{V}$, and generates smooth mouth trajectories $\mathbf{y}(t)$ using Vedic-inspired blending $\mathbf{y}(t) = (1-\alpha)\mathbf{a} + \alpha\mathbf{c} + \lambda\alpha(1-\alpha)(\mathbf{a} \odot \mathbf{c})$. A lightweight 2D ROI renderer performs landmark-based mouth ROI localization, mouth-bank compositing, and head-motion stabilization to achieve real-time synthesis on commodity CPUs. The approach is evaluated via a reproducible CPU-focused protocol, reporting synchronization accuracy within $\pm 40$ ms, temporal stability, and identity preservation, while benchmarking against CPU-feasible baselines like Wav2Lip. Results indicate acceptable lip-sync quality with substantially lower computational load and latency, enabling practical educational avatars in low-resource or offline environments. The work highlights interpretable, rule-based animation with potential extensibility to additional viseme control and language support, offering a viable alternative to heavy neural THG pipelines for classroom use. All math and algorithmic details are presented with explicit symbolic rules and arithmetic-inspired blending, promoting transparency and deployability in offline educational settings.

Abstract

Talking-head avatars are increasingly adopted in educational technology to deliver content with social presence and improved engagement. However, many recent talking-head generation (THG) methods rely on GPU-centric neural rendering, large training sets, or high-capacity diffusion models, which limits deployment in offline or resource-constrained learning environments. A deterministic and CPU-oriented THG framework is described, termed Symbolic Vedic Computation, that converts speech to a time-aligned phoneme stream, maps phonemes to a compact viseme inventory, and produces smooth viseme trajectories through symbolic coarticulation inspired by Vedic sutra Urdhva Tiryakbhyam. A lightweight 2D renderer performs region-of-interest (ROI) warping and mouth compositing with stabilization to support real-time synthesis on commodity CPUs. Experiments report synchronization accuracy, temporal stability, and identity consistency under CPU-only execution, alongside benchmarking against representative CPU-feasible baselines. Results indicate that acceptable lip-sync quality can be achieved while substantially reducing computational load and latency, supporting practical educational avatars on low-end hardware. GitHub: https://vineetkumarrakesh.github.io/vedicthg

VedicTHG: Symbolic Vedic Computation for Low-Resource Talking-Head Generation in Educational Avatars

TL;DR

This work tackles the limitation of GPU-heavy, data-hungry talking-head generation by proposing Symbolic Vedic Computation, a deterministic, CPU-oriented pipeline that converts speech into a time-aligned phoneme stream , maps phonemes to a compact viseme set , and generates smooth mouth trajectories using Vedic-inspired blending . A lightweight 2D ROI renderer performs landmark-based mouth ROI localization, mouth-bank compositing, and head-motion stabilization to achieve real-time synthesis on commodity CPUs. The approach is evaluated via a reproducible CPU-focused protocol, reporting synchronization accuracy within ms, temporal stability, and identity preservation, while benchmarking against CPU-feasible baselines like Wav2Lip. Results indicate acceptable lip-sync quality with substantially lower computational load and latency, enabling practical educational avatars in low-resource or offline environments. The work highlights interpretable, rule-based animation with potential extensibility to additional viseme control and language support, offering a viable alternative to heavy neural THG pipelines for classroom use. All math and algorithmic details are presented with explicit symbolic rules and arithmetic-inspired blending, promoting transparency and deployability in offline educational settings.

Abstract

Talking-head avatars are increasingly adopted in educational technology to deliver content with social presence and improved engagement. However, many recent talking-head generation (THG) methods rely on GPU-centric neural rendering, large training sets, or high-capacity diffusion models, which limits deployment in offline or resource-constrained learning environments. A deterministic and CPU-oriented THG framework is described, termed Symbolic Vedic Computation, that converts speech to a time-aligned phoneme stream, maps phonemes to a compact viseme inventory, and produces smooth viseme trajectories through symbolic coarticulation inspired by Vedic sutra Urdhva Tiryakbhyam. A lightweight 2D renderer performs region-of-interest (ROI) warping and mouth compositing with stabilization to support real-time synthesis on commodity CPUs. Experiments report synchronization accuracy, temporal stability, and identity consistency under CPU-only execution, alongside benchmarking against representative CPU-feasible baselines. Results indicate that acceptable lip-sync quality can be achieved while substantially reducing computational load and latency, supporting practical educational avatars on low-end hardware. GitHub: https://vineetkumarrakesh.github.io/vedicthg
Paper Structure (11 sections, 9 equations, 3 figures, 3 tables)

This paper contains 11 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Inference-time block diagram of the proposed talking-head generation pipeline. The controller coordinates the audio stream processing, including preprocessing, timing alignment, and Vedic phoneme-to-viseme mapping, with the visual stream, where the computed controls are applied to a facial template for real-time rendering.
  • Figure 2: Qualitative results using the same identity and audio, with frames extracted at matched phoneme timestamps.
  • Figure 3: CDF of absolute scheduling error $|\Delta t|$ between phoneme boundary timestamps and generated viseme schedule timestamps (internal alignment metric).