Table of Contents
Fetching ...

TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation

Jack Saunders, Vinay Namboodiri

TL;DR

TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data, and utilises a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time.

Abstract

Speech-driven facial animation is important for many applications including TV, film, video games, telecommunication and AR/VR. Recently, transformers have been shown to be extremely effective for this task. However, we identify two issues with the existing transformer-based models. Firstly, they are difficult to adapt to new personalised speaking styles and secondly, they are slow to run for long sentences due to the quadratic complexity of the transformer. We propose TalkLoRA to address both of these issues. TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data. It does this by training an adaptor with a small number of parameters for each subject. We also utilise a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time. TalkLoRA can be applied to any transformer-based speech-driven animation method. We perform extensive experiments to show that TalkLoRA archives state-of-the-art style adaptation and that it allows for an order-of-complexity reduction in inference times without sacrificing quality. We also investigate and provide insights into the hyperparameter selection for LoRA fine-tuning of speech-driven facial animation models.

TalkLoRA: Low-Rank Adaptation for Speech-Driven Animation

TL;DR

TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data, and utilises a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time.

Abstract

Speech-driven facial animation is important for many applications including TV, film, video games, telecommunication and AR/VR. Recently, transformers have been shown to be extremely effective for this task. However, we identify two issues with the existing transformer-based models. Firstly, they are difficult to adapt to new personalised speaking styles and secondly, they are slow to run for long sentences due to the quadratic complexity of the transformer. We propose TalkLoRA to address both of these issues. TalkLoRA uses Low-Rank Adaptation to effectively and efficiently adapt to new speaking styles, even with limited data. It does this by training an adaptor with a small number of parameters for each subject. We also utilise a chunking strategy to reduce the complexity of the underlying transformer, allowing for long sentences at inference time. TalkLoRA can be applied to any transformer-based speech-driven animation method. We perform extensive experiments to show that TalkLoRA archives state-of-the-art style adaptation and that it allows for an order-of-complexity reduction in inference times without sacrificing quality. We also investigate and provide insights into the hyperparameter selection for LoRA fine-tuning of speech-driven facial animation models.
Paper Structure (17 sections, 1 equation, 5 figures, 1 table)

This paper contains 17 sections, 1 equation, 5 figures, 1 table.

Figures (5)

  • Figure 1: We present TalkLoRA, a method for improving any transformer-based speech-driven animation model. We use Low Rank Adaptation to effectively and efficiently adapt to new identities and chunking to improve inference speed, with no loss of quality.
  • Figure 2: The chunking process is used to limit the context window of the transformers. We split incoming audio into overlapping chunks of size K+2P and process these in parallel. The padding is then removed and the results concatenated.
  • Figure 3: Qualitative results of our method showing a sentence on one of the train subjects. We compare our adaptation method on both Imitator and Faceformer and show improvements over their respective adaptation methods.
  • Figure 4: Graphs for determining the values of chunk size (K) and padding size (P) for chunking. (a) shows the effect of the size (K) of chunks compared vs the inference time and the validation loss for a validation subject. Too small a chunk takes a long time due to the padding, and also has poor quality. We find a sweet spot for time savings and quality at around 1-3 second chunks. (b) shows the effect of overlap size (P) in chunking. We show the y-postion of two lip vertices over time. It can be seen that a 0.2s overlap in chunking allows for outputs that are close to the un-chunked base model.
  • Figure 5: The effect of rank on lip $L_2$ loss across random training subsets. $\approx4$ yields the best results.