Table of Contents
Fetching ...

Animating the Past: Reconstruct Trilobite via Video Generation

Xiaoran Wu, Zien Huang, Chonghan Yu

TL;DR

Qualitative and quantitative experiments show that the automatic T2V prompt learning method introduced here can generate trilobite videos with significantly higher visual realism compared to powerful baselines, promising to boost both scientific understanding and public engagement.

Abstract

Paleontology, the study of past life, fundamentally relies on fossils to reconstruct ancient ecosystems and understand evolutionary dynamics. Trilobites, as an important group of extinct marine arthropods, offer valuable insights into Paleozoic environments through their well-preserved fossil records. Reconstructing trilobite behaviour from static fossils will set new standards for dynamic reconstructions in scientific research and education. Despite the potential, current computational methods for this purpose like text-to-video (T2V) face significant challenges, such as maintaining visual realism and consistency, which hinder their application in science contexts. To overcome these obstacles, we introduce an automatic T2V prompt learning method. Within this framework, prompts for a fine-tuned video generation model are generated by a large language model, which is trained using rewards that quantify the visual realism and smoothness of the generated video. The fine-tuning of the video generation model, along with the reward calculations make use of a collected dataset of 9,088 Eoredlichia intermedia fossil images, which provides a common representative of visual details of all class of trilobites. Qualitative and quantitative experiments show that our method can generate trilobite videos with significantly higher visual realism compared to powerful baselines, promising to boost both scientific understanding and public engagement.

Animating the Past: Reconstruct Trilobite via Video Generation

TL;DR

Qualitative and quantitative experiments show that the automatic T2V prompt learning method introduced here can generate trilobite videos with significantly higher visual realism compared to powerful baselines, promising to boost both scientific understanding and public engagement.

Abstract

Paleontology, the study of past life, fundamentally relies on fossils to reconstruct ancient ecosystems and understand evolutionary dynamics. Trilobites, as an important group of extinct marine arthropods, offer valuable insights into Paleozoic environments through their well-preserved fossil records. Reconstructing trilobite behaviour from static fossils will set new standards for dynamic reconstructions in scientific research and education. Despite the potential, current computational methods for this purpose like text-to-video (T2V) face significant challenges, such as maintaining visual realism and consistency, which hinder their application in science contexts. To overcome these obstacles, we introduce an automatic T2V prompt learning method. Within this framework, prompts for a fine-tuned video generation model are generated by a large language model, which is trained using rewards that quantify the visual realism and smoothness of the generated video. The fine-tuning of the video generation model, along with the reward calculations make use of a collected dataset of 9,088 Eoredlichia intermedia fossil images, which provides a common representative of visual details of all class of trilobites. Qualitative and quantitative experiments show that our method can generate trilobite videos with significantly higher visual realism compared to powerful baselines, promising to boost both scientific understanding and public engagement.

Paper Structure

This paper contains 12 sections, 12 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Qualitative comparison of the generated videos from four different models: Pika pika, Runway gen2, AnimateDiff guo2023animatediff, and ours. Our model significantly outperforms the others in generating trilobites with highly detailed morphological accuracy, realistic texturing, and appropriate environmental interactions. The prompting images in the first and second rows are courtesy of el2024rapid and exampleImage3, respectively.
  • Figure 2: The preference optimization of Script Writer contributes to visual realism of the generated trilobites. What is highlighted represents the changes in prompts after preference optimization. We can see that Script Writer learns to improve the quality of generated videos by adding more descriptions about the trilobite morphological details. The shown match score is the inverse of the distance from the most similar reference image ($1/\min_{r\in\mathcal{R}}D(x,r)$), which significantly improves after optimization. The first prompting image is courtesy of el2024rapid. Please note: The reference images are used to enhance the visual details of the generated content. We show the reference image with the highest match score calculated by the ORB detector. The connecting lines represent matching points with similar local image features, and do not mean that the trilobite in the video matches the trilobite in the real fossil image.
  • Figure 3: A qualitative comparison before and after Script Writer preference optimization, with a focus on the smoothness and continuity of the video. The Script Writer learns to add more prompts to impact the smoothness of the resulting video. The prompting image is courtesy of exampleImage1.
  • Figure 4: Another qualitative comparison before and after Script Writer preference optimization regarding video smoothness and continuity. The Script Writer learns to use words that indicate degree and process to enhance the resulting video smoothness.
  • Figure 5: Quantitative comparison: FID between adjacent frames. The x-axis represents the frame ID, ranging from 0 to 100, corresponding to the sequence of frames in the video. The y-axis quantifies the FID score, where a lower score indicates greater visual similarity and consistency between frames. The result shows that Script Writer preference optimization effectively improves the smoothness of the generated video.