Table of Contents
Fetching ...

Generative Timelines for Instructed Visual Assembly

Alejandro Pardo, Jui-Hsien Wang, Bernard Ghanem, Josef Sivic, Bryan Russell, Fabian Caba Heilbron

TL;DR

A large multimodal language model is developed, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions, which substantially outperforms established baseline models in accurately executing complex assembly instructions across various real-world inspired scenarios.

Abstract

The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.

Generative Timelines for Instructed Visual Assembly

TL;DR

A large multimodal language model is developed, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions, which substantially outperforms established baseline models in accurately executing complex assembly instructions across various real-world inspired scenarios.

Abstract

The objective of this work is to manipulate visual timelines (e.g. a video) through natural language instructions, making complex timeline editing tasks accessible to non-expert or potentially even disabled users. We call this task Instructed visual assembly. This task is challenging as it requires (i) identifying relevant visual content in the input timeline as well as retrieving relevant visual content in a given input (video) collection, (ii) understanding the input natural language instruction, and (iii) performing the desired edits of the input visual timeline to produce an output timeline. To address these challenges, we propose the Timeline Assembler, a generative model trained to perform instructed visual assembly tasks. The contributions of this work are three-fold. First, we develop a large multimodal language model, which is designed to process visual content, compactly represent timelines and accurately interpret timeline editing instructions. Second, we introduce a novel method for automatically generating datasets for visual assembly tasks, enabling efficient training of our model without the need for human-labeled data. Third, we validate our approach by creating two novel datasets for image and video assembly, demonstrating that the Timeline Assembler substantially outperforms established baseline models, including the recent GPT-4o, in accurately executing complex assembly instructions across various real-world inspired scenarios.

Paper Structure

This paper contains 31 sections, 13 figures, 11 tables, 2 algorithms.

Figures (13)

  • Figure 1: Instructed Visual Assembly. Given a visual collection, an input timeline, and an assembly instruction, our model (called the Timeline Assembler) performs the instructed assembly task and generates an output timeline with the desired edits. The collection comprises various media elements, such as video clips or images. The timeline is a sequential arrangement of these elements.
  • Figure 2: Timeline Assembler Architecture. We design a multimodal architecture to execute visual assembly instructions to generate visual timelines. Our model takes as inputs: a collection of images/videos $C$, a timeline $S$, and an assembly instruction $q$. Each image/video in the collection is represented with a unique identifier token $\mathbf{x}^{i}_{k}$ (color-coded) and a visual token $\mathbf{x}^{i}_{v}$, forming the sequence $\mathbf{X}_C$. The input timeline is represented with the sequence of tokens $\mathbf{X}_S$, which comprises the list of identifier tokens of the images/videos in the timeline. The assembly instruction is tokenized into $\mathbf{X}_q$. Given the input tokens, the task of the Large Language Model is to generate output timeline tokens $\mathbf{X}_{\tilde{S}}$, which are reconstructed into the output timeline $\tilde{S}$.
  • Figure 3: Training the Timeline Assembler. The Collection Tokenizer is composed of a frozen visual encoder $g(\cdot)$, a mapping function $\mathcal{H}$ that generates an identifier token for each input visual asset (image/video) in the collection, and a learnable projection layer $h_{\gamma}(\cdot)$ that maps visual embeddings into visual tokens aligned with the LLM. We keep the LLM $f_{\theta}(\cdot)$ mostly frozen except for a lightweight set of learnable LoRA Adapters hu2021lora.
  • Figure 4: Multi-Length Timeline Assembler. Unlike GPT-4o, which has decreasing performance after timeline lengths larger than 5, our Timeline Assembler is consistent across different lengths.
  • Figure : Creating Visual Assembly Task Datasets
  • ...and 8 more figures