Table of Contents
Fetching ...

COLT: Enhancing Video Large Language Models with Continual Tool Usage

Yuyang Liu, Meng Cao, Xinyuan Shi, Xiaondan Liang

TL;DR

COLT tackles continual tool usage in video LLMs by introducing a learnable tool codebook that stores tool prompts and activates tools via cosine similarity to user instructions, enabling automatic tool invocation as a streaming tool set evolves. It couples a vision encoder, a codebook of prompts, and a query encoder to condition LLMs on selected tools, across a three-stage training pipeline with a straight-through gradient estimator to handle non-differentiable tool selection. The approach yields state-of-the-art results on zero-shot video QA, MVBench, and the proposed VideoToolBench, while demonstrating strong continual-learning performance against baselines like L2P, DualPrompt, and CODA-Prompt. These results indicate COLT’s potential to enable robust, tool-enabled video understanding in open-source LLMs for real-world, evolving tool ecosystems.

Abstract

The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

COLT: Enhancing Video Large Language Models with Continual Tool Usage

TL;DR

COLT tackles continual tool usage in video LLMs by introducing a learnable tool codebook that stores tool prompts and activates tools via cosine similarity to user instructions, enabling automatic tool invocation as a streaming tool set evolves. It couples a vision encoder, a codebook of prompts, and a query encoder to condition LLMs on selected tools, across a three-stage training pipeline with a straight-through gradient estimator to handle non-differentiable tool selection. The approach yields state-of-the-art results on zero-shot video QA, MVBench, and the proposed VideoToolBench, while demonstrating strong continual-learning performance against baselines like L2P, DualPrompt, and CODA-Prompt. These results indicate COLT’s potential to enable robust, tool-enabled video understanding in open-source LLMs for real-world, evolving tool ecosystems.

Abstract

The success of Large Language Models (LLMs) has significantly propelled the research of video understanding. To harvest the benefits of well-trained expert models (i.e., tools), video LLMs prioritize the exploration of tool usage capabilities. Existing methods either prompt closed-source LLMs or employ the instruction tuning paradigm for tool-use fine-tuning. These methods, however, assume an established repository of fixed tools and struggle to generalize to real-world environments where tool data is perpetually evolving and streaming in. To this end, we propose to enhance open-source video LLMs with COntinuaL Tool usage (termed COLT), which automatically acquires tool-use ability in a successive tool stream without suffering 'catastrophic forgetting' of the past learned tools. Specifically, our COLT incorporates a learnable tool codebook as a tool-specific memory system. Then relevant tools are dynamically selected based on the similarity between user instruction and tool features within the codebook. To unleash the tool usage potential of video LLMs, we collect a video-centric tool-use instruction tuning dataset VideoToolBench. Extensive experiments on both previous video LLM benchmarks and the tool-use-specific VideoToolBench dataset demonstrate the state-of-the-art performance of our proposed COLT.

Paper Structure

This paper contains 18 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Our proposed COLT continually learns to invoke tools from tool-stream data without catastrophic forgetting. Benefiting from tool usage, COLT (a) is adept at dynamic content understanding and (b) supports flexible generation compared to existing methods lin2023videoli2023mvbench. The incorrect parts of responses are marked in red.
  • Figure 2: (a) Agent-based methods bootstrap closed-source LLMs via delicately designed prompts; (b) Instruction tuning with fixed tool-use dataset; (c) Instruction tuning with stream tool-use dataset (Ours); (d) Average tool calling accuracy on VideoTool vs. learned tools. Sequential training denotes training on a sequence of tasks independently.
  • Figure 3: An overview of COLT. Stage 1 aligns the visual and textual modalities through the individual training of the linear projector ${f}_{\boldsymbol{\delta}}$; In stage 2 and stage 3, the prompt within tool codebook $\mathbf{P}$ is adaptively selected according to the cosine similarity with the query feature $\mathbf{H}_{\mathbf{q}}$.
  • Figure 4: Illustration of the tool selection mechanism based on cosine similarity. Given the query embedding $\mathbf{h}_q$ and a tool codebook $\{\mathbf{p}_1, \dots, \mathbf{p}_N\}$, we first compute the cosine similarities between $\mathbf{h}_q$ and each tool prompt $\mathbf{p}_i$. The most relevant tools are then selected (e.g., via top-$K$) and concatenated with the visual/text embeddings before being fed into the large language model.
  • Figure 5: Qualitative results on MVBench.
  • ...and 3 more figures