Table of Contents
Fetching ...

Pegasus-v1 Technical Report

Raehyuk Jung, Hyojun Go, Jaehyuk Yi, Jiho Jang, Daniel Kim, Jay Suh, Aiden Lee, Cooper Han, Jae Lee, Jeff Kim, Jin-Young Kim, Junwan Kim, Kyle Park, Lucas Lee, Mars Ha, Minjoon Seo, Abraham Jo, Ed Park, Hassan Kianinejad, SJ Kim, Tony Moon, Wade Jeong, Andrei Popescu, Esther Kim, EK Yoon, Genie Heo, Henry Choi, Jenna Kang, Kevin Han, Noah Seo, Sunny Nguyen, Ryan Won, Yeonhoo Park, Anthony Giuliani, Dave Chung, Hans Yoon, James Le, Jenny Ahn, June Lee, Maninder Saini, Meredith Sanders, Soyoung Lee, Sue Kim, Travis Couture

TL;DR

Pegasus-1 tackles the challenge of understanding complex, multimodal video data by integrating visual, audio, and temporal information through a tripartite architecture (Video Encoder, Video-Language Alignment, and a Large Language Model) and a two-phase training regime. It achieves state-of-the-art results across key benchmarks for video conversation, zero-shot video QA, and video summarization, outperforming both open-source and proprietary baselines as of early 2024. The paper provides extensive qualitative analyses of capabilities—such as real-world knowledge, 3D spatial understanding, and temporal reasoning—while candidly detailing limitations like maximum video duration, hallucinations, safety concerns, and the absence of chat functionality. Together, these contributions underscore Pegasus-1’s potential for robust, interactive video-language understanding and offer a clear roadmap for extending length, reliability, and user interactivity in future work.

Abstract

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.

Pegasus-v1 Technical Report

TL;DR

Pegasus-1 tackles the challenge of understanding complex, multimodal video data by integrating visual, audio, and temporal information through a tripartite architecture (Video Encoder, Video-Language Alignment, and a Large Language Model) and a two-phase training regime. It achieves state-of-the-art results across key benchmarks for video conversation, zero-shot video QA, and video summarization, outperforming both open-source and proprietary baselines as of early 2024. The paper provides extensive qualitative analyses of capabilities—such as real-world knowledge, 3D spatial understanding, and temporal reasoning—while candidly detailing limitations like maximum video duration, hallucinations, safety concerns, and the absence of chat functionality. Together, these contributions underscore Pegasus-1’s potential for robust, interactive video-language understanding and offer a clear roadmap for extending length, reliability, and user interactivity in future work.

Abstract

This technical report introduces Pegasus-1, a multimodal language model specialized in video content understanding and interaction through natural language. Pegasus-1 is designed to address the unique challenges posed by video data, such as interpreting spatiotemporal information, to offer nuanced video content comprehension across various lengths. This technical report overviews Pegasus-1's architecture, training strategies, and its performance in benchmarks on video conversation, zero-shot video question answering, and video summarization. We also explore qualitative characteristics of Pegasus-1 , demonstrating its capabilities as well as its limitations, in order to provide readers a balanced view of its current state and its future direction.
Paper Structure (32 sections, 19 figures, 4 tables)

This paper contains 32 sections, 19 figures, 4 tables.

Figures (19)

  • Figure 1: Architectural Overview of Pegasus-1: Pegasus-1 is comprised of three main components: 1) the Video Encoder model for generating multi-modal embeddings from visual and audio inputs, the Video-Language Alignment Model for synchronizing video and text representations, and the Large Language Model for generating contextually relevant textual output.
  • Figure 2: The video, captured in the Tuileries Garden, lacks any verbal narration, allowing Pegasus-1 to deduce the location purely through visual cues. Impressively, it identifies key visual elements such as manicured hedges, ornamental trees, and the distinctive facade of a grand building—critical factors in pinpointing the exact setting. This capability showcases Pegasus-1's proficiency in analyzing and interpreting essential visual information to arrive at accurate conclusions.
  • Figure 3: The input video showcases various landscapes across Kyoto. Pegasus's response meticulously orders scenes to reflect their chronological appearance in the video, detailing Kyoto's landmarks in sync with their sequence. It intelligently infers the filming season as autumn, deduced from the presence of autumn foliage. Adhering precisely to the inquiry, Pegasus concludes by accurately providing Kyoto's coordinates, demonstrating its adeptness in both recognizing visual patterns and extracting relevant contextual information from the imagery presented.
  • Figure 4: The video showcases a Bugatti Chiron performing high-skill maneuvers, including producing smoke from its tires. Pegasus accurately identifies the exact car model, showcasing its broad knowledge of the real world. This highlights Pegasus's ability to recognize and interpret specific details from visual inputs.
  • Figure 5: The video features a brief gameplay moment from "Legend of Zelda: Breath of the Wild." Unlike the Gemini models, Pegasus provides a detailed visual analysis, such as noting the enemy's health bar, pinpointing the scene's location on a wooden platform, and mentioning the electric blue light during an attack. Additionally, Pegasus accurately identifies the game's title, showcasing its ability to interpret and convey comprehensive visual details along with correct contextual understanding.
  • ...and 14 more figures