HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Yu Tian; Tianqi Shao; Tsukasa Demizu; Xuyang Wu; Hsin-Tai Wu

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Yu Tian, Tianqi Shao, Tsukasa Demizu, Xuyang Wu, Hsin-Tai Wu

TL;DR

A novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM, and achieves a 31.5\% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation.

Abstract

Head pose estimation (HPE) requires a sophisticated understanding of 3D spatial relationships to generate precise yaw, pitch, and roll angles. Previous HPE models, primarily CNN-based, rely on cropped close-up human head images as inputs and often lack robustness in real-world scenario. Vision Language Models (VLMs) can analyze entire images while focusing on specific objects through their attention mechanisms. In this paper, we propose a novel framework to improve the HPE accuracy by leveraging the object detection grounding capability of a VLM, referred to as CogVLM. We empirically find that directly LoRA fine-tuning of this VLM for the HPE task fails to achieve desirable HPE accuracy, while some model merging methods can improve accuracy but frequently produce blended invalid response formats, struggling to handle both object detection and HPE tasks simultaneously. To integrate HPE capability into CogVLM effectively, we develop a novel LoRA layer-based model merging method. This merging approach applies a high cosine similarity threshold and a winner-takes-all layer selection strategy, aligning attention to the HPE task while preserving original object detection knowledge. It successfully resolves issues with blended invalid response formats and improves accuracy. Results show that our HPE-CogVLM achieves a 31.5\% reduction in Mean Absolute Error over the current state-of-the-art CNN model, 6DRepNet, in cross-dataset evaluation. Furthermore, HPE-CogVLM outperforms both directly LoRA fine-tuned and task arithmetic-based merged VLMs across all HPE metrics.

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

TL;DR

Abstract

Paper Structure (26 sections, 1 equation, 4 figures, 6 tables)

This paper contains 26 sections, 1 equation, 4 figures, 6 tables.

Introduction
Related Work
Head Pose Estimation (HPE)
Grounding in Vision Language Models
Model Merging in LLMs
Catastrophic Forgetting Problem
HPE-CogVLM Framework
Stage 1: Pre-training of the Original Grounding CogVLM on Weak Label Data
Stage 2: Supervised Fine-tuning of the Weak Label CogVLM on Task-specific (HPE) Data
Stage 3: Layer-based Merging between Original Grounding CogVLM and HPE-oriented CogVLM
Stage 4: Continual Fine-tuning of Layer-based Merging CogVLM on Mixture Data
Stage 5: Evaluation of HPE-CogVLM on Test Data
Experiments Setup
HPE Task Prompt Design
Datasets
...and 11 more sections

Figures (4)

Figure 1: Examples of CogVLM and HPE-CogVLM. (a) shows an example of CogVLM grounding capability, which demonstrates the original grounding CogVLM's ability to identify objects based on prompts, a foundational skill useful for HPE task. (b) displays a visualization of head orientation predicted by our HPE-CogVLM from the CMU Panoptic dataset, using Euler angles. The head pose labels are depicted with pitch (red axis), roll (green axis), and yaw (blue axis) angles, each indicated in their respective directions.
Figure 2: The framework of integrating HPE task into the original grounding CogVLM. This diagram illustrates our multi-stage integration process of HPE task into the original grounding CogVLM model with the information of dataset usages, designed prompts and model merging strategy.
Figure 3: The model performance under various rehearsal ratios (10% and 25%). (a) shows the MAE results under rehearsal ratio 10% and 25% on VLMs. (b) shows the Refcoco Test BBox accuracy results under rehearsal ratio 10% and 25% on VLMs.
Figure 4: The visualization displays cross attention maps generated in response to our custom prompts. The left image shows the attention map associated with the prompt "What is the head yaw pitch roll inside the bounding box [[335,179,445,332]]?" (BBox for the person on the left), and the right image corresponds to the prompt "What is the head yaw pitch roll inside the bounding box [[775,105,893,261]]?" (BBox for the person on the right).

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

TL;DR

Abstract

HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task

Authors

TL;DR

Abstract

Table of Contents

Figures (4)