Table of Contents
Fetching ...

DB3 Team's Solution For Meta KDD Cup' 25

Yikuan Xia, Jiazun Chen, Yirui Zhan, Suifeng Zhao, Weipeng Jiang, Chaorui Zhang, Wei Han, Bo Bai, Jun Gao

TL;DR

This work tackles ego-centric, multi-modal, multi-turn QA in the CRAG-MM benchmark by integrating domain-adaptive retrieval pipelines with a unified hallucination-control strategy. Key innovations include specialized retrieval modules (image KG grounding via Grounding-DINO, text-based web retrieval with merged query rewrites, and multi-turn context handling) and multi-stage refusal training (SFT, DPO, GRPO) complemented by checkpoint ensembling. The approach achieves 2nd place in Tasks 1 and 2 and 1st place in Task 3, securing the grand prize for ego-centric queries and demonstrating robust handling of first-person perspectives. Limitations include fine-grained entity recognition and OCR integration, suggesting future work on targeted recognition models and optimized OCR-VLM pipelines to further enhance performance.

Abstract

This paper presents the db3 team's winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup'25. Addressing the challenge's unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges.

DB3 Team's Solution For Meta KDD Cup' 25

TL;DR

This work tackles ego-centric, multi-modal, multi-turn QA in the CRAG-MM benchmark by integrating domain-adaptive retrieval pipelines with a unified hallucination-control strategy. Key innovations include specialized retrieval modules (image KG grounding via Grounding-DINO, text-based web retrieval with merged query rewrites, and multi-turn context handling) and multi-stage refusal training (SFT, DPO, GRPO) complemented by checkpoint ensembling. The approach achieves 2nd place in Tasks 1 and 2 and 1st place in Task 3, securing the grand prize for ego-centric queries and demonstrating robust handling of first-person perspectives. Limitations include fine-grained entity recognition and OCR integration, suggesting future work on targeted recognition models and optimized OCR-VLM pipelines to further enhance performance.

Abstract

This paper presents the db3 team's winning solution for the Meta CRAG-MM Challenge 2025 at KDD Cup'25. Addressing the challenge's unique multi-modal, multi-turn question answering benchmark (CRAG-MM), we developed a comprehensive framework that integrates tailored retrieval pipelines for different tasks with a unified LLM-tuning approach for hallucination control. Our solution features (1) domain-specific retrieval pipelines handling image-indexed knowledge graphs, web sources, and multi-turn conversations; and (2) advanced refusal training using SFT, DPO, and RL. The system achieved 2nd place in Task 1, 2nd place in Task 2, and 1st place in Task 3, securing the grand prize for excellence in ego-centric queries through superior handling of first-person perspective challenges.

Paper Structure

This paper contains 15 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Pipeline for Different Tasks: A domain adapter is first applied to classify the query domain for different pipelines, e.g., a math tool for math problems, a retrieval through entity chunks module for plants, and a rewriting and retrieving from web chunks module for all other domains. After a reranker module for all retrieved text chunks, we obtain multiple answers using trained hallucination control adapters, and these answers are ensembled to obtain a final output answer.
  • Figure 2: Confusion Matrix of Domain Prediction
  • Figure 3: Grounding in the Query Image + Rerank through Image
  • Figure 4: Retrieving and Reranking through Text.
  • Figure 5: Retrieval Pipeline for Text-indexed Web Source.
  • ...and 2 more figures