Table of Contents
Fetching ...

Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

Monika Shah, Somdeb Sarkhel, Deepak Venugopal

TL;DR

The paper tackles the challenge of disentangling knowledge acquired during fine-tuning from pretraining in visual captioning. It introduces a Hybrid Markov Logic Network (HMLN) that combines symbolic predicates extracted from captions with real-valued CLIP-based visual terms, and defines two inference tasks: abductive MAP inference to assess caption likelihood given an image, and back-tracing training examples via sampling and MILP to explain the caption’s dependence on fine-tuning data. Experiments on MSCOCO across five captioning architectures—including BLIP2, a Visual Large Language Model—show that LLM-based systems leverage broad pretraining knowledge, making their captions harder to trace to fine-tuning data, while non-LLM baselines align more with the finetuning distribution. The work provides a principled method for knowledge attribution in multimodal systems and suggests directions for diagnosing and improving the reliability of captioning under strong pretraining signals.

Abstract

Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM

Disentangling Fine-Tuning from Pre-Training in Visual Captioning with Hybrid Markov Logic

TL;DR

The paper tackles the challenge of disentangling knowledge acquired during fine-tuning from pretraining in visual captioning. It introduces a Hybrid Markov Logic Network (HMLN) that combines symbolic predicates extracted from captions with real-valued CLIP-based visual terms, and defines two inference tasks: abductive MAP inference to assess caption likelihood given an image, and back-tracing training examples via sampling and MILP to explain the caption’s dependence on fine-tuning data. Experiments on MSCOCO across five captioning architectures—including BLIP2, a Visual Large Language Model—show that LLM-based systems leverage broad pretraining knowledge, making their captions harder to trace to fine-tuning data, while non-LLM baselines align more with the finetuning distribution. The work provides a principled method for knowledge attribution in multimodal systems and suggests directions for diagnosing and improving the reliability of captioning under strong pretraining signals.

Abstract

Multimodal systems have highly complex processing pipelines and are pretrained over large datasets before being fine-tuned for specific tasks such as visual captioning. However, it becomes hard to disentangle what the model learns during the fine-tuning process from what it already knows due to its pretraining. In this work, we learn a probabilistic model using Hybrid Markov Logic Networks (HMLNs) over the training examples by relating symbolic knowledge (extracted from the caption) with visual features (extracted from the image). For a generated caption, we quantify the influence of training examples based on the HMLN distribution using probabilistic inference. We evaluate two types of inference procedures on the MSCOCO dataset for different types of captioning models. Our results show that for BLIP2 (a model that uses a LLM), the fine-tuning may have smaller influence on the knowledge the model has acquired since it may have more general knowledge to perform visual captioning as compared to models that do not use a LLM

Paper Structure

This paper contains 18 sections, 14 equations, 6 figures, 1 table, 3 algorithms.

Figures (6)

  • Figure 1: Illustrative example where M2 is pretrained with an LLM while M1 is not. For M1, the source of its knowledge in generating the caption can be traced in the training examples while for M2 it is harder since it may have more general knowledge from pretraining.
  • Figure 2: The x-axis shows the average log MAP objective value computed across test images. The y-axis shows the ranking of a model where the rank is computed based on the average rank across 5 standard captioning metrics ($R=1$ is the highest ranked model).
  • Figure 3: Illustrative examples for MAP values of captions for BLIP2 and non-VLLM models.
  • Figure 4: Comparing responses (from a Likert scale of 1-5) from AMT users across models for explaining their generated captions using training examples. Higher values indicate that the users more strongly agreed with the explanations.
  • Figure 5: Illustrative contrastive training examples for the generated captions. The caption shown in each case was generated for the first image, the second image is the largest probability example and the third one is the smallest probability training example.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4