Table of Contents
Fetching ...

Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer

Jaeyoung Kim, Jongho Lee, Hongjun Choi, Sion Jang

TL;DR

The paper tackles personalized figure caption generation by leveraging author-specific profile data from the same paper, addressing the trade-off between stylistic mimicry and caption informativeness. It introduces a two-stage pipeline: a caption-quality evaluator $f_{quality}$ filters training data and a multimodal caption generator $g_{caption}$ is fine-tuned with author profiles $(F, P, M, O)$ and related figures. Empirical results show that richer profile context yields higher BLEU/ROUGE scores and that there is a measurable tension between personalization and quality, motivating a quality-aware training paradigm that jointly predicts caption quality and enables inference-time control via Predicted-Q versus Forced-Q6. The work demonstrates competitive performance relative to larger models and highlights practical considerations for deploying caption automation systems that preserve author voice while maintaining high-quality scientific communication.

Abstract

We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.

Personalized Scientific Figure Caption Generation: An Empirical Study on Author-Specific Writing Style Transfer

TL;DR

The paper tackles personalized figure caption generation by leveraging author-specific profile data from the same paper, addressing the trade-off between stylistic mimicry and caption informativeness. It introduces a two-stage pipeline: a caption-quality evaluator filters training data and a multimodal caption generator is fine-tuned with author profiles and related figures. Empirical results show that richer profile context yields higher BLEU/ROUGE scores and that there is a measurable tension between personalization and quality, motivating a quality-aware training paradigm that jointly predicts caption quality and enables inference-time control via Predicted-Q versus Forced-Q6. The work demonstrates competitive performance relative to larger models and highlights practical considerations for deploying caption automation systems that preserve author voice while maintaining high-quality scientific communication.

Abstract

We study personalized figure caption generation using author profile data from scientific papers. Our experiments demonstrate that rich author profile data, combined with relevant metadata, can significantly improve the personalization performance of multimodal large language models. However, we also reveal a fundamental trade-off between matching author style and maintaining caption quality. Our findings offer valuable insights and future directions for developing practical caption automation systems that balance both objectives. This work was conducted as part of the 3rd SciCap challenge.

Paper Structure

This paper contains 7 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Multimodal LLM architecture for the personalized figure caption generation task. The model receives two types of inputs: (1) target figure information including the figure itself (F), explanatory paragraphs (P), textual mentions (M), and OCR-extracted text (O), and (2) optional profile data from N related figures in the same paper (where N can be 0), each containing corresponding figures, paragraphs, mentions, OCR text, and captions (C).
  • Figure 2: Confusion matrix showing the agreement between fine-tuned Qwen-2.5-VL-3B and GPT-4.1.
  • Figure 3: Caption quality score comparing Gemini-2.5-flash and fine-tuned Qwen-2.5-VL-7B. Both models generate lower quality captions compared to author-written captions.
  • Figure 4: Quality-aware model architecture for personalized caption generation. (a) Training phase: The model predicts both the quality score (Q) and caption (C) from profile data and target figure information, learning to associate quality levels with caption characteristics. (b) Inference phase: The model can be conditioned on a specified quality score (e.g., Q=6 for maximum quality) to control the quality-personalization trade-off during generation. Profile data includes figures (F), paragraphs (P), mentions (M), OCR text (O), and captions (C) from related figures in the same paper.
  • Figure 5: The actual prompt for caption quality assessment we used in our experiments.
  • ...and 1 more figures