Table of Contents
Fetching ...

PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization

Vijay Jaisankar, Sambaran Bandyopadhyay, Kalp Vyas, Varre Chaitanya, Shwetha Somasundaram

TL;DR

PostDoc addresses automatic poster generation from long multimodal documents by jointly optimizing content selection and design. It formulates a novel deep submodular function to capture coverage, diversity, and cross-modal alignment among text and images, and trains its weights via a hinge loss with alternating optimization. The pipeline paraphrases the selected content using GPT-3.5-turbo and generates a poster template with font, color, and layout decisions (including a heuristic layout) tuned to the content. Automated and human evaluations on MSMO and NJU-Fudan datasets show PostDoc outperforms baselines in textual coverage and poster aesthetics while offering faster inference and cost efficiency. Limitations include handling non-natural images and structured elements, with future work proposing fine-tuned vision-language models on such content.

Abstract

A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.

PostDoc: Generating Poster from a Long Multimodal Document Using Deep Submodular Optimization

TL;DR

PostDoc addresses automatic poster generation from long multimodal documents by jointly optimizing content selection and design. It formulates a novel deep submodular function to capture coverage, diversity, and cross-modal alignment among text and images, and trains its weights via a hinge loss with alternating optimization. The pipeline paraphrases the selected content using GPT-3.5-turbo and generates a poster template with font, color, and layout decisions (including a heuristic layout) tuned to the content. Automated and human evaluations on MSMO and NJU-Fudan datasets show PostDoc outperforms baselines in textual coverage and poster aesthetics while offering faster inference and cost efficiency. Limitations include handling non-natural images and structured elements, with future work proposing fine-tuned vision-language models on such content.

Abstract

A poster from a long input document can be considered as a one-page easy-to-read multimodal (text and images) summary presented on a nice template with good design elements. Automatic transformation of a long document into a poster is a very less studied but challenging task. It involves content summarization of the input document followed by template generation and harmonization. In this work, we propose a novel deep submodular function which can be trained on ground truth summaries to extract multimodal content from the document and explicitly ensures good coverage, diversity and alignment of text and images. Then, we use an LLM based paraphraser and propose to generate a template with various design aspects conditioned on the input content. We show the merits of our approach through extensive automated and human evaluations.
Paper Structure (34 sections, 1 theorem, 33 equations, 3 figures, 8 tables)

This paper contains 34 sections, 1 theorem, 33 equations, 3 figures, 8 tables.

Key Result

Theorem 2.1

The set function $f$ in Equation eq:dsf is a monotone submodular function.

Figures (3)

  • Figure 1: Block Diagram of PostDoc
  • Figure 2: A sample poster generated by PostDoc for a research paper
  • Figure 3: A sample layout generated by this method ($N_I$ = 4, $N_T$ = 5)

Theorems & Definitions (2)

  • Theorem 2.1
  • proof