Table of Contents
Fetching ...

Science-T2I: Addressing Scientific Illusions in Image Synthesis

Jialuo Li, Wenhao Chai, Xingyu Fu, Haiyang Xu, Saining Xie

TL;DR

SciScore is presented, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model by applying the proposed fine-tuning method to FLUX.

Abstract

We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

Science-T2I: Addressing Scientific Illusions in Image Synthesis

TL;DR

SciScore is presented, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model by applying the proposed fine-tuning method to FLUX.

Abstract

We present a novel approach to integrating scientific knowledge into generative models, enhancing their realism and consistency in image synthesis. First, we introduce Science-T2I, an expert-annotated adversarial dataset comprising adversarial 20k image pairs with 9k prompts, covering wide distinct scientific knowledge categories. Leveraging Science-T2I, we present SciScore, an end-to-end reward model that refines the assessment of generated images based on scientific knowledge, which is achieved by augmenting both the scientific comprehension and visual capabilities of pre-trained CLIP model. Additionally, based on SciScore, we propose a two-stage training framework, comprising a supervised fine-tuning phase and a masked online fine-tuning phase, to incorporate scientific knowledge into existing generative models. Through comprehensive experiments, we demonstrate the effectiveness of our framework in establishing new standards for evaluating the scientific realism of generated content. Specifically, SciScore attains performance comparable to human-level, demonstrating a 5% improvement similar to evaluations conducted by experienced human evaluators. Furthermore, by applying our proposed fine-tuning method to FLUX, we achieve a performance enhancement exceeding 50% on SciScore.

Paper Structure

This paper contains 90 sections, 37 equations, 21 figures, 15 tables.

Figures (21)

  • Figure 1: Comparison between GPT-4o and SciScore. Given a prompt (in grey) requiring scientific knowledge, FLUX FLUX model generates imaginary images (lower row) that are far from reality (upper row). Moreover, LMMs like GPT-4oGPT fail to identify the realistic image, whereas our end-to-end reward model SciScore succeeds. Notice that the prompts here are summarization of the real prompts that we used for illustration purposes.
  • Figure 2: Data statistics. (Left) Science-T2I is organized into three primary scientific fields: Chemistry, Biology, and Physics. Each field is divided into specific categories, with the numbers indicating the volume of implicit prompts collected for each category. (Right) Word cloud of structured prompt in Science-T2I.
  • Figure 3: Data curation pipeline. For each task, GPT-4oGPT first generates structured templates that capture the scientific principles while allowing for variability in objects or substances. These templates are used to create implicit prompts, which GPT-4oGPT then expands into explicit and superficial prompts, ultimately guiding the synthesis of corresponding explicit and superficial images.
  • Figure 4: Online fine-tuning pipeline. For each prompt, two images are generated to compute SciScore preference metric. Simultaneously, GroundingDINO liu2024groundingdinomarryingdino extracts segmentation masks from these images based on the prompts, which are then used to block gradient propagation in the corresponding spatial regions.
  • Figure 5: Performance of SciScore in ST and CT.
  • ...and 16 more figures