Table of Contents
Fetching ...

Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

Kenan Tang, Praveen Arunshankar, Andong Hua, Anthony Yang, Yao Qin

Abstract

The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.

Banana100: Breaking NR-IQA Metrics by 100 Iterative Image Replications with Nano Banana Pro

Abstract

The multi-step, iterative image editing capabilities of multi-modal agentic systems have transformed digital content creation. Although latest image editing models faithfully follow instructions and generate high-quality images in single-turn edits, we identify a critical weakness in multi-turn editing, which is the iterative degradation of image quality. As images are repeatedly edited, minor artifacts accumulate, rapidly leading to a severe accumulation of visible noise and a failure to follow simple editing instructions. To systematically study these failures, we introduce Banana100, a comprehensive dataset of 28,000 degraded images generated through 100 iterative editing steps, including diverse textures and image content. Alarmingly, image quality evaluators fail to detect the degradation. Among 21 popular no-reference image quality assessment (NR-IQA) metrics, none of them consistently assign lower scores to heavily degraded images than to clean ones. The dual failures of generators and evaluators may threaten the stability of future model training and the safety of deployed agentic systems, if the low-quality synthetic data generated by multi-turn edits escape quality filters. We release the full code and data to facilitate the development of more robust models, helping to mitigate the fragility of multi-modal agentic systems.

Paper Structure

This paper contains 25 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Iteratively replicating an image using Nano Banana Pro severely degrades an image, but BRISQUE assigns better (lower) scores to images of worse quality. BRISQUE is a No-Reference Image Quality Assessment (NR-IQA) metric that has been widely used to assess the quality of AI-generated images. This counter-intuitive failure is pervasive across diverse NR-IQA metrics and image textures.
  • Figure 2: The reasoning summary from Nano Banana Pro appears as clear-cut generation and evaluation sections. The bold text are section titles, copied verbatim from the reasoning summary from the Nano Banana Pro API. In this example, the first two sections are dedicated to image generation, whereas the last two sections are dedicated to the evaluation of a generated image.
  • Figure 3: A summary of the failure modes of instruction following, categorized into sub-object level (blue), object level (yellow), and image level (green). The images have been cropped and zoomed for visual clarity. As the failures were consistent across different runs and editing steps, we do not report the exact run index and step index for each image here. See \ref{['sec:analysis_of_instruction_following_failures']} for details.
  • Figure 4: We normalized NR-IQA scores (BRISQUE as an example) and calculated the difference across steps to quantify the score trend. Please see \ref{['sec:nr_iqa_methods_fail_to_quantify_degradation']} for details. The three $\Delta$ values can also be found at the intersection of the second row (Dongpo) and the second column in each heatmap of \ref{['fig:iqa_nano_banana_pro']}.
  • Figure 5: All 21 NR-IQA metrics fail on identifying degradation, assigning higher normalized scores to images with worse quality. The heatmap shows the gap between the pair of normalized scores calculated for the image of a later editing (5, 10, or 20) and the image of the first step. The normalization converts each NR-IQA metric to the same scale of [0, 100], with higher scores corresponding to better image quality. Positive gaps indicate failures and are marked by blue colors in the heatmap. Due to the diversity in the texture types of the 12 initial images, each NR-IQA metric fails on a different set of initial images.
  • ...and 5 more figures