Table of Contents
Fetching ...

FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

Reza Saadati Fard, Emmanuel Agu, Palawat Busaranuvong, Deepak Kumar, Shefalika Gautam, Bengisu Tulu, Diane Strong, Lorraine Loretz

TL;DR

This work tackles the challenge of accurately staging pressure ulcers (Stages I–IV) from images while ensuring interpretability. It introduces FT-ARM, a fine-tuned multimodal large language model with an agentic reflection mechanism that iteratively reasons over visual cues and clinical knowledge to refine predictions, and uses LoRA for efficient domain adaptation. On the PIID benchmark, FT-ARM achieves 85.2% accuracy and 0.85 F1, surpassing strong CNN, ViT, and prompting-based MLLM baselines, and it provides clinically grounded natural-language rationales. The study demonstrates the value of combining targeted fine-tuning with self-reflective reasoning for reliable, interpretable AI in wound care, under live-inference conditions that better reflect real-world deployment and clinical trust.

Abstract

Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.

FT-ARM: Fine-Tuned Agentic Reflection Multimodal Language Model for Pressure Ulcer Severity Classification with Reasoning

TL;DR

This work tackles the challenge of accurately staging pressure ulcers (Stages I–IV) from images while ensuring interpretability. It introduces FT-ARM, a fine-tuned multimodal large language model with an agentic reflection mechanism that iteratively reasons over visual cues and clinical knowledge to refine predictions, and uses LoRA for efficient domain adaptation. On the PIID benchmark, FT-ARM achieves 85.2% accuracy and 0.85 F1, surpassing strong CNN, ViT, and prompting-based MLLM baselines, and it provides clinically grounded natural-language rationales. The study demonstrates the value of combining targeted fine-tuning with self-reflective reasoning for reliable, interpretable AI in wound care, under live-inference conditions that better reflect real-world deployment and clinical trust.

Abstract

Pressure ulcers (PUs) are a serious and prevalent healthcare concern. Accurate classification of PU severity (Stages I-IV) is essential for proper treatment but remains challenging due to subtle visual distinctions and subjective interpretation, leading to variability among clinicians. Prior AI-based approaches using Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) achieved promising accuracy but offered limited interpretability. We present FT-ARM (Fine-Tuned Agentic Reflection Multimodal model), a fine-tuned multimodal large language model (MLLM) with an agentic self-reflection mechanism for pressure ulcer severity classification. Inspired by clinician-style diagnostic reassessment, FT-ARM iteratively refines its predictions by reasoning over visual features and encoded clinical knowledge from text, enhancing both accuracy and consistency. On the publicly available Pressure Injury Image Dataset (PIID), FT-ARM, fine-tuned from LLaMA 3.2 90B, achieved 85% accuracy in classifying PU stages I-IV, surpassing prior CNN-based models by +4%. Unlike earlier CNN/ViT studies that relied solely on offline evaluations, FT-ARM is designed and tested for live inference, reflecting real-time deployment conditions. Furthermore, it produces clinically grounded natural-language explanations, improving interpretability and trust. By integrating fine-tuning and reflective reasoning across multimodal inputs, FT-ARM advances the reliability, transparency, and clinical applicability of automated wound assessment systems, addressing the critical need for consistent and explainable PU staging to support improved patient care.

Paper Structure

This paper contains 27 sections, 5 equations, 18 figures, 9 tables, 1 algorithm.

Figures (18)

  • Figure 1: Visualization of the anatomy of pressure ulcer stages I–IV, as defined by the National Pressure injury Advisory Panel (NPIAP) edsberg2016revised. Stage I involves skin erythema without tissue loss; Stage II presents partial-thickness skin loss with exposure of dermis; Stage III shows full-thickness tissue loss extending into subcutaneous fat; and Stage IV indicates extensive damage reaching a muscle or bone. barghouthi2023systematic.
  • Figure 2: Envisioned usage scenario for FT-ARM. A nurse captures a wound photo and optional clinical notes using a smartphone app (a), which sends them to FT-ARM running in the cloud (b–c), and receives a predicted pressure ulcer stage with decision rationale and explanations (d).
  • Figure 3: Overview of a typical Multimodal LLM (MLLM) architecture yin2024survey.. The left side shows a standard processing pipeline in which text and image inputs are embedded via a tokenizer and modality encoder, respectively. A connector module then aligns the image embeddings with text tokens before feeding them into a unified LLM for response generation. The right side illustrates two common connector types: (a) Projection-based connectors (e.g., MLPs), which transform visual embeddings into token space; and (b) Fusion-based connectors, which integrate image features directly within the LLM via multi-head attention.
  • Figure 4: Example of FT-ARM input and output structure. The input consists of a wound image, a task-specific prompt, and an optional caregiver note. While caregiver notes are supported by the system, they were not used during fine-tuning or evaluation in this study. FT-ARM generates a structured output that includes both the predicted PU stage and corresponding explanatory rationale. This example illustrates a Stage III prediction from visual and contextual features.
  • Figure 5: FT-ARM architecture for PU staging. A wound image and optional clinical note are fed into a fine-tuned Generator LLM, which generates an initial stage prediction and corresponding explanatory rationale. This output is then reviewed by a Critique LLM that provides feedback through a self-reflection loop, enabling the system to revise its answer. The final output includes both a pressure ulcer stage classification and an interpretable corresponding rationale. This iterative structure enhances both predictive reliability and clinical transparency.
  • ...and 13 more figures