Table of Contents
Fetching ...

A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

Daniel Sonntag, Michael Barz, Thiago Gouvêa

TL;DR

The paper outlines the No-IDLE initiative, a prototype for interactive deep learning that foregrounds human-in-the-loop design, multimodal interaction, and explainability to extend DL reach to non-experts. It centers on an interactive photo book use case to study combined NLP, MMI, ML, and HCI components, including gaze-driven feedback, entity-aware captioning, and mixed-initiative learning. Key contributions include a detailed blueprint for an end-to-end interactive DL system, methodologies for incremental model updates via explanatory feedback, and a VR-enabled evaluation plan to measure usability, learning efficiency, and user experience. The work aims to create a scalable testbed that informs broader AI deployment in domains like healthcare and manufacturing, with future directions involving integration with large language models such as ChatGPT.

Abstract

This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users' behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE's goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.

A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)

TL;DR

The paper outlines the No-IDLE initiative, a prototype for interactive deep learning that foregrounds human-in-the-loop design, multimodal interaction, and explainability to extend DL reach to non-experts. It centers on an interactive photo book use case to study combined NLP, MMI, ML, and HCI components, including gaze-driven feedback, entity-aware captioning, and mixed-initiative learning. Key contributions include a detailed blueprint for an end-to-end interactive DL system, methodologies for incremental model updates via explanatory feedback, and a VR-enabled evaluation plan to measure usability, learning efficiency, and user experience. The work aims to create a scalable testbed that informs broader AI deployment in domains like healthcare and manufacturing, with future directions involving integration with large language models such as ChatGPT.

Abstract

This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users' behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE's goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.

Paper Structure

This paper contains 11 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: We plan to combine several modules based on deep learning models to create photo book pages from natural language input. These modules include, for instance, image retrieval, image captioning, and person recognition.
  • Figure 2: The user can provide multimodal feedback to the photo book tool to alter the created content. For instance, we plan to jointly interpret the user's gaze signal and spoken utterances to improve person recognition. An example is shown in figure \ref{['fig:mmi_demo_setup']}.
  • Figure 3: Example of a multimodal user input to our photo book application (based on an existing demo setup). The user provides corrective feedback in natural language by saying "This is Sarah, not Mary". The system uses his gaze to resolve the face that was referred to and uses the new information to update the underlying deep learning models as depicted in figure \ref{['fig:usecase_feedback']}.
  • Figure 4: Visualisation of the virtual reality scenario. Images and the photo book are presented in an immersive virtual environment. Through multimodal interaction (pointing, eye-/gaze-tracking, natural speech) the user engages with the system and provides corrective feedback by saying "This is Sarah, not Mary". The system uses implicit and explicit pointing or gaze to resolve the face that was referred to and uses the new information to update the underlying deep learning models as depicted in figure \ref{['fig:usecase_feedback']}. In addition to the multimodal setup depicted in figure \ref{['fig:mmi_demo_setup']}, VR tracking provides detailed spatial tracking information that will be included in the data analysis.
  • Figure 5: High level overview of our proposed method in the workflow of the Ophthalmo-AI project (BMBF). Given a retinal image, our models will generate 3 types of predictions (DR grade, lesion region, visual explanation) simultaneously. Ophthalmologists can observe the predictions and provide feedback for model fine-tuning.
  • ...and 3 more figures