Table of Contents
Fetching ...

Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking

David Bayani

TL;DR

This work probes GPT3.5's cross-modal capabilities by directly using ASCII-art as input for recognition and generation tasks, avoiding transformation of visuals into natural language. Through two recognition tracks—diagram-like ASCII-art and human-drawn depictions—and a suite of generation tasks (verbatim, translation, noise, size, rotation)—the study demonstrates that GPT3.5 harbors nontrivial visual-spatial competencies, albeit with limitations and variability across tasks. The findings indicate partial invariances to transforms and partial semantic grounding of object parts, with performance often influenced by prompts and potential memorization. Overall, the results suggest GPT3.5 possesses unexpected, though not human-level, cross-modal abilities, highlighting both the promise and the constraints of text-only models in handling graphical content.

Abstract

Over the eight months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche-industry of papers have emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been either natural language text or stylized, code-like language. Drawing inspiration from the prowess we expect a truly human-level intelligent agent to have across multiple signal modalities, in this work we examine GPT3.5's aptitude for visual tasks, where the inputs feature content provided as ASCII-art without overt distillation into a lingual summary. We conduct experiments analyzing the model's performance on image recognition tasks after various transforms typical in visual settings, trials investigating knowledge of image parts, and tasks covering image generation.

Testing the Depth of ChatGPT's Comprehension via Cross-Modal Tasks Based on ASCII-Art: GPT3.5's Abilities in Regard to Recognizing and Generating ASCII-Art Are Not Totally Lacking

TL;DR

This work probes GPT3.5's cross-modal capabilities by directly using ASCII-art as input for recognition and generation tasks, avoiding transformation of visuals into natural language. Through two recognition tracks—diagram-like ASCII-art and human-drawn depictions—and a suite of generation tasks (verbatim, translation, noise, size, rotation)—the study demonstrates that GPT3.5 harbors nontrivial visual-spatial competencies, albeit with limitations and variability across tasks. The findings indicate partial invariances to transforms and partial semantic grounding of object parts, with performance often influenced by prompts and potential memorization. Overall, the results suggest GPT3.5 possesses unexpected, though not human-level, cross-modal abilities, highlighting both the promise and the constraints of text-only models in handling graphical content.

Abstract

Over the eight months since its release, ChatGPT and its underlying model, GPT3.5, have garnered massive attention, due to their potent mix of capability and accessibility. While a niche-industry of papers have emerged examining the scope of capabilities these models possess, the information fed to and extracted from these networks has been either natural language text or stylized, code-like language. Drawing inspiration from the prowess we expect a truly human-level intelligent agent to have across multiple signal modalities, in this work we examine GPT3.5's aptitude for visual tasks, where the inputs feature content provided as ASCII-art without overt distillation into a lingual summary. We conduct experiments analyzing the model's performance on image recognition tasks after various transforms typical in visual settings, trials investigating knowledge of image parts, and tasks covering image generation.
Paper Structure (32 sections, 2 equations, 16 figures, 4 tables)

This paper contains 32 sections, 2 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: One of the examples we provide as part of the prompt to GPT3.5 in the experiments of \ref{['subsec:exps.recog.HumanDrawings']}. The use of labels starting with "EX_" are to help reduce any chance of ambiguity as to the role the information plays in the prompt. The appearance of the number six among the tags --- "EX_CHOICE_FOR_6" and "EXPECTED_ANSWER_TO_6_FOR_EX" --- are to indicate the sub-question of the prompt that, respectively, the choices and answer are for. See the example prompt at \ref{['apdx:fig.humanASCIIArtPrompt1']}.
  • Figure 2: Prompts used in the ASCII-art generation experiments of \ref{['sec.exp.gen.queries']}. Bolded, bracketed text of larger size indicates either places where a the preamble from \ref{['fig:exp.gen.queries.preambleForSeveralPrompts']} should be substituted in, or the place where ASCII-art for an instance of the query would be placed.
  • Figure 3: The example of interesting alterations to the the reference image produced by GPT3.5 during our verbatim generation trials. Notice the boxes C, L, n, m, and Z are one row longer in the results returned by the network than in the reference provided.
  • Figure 4: An example of middle-grade result from the translation trials, this one leaning toward the better-end of the quality spectrum.
  • Figure 5: An example of middle-grade result from the translation trials, this one leaning toward the worse-end of the quality spectrum.
  • ...and 11 more figures