IEEE VIS 2024 Content: An Empirical Evaluation of the GPT-4 Multimodal Language Model on Visualization Literacy Tasks

An Empirical Evaluation of the GPT-4 Multimodal Language Model on Visualization Literacy Tasks



Alexander Bendeck - Georgia Institute of Technology, Atlanta, United States

 John Stasko - Georgia Institute of Technology, Atlanta, United States

 Screen-reader Accessible PDF

 Download Supplemental Material

 Room: Bayshore I + II + III

2024-10-18T13:18:00ZGMT-0600Change your timezone on the schedule page
2024-10-18T13:18:00Z

Exemplar figure, described by caption below — Large vision-language models like GPT-4V are extremely powerful, but we have little understanding of their visualization literacy capabilities. We conduct an empirical evaluation of the GPT-4V model on four tasks from the visualization literature related to visualization literacy: (1) the Visualization Literacy Assessment Test (VLAT); (2) a chart question answering dataset; (3) a set of questions about deceptive visualization design choices; and (4) a set of questions about visualizations with misaligned titles. We also release all materials and code to support future research.

Fast forward

Full Video

Keywords

Visualization Literacy, Large Language Models, Natural Language

Abstract

Large Language Models (LLMs) like GPT-4 which support multimodal input (i.e., prompts containing images in addition to text) have immense potential to advance visualization research. However, many questions exist about the visual capabilities of such models, including how well they can read and interpret visually represented data. In our work, we address this question by evaluating the GPT-4 multimodal LLM using a suite of task sets meant to assess the model's visualization literacy. The task sets are based on existing work in the visualization community addressing both automated chart question answering and human visualization literacy across multiple settings. Our assessment finds that GPT-4 can perform tasks such as recognizing trends and extreme values, and also demonstrates some understanding of visualization design best-practices. By contrast, GPT-4 struggles with simple value retrieval when not provided with the original dataset, lacks the ability to reliably distinguish between colors in charts, and occasionally suffers from hallucination and inconsistency. We conclude by reflecting on the model's strengths and weaknesses as well as the potential utility of models like GPT-4 for future visualization research. We also release all code, stimuli, and results for the task sets at the following link: https://doi.org/10.17605/OSF.IO/F39J6