IEEE VIS 2024 Content: LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

LLM Comparator: Interactive Analysis of Side-by-Side Evaluation of Large Language Models

Minsuk Kahng - Google, Atlanta, United States

Ian Tenney - Google Research, Seattle, United States

Mahima Pushkarna - Google Research, Cambridge, United States

Michael Xieyang Liu - Google Research, Pittsburgh, United States

James Wexler - Google Research, Cambridge, United States

Emily Reif - Google, Cambridge, United States

Krystal Kallarackal - Google Research, Mountain View, United States

Minsuk Chang - Google Research, Seattle, United States

Michael Terry - Google, Cambridge, United States

Lucas Dixon - Google, Paris, France

Room: Bayshore V

2024-10-18T12:42:00ZGMT-0600Change your timezone on the schedule page
2024-10-18T12:42:00Z
Exemplar figure, described by caption below
LLM Comparator is a visual analytics tool consisting of multiple views: an interactive table which displays individual prompts and model responses, and a visualization summary which comprises multiple panels, including score distribution, metrics by prompt category, rationale clusters, n-grams, and custom functions.
Fast forward
Full Video
Keywords

Visual analytics, large language models, model evaluation, responsible AI, machine learning interpretability.

Abstract

Evaluating large language models (LLMs) presents unique challenges. While automatic side-by-side evaluation, also known as LLM-as-a-judge, has become a promising solution, model developers and researchers face difficulties with scalability and interpretability when analyzing these evaluation outcomes. To address these challenges, we introduce LLM Comparator, a new visual analytics tool designed for side-by-side evaluations of LLMs. This tool provides analytical workflows that help users understand when and why one LLM outperforms or underperforms another, and how their responses differ. Through close collaboration with practitioners developing LLMs at Google, we have iteratively designed, developed, and refined the tool. Qualitative feedback from these users highlights that the tool facilitates in-depth analysis of individual examples while enabling users to visually overview and flexibly slice data. This empowers users to identify undesirable patterns, formulate hypotheses about model behavior, and gain insights for model improvement. LLM Comparator has been integrated into Google's LLM evaluation platforms and open-sourced.