IEEE VIS 2024 Content: Ferry: Toward Better Understanding of Input/Output Space for Data Wrangling Scripts

Ferry: Toward Better Understanding of Input/Output Space for Data Wrangling Scripts

Zhongsu Luo - Zhejiang University, Hangzhou, China

Kai Xiong - Zhejiang University, Hangzhou, China

Jiajun Zhu - Zhejiang University, Hangzhou,Zhejiang, China

Ran Chen - Zhejiang University, Hangzhou, China

Xinhuan Shu - Newcastle University, Newcastle Upon Tyne, United Kingdom

Di Weng - Zhejiang University, Ningbo, China

Yingcai Wu - Zhejiang University, Hangzhou, China

Room: Bayshore V

2024-10-16T17:57:00ZGMT-0600Change your timezone on the schedule page
2024-10-16T17:57:00Z
Exemplar figure, described by caption below
The user interface of Ferry. Ferry is an interactive system that uses a constraint-based approach to help data workers understand the input/output space of data wrangling scripts. It aids in comprehending this space through constraint icon and constraint tag, combined with sample data. Additionally, Ferry detects conflicts between requirements and scripts, facilitating efficient scripts reuse and debugging.
Fast forward
Keywords

Data wrangling, Visual analytics, Constraints, Program understanding

Abstract

Understanding the input and output of data wrangling scripts is crucial for various tasks like debugging code and onboarding new data. However, existing research on script understanding primarily focuses on revealing the process of data transformations, lacking the ability to analyze the potential scope, i.e., the space of script inputs and outputs. Meanwhile, constructing input/output space during script analysis is challenging, as the wrangling scripts could be semantically complex and diverse, and the association between different data objects is intricate. To facilitate data workers in understanding the input and output space of wrangling scripts, we summarize ten types of constraints to express table space and build a mapping between data transformations and these constraints to guide the construction of the input/output for individual transformations. Then, we propose a constraint generation model for integrating table constraints across multiple transformations. Based on the model, we develop Ferry, an interactive system that extracts and visualizes the data constraints describing the input and output space of data wrangling scripts, thereby enabling users to grasp the high-level semantics of complex scripts and locate the origins of faulty data transformations. Besides, Ferry provides example input and output data to assist users in interpreting the extracted constraints and checking and resolving the conflicts between these constraints and any uploaded dataset. Ferry’s effectiveness and usability are evaluated through two usage scenarios and two case studies, including understanding, debugging, and checking both single and multiple scripts, with and without executable data. Furthermore, an illustrative application is presented to demonstrate Ferry’s flexibility.