Explainability Perspectives on a Vision Transformer: From Global Architecture to Single Neuron
Anne Marx - ETH Zurich, Zürich, Switzerland
Yumi Kim - Eth Zurich , Zürich, Switzerland
Luca Sichi - ETH Zürich, Zürich, Switzerland
Diego Arapovic - ETH Zürich, Zürich, Switzerland
Javier Sanguino Bautiste - ETH Zürich, Zürich, Switzerland. ETH Zürich, Zürich, Switzerland
Rita Sevastjanova - ETH, Zurich, Switzerland. ETH Zürich, Zürich, Switzerland
Mennatallah El-Assady - ETH Zurich, Zurich, Switzerland. ETH Zürich, Zürich, Switzerland
Room: Bayshore I
2024-10-13T12:30:00ZGMT-0600Change your timezone on the schedule page
2024-10-13T12:30:00Z
Full Video
Abstract
Transformers, initially designed for Natural Language Processing, have emerged as a strong alternative to Convolutional Neural Networks in Computer Vision. However, their interpretability remains challenging. We overcome the limitations of earlier studies by offering interactive components, engaging the user in the exploration of the Vision Transformer (ViT). Furthermore, we offer various complementary explainability methods to challenge the insight they provide. Key contributions include: - Interactive analysis of the ViT architecture and explainability methods. - Identifying critical information from input images used for classification. - Investigating neuron activations at various depths to understand learned features. - Introducing an innovative adaptation of activation maximization for attention scores to trace attention head focus across network layers. - Highlighting the limitations of each method through occlusion-based interaction. Our findings include that ViTs tend to generalize well by relying on a broad set of object features and contexts seen in the input image. Furthermore, the focus of neurons and attention heads shifts to more complex patterns at deeper layers. We also acknowledge that we cannot rely on a single explainability method to understand the decision-making process of transformers. Our blog post provides an engaging and multi-facetted interpretation of the ViT to the readers by combining interactivity with key research questions.