VDAT_2025

Computing systems have fueled the growth of artificial intelligence (AI). Improvements in AI algorithms have inevitably gone hand-in-hand with the improvements in the hardware accelerators. Our ability to train increasingly complex AI models and achieve low-power, real-time inference depends on the capabilities of computing systems. In recent years, the metrics used for optimizing and evaluating AI algorithms are diversifying: along with accuracy, there is increasing emphasis on the metrics such as energy efficiency and model size. Given this, researchers working on AI can no longer afford to ignore the computing system. Instead, the knowledge of the potential and limitations of computing systems can provide invaluable guidance to them in designing the most efficient and accurate algorithms. This tutorial seeks to arouse curiosity and even an interest in the AI accelerators, with the example of one of the most popular commercial accelerator, viz., Google's TPU. We first present the basics, viz., systolic array architecture for matrix multiplication. Then, we dive into the architecture of TPU, its salient features, comparison with CPU and GPU architecture, and evaluation results on AI workloads. We finally provide a view of evolution of Google's TPU over 6 versions to learn how TPU has transformed to address the needs of changing AI workloads. This tutorial is at the intersection of deep learning algorithms, computer architecture, and chip design, and thus, is expected to be beneficial for a broad range of learners.

Google's Tensor Processing Unit: Understanding State-of-the-art AI Accelerator

Domain

Prerequisites

Key Words

Abstract

Tutorial Preview