Milestone Google TPU v4 re-release

The Google I/O developer conference was cancelled last year due to the epidemic, but it made a strong return this year in an online format. In the Google campus where no developers were present, Google CEO Sundar Pichai announced the launch of a number of new technologies, in addition to the holographic video chat technology Project Starling, which can help users achieve “space teleportation” to make people In addition to Project Starling, a refreshing holographic video chat technology that enables “spatial teleportation,” and TPU v4, the latest generation of AI chips.

“This is the fastest system we’ve ever deployed on Google, and it’s a historic milestone for us.” Pichai introduced it this way.

The most powerful TPU with 2X faster speed and 10X higher performance

Google officially introduced that TPU v4 has an average performance improvement of 2.7 times compared to the previous generation TPU v3 at the same 64-chip scale, not considering the improvement brought by software.

In real-world applications, TPU v4 functions mainly connected to Pods, with 4096 TPU v4 single chips in each TPU v4 Pod. Thanks to its unique interconnect technology, which can transform hundreds of independent processors into one system, the interconnect bandwidth is 10 times larger in scale than any other network technology, and each TPU v4 Pod can reach 1 Each TPU v4 Pod can reach 1 exaFlOP level of arithmetic power, achieving 10 to the 18th power of floating point operations per second. This is even twice the performance of the world’s fastest supercomputer, Fuyue.

“If there are 10 million people using laptops at the same time, all these computers can add up to just 1 exaFLOP of computing power. And before to reach 1 exaFLOP, you may need a special custom supercomputer.” So says Pichai.

This year’s MLPerf results show that the power of Google TPU v4 should not be underestimated, in the image classification training test using the ImageNet dataset (accuracy of at least 75.90%), 256 TPU v4 completed this task in 1.82 minutes, which is almost the same as 768 Nvidia A100 graphics cards, 192 AMD Epyc 7742 cores (1.06 minutes), 512 Huawei AI-optimized Ascend910 chips, and 128 Intel Xeon Platinum 8168 cores (1.56 minutes) combined together.

TPU v4 also scored high when tasked with training the Transform-based reading comprehension BERT model on a large Wikipedia corpus. Training with 256 TPUs v4 took 1.82 minutes, which is more than 1 minute slower than the 0.39 minutes required for training with 4096 TPUs v3. Meanwhile, to achieve a training time of 0.81 minutes using Nvidia’s hardware would require 2048 A100 cards and 512 AMD Epyc 7742 CPU cores.

Google also showed specific AI examples that can use TPU v4 at the I/O conference, including the MUM model (Multitask Unified Model), which can simultaneously process web pages, images and other data, and LaMDA, which is built for conversations, are scenario models that can use TPU v4, with the former being 1000 times stronger than the Read The former is 1000 times stronger than the reading and understanding model BERT and is suitable for empowering search engines to help users get the information they want more efficiently, while the latter can communicate with humans in an uninterrupted dialogue.

The TPU, which is not for sale, will soon be deployed in Google’s data centers, and about 90% of the TPU v4 Pod will use green energy. In addition, Google also said that it will be open to Google Cloud customers later this year.

Google developed its own TPU and updated four generations in five years

Google first announced its first in-house custom AI chip in 2016, distinguishing it from the most common combination architecture for training and deploying AI models, namely the CPU and GPU combination. Not only GPU can do training and reasoning.

Google’s first generation TPU uses a 28nm process, consumes about 40w, and is only suitable for deep learning inference, and is also used in Google search, translation and other machine learning models, in addition to AlphaGo.

In May 2017, Google released TPU v2 that enables machine learning model training and inference, reaching 180TFLOPs floating point computing capacity, while memory bandwidth was also improved, 30x higher than the CPU AI workload launched at the same time, and 15x higher than the GPU AI workload, beaten by AlphaGo based on 4 TPU v2 s world Go champion Ke Jie felt all this most intuitively.

In May 2018, Google released another third-generation TPU with twice the performance of the previous generation TPU, enabling 420 TFLOPs of floating-point operations and 128GB of high-bandwidth memory.

According to the rhythm of iterative updates once a year, Google is supposed to launch the fourth generation TPU in 2019, but this year’s I/O conference, Google launched the second and third generation TPU Pods, which can be configured with more than 1,000 TPUs, greatly reducing the time required to conduct complex model training.

In the history of AI chip development, Google TPU is a rare technological innovation in terms of memory on chip and programmability, breaking the “monopoly” position of GPU and opening up a new competitive landscape of AI chips in the cloud.

Google TPU has already told us a small part of the answer.