Elon Musk played AI Day to the crowd with a focus on the humanoid robot Optimus. But while this could have a huge impact on our lives and society if it goes into mass production at Musk’s proposed price ($20,000), another part of the presentation will have more immediate results. This was the status report on the Dojo supercomputer. It could actually change the world much faster than a bipedal bot.
The first thing to point out is that Tesla is a software company that also happens to make the hardware match its software. As the driving force behind the emergence of the “software-defined vehicle,” Tesla pioneered the integration of systems and connectivity into cars. This reduced costs, improved features and made updateability much easier. Rapid software development is where Tesla really beats the competition more than any other aspect, even if the company has a lead elsewhere.
The most important emerging capability for vehicles is autonomy, and this is primarily a software problem. The Tesla beta of FSD (full self-driving) has been met with controversy as a massive experiment that makes the general public its guinea pigs. But just as human beings can’t learn to drive without some practice on real roads, self-driving cars need to experience real-world situations to develop strategies for handling them. Companies developing autonomous driving systems can speed up this process by creating realistic world simulations and testing models against it. But for FSD to work, it must be tested against the chaos of real-world situations and then its strategies refined to deal with it.
This is where Dojo comes in. Tesla already uses a massive supercomputer powered by NVIDIA GPUs to process its FSD data to create better models. It consists of 5,760 NVIDIA A100 graphics cards installed in 720 nodes of eight GPUs each. It is capable of a performance of 1.8 exaflops, making it among the fastest supercomputers in the world. One of the tasks this system performs is “auto-tagging”, which tags raw data so that it can become part of a decision-making system. Although a self-driving car must have some on-the-fly recognition, the majority of this comes from fitting sensor data into pre-processed models of the world with pre-determined actions for specific situations. Just as human beings learn to recognize road conditions from past experience and react accordingly, an autonomous vehicle relies on the driving experience of all the inputs to its AI model to decide how to drive.
Dojo promises to massively accelerate how quickly these models improve. During AI Day, Tesla claimed that just four Dojo systems cabinets can do the same auto-labeling job with 4,000 GPUs in 72 racks together. The company made similar performance promises for other categories of work related to training the models needed for autonomous driving. Tesla will deploy Dojo in clusters called “exapods,” which consist of 10 cabinets, and plans to have seven of those exapods in its Palo Alto data center. With each exapod capable of 1.1 exaflops, that would be nearly 8 exaflops of processing power focused primarily on processing AI models for Tesla’s autonomous vehicles (and, presumably, the Optimus robot).
The way Dojo works is very different from either CPU-based or GPU-based supercomputers. Dojo is made up of ’tiles’, which take a different approach than normal computer CPUs or GPUs. A CPU incorporates multiple processing cores on a chip, each capable of executing complex software operations at high frequency. However, currently CPU designs only combine up to 64 cores each for a main CPU, with a node offering a maximum of two CPUs and 128 cores. A CPU-based supercomputer will aggregate many of these nodes into one system. Frontier, the world’s fastest supercomputer coming online this year, has 9,400 nodes and 602,112 CPU cores.
Modern GPUs have many, many more cores each. The recently released NVIDIA GeForce RTX 4090 has 16,384 cores, and the A100s in Tesla’s latest GPU-based supercomputer have 6,912 each, but these can perform very simple operations, albeit very quickly. This is why GPUs have found favor with artificial intelligence and machine learning applications, such as those involved in building self-driving models. The typical maximum is eight GPUs per node. Tesla’s latest GPU-based supercomputer cluster has nearly 40 million GPU cores.
Dojo is different. Instead of combining several smaller chips, the D1 chip is one large chip with 354 cores specifically aimed at AI and ML. Six of these are then combined into one disc, along with supporting hardware. Two of these disks can be installed in a single cabinet, giving each cabinet 4,248 cores and a 10-cabinet exapod 42,480 cores. A CPU-based supercomputer will have fewer cores in the same space, and a GPU-based one much more. But since Dojo is specifically optimized for AI and ML processing, it’s orders of magnitude faster than for the same data center footprint.
Tesla plans to bring the first Dojo exapod online in Q1 2023, but hasn’t said when the other six will arrive. When this level of processing becomes fully available, focused on crushing through the training of Tesla’s FSD models, it will massively accelerate the development of autonomous vehicles. There are now 160,000 Tesla drivers participating in the FSD beta, collecting real-world driving data. With Dojo exapods using this data to build new models and the new systems rolling out to 160,000 users, a virtuous circle will develop and more testers are likely to be recruited, further accelerating development.
That’s why Dojo was the really big news from Tesla’s AI Day 2022, not Optimus. At the previous AI Day 2021, Tesla only talked about the D1’s specs and showed off the early silicon. Things have come a long way. Of course, you should always take big announcements from Elon Musk with a grain of salt, but assuming the Dojo starts shipping next year, expect to see much faster iterations of the Tesla FSD beta, much faster improvements, and a shorter time to market of autonomous vehicles than you might have previously expected.