Nvidia is promoting their own high-end performance in major AI and machine learning benchmarks, as apparently some kind of floodgate has popped open on companies talking about performance metrics for their own hardware solution (This isn’t literally true, but we’ve been seeing a lot of artificial intelligence, machine learning, and similar data cross our desks of late). According to Nvidia, it’s hit some major milestones, including:
- A single V100 Tensor Core GPU achieves 1,075 images/second when training ResNet-50, a 4x performance increase compared with the previous generation Pascal GPU;
- A single DGX-1 server powered by eight Tensor Core V100s achieves 7,850 images/second, almost 2x the 4,200 images/second from a year ago on the same system;
- A single AWS P3 cloud instance powered by eight Tensor Core V100s can train ResNet-50 in less than three hours, 3x faster than a TPU instance.
Nvidia is also talking up the use of Volta as a potential replacement for ASICs that would otherwise provide superior functionality in a limited set of use-cases or scenarios. It’s not clear — and I genuinely mean that — how such claims should be interpreted. Nvidia notes: “For instance, each Tesla V100 Tensor Core GPU delivers 125 teraflops of performance for deep learning compared to 45 teraflops by a Google TPU chip. Four TPU chips in a ‘Cloud TPU’ deliver 180 teraflops of performance; by comparison, four V100 chips deliver 500 teraflops of performance.” It also refers to a project by fast.ai to optimize image classification on the CIFAR-10 dataset using Volta that turned in best-in-class overall performance, beating all other competitors.
There are problems, however, with relying on FLOPS to measure performance. FLOPS is calculated by a simple mathematical equation:
In GPUs, this works out to GPU cores * clock * two instructions per clock (one multiply, one accumulate) = X rating in TFLOPS. This intrinsically assumes that the GPU is executing a multiply and an accumulate on every GPU core simultaneously. This assumption allows us to generate comparative metrics relatively quickly using a constant formula, but there’s a huge loophole: If GPU #1 typically achieves only 50 percent of its theoretical peak FLOPs, it could be outperformed by GPU #2, which might have much lower maximum theoretical FLOPs performance but still exceed the perf of GPU #1 if it’s more efficient. This also applies to any kind of comparison between two different solutions.
The recent reports on Google’s cloud TPU being more efficient than Volta, for example, were derived from the ResNet-50 tests. The results Nvidia is referring to use the CIFAR-10 data set. The Dawnbench team records no results for TPUs in this test, and fast.ai’s blog post on the topic may explain why this is:
Google’s TPU instances (now in beta) may also a good approach, as the results of this competition show, but be aware that the only way to use TPUs is if you accept lock-in to all of:
Google’s hardware (TPU)
Google’s software (Tensorflow)
Google’s cloud platform (GCP).
problematically, there is no ability to code directly for the TPU, which severely limits algorithmic creativity (which as we have seen, is the most important part of performance). Given the limited neural network and algorithm support on TPU (e.g. no support for recurrent neural nets, which are vital for many applications, including Google’s own language translation systems), this limits both what problems you can solve, and how you can solve them.
As hardware and software continue to evolve, we’ll see how these restrictions and capabilities evolve along with them. It’s absolutely clear that Volta is a heavy-hitter in the AI/ML market as a whole, with excellent performance and the flexibility to handle many different kinds of tasks. How this will change as more custom hardware comes online and next-generation solutions debut is still unclear.