New research out of IBM Research’s lab in Almaden, California, nearly two decades in the making, has the potential to drastically shift how we can efficiently scale up powerful AI hardware systems.
Since the birth of the semiconductor industry, computer chips have primarily followed the same basic structure, in which the processing units and the memory storing the information to be processed are stored discretely. While this structure has allowed for simpler designs that have been able to scale well over the decades, it’s created what’s called the von Neumann bottleneck, where it takes time and energy to continually shuffle data back and forth between memory, processing, and any other devices within a chip.
The work by IBM Research’s Dharmendra Modha and his colleagues aims to change this, taking inspiration from how the brain computes. Over the last eight years, Modha has been working on a new type of digital AI chip for neural inference, which he calls NorthPole. It’s an extension of TrueNorth, the last brain-inspired chip that Modha worked on prior to 2014.
In tests on the popular ResNet-50 image recognition and YOLOv4 object detection models, the new prototype device has demonstrated higher energy efficiency, higher space efficiency, and lower latency than any other chip currently on the market, and is roughly 4,000 times faster than TrueNorth.
The first promising set of results from NorthPole chips were just published in Science. NorthPole is a breakthrough in chip architecture that delivers massive improvements in energy, space, and time efficiencies, according to Modha. Using the ResNet-50 model as a benchmark, NorthPole is considerably more efficient than common 12-nm GPUs and 14-nm CPUs. (NorthPole itself is built on 12 nm node processing technology.)
In both cases, NorthPole is 25 times more energy efficient, when it comes to the number of frames interpreted per joule of power required. NorthPole also outperformed in latency, as well as space required to compute, in terms of frames interpreted per second per billion transistors required. According to Modha, on ResNet-50, NorthPole outperforms all major prevalent architectures—even those that use more advanced technology processes, such as a GPU implemented using a 4 nm process.
One of the biggest differences with NorthPole is that all of the memory for the device is on the chip itself, rather than connected separately. Without that von Neumann bottleneck, the chip can carry out AI inferencing considerably faster than other chips already on the market. NorthPole was fabricated with a 12-nm node process, and contains 22 billion transistors in 800 square millimeters. It has 256 cores and can perform 2,048 operations per core per cycle at 8-bit precision, with potential to double and quadruple the number of operations with 4-bit and 2-bit precision, respectively.
The NorthPole chip on a PCIe card.
Architecturally, NorthPole blurs the boundary between compute and memory. At the level of individual cores, NorthPole appears as memory-near-compute and from outside the chip, at the level of input-output, it appears as an active memory.—Dharmendra Modha
This makes NorthPole easy to integrate in systems and significantly reduces load on the host machine.
NorthPole’s biggest advantage is also a constraint: it can only easily pull from the memory it has onboard. All of the speedups that are possible on the chip would be undercut if it had to access information from another place. Via an approach called scale-out, NorthPole can actually support larger neural networks by breaking them down into smaller sub-networks that fit within NorthPole’s model memory, and connecting these sub-networks together on multiple NorthPole chips.
So while there is ample memory on a NorthPole (or collectively on a set of NorthPoles) for many of the models that would be useful for specific applications, this chip is not meant to be a jack of all trades.
We can’t run GPT-4 on this, but we could serve many of the models enterprises need. And, of course, NorthPole is only for inferencing.—Dharmendra Modha
This efficacy means that the device also doesn’t need bulky liquid-cooling systems to run—fans and heat sinks are more than enough—meaning that it could be deployed in some rather small spaces.
While research into the NorthPole chip is still ongoing, its structure lends itself to emerging AI use cases, as well as more well-established ones.
In testing, NorthPole team focused primarily on computer vision-related uses, in part because funding for the project came from the U.S. Department of Defense. Some of the primary applications in consideration were detection, image segmentation, and video classification. But it was also tested in other arenas, such as natural language processing (on the encoder-only BERT model) and speech recognition (on the DeepSpeech2 model). The team is currently exploring mapping decoder-only large language models to NorthPole scale-out systems.
NorthPole could potentially be the sort of device that’s needed to move autonomous vehicles from machines that require set maps and routes to operate on a small scale, to ones that can think and react to the rare edge-case situations that make navigating in the real world so challenging even for proficient human drivers.
This is just the start of the work for Modha on NorthPole. The current state of the art for CPUs is 3 nm—and IBM itself is already years into research on 2 nm nodes. That means there’s a handful of generations of chip processing technologies NorthPole could be implemented on, in addition to fundamental architectural innovations, to keep finding efficiency and performance gains.
Dharmendra S. Modha et al. (2023) “Neural inference at the frontier of energy, space, and time.” Science 382, 329-335 doi: 10.1126/science.adh1174