By Karu Sankaralingam, Jude Shavlik, and Greg Wright
Artificial Intelligence (AI) and Machine Learning (ML) is continually evolving, and chip design must evolve accordingly in order to successfully enable and support The Age of AI.
Recent developments in neural networks
The deep learning field is changing quickly and radically. Categorizing images (Is it an artichoke? A hammer? A zebra?) via Convolutional Neural Networks led the first half-decade of exploding interest in deep learning, starting with the impressive accuracy gains of 2012’s AlexNet. Excitement quickly centered on more ambitious approaches that found multiple objects within images (cars, bikes, pedestrians, stop signs, etc.) in approaches such as 2014’s R-CNN and 2016’s SSD algorithms.
Interestingly, as the tasks became more challenging, much of the computation moved outside of traditional neural-network functions, including SSD’s complex search for a coherent set of multiple ‘bounding boxes’ out of over 100,000 candidates produced by the neural network.
Language-based tasks have been garnering more excitement the last few years. A major workhorse in the emergence of deep learning has been the Long Short-Term Memory (LSTM) model, especially in temporal and sequential tasks like speech recognition (2015’s Deep Speech 2) and machine translation from one ‘natural’ language, such as English, to another, such as Japanese (2016’s Neural Machine Translation, NMT). As with recent image-processing approaches, a substantial portion of the computation in language-based tasks lies outside of traditional neural-network components. For instance, NMT’s computation time is dominated by decoding the deep learner’s numeric predictions into the closest matching words in the target language (e.g., Japanese).
Another way that natural interactions with computers has become more ambitious is the increasing attention to long-duration dialogs, in chatbots, virtual assistants (Siri, Alexa, etc.), and conversational user interactions, such as for automated customer interactions. Most recently LSTMs are becoming supplanted by transformer and attention-based methods, with 2019’s BERT, which uses no LSTMs for text processing, being the currently hottest method in ML.
Most of ML assumes examples (observations or the final output) are described by a fixed number of features. Sentences in English and other natural languages of course vary greatly in length. Usually this variation is addressed, in an arguably unappealing brute-force manner, by enforcing a maximum sentence length and padding short sentences with a ‘null’ word so they reach this maximum length, though there are some impressive exceptions to this, such as the above-mentioned NMT.
One last emerging direction we wish to mention are machine-learning approaches, such as graph-based neural networks, that allow for examples of arbitrary size. One appeal of such approaches is that large amounts of world knowledge can be captured in graphs; for example, people, places, and things can be represented as nodes in the graph and relations – motherOf, worksFor, receivedDegreeFrom, etc. – can be represented as links between nodes. Graph neural networks, due to their irregular shapes and sizes that vary across training examples, involve calculations that stress the capabilities of traditional neural-network platforms.
Impact on program properties
These innovations in ML are happening rapidly and are creating large business value. At the same time, they are creating immense needs and opportunities on other portions of the physical infrastructure and software stack that run these algorithms.
First, different algorithms end up having different program properties. Our simple and informal terminology to characterize program behavior into three classes is as follows:
- Nicely shaped computation and embarrassingly parallel: for example, millions of 3x3 convolutions happening in parallel.
- Nicely shaped computation but limited parallelism: for example, the vector-matrix computation of a single LSTM node in a chain of LSTM nodes.
- Irregularly shaped computation: for example, the aggregation phase in a graph-convolution network.
As algorithms have evolved, how much of their time is spent on these three classes has changed. Even for a given algorithm, based on whether an application is doing learning or prediction (possibly on a large batch of examples simultaneously), the balance changes. Finally, as the algorithms get more sophisticated and ambitious, their nuances end up creating further subtlety within what we are coarsely describing as nicely shaped and irregularly shaped, and the levels of parallelism. Recently, a SIGARCH blog entry titled, “Deep Learning: It’s Not All About Recognizing Cats and Dogs” showed that, based on measured data on Facebook’s ML workloads, less than 30% of computation cycles was spent on image processing. They also introduce a more sophisticated and hardware-oriented characterization of application needs.
Impact on hardware
These program properties hint at the diversity that compilers, run-time engines, and hardware need to support. Merely supporting matrix multiplication globally mapped on a chip will simply not cut it anymore.
While GPUs are the state-of-the-art today, hardware solutions that provide the four properties outlined below are strongly positioned to overtake GPUs in the coming years:
1. User friendly — First, the hardware + software stack must be user-friendly. They must be able to execute the languages and frameworks like Tensorflow, Pytorch, etc. that data scientists operate in. GPUs are the state of the art here, and new hardware solutions need to provide software capability that is at least as good as a GPU.
2. Durable, time-scalable, reprogrammable — Second, future hardware architectures should provide durability as a first-order primitive, meaning that the solution is scalable in time, or future proof, in order to deliver high performance across algorithm experimentation, development, and deployment. They need to be reprogrammable and be able to support a diverse set of workloads even when not well matched. Coding and optimizing algorithms for GPUs is a specialist skill, and GPUs offer limited accessibility to the data scientist for novel approaches. Allowing ways for data scientists to more easily manipulate and extract efficiency from the hardware would make alternative architectures attractive for adoption.
3. Dynamic — Third, workload consolidation and elasticity needs in a production deployment make support for dynamism as a first-class primitive necessary. Thus the hardware and software stacks must naturally provide support for virtualization, migration, and other aspects of hyperscale deployment.
4. High-Performance and efficient — Fourth and finally, the solutions must be at least competitive on performance and power efficiency.
A clean-slate chip design
A new, clean-slate chip design is needed for the Age of AI. Existing solutions such as CPUs and GPUs are simply not up to the task anymore. New chips will need to be user-friendly, durable, reprogrammable, time-scalable, dynamic, and efficient in order to keep up with the rigorous demands of neural networks and their more sophisticated algorithms.