Compilability is a First Order Need for AI Hardware

By Karu Sankaralingam

Computer Architecture in the 90s saw the era of microarchitecture, where deep application-knowledge and the role of the compiler was inconsequential. Applications today are changing at a rapid pace but, paradoxically, proposed architectures are again becoming more specialized. In the recent decade, around 2015, Artificial Intelligence (AI) and Machine Learning (ML) has exploded in popularity and serves as an example of this rapid evolution by how quickly new types of AI algorithms are being suggested and evaluated. Whereas initially vision tasks were dominant (convolution based image classification), we now see a wide variety of models that require varying levels of compute and memory throughput (Segmentation, NLP, NMT, Recommendations, etc).

GPUs are dominant, but not optimal

Today, the GPU dominates as the primary platform for conducting AI and machine learning work, but not because GPUs are in any way optimal. While many other commercial chips have been developed, they are almost entirely limited to object classification tasks. Meanwhile AI and the world is moving to video and time-based data that only NVIDIA supports, strengthening their stranglehold.

In academia, one research path is to build “clean-slate” architectures. Many papers are published in top architecture conferences with this approach. However, these works are ignoring a key to success: compilability. Compilability and software maturity is the reason why the GPU is still, 5 years into the AI revolution, dominating the field.

Compilability refers to how easy it is to mechanically take user code and produce an executable program from it. Orthogonally, software engineering is taking a workload or use-case and writing user code.

In industry and the startup world, there are many examples of purpose-built hardware solutions targeting CNNs alone, or particular types of CNNs, using ad-hoc software development environments, which become unusable because of the relentless pace of algorithm change.

In the AI research world, another subtle storm is brewing, where the mainstream development stack (aka NVIDIA GPUs) supports certain types of algorithms well, which creates a barrier to exploring other algorithms (not supported by the hardware), which might be orders of magnitude faster or highly accurate. An analysis from researchers at Google Brain showed that “Tensorflow and PyTorch must copy, rearrange, and materialize to memory two orders of magnitude more data than necessary to run capsule networks.” It seems we have been forced into the streetlight effect.

Chip architects need to move out of the “Valley of Irrelevance”

Our qualitative diagram, shown above, illustrates our argument. We need to consider new architectures that are beyond the GPU’s performance capability with at least as good compilability. Landing in the “Valley of Irrelevance” will leave a design hopelessly behind the incumbent GPUs. Superior compilability would be attractive even if performance was worse, as it could allow more efficient models.

It is tempting to think hand-optimization and designing high-performance libraries is sufficient. Compilability cannot be side-stepped by a brute-force or chip-gurus write-assembly-code approach (also known as the ‘graduate student algorithm’). The rate of model growth and diversity of models means the speed at which this can be done will simply not match where applications are headed: a model may be developed, deployed, and then displaced by a next-generation model before hand-optimization is complete.

We arrive at the following claims: First, because of the state of AI applications and AI hardware, we claim an opportunity to democratize AI exists for AI hardware. Second, architects need to move out of this valley of irrelevant architectures where compilability is so low they are unusable in practical applications. And third, to approach domains like AI, only hardware that has been conceived with compilability from the ground-up can succeed.