Fundamental Behaviors for Programmable Software Acceleration

Executive Summary: Conventional wisdom states that programmable processors such as CPUs are too power-hungry for practical deployment in data-centers to solve AI and big-data problems. Instead, purpose-built chips are required for problems with large computational needs. However, we show that four fundamental behaviors are universal across these AI and big-data applications. Further, these behaviors can be formally defined, can be extracted by a compiler, and can be organized into a programmable design that runs as efficiently as a custom chip. In this white paper we describe the motivation, define the behaviors and the importance of their roles, and describe a new architecture that leverages these findings to create a single flexible design that can replace the traditional approach of creating specialized processors.

Background and Conventional Wisdom

Performance improvements from general-purpose processors have proved elusive in recent years. By general purpose processors, we refer to the chips that are primarily used in data-centers and laptops. This has led to a surge of interest in architectures serving narrowly defined applications in the hope of re-establishing system improvements for at least some of the most significant application domains. The conventional approach is to design and build custom chips for a specific problem, or so-called “domain-specific accelerators” (DSAs). Such engines are capable of performing computations in a particular domain with high performance and energy efficiency. DSAs have been developed for machine learning, cryptography, XML processing, regular expression matching, H.264 decoding, and many others. In the best cases, they can achieve 10X to 1,000X performance and energy improvements over general purpose processors for a given application.

The Obsolescence Problem

To gain these efficiency improvements DSAs sacrifice programmability - a high price to pay. First, this makes DSAs prone to rapid obsoletion: the domains and algorithms which require specialization are constantly evolving due to scientific progress and changing user needs. Second, the needs of the relevant domains typically migrate across different device types (server, mobile, wearable, etc). Supporting each device type by creating fundamentally new designs costs both design and validation time. Finally, most business deployments must run several different workloads to support their business objectives, thereby requiring a jumbled integration of many such custom designs, each of which is prone to obsoletion. For example, in the financial services it is common to use deep-learning for analyzing time series data, while also using regression trees for other data-types. In retail services and e-commerce (think e-bay, Target, Walmart), image recognition and recommendation systems are workhorse algorithms, while big-data processing is used in the background for inventory management.
The timeframe required to design (12 months), physically manufacture (4 months for technology nodes like 7nm), and qualify a chip for widespread deployment (3 months), does not align with supporting algorithms that change every few months. The paradigm of building custom chips to target specific domains is fundamentally broken as a business model.
The result is a decision between two unsustainable and inefficient decisions: use CPUs and run at untenable low efficiencies, or use custom-chips that risk rapid obsolescence given the current pace of algorithmic change.

Our Discovery

Our analysis of this problem was fueled by two fundamental questions, each a corollary to the other. First, what do custom-chips and DSAs do that makes them so much more efficient than programmable CPUs? Second, what do CPUs do that allows them to be programmable? By asking and answering these two questions, we arrived at several surprising breakthroughs.
  • Breakthrough 1 : DSAs and custom-chips are unnecessarily specialized. Regardless of what problem they are solving, across the board, their designs reuse the same set of fundamental ideas.
  • Breakthrough 2: These fundamental ideas can be formally defined as program behaviors and can be paired with corresponding efficient hardware implementation.
  • Breakthrough 3: These particular behaviors can be succinctly and formally defined, allowing compiler tools, computer architects, and chip designers to build upon them and create solutions that are programmable, yet as efficient as custom-designed chips.

The Behaviors

The general principle of concurrency is an overarching design approach that allows an algorithm to be doing many things simultaneously to achieve the final outcome faster than doing them sequentially. Using concurrency, an application may be deconstructed to perform multiple behaviors simultaneously. Concurrency is further exploited in chip architectures to perform work in parallel by favoring design structures with low overhead. We evaluated a variety of DSAs to analyze the architectural decisions made to optimize for concurrency and identify the behaviors emphasized by each. Universally, we observed the same four behaviors utilized across various applications in various domains:
  • Data Gathering: Collecting the input data for computation including the storage of those values, the movement from one part of the algorithm to another, and the reuse of that data across different parts of the application.
  • Computational dataflow or computation: Primitive data operations (e.g., arithmetic, comparison, etc.) to produce algorithm output. Machine learning algorithms may exhibit significant concurrency and regularity such as independent and repetitive computations on large data volumes, or may have long sequences of dependent computations.
  • Synchronization: Sequencing data flow among multiple concurrent algorithm stages. Synchronization is the opposite of concurrency; synchronization ensures dependent algorithmic steps are correctly executed.
  • Control: Overall algorithm management including initialization, checking termination and convergence, and handling boundary cases or error conditions. Control behavior is often irregular in comparison to computations, and may become a bottleneck when a system is only optimized for regularity.

Observing The Behaviors and Using Them

The behaviors can be efficiently implemented in hardware and software. Through careful analysis we have identified how each behavior interacts with the others, and understand how each contributes to the overall productivity of the processor for a given workload.
After analyzing several custom chips, it becomes apparent that they are consistently using these behaviors. Our key breakthrough is in discovering that they are being used in a way that is not particularly application-specific. The chart below illustrates the designs of four different DSAs. We have simply colored the chip-organizations to display how the different behaviors are implemented. This demonstrates the universality of the behaviors.
graph
We can also look at individual designs and quantify how much each behavior contributes to the processor’s total workload. In the graphic below, the Y-Axis represents the proportion of the computation devoted to each of the four given behaviors. The analysis is performed by isolating individual behaviors in a specialized chip by turning them on or off in a simulation environment. Three different processor designs are presented for this analysis: image processing, deep-learning, and database processing. We observe that despite substantially different architectural designs, each of these processors have substantially similar profiles in terms of the proportional workload devoted to each behavior.
graph

Creating a New Paradigm - Composable Behaviors

This breakthrough finding implies that a universal design should be able to deliver optimized performance for specialized use cases. We have utilized that knowledge to design a new form of a chip with a software stack and compiler that enables these four behaviors while eliminating this “unnecessary specialization”. It would not be sufficient to simply combine many specialized chip designs in an additive manner as that would create an excessively complex design while not addressing the flexibility required to handle changing algorithms. Instead we have created and built a new chip execution paradigm called Composable Behavior Execution, where each behavior is the basic primitive. In this design, we implement behaviors as the fundamental blocks and create a compiler that deconstructs each program into the various behaviors. This concept is similar to how CPU’s utilize an instruction as the basic primitive and use compilers to deconstruct all programs into various instructions. Behaviors are the new instruction set for AI and big data applications.

Further reading

We encourage you to look deeper by looking at our award-winning research publication: Domain Specialization is Generally Unnecessary for Accelerators. This breakthrough was recognized with a premier research award, IEEE Micro Top Picks Selection, for redefining what is an accelerator.