Background and Conventional Wisdom
Performance improvements from general-purpose processors have proved elusive in
recent years. By general purpose processors, we refer to the chips that are primarily used in data-centers and laptops. This has led to a surge of interest in architectures serving narrowly defined applications in the hope of re-establishing system improvements for at least some of the most significant application domains. The conventional approach is to design and build custom chips for a specific problem, or so-called “domain-specific accelerators” (DSAs). Such engines are capable of performing computations in a particular domain with high performance and energy efficiency. DSAs have been developed for machine learning, cryptography, XML processing, regular expression matching, H.264 decoding, and many others. In the best cases, they can achieve 10X to 1,000X performance and energy improvements over general purpose processors for a given application.
The Obsolescence Problem
To gain these efficiency improvements DSAs sacrifice programmability - a high price to pay. First, this makes DSAs prone to rapid obsoletion: the domains and algorithms which require specialization are constantly evolving due to scientific progress and changing user needs. Second, the needs of the relevant domains typically migrate across different device types (server, mobile, wearable, etc). Supporting each device type by creating fundamentally new designs costs both design and validation time. Finally, most business deployments must run several different workloads to support their business objectives, thereby requiring a jumbled integration of many such custom designs, each of which is prone to obsoletion. For example, in the financial services it is common to use deep-learning for analyzing time series data, while also using regression trees for other data-types. In retail services and e-commerce (think e-bay, Target, Walmart), image recognition and recommendation systems are workhorse algorithms, while big-data processing is used in the background for inventory management.
The timeframe required to design (12 months), physically manufacture (4 months for technology nodes like 7nm), and qualify a chip for widespread deployment (3 months), does not align with supporting algorithms that change every few months. The paradigm of building custom chips to target specific domains is fundamentally broken as a business model.
The result is a decision between two unsustainable and inefficient decisions: use CPUs and run at untenable low efficiencies, or use custom-chips that risk rapid obsolescence given the current pace of algorithmic change.
Our analysis of this problem was fueled by two fundamental questions, each a corollary to the other. First, what do custom-chips and DSAs do that makes them so much more efficient than programmable CPUs? Second, what do CPUs do that allows them to be programmable? By asking and answering these two questions, we arrived at several surprising breakthroughs.
- Breakthrough 1 : DSAs and custom-chips are unnecessarily specialized. Regardless of what problem they are solving, across the board, their designs reuse the same set of fundamental ideas.
- Breakthrough 2: These fundamental ideas can be formally defined as program behaviors and can be paired with corresponding efficient hardware implementation.
- Breakthrough 3: These particular behaviors can be succinctly and formally defined, allowing compiler tools, computer architects, and chip designers to build upon them and create solutions that are programmable, yet as efficient as custom-designed chips.
The general principle of concurrency is an overarching design approach that allows an algorithm to be doing many things simultaneously to achieve the final outcome faster than doing them sequentially. Using concurrency, an application may be deconstructed to perform multiple behaviors simultaneously. Concurrency is further exploited in chip architectures to perform work in parallel by favoring design structures with low overhead. We evaluated a variety of DSAs to analyze the architectural decisions made to optimize for concurrency and identify the behaviors emphasized by each. Universally, we observed the same four behaviors utilized across various applications in various domains:
- Data Gathering: Collecting the input data for computation including the storage of those values, the movement from one part of the algorithm to another, and the reuse of that data across different parts of the application.
- Computational dataflow or computation: Primitive data operations (e.g., arithmetic, comparison, etc.) to produce algorithm output. Machine learning algorithms may exhibit significant concurrency and regularity such as independent and repetitive computations on large data volumes, or may have long sequences of dependent computations.
- Synchronization: Sequencing data flow among multiple concurrent algorithm stages. Synchronization is the opposite of concurrency; synchronization ensures dependent algorithmic steps are correctly executed.
- Control: Overall algorithm management including initialization, checking termination and convergence, and handling boundary cases or error conditions. Control behavior is often irregular in comparison to computations, and may become a bottleneck when a system is only optimized for regularity.
Observing The Behaviors and Using Them
The behaviors can be efficiently implemented in hardware and software. Through careful analysis we have identified how each behavior interacts with the others, and understand how each contributes to the overall productivity of the processor for a given workload.
After analyzing several custom chips, it becomes apparent that they are consistently using these behaviors. Our key breakthrough is in discovering that they are being used in a way that is not particularly application-specific. The chart below illustrates the designs of four different DSAs. We have simply colored the chip-organizations to display how the different behaviors are implemented. This demonstrates the universality of the behaviors.