Composable Dataflow Execution

Executive Summary: Computational Dataflow is a program behavior that requires a relatively easy-to-implement hardware organization. The behavior formally defines the flow of values between different operations in a program. Historically, when dataflow has been implemented as a hardware architecture the resulting systems have invariably fallen short of their promise to deliver high-performance. We observe that these programs almost always intertwine the dataflow behavior with program change-in-control behavior, and that this combination has hobbled past efforts. We assert that the simple, yet novel idea of embedding control into a dataflow chip’s organization results in improvements that surpass the long-held performance and efficiency barriers.

 

Background and Conventional Wisdom

Historically and in the present, many dataflow machines have been proposed and explored. These proposals promise to achieve a certain level of computinge efficiency. They are cast as a solution to the inefficiency of von Neumann-computing: executing one instruction at a time. Historically these designs have failed to provide such performance efficiency, which has resulted in the domination of conventional, general-purpose processors. These dataflow designs use a technique called out-of-order execution, which identifies what instructions can execute when, using circuits in hardware. This approach consumes extreme amounts of power and transistors, making the CPUs exorbitantly large and expensive. Recently, dataflow architectures have generated renewed interest with hopes of addressing some of the compute challenges in machine learning.

Problem

We observe that past challenges of dataflow remain. Unless those are addressed, these architectures are unlikely to provide efficiency benefits for machine learning, similar to the challenges with other domains. The particular deficiency of dataflow is that control speculation is limited by the difficulty of implementing dataflow-based squashing. Additionally, the latency cost of explicit data communication can be prohibitive. Finally, the compilation challenges have proven hard to surmount. The massive scale and complexity of (application-level) graphs and richness of operators in modern deep-learning, makes the compiler problem excessively hard.

Our Discovery

What is unexplored so far is the fine-grained interleaving of explicit-dataflow with a control-processor – i.e. the theoretical and practical limits of being able to switch with low cost between an explicit-dataflow hardware/ISA and a control processor’s Von Neumann ISA. Figure 1(a) shows a logical view of such a hybrid architecture, and Figure 1(b) shows the capability of this architecture to exploit fine-grain (thousands to millions of instructions) application phases.
graph
Figure 1. Exploiting dynamic program behavior. (a) The logical view of a heterogeneous architecture, where Von Neumann and dataflow are substrates that are integrated to the same cache hierarchy. (b) The architecture preference changes over time during application execution.
When framed this way, we made some important and surprising discoveries and also raised interesting questions. Are the benefits of fine-grained interleaving of execution models significant enough? How might one build a practical and small-footprint dataflow engine capable of serving as an offload engine? Which types of control-cores can substantially benefit? Why are certain program region types suitable for explicit dataflow execution?
The potential benefits of an ideal hybrid architecture (ideal dataflow + four-wide OOO) are shown in Figure 2(a). Above each bar is the percentage of execution time in dataflow mode.
Figure 2(b) shows the overall energy and performance trends for three different control processors and how integrating dataflow into them improves their performance.
graph
Figure 2. Potential of ideal explicit dataflow specialization: (a) heterogeneous ideal dataflow performance and (b) overall tradeoffs. The “ideal” dataflow processor is constrained only by the program’s control and data dependencies, but it is non-control-speculative. For its energy model, only functional units and caches are considered.
To further understand when and why explicit dataflow can provide benefits, we consider the program space along with two dimensions: control regularity and memory regularity. Figure 3 shows our view on how different programs in this space can be executed by other architectures more efficiently than with an out-of-order processor.
The characterization shows the design-space is rich, and building a chip that supports only the dataflow behavior will be insufficient. Instead dataflow must be composable and part of a comprehensive behavior specialization architecture. The principles described above show a paradigm of behavior specialization in which different specialized hardware substrates target codes with certain program behaviors. This paradigm helps merge the concepts of specialization and general-purpose computing by targeting a wide spectrum of program behaviors with a customized chip’s efficiency.
control
Figure 3. Architectures’ effectiveness based on application characteristics of memory and control regularity. Regular program regions (1) are most suited to vector processors, and latency-bound regions (3) are suited to simple processors. In addition to (2), an explicit-dataflow processor is especially apt for regions with high ILP (3) or significant unpredictable control (4).

Further reading

We encourage you to look deeper by looking at our award-winning research publication. This breakthrough was recognized with multiple premier research awards, IEEE Micro Top Picks Selection and a Communication of ACM Research Highlights.