Background and Conventional Wisdom
Historically and in the present, many dataflow machines have been proposed and explored. These proposals promise to achieve a certain level of computinge efficiency. They are cast as a solution to the inefficiency of von Neumann-computing: executing one instruction at a time. Historically these designs have failed to provide such performance efficiency, which has resulted in the domination of conventional, general-purpose processors. These dataflow designs use a technique called out-of-order execution, which identifies what instructions can execute when, using circuits in hardware. This approach consumes extreme amounts of power and transistors, making the CPUs exorbitantly large and expensive. Recently, dataflow architectures have generated renewed interest with hopes of addressing some of the compute challenges in machine learning.
We observe that past challenges of dataflow remain. Unless those are addressed, these architectures are unlikely to provide efficiency benefits for machine learning, similar to the challenges with other domains. The particular deficiency of dataflow is that control speculation is limited by the difficulty of implementing dataflow-based squashing. Additionally, the latency cost of explicit data communication can be prohibitive. Finally, the compilation challenges have proven hard to surmount. The massive scale and complexity of (application-level) graphs and richness of operators in modern deep-learning, makes the compiler problem excessively hard.
What is unexplored so far is the fine-grained interleaving of explicit-dataflow with a control-processor – i.e. the theoretical and practical limits of being able to switch with low cost between an explicit-dataflow hardware/ISA and a control processor’s Von Neumann ISA. Figure 1(a) shows a logical view of such a hybrid architecture, and Figure 1(b) shows the capability of this architecture to exploit fine-grain (thousands to millions of instructions) application phases.
Figure 1. Exploiting dynamic program behavior. (a) The logical view of a heterogeneous architecture, where Von Neumann and dataflow are substrates that are integrated to the same cache hierarchy. (b) The architecture preference changes over time during application execution.
When framed this way, we made some important and surprising discoveries and also raised interesting questions. Are the benefits of fine-grained interleaving of execution models significant enough? How might one build a practical and small-footprint dataflow engine capable of serving as an offload engine? Which types of control-cores can substantially benefit? Why are certain program region types suitable for explicit dataflow execution?
The potential benefits of an ideal hybrid architecture (ideal dataflow + four-wide OOO) are shown in Figure 2(a). Above each bar is the percentage of execution time in dataflow mode.
Figure 2(b) shows the overall energy and performance trends for three different control processors and how integrating dataflow into them improves their performance.