Executive Summary: Spatial architectures provide energy-efficient computation, but require effective scheduling algorithms. Existing heuristic-based approaches offer low compiler/architect productivity, little optimality insight, and low architectural portability.
We develop a spatial-scheduling framework by utilizing constraint-solving theory, and find that architecture primitives and scheduler responsibilities can be related through five abstractions:
computation placement, data routing, event timing, resource utilization, and the optimization objective. We encode these responsibilities as 20 Integer Linear Programming constraints, and create schedulers for three architectures. Our results show a general declarative approach is implementable, practical, and can outperform specialized schedulers.
Background and Conventional Wisdom
Hardware specialization has emerged as an important way to sustain microprocessor performance improvements to address transistor energy efficiency challenges and general purpose processing’s inefficiencies. The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-by-instruction pipelined execution and instead adopting a spatial architecture paradigm. The focus of architecture-specific schedulers has typically been on developing polynomial-time algorithms that approximate the optimal solution using knowledge of the architecture.
While heuristic-based approaches are popular and effective, they have three problems:
- i) poor compiler developer/architect productivity since new algorithms, heuristics, and implementations are required for each architecture,
- ii) lack of insight on the optimality of solution,
- iii) sandboxing of heuristics to specific architectures. In short, the problem of spatial architecture scheduling has been solved numerous times since the early 1980s in ad-hoc ways targeted at one hardware architecture. There has not been a solution that effectively solves the problem in a general way.
We have developed an elegant, mathematically-based general formulation that works for any spatial architecture - as long as the hardware can be expressed as a graph. It provides a general solution to a problem previously solved numerous times in ad-hoc specialized ways that match or outperform the specialized approaches, despite its generality.
By unifying the experience of architecture-specific spatial schedulers, the principal of architectural generality, and the mathematical power of integer linear programming, we provide a constraint-centric formulation that provides high developer productivity, provable properties on results, and true architectural generality. It achieves architectural generality by first abstracting the scheduling problem in terms of five scheduling responsibilities, and then developing constraint-theory formulations of them. Specifically, these abstractions are: i) placement of computation on the hardware substrate, ii) routing of data on the substrate to reflect the computation semantics, iii) managing the timing of events in the hardware, iv) managing utilization to orchestrate concurrent usage of hardware resources, and v) forming optimization objectives to model performance goals.
The table below shows how these responsibilities can get mapped across three very different architectures.
|#||Architecture Feature||Scheduler responsibility||TRIPS||DySER||PLUG|
|Placement of computation||Homogeneous compute units||Heterogeneous compute units||Homogeneous compute units|
|2||Network HW organization||Routing of data||2-D grid, dimension-order routing||2-D grid, unconstrained routing||2-D multi-network grid, dimension-order routing|
|3||HW timing and
|Manage timing of events||Data-flow execution and dynamic network arbitration||Data-flow execution and conflict-free network assignment + flow control||Hybrid data-flow and in-order execution with static compute and network timing|
|4||Concurrent HW usage
within a block
|Manage utilization||8-slots per compute-unit, regtile, data-tile||No concurrent usage; dedicated compute units, switches, links||32 slots per compute-unit; bundled links and multicast communication|
|Concurrent HW usage
|Concurrent execution of different blocks||Concurrent usage across blocks with pipelined execution||Pipelined execution across different tiles|
|Naturally enforced by ILP constraints||Any assignment legal||Throughput||Throughput|
|ILP objective formulation||Throughput and Latency||Latency & Latency Mismatch||Latency|
Our contribution can be expressed simply: we demonstrate a general spatial architecture scheduler can be constructed through these five scheduling abstractions and show they can be expressed using constraint-solving theory, specifically integer linear programming.
We show that standard modeling languages and solvers allow a succinct implementation (general formulation is less than 50 lines in GAMS), provide solutions in tractable run-times, and produce mappings that are competitive with or significantly better than those of specialized schedulers considering three diverse architectures. Our formal contribution is a set of 20 equations, as presented in the full paper. The table below summarizes how a novel chip design can be supported by our compiler.
|Architecture||Description||ILP Modeling and scheduler responsibility||ILP Constraints|
|Compute HW organization|
|16 titles, 6 routers per tile||Each tile is 7 nodes in H, one in N, and six in R.||Gen. ILP framework|
|4 mem-banks per tile||Handled with utilization||Gen. ILP framework|
|32 cores per tile||Handled with utilization||Gen. ILP framework|
|Network HW organization|
|2D nearest neighbor mesh||Node n connected to r; r connected to 4 neighbors||Gen. ILP framework|
|Dimension order routing||I(L,L) configure for dimension-order routing||Gen. ILP framework|
|Multicast messages deliver values at every intermediate node on path||Enforce mapping of multicast messages to link to computation node||Eq. 25|
|HW timing and synchronization|
|Code-scheduling of network send instructions||Variables for send timing and receive time||Δ(Ε);Γ(Ε)|
|Send instructions scheduled to avoid n/w conflicts||Variables for delaying send/receive timing by padding with no-ops||Eq. 26, 27a-c, 28|
|Concurrent HW usage||4 mem-banks per tile||Manage utilization (MAXN = 4)||Gen. ILP framework|
|Dedicated network per message||Manage utilization (MAXL = 1)||Gen. ILP framework|
|Mutually exclusive activation of nodes in G||Concept of vertex bundles and utilization refined||Eq. 29, 30, 31|
|Code-length limitations (maximum is 32) handled for all code on title||Manage utilization and combine with vertex bundles||Eq. 32, 33, 34|