Universal Compiler

Executive Summary: Spatial architectures provide energy-efficient computation, but require effective scheduling algorithms. Existing heuristic-based approaches offer low compiler/architect productivity, little optimality insight, and low architectural portability.
We develop a spatial-scheduling framework by utilizing constraint-solving theory, and find that architecture primitives and scheduler responsibilities can be related through five abstractions:
computation placement, data routing, event timing, resource utilization, and the optimization objective. We encode these responsibilities as 20 Integer Linear Programming constraints, and create schedulers for three architectures. Our results show a general declarative approach is implementable, practical, and can outperform specialized schedulers.

Background and Conventional Wisdom

Hardware specialization has emerged as an important way to sustain microprocessor performance improvements to address transistor energy efficiency challenges and general purpose processing’s inefficiencies. The fundamental insight of many specialization techniques is to “map” large regions of computation to the hardware, breaking away from instruction-by-instruction pipelined execution and instead adopting a spatial architecture paradigm. The focus of architecture-specific schedulers has typically been on developing polynomial-time algorithms that approximate the optimal solution using knowledge of the architecture.

Problem

While heuristic-based approaches are popular and effective, they have three problems:
  • i) poor compiler developer/architect productivity since new algorithms, heuristics, and implementations are required for each architecture,
  • ii) lack of insight on the optimality of solution,
  • iii) sandboxing of heuristics to specific architectures. In short, the problem of spatial architecture scheduling has been solved numerous times since the early 1980s in ad-hoc ways targeted at one hardware architecture. There has not been a solution that effectively solves the problem in a general way.

Our Discovery

We have developed an elegant, mathematically-based general formulation that works for any spatial architecture - as long as the hardware can be expressed as a graph. It provides a general solution to a problem previously solved numerous times in ad-hoc specialized ways that match or outperform the specialized approaches, despite its generality.
By unifying the experience of architecture-specific spatial schedulers, the principal of architectural generality, and the mathematical power of integer linear programming, we provide a constraint-centric formulation that provides high developer productivity, provable properties on results, and true architectural generality. It achieves architectural generality by first abstracting the scheduling problem in terms of five scheduling responsibilities, and then developing constraint-theory formulations of them. Specifically, these abstractions are: i) placement of computation on the hardware substrate, ii) routing of data on the substrate to reflect the computation semantics, iii) managing the timing of events in the hardware, iv) managing utilization to orchestrate concurrent usage of hardware resources, and v) forming optimization objectives to model performance goals.
The table below shows how these responsibilities can get mapped across three very different architectures.
# Architecture Feature Scheduler responsibility TRIPS DySER PLUG
1 Compute HW
organization
Placement of computation Homogeneous compute units Heterogeneous compute units Homogeneous compute units
2 Network HW organization Routing of data 2-D grid, dimension-order routing 2-D grid, unconstrained routing 2-D multi-network grid, dimension-order routing
3 HW timing and
synchronization
Manage timing of events Data-flow execution and dynamic network arbitration Data-flow execution and conflict-free network assignment + flow control Hybrid data-flow and in-order execution with static compute and network timing
4 Concurrent HW usage
within a block
Manage utilization 8-slots per compute-unit, regtile, data-tile No concurrent usage; dedicated compute units, switches, links 32 slots per compute-unit; bundled links and multicast communication
Concurrent HW usage
across blocks
Concurrent execution of different blocks Concurrent usage across blocks with pipelined execution Pipelined execution across different tiles
5 Performance Goal
architecturally mandated
Naturally enforced by ILP constraints Any assignment legal Throughput Throughput
Performance Goal
high efficiency
ILP objective formulation Throughput and Latency Latency & Latency Mismatch Latency
Our contribution can be expressed simply: we demonstrate a general spatial architecture scheduler can be constructed through these five scheduling abstractions and show they can be expressed using constraint-solving theory, specifically integer linear programming.
We show that standard modeling languages and solvers allow a succinct implementation (general formulation is less than 50 lines in GAMS), provide solutions in tractable run-times, and produce mappings that are competitive with or significantly better than those of specialized schedulers considering three diverse architectures. Our formal contribution is a set of 20 equations, as presented in the full paper. The table below summarizes how a novel chip design can be supported by our compiler.
Architecture Description ILP Modeling and scheduler responsibility ILP Constraints
Compute HW organization
16 titles, 6 routers per tile Each tile is 7 nodes in H, one in N, and six in R. Gen. ILP framework
4 mem-banks per tile Handled with utilization Gen. ILP framework
32 cores per tile Handled with utilization Gen. ILP framework
Network HW organization
2D nearest neighbor mesh Node n connected to r; r connected to 4 neighbors Gen. ILP framework
Dimension order routing I(L,L) configure for dimension-order routing Gen. ILP framework
Multicast messages deliver values at every intermediate node on path Enforce mapping of multicast messages to link to computation node Eq. 25
HW timing and synchronization
Code-scheduling of network send instructions Variables for send timing and receive time Δ(Ε);Γ(Ε)
Send instructions scheduled to avoid n/w conflicts Variables for delaying send/receive timing by padding with no-ops Eq. 26, 27a-c, 28
Concurrent HW usage 4 mem-banks per tile Manage utilization (MAXN = 4) Gen. ILP framework
Dedicated network per message Manage utilization (MAXL = 1) Gen. ILP framework
Mutually exclusive activation of nodes in G Concept of vertex bundles and utilization refined Eq. 29, 30, 31
Code-length limitations (maximum is 32) handled for all code on title Manage utilization and combine with vertex bundles Eq. 32, 33, 34
The general and declarative approach allows schedulers to be specified, implemented, and evaluated rapidly. In practical terms, we compile code and create object files in seconds, while the same task will require hours or days of computation for FPGAs.

Further reading

We encourage you to look deeper by looking at our award winning research publication. This breakthrough was recognized with multiple premier research awards, PLDI Distinguished Paper Award and a Communication of ACM Research Highlights Nomination by SIGPLAN.