By Karu Sankaralingam
We are in the midst of an automation revolution that is disrupting many industries, spanning financial services, to agriculture, to healthcare to retail. This revolution is driven by data and AI algorithms that are changing rapidly, as often as every few months. Unfortunately, existing solutions like CPUs (which lack the computational horsepower to scale) and GPUs (which are optimized for today’s common functions but lose efficiency when programmed for other workloads) do not have what it takes to keep up with the demands of these changing algorithms. The compute platform that needs to run these algorithms and power the automation revolution must provide not just programmability, but also time-scalability.
Programmability is the characteristic that enables engineers to develop applications for a diverse array of computing needs across many industries using the same platform.
Time-scalability is the ability of a platform to efficiently execute software that was developed years after the platform was developed. It allows today’s smartphone to run apps and launch software ecosystems that were not even imagined when smart-phone chips first appeared.
The old chip paradigms are out
The three current established chip paradigms — Von Neumann, ASIC and Dataflow — have well known and established limitations that are ill-suited for the demands of today’s applications and markets. Von Neumann is too inefficient, but is extremely general purpose. ASICs are too rigid for many fast moving domains. Dataflow straddles a known tradeoff, but gets cripplingly inefficient when “irregular” control, communication, and synchronization even sparingly appear in an application.
All three impute undue burden on hardware design, software developer, or the compiler developer (or in some cases all three!) and in so doing miss the forest for the trees.
Composable Behavior Execution: A new chip paradigm
SimpleMachines’ Composable Behavior Execution presents a new chip paradigm that provides both efficient execution and adaptability to rapid changes in software needs. This new chip paradigm takes a holistic view of algorithms, software, compilers, and hardware, and acknowledges that behaviors transcend and cut through all of these layers.
Composable Behavior Execution is a clean-slate design that breaks away from the decades old design. CPUs used to be driven by executing one "instruction" or line-of-code at a time with the chip having no knowledge of data or the global scope of this instruction's role in the entire program. SimpleMachines’ chip instead directly manipulates and understands program properties: data size and shape, and whole program size and shape. With this global information, our software stack on-the-fly transforms the chip's storage and execution mechanisms to match the applications data and computation patterns, achieving the same effect of having a custom chip built for that application. These ideas came out of 64 person years of research, 6 PhDs, 7 best-paper awards, 13 patents, and a further 20 invention disclosures.
How it works
SimpleMachines’ compiler and chip hardware is based on identifying four fundamental behaviors that are universal and central to many algorithms: 1) operand communication, 2) synchronization, 3) computation, and 4) control. Our compiler can take any program at the Tensorflow/ONNX/PyTorch graph-level and deconstruct it into these four behaviors. Our chip implementation directly implements these four behaviors, creating an engine that runs as efficiently as a customized chip.
Our accompanying proprietary and patented chip design implements these four behaviors as first-order primitives. This allows us to have a platform that is completely under software control, while running at the efficiency of what a fully customized chip would be for those applications.
A typical programmable chip has a pipeline — comprising fetch, decode, execute, memory, and writeback — that incurs enormous overheads. A customized chip simply implements one algorithm while eschewing flexibility. Our Composable Behavior engine architecture, instead implements these coarse-grained blocks on chip, interconnecting them and allowing them to interact with each other. The program’s machine code representation describes the dependencies (and concurrency) between these four aspects of program execution, for each coarse grain phase of the program. For any given application the relative balance and interaction between these behaviors changes and is controlled and orchestrated by our dynamic run-time engine.
Our software stack on-the-fly, and automatically performs the design, implementation, and synthesis task that an ASIC designer does in months for each custom chip. This software solution is possible because of advances in machine learning (particularly integer linear programming), compiler technology, and chip architecture.
Efficiency, programmability, and time-scalability
SimpleMachines’ new breakthrough platform provides computation efficiency, while also providing programmability and time-scalability. It is based on our technology breakthrough that allows our compiler to decompose an algorithm into its fundamental behaviors, and provides specific hardware acceleration for each behavior.