The future of computing performance lies in software and hardware parallelism. Simply put, application programs must be expressed by splitting work into numerous computations that execute on separate processors which communicate only from time to time, or better yet never communicate at all.
Moore’s law in software abstraction
For much of the last 30-years, programmers have not needed to rewrite their software in order for it to get faster. It was expected that software would run faster on the next generation of hardware as a result of the performance gained by shrinking transistors and squeezing more of them on a piece of silicon. Hence, programmers focused their attention on designing and building new applications that executed accurately on existing hardware but — anticipating the next generation of faster hardware — were often too compute-intensive to be effective. The demand for next-gen hardware has been significantly defined by software pressure.1
The sequential-programming model evolved in that system as well. To create more capable software, developers relied heavily on high levels of sequential programming languages and software abstraction — reusing component software and libraries for common tasks.
Moore’s law helped to propel the advancement in sequential-language abstractions since expanding processor speed covered their costs. For example, early sequential computers were programmed in assembly language statements, which have a 1:1 mapping to the computer-executed instructions. In 1957, Backus and his colleagues recognized that assembly-language programming was difficult, so they launched the first implementation of Fortran for the IBM 704 computer. The IBM programmers wrote in Fortran, and then a compiler translated Fortran into the computer’s assembly language. The team made these claims:
- Programs will contain fewer errors.
- It will take less time to write a correct program in Fortran.
- The performance of the program will be comparable to that of assembly language.
Understanding the “memory wall”
With the programming changes that are required to move from single to multi-core processors, software developers are finding it ever more challenging to deal with the growing gap between processor performance and memory system, often referred to as the “memory wall.”
The memory wall reflects a constant shift in the balance between the costs of computational and memory operations and adds to the complexity of achieving high performance. Effective parallel computation requires coordinating computations and data — the system must collocate computations with data and the memory in which it is stored. And while chip processors have achieved great performance gains from technical advancements, main-memory bandwidth, energy accessibility, and latency have scaled at a lower rate for years. The result is a performance gap between memory performance and processor that has become more and more significant. While on-chip cache memory is used to bridge the gap partially, even a cache with a state-of-the-art algorithm to predict the next operands needed by the processor cannot close that gap effectively. The aggregate rate of computation on a single chip will continue to outpace main-memory capacity and performance improvements. And thus, the advent of chip multiprocessors means that the bandwidth gap will almost certainly continue to widen.
To keep the memory from strictly limiting system power and performance, applications must have locality and the amount of that locality must increase. To boost locality, it is imperative to design software in a manner that will reduce (1) the communication between processors, and (2) data transfer between the memory and processors. In other words, make data travel less and minimize the distance that data needs to travel.
Mind the Gap
With the benefits of Moore’s Law winding down, the burden of increasing application performance should, ideally, be shared by software developers and hardware designers. But emerging new technologies may obfuscate those responsibilities with an interesting hybrid solution. Recent achievements by companies like SimpleMachines have given birth to new software and new hardware that synchronizes to eliminate this gap - all without the original software developer needing to be aware of the underlying transformation that takes place and without the hardware designer ever needing to know what application will be running. The keys to these new technologies are two breakthroughs: 1) A new compiler design that focuses on understanding application intent rather than instructions and then converts that algorithm into a series of processes that are each optimized to reduce data transfer, and 2) a chip architecture that dynamically changes the computation flow to address the locality issue based on algorithm needs. These breakthroughs offer a radically new way to efficiently solve a problem that would otherwise be extremely time and capital intensive to address, and does so in a way that is future-proofed against future enhancements that would otherwise only continue to increase that gap.