NextGenLog: #ALGORITHMS: "Variable Clock Speeds Multi-Core Development"

Friday, April 13, 2012

#ALGORITHMS: "Variable Clock Speeds Multi-Core Development"

Developing new multi-core processors has been hampered by simulation techniques that don't work well with the divide-and-conquer strategies adopted by parallel processing algorithms using many cores simultaneously. Now MIT claims to have solved these problems with its Arete system that varies the ratio between real clock cycles and simulated clock cycles in a field-programmable-gate array that speeds development while not sacrificing accuracy. R. Colin Johnson

Arete makes use of an Xilinx field-programmable gate array, or FPGA (center). Photo: Thomas.L/flickr

Here is what MIT says about Arete:Most computer chips today have anywhere from four to 10 separate cores, or processing units, which can work in parallel, increasing the chips’ efficiency. But the chips of the future are likely to have hundreds or even thousands of cores.

For chip designers, predicting how these massively multicore chips will behave is no easy task. Software simulations work up to a point, but more accurate simulations typically require hardware models — programmable chips that can be reconfigured to mimic the behavior of multicore chips.

At the IEEE International Symposium on Performance Analysis of Systems and Software earlier this month, researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) presented a new method for improving the efficiency of hardware simulations of multicore chips. Unlike competing methods, it guarantees that the simulator won’t go into “deadlock” — a state in which cores get stuck waiting for each other to relinquish system resources, such as memory. The method should also make it easier for designers to develop simulations and for outside observers to understand what those simulations are intended to do.

Hardware simulations of multicore chips typically use devices called field-programmable gate arrays, or FPGAs. An FPGA is a chip with an array of simple circuits and memory cells that can be hooked together in novel configurations after the chip has left the factory. The chips sold by some small-market manufacturers are, in fact, specially configured FPGAs.

Chip architects using FPGAs to test multicore-chip designs, however, must simulate the complex circuitry found in general-purpose microprocessors. One way to do that is to hook together a lot of the FPGA’s simple circuits, but that consumes so many of them so rapidly that the simulator ends up modeling only a small portion of the whole chip design. The other approach is to simulate the complex circuits’ behavior in stages — using a partial circuit but spending, say, eight clock cycles on a calculation that, in a real chip, would take only one. Traditionally, however, that’s meant slowing down the whole simulation, to allow eight real clock cycles per one simulated cycle.

For a simulation system they’ve dubbed Arete, graduate students Asif Khan and Muralidaran Vijayaraghavan; their adviser, Arvind, the Charles W. and Jennifer C. Johnson Professor of Electrical Engineering and Computer Science; and Silas Boyd-Wickizer, a CSAIL graduate student in the Parallel and Distributed Operating Systems Group, adopted the second approach, but they developed a circuit design that allows the ratio between real clock cycles and simulated cycles to fluctuate as needed. That allows for faster simulations and more economical use of the FPGA’s circuitry.

Every logic circuit has some number of input wires and some number of output wires, and the CSAIL researchers associate a little bit of memory with each such wire. Data coming in on a wire is stored in memory until all the operations that require it have been performed; data going out on a wire is stored in memory until the data going out on the other wires has been computed, too. Once all the outputs have been determined, the input data is erased, signaling the completion of one simulated clock cycle. Depending on the complexity of the calculation the circuit was performing, the simulated clock cycle could correspond to one real clock cycle, or eight, or something in between.

The CSAIL researchers argue that its easier for outside observers — and even for chip designers themselves — to understand what a simulation is intended to do. The researchers’ high-level language, which they dubbed StructuralSpec, builds on the BlueSpec hardware design language that Arvind’s group helped develop in the late 1990s and early 2000s. The StructuralSpec user gives a high-level specification of a multicore model, and software spits out the code that implements that model on an FPGA. Where a typical, hand-coded hardware model might have about 30,000 lines of code, Khan says, a similar model implemented on StructuralSpec might have only 8,000 lines of code.

Kattamuri Ekanadham, a chip researcher at IBM’s T. J. Watson Laboratory, is currently building his own implementation of the MIT researchers’ simulator.
Further Reading