Objective:
Architectural rigidity restricts a multiprocessor to a given application domain, demanding a redesign for every application. In this research, we propose to merge a set of heterogeneous architectural diversities into a single architectural template. The concept is to introduce application-specific customization via a single-cycle, area-efficient run-time reconfiguration. The method proposed is independent of the type of the individual processors.
Reconfigurable Multiprocessor Template:
A fixed architecture often renders resources as unused or inefficiently used depending on the application that is being executed. This worsens resource efficiency, power consumption and the overall system performance. In order to achieve a balance between application domain customization while retaining the base instruction-set architecture of a processor, a generalized template is proposed. The template comprises four processors organized in a cluster. For scalability, such clusters can be interconnected via a network-on-chip. Reconfigurable connectivity is added as an architectural enhancement to facilitate inter-processor communication. Multiplexers and associated control structures were added between stages of the instruction pipeline to allow interchangeable connectivity between processors. This interchangeability via multiplexing can be defined as a reconfigurable inter-pipeline-interconnect. This feature extends the usability of a fixed collection of processors to applications with varying resource requirements and degrees of parallelism. The building blocks within a processor are treated as distributed resources, accessible by all (or a subset of) processors. This modular and reconfigurable architecture has been implemented on our QuadroCore multiprocessor, but can be extended to any multiprocessor organization to improve resource-efficiency, tuned to application requirements.
Reconfiguration Design Space
In order to adapt the architecture to suit the application, variations in the type of inter-processor synchronization, communication and the granularity of parallelism, are alterable within the template. These variations define the reconfiguration design space. Further optimizations, by varying the word-length, sharing building blocks from processors that are unused or faulty, enhance resource efficiency (in terms of power savings and area utilization). All these alterations are introduced via a single-cycle, area-efficient run-time reconfiguration.
Our RISC-based multiprocessor - QuadroCore, has an automated compiler-driven design flow, which determines the best set of configurations (from a known set of configurations) for the application using standard program analysis techniques. This reconfiguration scheme offers a very low overhead (both time & area) and the advantage of using the best suited configuration is possible without making a significant impact on the execution time and the operating frequency.
Figure 1 shows a few sample configurations feasible in the QuadroCore. The register bank configuration allows the first processor to access the register-file of the remaining processor to address the high register pressure in the application. A SIMD configuration broadcasts the same instruction stream to all the processors, when all the processors perform the same set of operations on different data streams. Variable word-length ALUs are configured by merging two or more neighbouring ALUs, resulting in the application speedup. All these configurations directly result in power savings, as the unused resources are power-gated. These configurations together encompass the reconfiguration design space.
Instruction Stream as the Configuration Stream
A special reconfiguration instruction executed during run-time connects the building blocks within the multiprocessor cluster. Depending on the configuration demanded by the application, these instructions are executed at boundaries between regions where a change in resource requirement is anticipated. This reconfiguration instruction acts as the configuration data to define the functionality of the reconfigurable interconnects between the intermediate stages of the instruction pipeline. Hence, an explicit requirement for a reconfiguration controller is entirely avoided in this methodology. In addition, the need for a separate configuration memory space is avoided, since the configuration data is embedded well within the instruction stream.
Performance Analysis and Future Work
The QuadroCore architecture comprises four existing 32-bit RISC processors called NCore, mapped on UMC's 90nm standard cells. Without altering the base instruction set, additional instructions that define the connectivity between the processor building blocks. The area overhead due to reconfigurability was about 8% and the loss in terms operating frequency was about 3%.


