!!Conference or Journal where the paper was published 2010 18th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines !!Reviewer João Paulo Pizani Flor !!Summary !!!Motivation/Problem * A significant class of users would benefit from the ability to describe data parallel programs using a regular programming language * Furthermore, they would benefit from the ability to synthesize these descriptions into an FPGA fabric. * Current systems and toolchains for programming in data-parallel way usually require the user to learn a new language. !!!Solution We argue that for a large class of data-parallel tasks it is possible to utilize an existing and widespread programming language like C++ and regular compilers to describe data-parallel computations. Furthermore, these descriptions are also able to be synthesized into an FPGA fabric. We have implemented an EDSL (Embedded Domain-Specific Language) on top of C++ which targets: * GPGPU (currently through the DirectX9 API) * X64 multicore processors * FPGA circuits :: {img fileId="317" width="500" rel="box[g]"} :: * The GPGPU and X64 targets work "on-line", by compiling and sending the code to the targets just-in-time. * In contrast, to FPGA target is currently offline, because we depend on non standardized toolchains. The core of the EDSL are the parallel array types, under namespace ParallelArrays. They represent data structures over which parallel operations can be performed. Below is the complete source code necessary to add two floating-point arrays element-wise, using an FPGA circuit instead of a regular CPU. The code can be easily modified to adapt to other targets: {CODE(colors="c",ln="0",wiki="0",rtl="0",ishtml="0")} using namespace ParallelArrays; using namespace MicrosoftTargets; using namespace std; int main() { Target &tgtFPGA = CreateFPGATarget("adder", Virtex5); const int size = 5; float f1\[size\] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f}; float f2\[size\] = {0.1f, 0.2f, 0.3f, 0.4f, 0.5f}; FPA x = FPA(f1, size); FPA y = FPA(f2, size); FPA x = x + y; float resultArray\[size\]; tgtFPGA.ToArray(z, resultArray, size); return 0; } {CODE} !!!Implementation/Why does it work? * The parallel array objects have a fundamental difference when compared with normal objects: ** Their contents can be stored either in main memory, GPU memory or BRAMs of an FPGA. * Overloaded operators and static functions over parallel arrays build an expression tree, which is evaluated and compiled at runtime. :: {img fileId="318" width="400" rel="box[g]"} :: * The evaluation of this expression tree happens at the call to the ToArray method of the target. ** tgtFPGA(resultingParallelArray, addressToStoreResults, sizeOfResults); * The MicrosoftTargets library takes care of abstracting details like communication with GPUs, synchronization and translation to VHDL. !!!Closing remarks and questions left unanswered * Is it amenable to be used in embedded systems? ** Does it work together with toolchains for several processor architectures? * Very good perspectives: ** Accelerator or accelerator-like approaches used to program systems like Intel Stellarton and AMD Fusion !!!Interesting references * Guy E. Blelloch, Jonathan C. Hardwick, Siddhartha Chatterjee, Jay Sipelstein, and Marco Zagha. 1993. Implementation of a portable nested data-parallel language. In Proceedings of the fourth ACM SIGPLAN symposium on Principles and practice of parallel programming (PPOPP '93). ACM, New York, NY, USA, 102-111. DOI=10.1145/155332.155343 http://doi.acm.org/10.1145/155332.155343 * D. Tarditi, S. Puri, J. Oglesby, “Accelerator: using data-parallelsim to program GPUs for genral purpose uses,” ASPLOS 2006.