|
Jean-Marc Morel from Bull explained that new machines will be parallel machines, and this will require major changes in the way software is being developed. The effect is that enhanced computing power will be obtained, indeed no longer from using one faster core (processor unit) but from using several cores. Because many domains depend heavily on the use of efficient technical and scientific computing systems, parallel computing is becoming more and more important for the European society and industry.
To meet new scientific and engineering challenges, the HPC community mainly needs:
- faster and more accurate computational mathematics
- new parallel programming models
- methods and tools to detect and extract parallelism (thread extraction)
- enriched directives to express parallelism
- easily to use and scalable performance-analysis tools
- powerful correctness checker and debugger
- optimized libraries for multi-thread / multi-core, and optimization tools
- enhanced thread management, job scheduling, ...
Denis Barthou from the Université de Versailles Saint Quentin en Yvelines, France presented the project's work packages:
- WP1 addresses project management, guidance and dissemination
- WP2 is about evolving methods and tools to facilitate the development of parallel applications and to make them more efficient
- WP3 provides advanced tools to check, debug, measure, analyse, and optimise parallel applications
- WP4 will extend the Linux Operating System
A host of partners are engaged in evolving the design and programming models for parallel applications. The multi-level parallel programming and execution model for MPSoC is handled by CEA-LIST / LaSTRE. The hybrid parallel programming to convert OpenMP programmes to hybrid programmes is being addressed by INT. The application restructuring and code optimization is done by CAPS and UVSQ. The adaptation of message passing APIs to MPSoC environments is performed by UAB. The method and tool to design interconnection networks on Chip is also executed by UAB and Robotiker.
ParMA wants to provide parallel programming tools that apply to any application domain and any parallel architecture, and integrate them. Among them are the MPI Correctness checker: MARMOT from HLRS to support full MPI-2 Std. parallel IO, one-sided com, C++. Tracing facilities like Open Trace Format and Peruse (MPI extension) will be used as well as a performance-analysis toolset for large scale applications consisting of Kojak and SCALASCA from JSC; and Vampir framework from TU Dresden (ZIH) and GWT. The DDT Distributed Debugging Tool has to deal with large applications.
Push operating systems will support new multi-core architectures to improve shared and cache memory management to minimize buffer copy and optimize bandwidth; to improve cache coherency for better latency; and to monitor the timer and synchronise daemons to reduce disturbances.
One of the project goals is to demonstrate the technology in various application domains. such as:
- metal forming simulation with INDEED by gns
- Casting process simulation with MAGMASOFT by MAGMA
- 3D-combustion simulation by RECOM-AIOLOS by RECOOM
- HPC benchmark by BULL
- Avionics: tracking algorithm and controlled simulation of system by DASSAULT
- Software Defined Radio by Indra
- VR for manufacturing processes by CEA List
All tools and most applications have been installed on the common Bull HPC cluster. Application developers have experimented with the various tools. Test and performance tools have been evolved and start being integrated e.g. KOJAK and Vampir, Marmot and DDT. The code optimization tool (MAQAO) is being adapted for Xeon.
Performance analysis has been carried out by all aplications to measure the speed-up on a multi-core architecture.
Exploiting the power of multi-core is not straightforward. With the level of parallelism implemented in a classical HPC application, the speed-up is often below the ideal. Experience with an hybrid application shows that only with a given combination of OpenMP threads and MPI processes, a parallel efficiency of 94% can be obtained.
A detailed profiling and analysis using the performance tools is required. The parallelization strategies must be improved and in some cases new algorithms are required.
The team already had some important lessons that were learned in the process. CEA-LIST conducted a research to increase the performance of collision detection, a complex tree traversal problem due to the heterogeneity of individual tasks with regard to the compute time, using Bounding Volume Hierarchy (BVH). They developed a task calibration scheme to get a set of homogeneous tasks matching the number of available cores that limits the synchronisation overhead. Among their findings was that Depth First Search (DFS) can be optimized by exploiting the parallelism discovered by first traversing the tree in a weight directed matter.
The performance-analysis tools received some main extensions:
- KOJAK was rewritten to benefit from features of the SCALASCA project which turned it into a more portable, robust, efficient, and scalable measurement system.
- The support of MPI/I/O tracing, fork/exec calls, Lustre performance counters, ... has been implemented in Vampir, thus enabling a combined performance analysis of applications and file systems.
- Trace converter enables to leverage KOJAK and Vampir facilities.
- UNITE is being developed.
Parallelism and scalability have to do with Amdahl's Law. How sequential operations limit the scalability of a parallel process was shown by the ParMA project team as follows:
- Speed-uo = Total Serial/Total Parallel = 100/40 = 2.5 with 4 parallel processors
- Max speed-up <100/20 = 5 with an infinite number of parallel processors without taking into account system overheads
To obtain better scaling, the serial portion of the application must be optimized.
The ParMA objectives are to identify real performance bottlenecks and to find how to transform applications to remove them. A multi-level optimization is required for the parallelism between nodes, cores, threads, and ILP.
The performance results have to be in good combination of the optimization at different levels.
Different tools for different approaches are required for collecting execution traces and profiling and to obtain a static performance evaluation, and different levels of parallelism are needed such as an MPI/OpenMP, multi-threaded execution and a single thread performance tuning.
The objectives of MAQAO are to help identify performance bottlenecks for a single thread, collect static information from the assembly code, collect value profiling information, estimate potential performance gain, and
find opportunities for specialization.
The project team will collect static information from the assembly code by detecting inefficient code patterns;
evaluating static performance for inner loops; and collect value profiling information on the function parameters,
prefetch values, and loop trip count distribution.
It is necessary to estimate the potential performance gain for each inner loop and for each function hotpath and to find opportunities for specialization with regard to the loop trip counts and function parameters.
Furthermore ParMa will propose performance tuning hints, source code modifications, compiler flags and pragmas, and specialization opportunities. So far the project has been limited to one loop but for nested loops the partners propose hierarchical optimization.
In exploring combinations of optimizations, the partners will specify the optimization space to explore with pragmas, for a given code fragment and occupy themselves with source to source transformations: unroll, interchange, tiling, ...
When combining with the static analysis of MAQAO, they will find best performing variants of code fragments by filtering out worst variants with MAQAO based on static performance evaluation.
The results for library code generation and for linear algebra are that the performance is comparable to vendor-libraries, using only source to source optimisation and compiler. The results for large application performance tuning are the molecular simulation code GIBBS, a speed-up of x2.24 on Itanium2 platform, single thread performance, the Lattice QCD - ETMC code, a large floating point code, and a speed-up of 40% for Lattice.
By moving from single thread to multi-threads, MAQAO and a hierarchical approach are used and this generates multiple code variants to choose from and provides a performance estimation for each variant. The partners want
to dhoose the best variant for scalability, thus satisfying thread affinity constraints and addressing communication issues depending on application context.
Future work consists in refining parallel programming techniques, finalizing the integration of performance analysis, correctness checker, and debugging tools into a framework, and defining a methodology for multi-level performance-analysis, integrating KOJAK and VAMPIR with MAQAO.
More information is available at http://www.parma-itea2.org |