Automatic parallelization tool

An automatic parallelization tool is a computer program aiding in automatic parallelization of existing sequential (single-threaded) code into parallel (multithreaded or vectorized) code. It aims to facilitate re-use of already written software with the performance benefits of parallelization, reducing the amount of software rewriting needed, and saving the need to rewrite all of it.

In the past, parallel hardware was only implemented in high-end machines or by means of distributed computing, but with the advent of graphics processing units (GPUs) and multi-core central processing units (CPUs) in consumer devices it has become widespread in low-end computers also. Hence, it has become desirable to automate the process of converting older, single-threaded applications to exploit parallel hardware. Further, automatic parallelization tools can enable a focus on writing applications in a single-threaded manner while still benefiting from parallelization. Some caveats in the conversion include handling issues such as synchronization and deadlock avoidance, which do not arise in single-threaded computing.

Need for automatic parallelization

Past methods provided solutions for languages like Fortran and C; however, these were not enough. These methods dealt with parallelizing sections for specific systems, such as loops or given sections of code. Identifying opportunities for parallelizing is a critical step while generating multithreaded application. This need to parallelize applications is partly addressed by tools that analyze code to exploit parallelism. These tools use methods for either compile time or run time. These methods are built-in in some parallelizing compilers but a user needs to identify parallelize code and mark the code with special language constructs. The compiler identifies these language constructs and analyzes the marked code for parallelizing. Some tools parallelize only special forms of code, such as loops. Hence, a fully automatic tool for converting sequential code to parallel code is needed.^[1]

General procedure of parallelization

1. The process starts with identifying code sections that have parallelism possibilities. Often this task is difficult since the programmer working to parallelize the code did not originally write the code under consideration, or is new to the application domain. Thus, though this first stage in the parallelizing process seems easy, at first it may not be so.

2. The next stage is to shortlist code sections out of the identified ones that can truly be parallelized. This stage is again most important and difficult since it involves much analysis, generally for codes in C and C++ where pointers are involved and difficult to analyze. Many special methods such as pointer alias analysis and function side effect analysis are needed to conclude whether a section of code is dependent on any other code. The more dependencies exist in the identified code sections, the more the possibilities of parallelizing decrease.

3. Sometimes the dependencies are removed by changing the code and this is the next stage in parallelizing. Code is transformed such that the function, and hence output, is unchanged, but the dependency, if any, on other code section(s) or other instruction(s) is removed.

4. The last stage in parallelizing is generating the parallel code. This code is always functionally similar to the original sequential code, but has added constructs or code sections which, when executed, create multiple threads or processes.

Automatic parallelization method

See also main article automatic parallelization.

Scan

This is the first stage where the scanner reads the input source files to identify all static and external uses. Each line in the file is checked against predefined patterns to segregate into tokens. These tokens will be stored in a file which is used later by the grammar engine. The grammar engine will check patterns of tokens that match with predefined rules to identify variables, loops, control statements, functions, etc., in the code.

Analyze

The analyzer is used to identify sections of code that can be executed concurrently. The analyzer uses the static data information provided by the scanner-parser. The analyzer will first find out all the functions that are independent of each other and mark them as individual tasks. Then an analyzer finds which tasks are having dependencies.

Schedule

The scheduler lists all the tasks and their dependencies on each other as to execution and start times. The scheduler produces an optimal schedule as to the number of processors to use or the total time of execution for the application.

Code generation

The scheduler generates a list of all the tasks and the details of the cores on which they will execute along with the time for which they will execute. The code Generator will insert special constructs in the code that will be read during execution by the scheduler. These constructs will instruct the scheduler on which core a given task will execute along with the start and end times.

Parallelization tools

There are many automatic parallelizing tools for Fortran, C, C++, and several other languages.

YUCCA

YUCCA is a sequential-to-parallel automatic code conversion tool developed by KPIT Technologies Ltd., Pune. It takes input as C source code which may have multiple source and header files. It gives output as transformed multi-threaded parallel code using POSIX Threads (pthreads) functions and OpenMP constructs. The YUCCA tool does task- and loop-level parallelizing.

Par4All

Par4All is an automatic parallelizing and optimizing compiler (workbench) for C and Fortran sequential programs. This source-to-source compiler adapts existing applications to various hardware targets such as multicore systems, high performance computers and GPUs. It creates a new source code and thus allows the original source code of the application to remain unchanged.

Cetus

Cetus is a compiler infrastructure for the source-to-source transformation of software programs written in C. This project is developed by Purdue University. Cetus is written in Java. It provides basic infrastructure to write automatic parallelizing tools or compilers. The basic parallelizing methods implemented are privatization, reduction variables recognition, and induction variable substitution.

A new graphical user interface (GUI) was added in February 2013. Speedup calculations and graph display were added in May 2013. A Cetus remote server in a client–server model was added in May 2013 and users can optionally transform C code through the server. This is very useful when Cetus runs on a non-Linux platform. An experimental Hubzero version of Cetus was implemented in May 2013 and Cetus can also run via web browser.

Pluto

Pluto (stylized as PLUTO) is an automatic parallelizing tool based on the polyhedral model, which for compiler optimizing, is a representation for programs that makes it convenient to perform high-level transformations such as loop nest optimizations and loop parallelizing. Pluto transforms C programs from source to source for coarse-grained parallelism and data locality simultaneously. The core transformation framework mainly works by finding affine transformations for efficient tiling and fusion, but not limited to those. OpenMP parallel code for multicores can be automatically generated from sequential C program sections.

Polaris

The Polaris compiler takes a Fortran 77 program as input, transforms this program so that it runs efficiently on a parallel computer, and outputs this program version in one of several possible parallel Fortran dialects. Polaris performs its transformations in several compiling passes. In addition to many commonly known passes, Polaris includes advanced abilities, performing these tasks: array privatization, data dependence testing, induction variable recognition, Inter procedural analysis, and symbolic program analysis.

Intel C++ Compiler

The auto-parallelizing feature of the Intel C++ Compiler automatically translates serial portions of the input program into semantically equivalent multi-threaded code. Automatic parallelizing determines the loops that are good work sharing candidates, performs the data-flow analysis to verify correct parallel execution, and partitions the data for threaded code generation as is needed in programming with OpenMP directives. The OpenMP and auto-parallelizing applications provide the performance gains from shared memory on multiprocessor systems.

Intel Advisor

The Intel Advisor 2017 is a vectorization optimization and thread prototyping tool. It integrates several steps into its workflow to search for parallel sites, enable users to mark loops for vectorization and threading, check loop-carried dependencies and memory access patterns for marked loops, and insert pragmas for vectorization and threading.

AutoPar

AutoPar is a tool which can automatically insert OpenMP pragmas into input serial C/C++ codes. For input programs with existing OpenMP directives, the tool will double check the correctness when the right option is turned on. Compared to conventional tools, AutoPar can incorporate user knowledge (semantics) to discover more parallelizing opportunities.

iPat/OMP

This tool provides users with the assistance needed for OpenMP parallelizing of a sequential program. This tool is implemented as a set of functions on the Emacs editor. All the activities related to program parallelizing, such as selecting a target portion of the program, invoking an assistance command, and modifying the program based on the assistance information shown by the tool, can be handled in the source program editor environment.^[2]

Vienna Fortran Compiler (VFC)

The Vienna Fortran Compiler is a new source-to-source parallelizing system for High Performance Fortran (HPF+, an optimized HPF version), which addresses the needs of irregular applications.

SUIF

SUIF (Stanford University Intermediate Format) is a free infrastructure designed to support collaborative research in optimizing and parallelizing compilers. SUIF is a fully functional compiler that takes both Fortran and C as input languages. The parallelized code is output as a single program, multiple data (SPMD) parallel C version of the program that can be compiled by native C compilers on a variety of architectures.^[3]

Omni OpenMP Compiler

The Omni OpenMP Compiler translates C and Fortran programs with OpenMP pragmas into C code suitable for compiling with a native compiler linked with the Omni OpenMP runtime library. It does for loop parallelizing.

Timing-Architects Optimizer

The Timing-Architects Optimizer uses a simulation based approach to improve task allocation and task parallelizing to multiple cores. By use of a simulation based performance and real-time analysis, different task allocation alternatives are benchmarked against each other. Dependencies and processor platform specific effects are considered. TA Optimizer is used in embedded system engineering.

TRACO

It uses the Iteration Space Slicing and Free Schedule Framework. The core is based on the Presburger Arithmetic and the transitive closure operation. Loop dependencies are represented with relations. TRACO uses the Omega Calculator, CLOOG and ISL libraries, and the Petit dependence analyser. The compiler extracts better locality with fine- and coarse-grained parallelism for C/C++ applications. The tool is developed by the West-Pomeranian University of Technology team; (Bielecki, Palkowski, Klimek and other authors) http://traco.sourceforge.net.

SequenceL

SequenceL is a general-purpose functional programming language and auto-parallelizing tool set, which main design objectives are performance on multi-core processor hardware, ease of programming, platform portability/optimization, and code clarity and readability. Its main advantage is that it can be used to write straightforward code that automatically exploits all processing power available, without programmers needing to identify parallelisms, specify vectorization, avoid race conditions, and consider other challenges of manual directive-based programming methods such as OpenMP.

Programs written in SequenceL can be compiled to multithreaded code that runs in parallel, with no explicit indications from a programmer of how or what to parallelize. As of 2015, versions of the SequenceL compiler generate parallel code in C++ and OpenCL, which allows it to work with most popular programming languages, including C, C++, C#, Fortran, Java, and Python. A platform-specific runtime manages the threads safely, automatically providing parallel performance according to the number of cores available.

OMP2MPI

OMP2MPI^[4] Automatically generates MPI source code from OpenMP. Allowing that the program exploits non shared-memory architectures such as cluster, or Network-on-Chip based(NoC-based) multiprocessors-system-on-chip (MPSoC). OMP2MPI gives a solution that allow further optimization by an expert that want to achieve better results.

OMP2HMPP

OMP2HMPP,^[5] a tool that, automatically translates a high-level C source code(OpenMP) code into HMPP. The generated version rarely will differs from a hand-coded HMPP version, and will provide an important speedup, near 113%, that could be later improved by hand-coded CUDA.

emmtrix Parallel Studio

emmtrix Parallel Studio is a source-to-source parallelizing tool combined with an interactive GUI developed by emmtrix Technologies GmbH. It takes C, MATLAB, Simulink, Scilab or Xcos source code as input and generates parallel C code as output. It relies on static schedule and a message passing application programming interface (API) for the parallel program. The whole parallelizing process is controlled and visualized in an interactive GUI enabling parallelizing decisions by the end user. It targets embedded multicore architectures combined with GPU and field-programmable gate array (FPGA) accelerators.

CLAW Compiler

The CLAW Compiler translates Fortran programs with claw pragmas into Fortran code suitable for a specific supercomputer target augmented with OpenMP or OpenACC pragmas.

PaSH

PaSH is parallelizing compiler for Unix shell scripts.^[6]