Before we learn how to use the DSU cluster, let’s overview some of the basics in parallel computing.
In sequential programming, processes are executed consecutively on a single processor. This implies that a process is not started until the preceding process has finished its execution.
In parallel programming, multiple processes can be executed simultaneously using multiple computing resources to solve a computational problem.
The computing resources may include:
Allocating more resources to a task will minimize its execution time. Parallel clusters can be built from cheap, commodity hardware.
It is impractical to solve very large and/or complex problems using a single machine due to its limited computational power, memory and storage.
Many tasks can be executed simultaneously using multiple computing resources.
It is possible to use resources over the internet (i.e., computer cloud) when local compute resources are scarce.
In a shared memory system, all processors have access to a pool of shared memory.
In a distributed memory system, each processor has its own local memory. Message passing is used to exchange data between processors through the interconnection network. Message passing is an effective communication mechanism between processors in distributed memory architecture.
Some applications need more computational resources than a single computer can provide. There are two ways of overcoming this limitation.
Installing more memory or processing power to a single node.
Networking multiple nodes together so that they work as a single logical unit.
A computer cluster is a horizontally scaled architecture. However, each node in the cluster can be scaled up vertically as well. The networked computers act as a single powerful machine. A computer cluster provides:
A computer cluster follows master-worker architecture. One node in the cluster is the master node and all other nodes in the cluster are called worker nodes or compute nodes. The master node distributes data among the worker nodes and also assigns tasks to worker nodes. It also monitors for worker failure and if a worker fails, it reassigns a task to some other worker.
An important aspect of the computer cluster is that multiple users should be able to use the resources concurrently. To do that, we need a resource manager and a job scheduler.
The Simple Linux Utility for Resource Management (SLURM) is an open source, resource management and job scheduling system for submitting, executing, monitoring, and managing jobs on Linux clusters.
SLURM provides three key functions:
The DSU Cluster uses SLURM as the resource manager and job scheduler.
More information about SLURM and its commands can be found from the official Slurm website.
A preferred option for running a job on the DSU cluster is to set up a job script. This script will request cluster resources and run your program when resources are available. If resources are not available, the job will be queued until it meets the required resources available.
To properly configure a job script, you will need to know the general script format, the commands you wish to use, how to request the resources required for the job to run, and, possibly some of the Slurm environmental variables.
The following is a list of Slurm commands that are used frequently
|sbatch||Submits job scripts into system for execution (queued)|
|scancel||Cancels a job|
|scontrol||Used to view and modify Slurm configuration and state|
|sinfo||Display state of partitions and nodes|
|squeue||Display state of jobs|
The following is a list of commonly requested resources and the Slurm syntax to get it.
||Controls the number of tasks (processes) to be created for the job|
||The real memory required per node|
||The real memory required per processor|
|--output||Path for standard output|
|--error||Path for standard error|
Below is a list of some environmental variables that are defined by Slurm when a job is launched into execution.
|$SLURM_JOB_ID||ID of job allocation|
|$SLURM_SUBMIT_DIR||Directory where job was submitted|
|$SLURM_JOB_NODELIST||File containing allocated hostnames|
|$SLURM_NTASKS||Total number of cores for job|
In this section, we demonstrate how to use DSU cluster to run Serial and Parallel programs. The following example is used to explain how parallel processing differs from sequential processing.
Numerical integration is used to approximate the area under the curve of a function in situations where the integral cannot be solved. The value of the PI can be defined by the above integral. We approximate the value of pi using numerical integration as illustrated in the diagram (The area under the curve between 0 and 1 gives an approximate value for PI). Here, we divide the area into 4 rectangles and sum them up to get an approximate value for PI. When the number of rectangles is increased, we get more accurate value for PI.
Let’s solve this problem sequentially. In this paradigm, area of a single rectangle is calculated at a time. Once it is calculated, the area of the next rectangle is calculated and so on. At the end, all the areas will be summed up to get the approximate value for PI. The C program for this problem is as follows.
Compiling this C Program:
dsu@master ~$gcc –o pi_serial pi_serial.c
The usual way to run this program:
dsu@master ~$ ./pi_serial
Let’s run this program using batch mode. To do that, we have to prepare a batch script.
The batch script (test_slurm.sh) is as follows.
Let’s submit the batch script. As it is a serial program, it will use only one processor.
dsu@master ~$ sbatch test_slurm.sh
Once a job is submitted, slurm will generate a job id and display it on the terminal.
The statuses of the currently running jobs can be viewed using “SVIEW” command.
We also can view the full details of a currently running job using “scontrol show job [job_id]”.
Once the job is completed, we can view the output using “cat slurm.out”.
If the program fails, the error log can be viewed using “cat slurm.error”.
Let’s code the above problem using parallel programming. Earlier, single processor was calculating the areas of each rectangle one at a time. In parallel processing, multiple processors are used to solve this problem. Each processor calculates an area of a rectangle concurrently. Execution time will be minimized when more number of processors is used.
The Message Passing Interface (MPI) is a library specification for creating parallel applications.
In parallel computing, multiple nodes work on different parts of a computing problem simultaneously. The challenge is to synchronize the actions of each node, exchange data between nodes and provide command and control over the nodes. The message passing interface provides a standard suite of functions for these tasks. The primary purpose of MPI is distributed memory parallelism.
There are several implementations of the MPI standard.
Currently, DSU supports OpenMPI implementation of the MPI. OpenMPI supports C, C++, F77, and F90.
Let’s run the parallel version of the “calculate pi” example using OpenMPI with C.
The source code is as follows (cpi.c).
Let’s compile this parallel program using mpicc. Mpicc is a shell script that compiles MPI programs written in C.
dsu@master ~$ mpicc –o cpi cpi.c
Let’s modify the slurm script (test_slurm.sh) to run this program on our cluster using 5 processors.
Mpirun is a shell script that executes serial and parallel jobs in OpenMPI.
--mca btl_tcp_if_include 10.40.62.0/16 is a parameter that is used to bypass our university proxy server.
The progress and output of the program can be viewed as we discussed earlier.
As we mentioned earlier, OpenMPI supports C, C++, and Fortran. In order to do parallel programming in Python, we use another library called MPI for Python (MPI4Py). It provides bindings of the MPI standard for the Python programming language.
Besides MPI4Py, there are several modules available for parallel programming in Python. DSU supports only MPI4Py.
Let’s parallelize the “calculate pi” problem using MPI4Py (calculate_pi.py).
We do not need to compile this Python code as Python is an interpreted language.
The slurm script to run this program is as follows.
There are several packages available for parallel programming in R. Some of them are listed below.
|Shared Memory||Distributed Memory|
|Examples: parallel, Simple Network of Workstations (snow), foreach, gputools||Examples: pbdR, Rmpi, RHadoop, RHIPE, snow via Rmpi|
We will be discussing snow and Rmpi in the following examples.
Rmpi is a CRAN package that provides an interface between R and the MPI. Snow is used to parallelize R programs using multiple cores within a machine. However, snow alone cannot be used to distribute tasks across multiple nodes. In the more common scenarios, snow is used with Rmpi for distributed computing across the cluster.
Below is the Rmpi version of the “calculate pi” example (calculate_pi.R).
The above program uses 6 processors (a master and 5 slaves). A batch script to submit this job is as follows
The following code explains how to solve the same problem using snow via Rmpi in a distributed fashion using multiple nodes (cal_pi.R).
A simple batch script to run the above problem is as follows.
More details about the topics we covered can be found from the following web sites.
(Lectures 38 – 41)