[Published in Open Source For You (OSFY) magazine, May 2014 edition.]
This article guides readers through the installation of GNU Unified Parallel C, which is designed for high performance computing on large scale parallel machines.
GNU Unified Parallel C is an extension to the GNU C compiler (GCC), which supports execution of Unified Parallel C (UPC) programs. UPC uses the Partitioned Global Address Space (PGAS) model for its implementation. The current version of UPC is 1.2, and a 1.3 draft specification is available. GNU UPC is released under the GPL license, while, the UPC specification is released under the new BSD license. To install it on Fedora, you need to first install the gupc repository:
$ sudo yum install http://www.gccupc.org/pub/pkg/rpms/gupc-fedora-18-1.noarch.rpm
You can then install the gupc RPM using the following command:
$ sudo yum install gupc-gcc-upc
The installation directory is /usr/local/gupc. You will also require the numactl (library for tuning Non-Uniform Memory Access machines) development packages:
$ sudo yum install numactl-devel numactl-libs
To add the installation directory to your environment, install the environment-modules package:
$ sudo yum install environment-modules
You can then load the gupc module with:
# module load gupc-x86_64
Consider the following simple ‘hello world’ example:
#include <stdio.h>
int main()
{
printf("Hello World\n");
return 0;
}
You can compile it using:
# gupc hello.c -o hello
Then run it with:
# ./hello -fupc-threads-5
Hello World
Hello World
Hello World
Hello World
Hello World
The argument -fupc-threads-N specifies the number of threads to be run. The program can also be executed using:
# ./hello -n 5
The gupc compiler provides a number of compile and run-time options. The ’-v’ option produces a verbose output of the compilation steps. It also gives information on GNU UPC. An example of such an output is shown below:
# gupc hello.c -o hello -v
Driving: gupc -x upc hello.c -o hello -v -fupc-link
Using built-in specs.
COLLECT_GCC=gupc
COLLECT_LTO_WRAPPER=/usr/local/gupc/libexec/gcc/x86_64-redhat-linux/4.8.0/lto-wrapper
Target: x86_64-redhat-linux
Configured with: ...
Thread model: posix
gcc version 4.8.0 20130311 (GNU UPC 4.8.0-3) (GCC)
COLLECT_GCC_OPTIONS='-o' 'hello' '-v' '-fupc-link' '-mtune=generic' '-march=x86-64'
...
GNU UPC (GCC) version 4.8.0 20130311 (GNU UPC 4.8.0-3) (x86_64-redhat-linux)
compiled by GNU C version 4.8.0 20130311 (GNU UPC 4.8.0-3),
GMP version 5.0.5, MPFR version 3.1.1, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
...
#include "..." search starts here:
#include <...> search starts here:
/usr/local/gupc/lib/gcc/x86_64-redhat-linux/4.8.0/include
/usr/local/include
/usr/local/gupc/include
/usr/include
End of search list.
GNU UPC (GCC) version 4.8.0 20130311 (GNU UPC 4.8.0-3) (x86_64-redhat-linux)
compiled by GNU C version 4.8.0 20130311 (GNU UPC 4.8.0-3),
GMP version 5.0.5, MPFR version 3.1.1, MPC version 0.9
GGC heuristics: --param ggc-min-expand=100 --param ggc-min-heapsize=131072
Compiler executable checksum: 9db6d080c84dee663b5eb4965bf5012f
COLLECT_GCC_OPTIONS='-o' 'hello' '-v' '-fupc-link' '-mtune=generic' '-march=x86-64'
as -v --64 -o /tmp/cccSYlmb.o /tmp/ccTdo4Ku.s
...
COLLECT_GCC_OPTIONS='-o' 'hello' '-v' '-fupc-link' '-mtune=generic' '-march=x86-64'
...
The -g option will generate debug information. To output debugging symbol information in DWARF-2 (Debugging With Attributed Record Formats), use the -dwarf-2-upc option. This can be used with GDB-UPC, a GNU debugger that supports UPC.
The -fupc-debug option will also generate filename and the line numbers in the output.
The optimization levels are similar to the ones supported by GCC: ’-O0’, ’-O1’, ’-O2’, and ’-O3’.
Variables that are shared among threads are declared using the ‘shared’ keyword. Examples include:
shared int i;
shared int a[THREADS];
shared char *p;
‘THREADS’ is a reserved keyword that represents the number of threads that will get executed run-time. Consider a simple vector addition example:
#include <upc_relaxed.h>
#include <stdio.h>
shared int a[THREADS];
shared int b[THREADS];
shared int vsum[THREADS];
int
main()
{
int i;
/* Initialization */
for (i=0; i<THREADS; i++) {
a[i] = i + 1; /* a[] = {1, 2, 3, 4, 5}; */
b[i] = THREADS - i; /* b[] = {5, 4, 3, 2, 1}; */
}
/* Computation */
for (i=0; i<THREADS; i++)
if (MYTHREAD == i % THREADS)
vsum[i] = a[i] + b[i];
upc_barrier;
/* Output */
if (MYTHREAD == 0) {
for (i=0; i<THREADS; i++)
printf("%d ", vsum[i]);
}
return 0;
}
‘MYTHREAD’ indicates the thread that is currently running. upc_barrier is a blocking synchronization primitive that ensures that all threads complete before proceeding further. Only one thread is required to print the output, and THREAD 0 is used for the same. The program can be compiled, and executed using:
# gupc vector_addition.c -o vector_addition
# ./vector_addition -n 5
6 6 6 6 6
The computation loop in the above code can be simplified with the upc_forall statement:
#include <upc_relaxed.h>
#include <stdio.h>
shared int a[THREADS];
shared int b[THREADS];
shared int vsum[THREADS];
int
main()
{
int i;
/* Initialization */
for (i=0; i<THREADS; i++) {
a[i] = i + 1; /* a[] = {1, 2, 3, 4, 5}; */
b[i] = THREADS - i; /* b[] = {5, 4, 3, 2, 1}; */
}
/* Computation */
upc_forall(i=0; i<THREADS; i++; i)
vsum[i] = a[i] + b[i];
upc_barrier;
if (MYTHREAD == 0) {
for (i=0; i<THREADS; i++)
printf("%d ", vsum[i]);
}
return 0;
}
The upc_forall construct is similar to a for loop, except, that it accepts a fourth parameter, the affinity field. It indicates the thread on which the computation runs. It can be an integer that is internally represented as integer % THREADS, or it can be an address corresponding to a thread. The program can be compiled and tested with:
# gupc upc_vector_addition.c -o upc_vector_addition
# ./upc_vector_addition -n 5
6 6 6 6 6
The same example can also be implemented using shared pointers:
#include <upc_relaxed.h>
#include <stdio.h>
shared int a[THREADS];
shared int b[THREADS];
shared int vsum[THREADS];
int
main()
{
int i;
shared int *p1, *p2;
p1 = a;
p2 = b;
/* Initialization */
for (i=0; i<THREADS; i++) {
*(p1 + i) = i + 1; /* a[] = {1, 2, 3, 4, 5}; */
*(p2 + i) = THREADS - i; /* b[] = {5, 4, 3, 2, 1}; */
}
/* Computation */
upc_forall(i=0; i<THREADS; i++, p1++, p2++; i)
vsum[i] = *p1 + *p2;
upc_barrier;
if (MYTHREAD == 0)
for (i = 0; i < THREADS; i++)
printf("%d ", vsum[i]);
return 0;
}
# gupc pointer_vector_addition.c -o pointer_vector_addition
# ./pointer_vector_addition -n 5
6 6 6 6 6
Memory can also be allocated dynamically. The upc_all_alloc function will allocate collective global memory that is shared among threads. A collective function will be invoked by every thread. The upc_global_alloc function will allocate non-collective global memory which will be different for all threads in the shared address space. The upc_alloc function will allocate local memory for a thread. Their respective declarations are as follows:
shared void *upc_all_alloc (size_t nblocks, size_t nbytes);
shared void *upc_global_alloc (size_t nblocks, size_t nbytes);
shared void *upc_alloc (size_t nbytes);
To protect access to shared data, you can use the following synchronization locks:
void upc_lock (upc_lock_t *l)
int upc_lock_attempt (upc_lock_t *l)
void upc_unlock(upc_lock_t *l)
There are two types of barriers for synchronizing code. The upc_barrier construct is blocking. The non-blocking barrier uses upc_notify (non-blocking), and upc_wait (blocking) constructs. For example:
#include <upc_relaxed.h>
#include <stdio.h>
int
main()
{
int i;
for (i=0; i<THREADS; i++) {
upc_notify;
if (i == MYTHREAD)
printf("Thread: %d\n", MYTHREAD);
upc_wait;
}
return 0;
}
The corresponding output is shown below:
# gupc count.c -o count
# ./count -n 5
Thread: 0
Thread: 1
Thread: 2
Thread: 3
Thread: 4
You can refer the GUPC user guide for more information.