UI Neon Cluster and the Intel Phi Co-Processor

Introduction

The Intel Phi co-processor is known as a MIC (Many Integrated Core architecture) designed to run parallel applications. The Neon cluster has 29 Intel Xeon Phi 5110P co-processors; each has 60 cores, 8GB memory, and runs at 1.053GHz.

qlogin

Login to a UI node with an Intel Phi co-processor $ qlogin -l phi The default queue is the UI queue (i.e. the above command is the same as qlogin -l phi -q UI). By default, all slots in the node are reserved. You can also try the all.q: $ qlogin -l phi -q all.q The all.q may evict you if a user initiates a higher priority job. The UI queue will not evict, but you may not be able to reserve a node on the UI queue. If you don't need a Phi, just omit the -l phi flag.

If a node is unavailable, try our LT node (LT contains a Phi). To login to the LT node, type $ qlogin -q LT This will reserve all slots in the LT node and will lock others out. If possible, reserve fewer than 16 slots so that others may still qlogin into the LT node. For example, to reserve 4 slots type $ qlogin -q LT -pe smp 4 If you only reserve 4 slots, do not let your program use more than 4 slots (you will disturb another person working on the node).

Compile Variables

To use Phi, you must set up the compile variables. Add the following at the end of your .bash_profile file in your home directory: # set up environment for Intel compiler source /opt/intel/bin/compilervars.sh intel64 lp64 Logout then log back in for the changes to take effect. The compile variables will now be automatically setup every time you login.

Intel MKL (Math Kernel Library)

The key to getting fast speeds on the MIC is using Intel's highly tuned MKL library. The library is compatible with C/C++/Fortran, contains the expected linear algebra, vector math and statistics functions (BLAS, LAPACK, etc.), and is highly vectorized for speed. MKL also contains various random number generators (RNG's) and lots of other stuff. MKL Reference Manual

Automatic Offload to the MIC

Files used in this section: sgemm.cpp, Makefile, sgemm.sh

The Intel compiler can automatically offload computations to the Phi. The following C++ program initializes two matrices, A and B, takes the product, and stores the result in matrix C (all matrices are NxN). It is simple to automatically offload to the MIC; just include the mkl.h header file and call an MKL function (here, we call the cblas_sgemm scalar-matrix-matrix product function). If it is computationally advantageous, the computations will automatically be offloaded to the MIC, executed, and the result returned to the host. Syntax: cblas_sgemm computes C = alpha*A*B + beta*C where A, B, and C are matrices and the scaling constants are alpha and beta. The matrices are stored in row-major vectorized format. // Automatic Offload SGEMM #include "cstdlib" #include "iostream" #include "omp.h" #include "mkl.h" using namespace std; int main() { //matrix dimensions = NxN //int N = 2560; //int N = 5120; //int N = 7680; int N = 10240; //scaling factors float alpha = 1.0, beta = 0.0; //matrices float A[N*N], B[N*N], C[N*N]; //initialize the matrices for (int i = 0; i < N; i++) { for (int j = 0; j < N; j++) { A[i*N+j] = (float) i+j; B[i*N+j] = (float) i-j; C[i*N+j] = 0.0; } } cout << "MIC devices present: " << mkl_mic_get_device_count() << "\n"; cout << "Warm-up..."; cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, N, alpha, A, N, B, N, beta, C, N); cout << "Done\n"; int nIter = 5, nOmpThr; #pragma omp parallel nOmpThr = omp_get_num_threads(); double aveTime,minTime=1e6,maxTime=0.; for(int i=0; i < nIter; i++) { double startTime = dsecnd(); cout << "Performing multiplication " << i << endl; cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, N, N, N, alpha, A, N, B, N, beta, C, N); double endTime = dsecnd(); double runtime = endTime-startTime; maxTime=(maxTime > runtime)?maxTime:runtime; minTime=(minTime < runtime)?minTime:runtime; aveTime += runtime; } aveTime /= nIter; cout << "matrix size: " << N << endl; cout << "nThreads: " << nOmpThr << endl; cout << "nIter: " << nIter << endl; cout << "maxRT: " << maxTime << endl; cout << "minRT: " << minTime << endl; cout << "aveRT: " << aveTime << endl; cout << "aveGflop/S: " << 2e-9*N*N*N/aveTime << endl; }

Compiling

It is easiest to use makefiles to do the compiling. Save the following to a file called Makefile in your working directory. all: sgemm sgemm: icpc -mkl -O3 -openmp -Wno-unknown-pragmas -std=c++0x -vec-report3 sgemm.cpp -o sgemm clean: rm sgemm; To compile, just type $ make -B This will create an executable file called sgemm. Note that we are using the Intel C++ compiler icpc; the GCC compiler will not work with the MIC (as far as I know). Usually make would recognize when you made changes to sgemm.cpp and would re-compile when you simply executed make (instead of make -B). The cluster file system seems to booger this up for some reason; always issue make -B to re-compile your project (you could also run make clean followed by make).

Submit Script

Although a program can be directly executed, it is best to create a submit script (this will be helpful should you wish to submit your program using qsub). The nice thing about the submit script is that you can easily change the number of threads on the node, specify if the Phi co-processor is to be used, among other things. You can also change the number of threads on the MIC (however, it seems best to let the compiler figure this out automatically). The submit script, sgemm.sh, is shown below. #!/bin/sh #$ -l phi -pe smp 16 -cwd export MIC_ENV_PREFIX=MIC #number of threads on node export OMP_NUM_THREADS=1 #enable automatic MIC offload for MKL #0=do not offload, 1=offload to MIC export MKL_MIC_ENABLE=1 #number of threads on MIC #if not set, MIC chooses number of threads #export MIC_OMP_NUM_THREADS=240 export MIC_KMP_AFFINITY=balanced #export MIC_KMP_AFFINITY=compact #export MIC_KMP_AFFINITY=scatter ./sgemm To run the program, simply execute the script $ sh ./sgemm.sh The second line of the submit script is read when using qsub. Here, I'm specifying that I want a node with a Phi and I'm also specifying that I want to use 16 slots in the node; this line will not be read if qsub is not used.

Results

[mbognar@neon-p-compute-6-22 sgemm]$ sh ./sgemm.sh MIC devices present: 1 Warm-up...Done Performing multiplication 0 Performing multiplication 1 Performing multiplication 2 Performing multiplication 3 Performing multiplication 4 matrix size: 10240 nThreads: 1 nIter: 5 maxRT: 2.29494 minRT: 2.26511 aveRT: 2.28032 aveGflop/S: 941.747 The Gflop/S is highly dependent on the matrix size NxN. Using N=7680 yields 868.9 Gflop/S, while using N=8000 yields 616.1 Gflop/S.

Multiple Slots & Phi

To use, for example, 16 slots on the node and the MIC, simply let export OMP_NUM_THREADS=16 in the sgemm.sh file. MKL will now run 16 threads on the node and utilize the Phi co-processor. Output: [mbognar@neon-p-compute-6-22 sgemm]$ sh ./sgemm.sh MIC devices present: 1 Warm-up...Done Performing multiplication 0 Performing multiplication 1 Performing multiplication 2 Performing multiplication 3 Performing multiplication 4 matrix size: 10240 nThreads: 16 nIter: 5 maxRT: 1.6534 minRT: 1.63618 aveRT: 1.64063 aveGflop/S: 1308.94 This can result in a nice speed-up over just using the Phi alone.

MIC_KMP_AFFINITY

Affinity specifies how the threads will be distributed among the MIC processors. The affinity can result in notable speed differences. Be sure to tinker with the MIC_KMP_AFFINITY in the sgemm.sh file to get the fastest speed. export MIC_KMP_AFFINITY=balanced -- aveGlop/S: 942.452 export MIC_KMP_AFFINITY=compact -- aveGlop/S: 945.187 export MIC_KMP_AFFINITY=scatter -- aveGlop/S: 697.695

Explicit Offload to the MIC

Files used in this section: offload.cpp, Makefile, offload.sh

It is possible to explicitly offload code to the MIC using the #pragma offload target(mic) directive. The following program, offload.cpp, computes the sum 1+2+...+100 on both the MIC and CPU. #include "iostream" #include "mkl.h" using namespace std; int main() { const int n = 100; int sum_cpu = 0, sum_mic = 0; #pragma offload target(mic) \ in(sum_mic) out(sum_mic) { for(int i=1; i<=n; i++) sum_mic += i; } cout << "sum_mic = " << sum_mic << endl; for(int i=1; i<=n; i++) sum_cpu += i; cout << "sum_cpu = " << sum_cpu << endl; } Save the following to Makefile. all: offload # compile offload: icpc -mkl -O3 -openmp -Wno-unknown-pragmas -std=c++0x -vec-report3 offload.cpp -o offload # clean all files clean: rm offload; To compile offload.cpp, just type make -B. The submit file is called offload.sh. #!/bin/sh #$ -l phi -pe smp 16 -cwd export MIC_ENV_PREFIX=MIC #number of threads on node export OMP_NUM_THREADS=1 #enable automatic MIC offload for MKL #0=do not offload, 1=offload to MIC export MKL_MIC_ENABLE=1 #number of threads on MIC #if not set, MIC chooses number of threads #export MIC_OMP_NUM_THREADS=240 export MIC_KMP_AFFINITY=balanced #export MIC_KMP_AFFINITY=compact #export MIC_KMP_AFFINITY=scatter ./offload Running the code yields [mbognar@neon-kp-mm-compute-4-3 phi]$ sh offload.sh sum_mic = 5050 sum_cpu = 5050

Random Number Generation on the MIC

Files used in this section: rng.cpp, Makefile, rng.sh

The following C++ program shows how to use the MKL RNG on the MIC. There is no need to seed the individual threads; all threads draw from the same RNG stream. I have set the stream seed to the clock; just set seed to an integer for a fixed seed. The syntax for the MKL RNG is anything but convenient (the same goes for the Vector Statistical Library, VSL). rng.cpp is shown below: #include "iostream" #include "sys/time.h" #include "mkl.h" using namespace std; int main() { const int n = 100000; float r[n]; VSLStreamStatePtr stream; // set seed to clock timeval tim; gettimeofday(&tim, NULL); int seed = tim.tv_sec; // initialize RNG on MIC #pragma offload target(mic) in(seed) nocopy(stream) vslNewStream( &stream, VSL_BRNG_MT2203, seed ); #pragma offload target(mic) \ in(n) out(r) nocopy(stream) { vsRngUniform( VSL_RNG_METHOD_UNIFORM_STD, stream, n, r , 0.0, 1.0 ); } cout << "vsRngUniform:\n"; for(int i=0; i<5; i++) { cout << r[i] << endl; } cout << endl; #pragma offload target(mic) \ in(n) out(r) nocopy(stream) { vsRngGaussian( VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2, stream, n, r, 0.0, 1.0 ); } cout << "vsRngGaussian:\n"; for(int i=0; i<5; i++) { cout << r[i] << endl; } cout << endl; } The program produces the following output. [mbognar@neon-p-compute-6-22 phi]$ sh ./rng.sh vsRngUniform: 0.219141 0.37993 0.212526 0.14261 0.669456 vsRngGaussian: -0.901821 1.49422 1.07613 -0.507491 -0.0826771

qsub

Once you are done writing your program, you should submit it using qsub on the login-node if it will take a while to run (you will see username@neon-login-0-1 or username@neon-login-0-2 at the command prompt when you are on a login-node). The second line of your submit script specifies the options passed to qsub. For example, #!/bin/sh #$ -l phi -pe smp 16 -cwd ... specifies to run your code on the UI queue (the default) using a node with a Phi and reserving 16 slots (-cwd specifies that the program's output is to be placed in the current working directory; it is placed in your home directory if not specified). To run on the LT node with 4 slots, use #!/bin/sh #$ -q LT -pe smp 4 -cwd ... in your submit script. To submit your code to the queue, type qsub ./my-submit-script.sh in your working directory. To view the status of your program, type qstat -u username To kill your program, type qdel job-ID where job-ID is the job number of your program (which is printed out from the qstat command).

Grad Students: Good Stuff to Know

Invent some of your time now to learn about the following topics. It will save you a ton of time later.

Other Resources