The Intel Phi co-processor is known as a MIC (Many Integrated Core architecture) designed to run parallel applications. The Neon cluster has 29 Intel Xeon Phi 5110P co-processors; each has 60 cores, 8GB memory, and runs at 1.053GHz.
Login to a UI node with an Intel Phi co-processor
$ qlogin -l phi
The default queue is the UI queue (i.e. the above command is the same as qlogin -l phi -q UI). By default, all slots in the node are reserved. You can also try the all.q:
$ qlogin -l phi -q all.q
The all.q may evict you if a user initiates a higher priority job. The UI queue will not evict, but you may not be able to reserve a node on the UI queue. If you don't need a Phi, just omit the -l phi flag.
If a node is unavailable, try our LT node (LT contains a Phi). To login to the LT node, type
$ qlogin -q LT
This will reserve all slots in the LT node and will lock others out. If possible, reserve fewer than 16 slots so that others may still qlogin into the LT node. For example, to reserve 4 slots type
$ qlogin -q LT -pe smp 4
If you only reserve 4 slots, do not let your program use more than 4 slots (you will disturb another person working on the node).
To use Phi, you must set up the compile variables. Add the following at the end of your .bash_profile file in your home directory:
# set up environment for Intel compiler
source /opt/intel/bin/compilervars.sh intel64 lp64
Logout then log back in for the changes to take effect. The compile variables will now be automatically setup every time you login.
The key to getting fast speeds on the MIC is using Intel's highly tuned MKL library. The library is compatible with C/C++/Fortran, contains the expected linear algebra, vector math and statistics functions (BLAS, LAPACK, etc.), and is highly vectorized for speed. MKL also contains various random number generators (RNG's) and lots of other stuff. MKL Reference Manual
Files used in this section: sgemm.cpp, Makefile, sgemm.sh
The Intel compiler can automatically offload computations to the Phi. The following C++ program initializes two matrices, A and B, takes the product, and stores the result in matrix C (all matrices are NxN). It is simple to automatically offload to the MIC; just include the mkl.h header file and call an MKL function (here, we call the cblas_sgemm scalar-matrix-matrix product function). If it is computationally advantageous, the computations will automatically be offloaded to the MIC, executed, and the result returned to the host. Syntax: cblas_sgemm computes C = alpha*A*B + beta*C where A, B, and C are matrices and the scaling constants are alpha and beta. The matrices are stored in row-major vectorized format.
// Automatic Offload SGEMM
#include "cstdlib"
#include "iostream"
#include "omp.h"
#include "mkl.h"
using namespace std;
int main()
{
//matrix dimensions = NxN
//int N = 2560;
//int N = 5120;
//int N = 7680;
int N = 10240;
//scaling factors
float alpha = 1.0, beta = 0.0;
//matrices
float A[N*N], B[N*N], C[N*N];
//initialize the matrices
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
A[i*N+j] = (float) i+j;
B[i*N+j] = (float) i-j;
C[i*N+j] = 0.0;
}
}
cout << "MIC devices present: " <<
mkl_mic_get_device_count() << "\n";
cout << "Warm-up...";
cblas_sgemm(CblasRowMajor, CblasNoTrans,
CblasNoTrans, N, N, N, alpha,
A, N, B, N, beta, C, N);
cout << "Done\n";
int nIter = 5, nOmpThr;
#pragma omp parallel
nOmpThr = omp_get_num_threads();
double aveTime,minTime=1e6,maxTime=0.;
for(int i=0; i < nIter; i++) {
double startTime = dsecnd();
cout << "Performing multiplication " << i << endl;
cblas_sgemm(CblasRowMajor, CblasNoTrans,
CblasNoTrans, N, N, N, alpha,
A, N, B, N, beta, C, N);
double endTime = dsecnd();
double runtime = endTime-startTime;
maxTime=(maxTime > runtime)?maxTime:runtime;
minTime=(minTime < runtime)?minTime:runtime;
aveTime += runtime;
}
aveTime /= nIter;
cout << "matrix size: " << N << endl;
cout << "nThreads: " << nOmpThr << endl;
cout << "nIter: " << nIter << endl;
cout << "maxRT: " << maxTime << endl;
cout << "minRT: " << minTime << endl;
cout << "aveRT: " << aveTime << endl;
cout << "aveGflop/S: " << 2e-9*N*N*N/aveTime << endl;
}
It is easiest to use makefiles to do the compiling. Save the following to a file called Makefile in your working directory.
all: sgemm
sgemm:
icpc -mkl -O3 -openmp -Wno-unknown-pragmas -std=c++0x -vec-report3 sgemm.cpp -o sgemm
clean:
rm sgemm;
To compile, just type
$ make -B
This will create an executable file called sgemm. Note that we are using the Intel C++ compiler icpc; the GCC compiler will not work with the MIC (as far as I know). Usually make would recognize when you made changes to sgemm.cpp and would re-compile when you simply executed make (instead of make -B). The cluster file system seems to booger this up for some reason; always issue make -B to re-compile your project (you could also run make clean followed by make).
Although a program can be directly executed, it is best to create a submit script (this will be helpful should you wish to submit your program using qsub). The nice thing about the submit script is that you can easily change the number of threads on the node, specify if the Phi co-processor is to be used, among other things. You can also change the number of threads on the MIC (however, it seems best to let the compiler figure this out automatically). The submit script, sgemm.sh, is shown below.
#!/bin/sh
#$ -l phi -pe smp 16 -cwd
export MIC_ENV_PREFIX=MIC
#number of threads on node
export OMP_NUM_THREADS=1
#enable automatic MIC offload for MKL
#0=do not offload, 1=offload to MIC
export MKL_MIC_ENABLE=1
#number of threads on MIC
#if not set, MIC chooses number of threads
#export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=balanced
#export MIC_KMP_AFFINITY=compact
#export MIC_KMP_AFFINITY=scatter
./sgemm
To run the program, simply execute the script
$ sh ./sgemm.sh
The second line of the submit script is read when using qsub. Here, I'm specifying that I want a node with a Phi and I'm also specifying that I want to use 16 slots in the node; this line will not be read if qsub is not used.
[mbognar@neon-p-compute-6-22 sgemm]$ sh ./sgemm.sh
MIC devices present: 1
Warm-up...Done
Performing multiplication 0
Performing multiplication 1
Performing multiplication 2
Performing multiplication 3
Performing multiplication 4
matrix size: 10240
nThreads: 1
nIter: 5
maxRT: 2.29494
minRT: 2.26511
aveRT: 2.28032
aveGflop/S: 941.747
The Gflop/S is highly dependent on the matrix size NxN. Using N=7680 yields 868.9 Gflop/S, while using N=8000 yields 616.1 Gflop/S.
To use, for example, 16 slots on the node and the MIC, simply let
export OMP_NUM_THREADS=16
in the sgemm.sh file. MKL will now run 16 threads on the node and utilize the Phi co-processor. Output:
[mbognar@neon-p-compute-6-22 sgemm]$ sh ./sgemm.sh
MIC devices present: 1
Warm-up...Done
Performing multiplication 0
Performing multiplication 1
Performing multiplication 2
Performing multiplication 3
Performing multiplication 4
matrix size: 10240
nThreads: 16
nIter: 5
maxRT: 1.6534
minRT: 1.63618
aveRT: 1.64063
aveGflop/S: 1308.94
This can result in a nice speed-up over just using the Phi alone.
Affinity specifies how the threads will be distributed among the MIC processors. The affinity can result in notable speed differences. Be sure to tinker with the MIC_KMP_AFFINITY in the sgemm.sh file to get the fastest speed.
export MIC_KMP_AFFINITY=balanced -- aveGlop/S: 942.452
export MIC_KMP_AFFINITY=compact -- aveGlop/S: 945.187
export MIC_KMP_AFFINITY=scatter -- aveGlop/S: 697.695
Files used in this section: offload.cpp, Makefile, offload.sh
It is possible to explicitly offload code to the MIC using the #pragma offload target(mic) directive. The following program, offload.cpp, computes the sum 1+2+...+100 on both the MIC and CPU.
#include "iostream"
#include "mkl.h"
using namespace std;
int main()
{
const int n = 100;
int sum_cpu = 0, sum_mic = 0;
#pragma offload target(mic) \
in(sum_mic) out(sum_mic)
{
for(int i=1; i<=n; i++)
sum_mic += i;
}
cout << "sum_mic = " << sum_mic << endl;
for(int i=1; i<=n; i++)
sum_cpu += i;
cout << "sum_cpu = " << sum_cpu << endl;
}
Save the following to Makefile.
all: offload
# compile
offload:
icpc -mkl -O3 -openmp -Wno-unknown-pragmas -std=c++0x -vec-report3 offload.cpp -o offload
# clean all files
clean:
rm offload;
To compile offload.cpp, just type make -B.
The submit file is called offload.sh.
#!/bin/sh
#$ -l phi -pe smp 16 -cwd
export MIC_ENV_PREFIX=MIC
#number of threads on node
export OMP_NUM_THREADS=1
#enable automatic MIC offload for MKL
#0=do not offload, 1=offload to MIC
export MKL_MIC_ENABLE=1
#number of threads on MIC
#if not set, MIC chooses number of threads
#export MIC_OMP_NUM_THREADS=240
export MIC_KMP_AFFINITY=balanced
#export MIC_KMP_AFFINITY=compact
#export MIC_KMP_AFFINITY=scatter
./offload
Running the code yields
[mbognar@neon-kp-mm-compute-4-3 phi]$ sh offload.sh
sum_mic = 5050
sum_cpu = 5050
Files used in this section: rng.cpp, Makefile, rng.sh
The following C++ program shows how to use the MKL RNG on the MIC. There is no need to seed the individual threads; all threads draw from the same RNG stream. I have set the stream seed to the clock; just set seed to an integer for a fixed seed. The syntax for the MKL RNG is anything but convenient (the same goes for the Vector Statistical Library, VSL). rng.cpp is shown below:
#include "iostream"
#include "sys/time.h"
#include "mkl.h"
using namespace std;
int main()
{
const int n = 100000;
float r[n];
VSLStreamStatePtr stream;
// set seed to clock
timeval tim;
gettimeofday(&tim, NULL);
int seed = tim.tv_sec;
// initialize RNG on MIC
#pragma offload target(mic) in(seed) nocopy(stream)
vslNewStream( &stream, VSL_BRNG_MT2203, seed );
#pragma offload target(mic) \
in(n) out(r) nocopy(stream)
{
vsRngUniform( VSL_RNG_METHOD_UNIFORM_STD,
stream, n, r , 0.0, 1.0 );
}
cout << "vsRngUniform:\n";
for(int i=0; i<5; i++) {
cout << r[i] << endl;
}
cout << endl;
#pragma offload target(mic) \
in(n) out(r) nocopy(stream)
{
vsRngGaussian( VSL_RNG_METHOD_GAUSSIAN_BOXMULLER2,
stream, n, r, 0.0, 1.0 );
}
cout << "vsRngGaussian:\n";
for(int i=0; i<5; i++) {
cout << r[i] << endl;
}
cout << endl;
}
The program produces the following output.
[mbognar@neon-p-compute-6-22 phi]$ sh ./rng.sh
vsRngUniform:
0.219141
0.37993
0.212526
0.14261
0.669456
vsRngGaussian:
-0.901821
1.49422
1.07613
-0.507491
-0.0826771
Once you are done writing your program, you should submit it using qsub on the login-node if it will take a while to run (you will see username@neon-login-0-1 or username@neon-login-0-2 at the command prompt when you are on a login-node). The second line of your submit script specifies the options passed to qsub. For example,
#!/bin/sh
#$ -l phi -pe smp 16 -cwd
...
specifies to run your code on the UI queue (the default) using a node with a Phi and reserving 16 slots (-cwd specifies that the program's output is to be placed in the current working directory; it is placed in your home directory if not specified). To run on the LT node with 4 slots, use
#!/bin/sh
#$ -q LT -pe smp 4 -cwd
...
in your submit script. To submit your code to the queue, type
qsub ./my-submit-script.sh
in your working directory. To view the status of your program, type
qstat -u username
To kill your program, type
qdel job-ID
where job-ID is the job number of your program (which is printed out from the qstat command).
Invent some of your time now to learn about the following topics. It will save you a ton of time later.