Quantum Theory Project - University of Florida


		QTP Home page

	Slater Computing Laboratory News and announcemnets General computing info High performance computing info home QTP clusters guide QTP clusters description Diskspace guide Compilers and libraries guide Programming intro Moab/Torque guide PBS guide LoadLeveler guide QTP cluster Software UF HPCC software Administration History Slater home UF HPC Center CPU status	Programming Introduction The systems at the John C. Slater Lab support several styles of programming: serial programs: on SUN desktops and IBM cluster nodes shared memory programs: on SUN E5000 server and POWER3 IBM SP nodes message passing programs: on a group of SUN desktops, IBM cluster nodes or IBM SP nodes. This brief tutorial introduces the commands and flags to get started. The on-line documentation should be consulted for more details. Examples are given for Fortran 95 programs. But all options are the same for C and C++ programs and the compiler names are found by replacing f90 with c or C. Serial programs Shared memory programs (OpenMP and POSIX Threads) Distributed memory programs (MPI) All aspects of parallel programming are covered in detail in the class offered by Erik Deumens. See http://www.qtp.ufl.edu/~deumens/adv-prog.html for details. Serial programs The IBM Fortran compiler is called xlf. Aliases are xlf77 and xlf90. xlf90 -o serial serial.f will compile the program serial.f and produce the executable called serial. The IBM C compiler is called xlc and the C++ compiler is called xlC. All options discussed below hold equally for all langauges. The SUN SPARC compilers are called f77, f90, f95, cc, and CC. For Solaris on Intel the Portland Group compilers phf77 and pgf90 are available. You can link with BLAS, LAPACK, FFTW libraries (SUN and IBM) and ESSL libraries (IBM only) by specifying the link flag -lblas, -lfftw, -llapack, -lessl. You can gain maximal optimization for each architecture with the following flags SPARC: -O4 -fast POWER: -O5 -qarch=pwr -qtune=pwr -qhot POWER2: -O5 -qarch=pwr2 -qtune=pwr2 -qhot POWER2: -O5 -qarch=p2sc -qtune=p2sc -qhot POWER3: -O5 -qarch=pwr3 -qtune=pwr3 -qhot POWER4: -O5 -qarch=pwr4 -qtune=pwr4 -qhot NOTE It is possible within the IBM POWER architecture to do cross compilations. The -qarch flag specifies instructions set the compiler is allowed to use. Some sub-architectures have extra instructions, as a result binaries using that instruction will crash with illegal instruction errors when run on sub-architectures that do not have that instruction. The hierarchy is as follows: POWER pwr ----> POWER2 pwr2 ----> POWER2SC p2sc \| +-----------> POWERPC ppc \| +-----------> POWER3 pwr3 ----> POWER4 pwr4 The -qtune flag specifies which timings and cache sizes to use during optimizations. The value of that flag can make the binary go faster on some sub-architectures but it will never cause a crash. As a result is is OK to compile with -O5 -qarch=pwr -qtune=pwr4 This binary will always run, and run fastest on POWER4 machines like the p690. But it will not run as fast as when you use -O5 -qarch=pwr4 -qtune=pwr4 becuase this last set is allowed to use an extra instruction that is very fast on floating point operations (FMA, fused multuply add). NOTE If you compile on a machine with AIX 5, it will not run on a machine with AIX 4. However, if you compile on AIX 4, the binary with run on AIX 5. The libraries are optimized for different architectures and should be used with these optimizations by specifying the names as follows SPARC or POWER: -lessl POWER2: -lesslp2 POWER3: -lesslsmp Any optimization level higher than -O3 is very agressive and may alter the numerical results, you should test these levels for accuracy Back to the top. OpenMP programs Consider a simple program using OpenMP directives for achieving parallel execution on shared memory machines. (The source is reproduced at the end of this section). It can be compiled for the IBM with xlf90_r -qsmp=omp -o openmp openmp.f Then setting the environment variable with the appropriate shell command (example in korn shell) export OMP_NUM_THREADS=4 or, for XLF prior to XLF 7.1, export XLSMPOPTS="parthds=4" will run the application on 4 CPUs if the system has that many CPUs. To compile it for the SUN SPARC the command is f90 -omp -o openmp openmp.f and you set export PARALLEL=4 Optimization and linking in libraries is the same as for serial programs. To link to the POSIX libraries from C or C++ use the link flag -lpthread and specify the "reentrant version of the compilers xlc_r, xlC_r on the IBM. The example also shows how to time parallel programs: use real time (wall clock time) instead of CPU time. program openmp integer, parameter :: n = 100, q = 100 real8, dimension(n,n) :: a,b,c integer :: i,j,k,p integer :: ic,ir real8 :: time0 = 0.d0, time1 = 0.d0 real8 :: rtc, time10 = 0.d0, time11 = 0.d0 do j=1,n do i=1,n a(i,j) = float(i)float(j) b(i,j) = sqrt(float(i+j)) c(i,j) = 0.d0 end do end do call system_clock(count=ic,count_rate=ir) if (ir /= 0) then time0 = dfloat(ic)/dfloat(ir) else print ,'No time available.' end if time10 = rtc() !$OMP parallel do shared (a,b) private (i,j,k,p) do j=1,n do p=1,q do i=1,n do k=1,n c(i,j) = c(i,j) + a(i,k)c(k,j) end do end do end do end do !$OMP end parallel do call system_clock(count=ic,count_rate=ir) time1 = dfloat(ic)/dfloat(ir) time11 = rtc() print ,' Time ',time1-time0 print ,' Time ',time11-time10 print ,' Speed ',dfloat(2nnnq)/ (time11-time10)/dfloat(1000000),' Mflops' stop end program openmp Back to the top. MPI programs Consider a simpel program that uses the Message Passing Interface standard to achieve parallel execution on a system of compute nodes connected by a communication device. (The source is reproduced at the end of this section). It can be compiled on the IBM SP with mpxlf -o mpi mpi.f* It can then be executed by using the poe command ro by running the program directly. Options for controlling the execution are passed as flags or as environment variables. The most important thing to do is create a file with hostnames. The default is host.list. You can specify anothe rname in the variable MP_HOSTLIST or with the flag -hostlist. The number of tasks is set in MP_PROCS or with the flag -procs. The use of the fast switch on the SP can happen in two modes, called IP mode and user mode. User mode is best for optimal performance. The mode can be specified in MP_EUILIB=us or ip or with -euilib us or ip. The execution is then started with poe ./mpi -procs 4 -euilib us or with ./mpi -procs 4 -euilib us The result is the same. For C and C++ programs, change mpxlf into mpxlc or mpxlC. To compile mixed mode programs using both MPI and OpenMP or POSIX Threads, use mpxlf_r, mpxlc_r and mpxlC_r. On the SUN system the public domain MPI implementation MPICH is installed. The compilation is done with mpif77 -o mpi mpi.f and the execution is started with mpirun mpi -machinefile host.list -np 4 program mpi integer ntasks, myid, ierr include 'mpif.h' call MPI_INIT(ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, ntasks, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myid, ierr) print *,' nr tasks =',nprocs,' my id =', myid call MPI_FINALIZE(ierr) stop end program mpi >> top


	Have a Question? Contact us. Last Updated 12/15/07