Implementation

The GPU code is incorporated into the modular FLASH4 environment.

After installation, three additional modules are found in the FLASH directory tree.

  • GPU_BH_Tree
  • GpuIncludes
  • KernelCodes

The actual GPU accelerated Barnes Hut tree code resides in the GPU_BH_Tree directory embedded into the Grid module. The GpuIncludes and the KernelCodes modules hold wrapper functions and interface modules for the CUDA code. Additionally, they create an environment for future GPU modules and provide Makefiles needed by the FLASH4 setup procedure. Since the GPU code needs special compilers, the FLASH setup script needs to know that GPU code is used. For this, the -gpu argument was implemented into the setup script. The -gpu argument tells the setup script to use the Cuda compilers and to include a separate linking step during the make procedure. The respective compiler flags are defined in the Makefile.h file.



Makefile.h

Since the GPU code is written in CUDA-C, the nvcc compiler is required as an additional compiler for the FLASH setup procedure. Moreover, the Barnes Hut module is implemented to use relocatable device code and needs separate linking step.

To reflect your system, you must modify 12 additional different macros defined in Makefile:


LIB_CUDA: Cuda library macro
CUCOMP: the Cuda compiler (nvcc)
GPU_CPPC: the name of of the C++ compiler used by nvcc
CU_FLAGS_OPT: the Cuda compilation flags to produce an optimized executable
CU_FLAGS_DEBUG: the Cuda compilation flags to produce an executable for debugging
CU_FLAGS_TEST: the Cuda compilation flags to produce an executable for testing
GPU_CFLAGS_OPT: the nvcc C++ compilation flags to produce an optimized executable
GPU_CFLAGS_DEBUG: the nvcc C++ compilation flags to produce an executable for debugging
GPU_CFLAGS_TEST: the nvcc C++ compilation flags to produce an executable for testing
GPU_LINK: the Cuda compiler used for linking (nvcc)
GPU_L_FLAGS_OPT: Cuda linker flags used with the OPT compilation flags
GPU_L_FLAGS_DEBUG: Cuda linker flags used with the DEBUG compilation flags 
GPU_L_FLAGS_TEST: Cuda linker flags used with the TEST compilation flags
			
            

For example, here's how you might modify the macros defined in the Makefile.h for nvcc and intel compilers. First, we set the appropriate compilers. Note, that we use the MPI wrapper for the icpc compiler. We strongly recommend to use the mpi wrapper for your c++ compiler.

			
CUCOMP: nvcc
GPU_CPPC: mpiicpc
	         
            
Now, we define the three different setups of compiler flags as described earlier: the "OPT" set for normal, fully optimized code, the "DEBUG" set for debugging FLASH, and the "TEST" set for regression testing. These three sets are picked with the -auto, -debug, and -test flags to setup respectively.
			
CU_FLAGS_OPT  =  -c -rdc=true -maxrregcount 32 -O3\
                  --generate-code arch=compute_30,code=sm_35\
                  --generate-code arch=compute_30,code=sm_30\
                  --generate-code arch=compute_20,code=sm_20\
                  --generate-code arch=compute_20,code=sm_21\
                  -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_OPT)
                                  
CU_FLAGS_DEBUG = -c -g -G -lineinfo -v -rdc=true -maxrregcount 32\ 
                   --ptxas-options=-v -res-usage -O0\
                   --generate-code arch=compute_30,code=sm_35\
                   --generate-code arch=compute_30,code=sm_30\
                   --generate-code arch=compute_20,code=sm_20\
                   --generate-code arch=compute_20,code=sm_21\
                   -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_DEBUG)\
                   -D GPU_DEBUG
                                  
CU_FLAGS_TEST = -c -g -G -lineinfo -v -rdc=true -maxrregcount 32\ 
                 --ptxas-options=-v -res-usage -O0\
                 --generate-code arch=compute_30,code=sm_35\
                 --generate-code arch=compute_30,code=sm_30\
                 --generate-code arch=compute_20,code=sm_20\
                 --generate-code arch=compute_20,code=sm_21\
                 -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_TEST)\
                 -D GPU_DEBUG -D TRACE
				 
            
            
For the nvcc compiler flags we must the -rdc=true switch to compile relocatable code. Note that the respective "OPT","DEBUG" and "TEST" flags for the GPU_CPPC compiler are used with the -Xcompiler flag. Here we define the GPU_CPPC flags already in quotation marks for the -Xcompiler flag.
            
GPU_CFLAGS_OPT   = "-O3 -xHost -inline-level=2"
GPU_CFLAGS_DEBUG = "-g -traceback -debug all -mcmodel=large -D GPU_DEBUG"
GPU_CFLAGS_TEST  = "-g -traceback -mcmodel=large -debug all -D TRACE -D GPU_DEBUG"
			
            
Next come the linker, linker flags and the library macro. We use the -dlink and -rdc=true switches to link the device code.

LIB_CUDA =  -L/opt/cuda-7.0.28/lib64 -L/opt/cuda-7.0.28/extras/CUPTI/lib64/ -lcudart -lcuda -lstdc++


GPU_LINK = nvcc 
GPU_L_FLAGS_OPT = -O3\
                  --generate-code arch=compute_30,code=sm_35\
                  --generate-code arch=compute_30,code=sm_30\
                  --generate-code arch=compute_20,code=sm_20\
                  --generate-code arch=compute_20,code=sm_21\
                  -dlink -maxrregcount 32 -rdc=true\
                  -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_OPT) -o

GPU_L_FLAGS_DEBUG = -g -lineinfo -v -O0 -dlink\
                     --generate-code arch=compute_30,code=sm_35\
                     --generate-code arch=compute_30,code=sm_30\
                     --generate-code arch=compute_20,code=sm_20\
                     --generate-code arch=compute_20,code=sm_21\
                     -rdc=true -maxrregcount 32 --ptxas-options=-v\
                     -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_DEBUG) -o

GPU_L_FLAGS_TEST = -g -lineinfo -v -O0 -dlink\
                    --generate-code arch=compute_30,code=sm_35\
                    --generate-code arch=compute_30,code=sm_30\
                    --generate-code arch=compute_20,code=sm_20\
                    --generate-code arch=compute_20,code=sm_21\
                    -rdc=true -maxrregcount 32 --ptxas-options=-v\
                    -ccbin=$(GPU_CPPC) -Xcompiler $(GPU_CFLAGS_TEST) -o

With these macros, the auto generated Makefile will call the nvcc compiler to compile the GPU code and link all the device code into a single objectfile. Which is linked to the final executable during the normal lining step.