During the install, the MacLaurin sphere unit test setup is copied to the FLASH directory. With this setup, the speed and the GPU-BH tree parameters are tested. Remember to use the -gpu setup argument when you use the GPU code. In the corresponding flash.par file, three entrees are used to steer the runtime and the accuracy of the GPU-BH tree code.

  • grv_bh_gpu_TreeLimAngle
  • grv_bh_gpu_PinnedCells
  • grv_bh_gpu_PinnedMPI

The "grv_bh_gpu_TreeLimAngle" parameter describes the opening angle used in the Barnes Hut algorithm for the force evaluation and the calculation of the essential nodes. Valid values are between 0.0 and 1.0 with a default value of '0.5'. Here, a lower value means less accuracy but higher computational cost. For a parameter value of 0.5, the typical error is roughly around 1%.

The "grv_bh_gpu_PinnedCells" parameter is used to control how the cell data is stored in host memory. Valid values are '.true.' and '.false.' with the default value '.false.'. In the default setting, the cell data values are stored in 'normal' pageable memory. With a value of '.true.', the cell data is stored as page-locked (pinned) memory. Pinned memory can not be moved by the OS and high amounts of pinned memory may reduce the overall system performance. Moreover, the allocation of pinned memory may fail depending on the available resources. On the other hand, a higher bandwidth between the host and the device is possible when using pinned memory.

The "grv_bh_gpu_PinnedMPI" parameter controls how the communication buffer is stored in host memory. Valid values are '.true.' and '.false.' with the default value '.false.'. Here, pinned memory can be exploited to achieve a higher MPI bandwidth.

In the default setting, the MacLaurin sphere setup runs with a maximum and minimum refinement level of 4 resulting in initially 512 leaf blocks. For the GPU-Tree code, a total of 4 * 262144 double precision cell data values are stored. Assuming enough memory, this setup can run on with a single process utilizing one GPU device.

During the simulation,the minimum an maximum percentual error values are printed for each process. The following shows an example output of the accuracy test run with 6 MPI processes and one GPU device:

 Proc           3  Maximum potential error =   6.434942767625307E-003  SUCCESS! 
 Proc           3  Minimum potential error =   1.176866425422478E-008  SUCCESS! 
 Proc           1  Maximum potential error =   6.434942767625836E-003  SUCCESS! 
 Proc           1  Minimum potential error =   1.176863970477797E-008  SUCCESS! 
 Proc           2  Maximum potential error =   6.434942767625131E-003  SUCCESS! 
 Proc           2  Minimum potential error =   1.176863946291150E-008  SUCCESS! 
 Proc           4  Maximum potential error =   6.434942767626011E-003  SUCCESS! 
 Proc           4  Minimum potential error =   1.176866098902742E-008  SUCCESS! 
 Proc           0  Maximum potential error =   6.434942767629360E-003  SUCCESS! 
 Proc           0  Minimum potential error =   1.176863837451238E-008  SUCCESS!
 Proc           5  Maximum potential error =   6.434942767621079E-003  SUCCESS! 
 Proc           5  Minimum potential error =   1.176866086809419E-008  SUCCESS!

The "grv_bh_gpu_TreeLimAngle" parameter can be adjusted to achieve a higher or lower accuracy. Here, we achieved a maximum potential error below 1% with an opening angle of 0.5

The runtime and efficiency of the GPU-BH code can be controlled in the log file created for each FLASH simulation. Beside the runtimes of the major routines printed in the performance summary section of the log file, an own section only for memory access and kernel runtimes is printed.

The following shows an example of this section generated for pageable MPI buffer memory and cell data memory

 =================================GPU TIMERS===================================
 --------------------------- EVOLUTION Calc. Potential  -----------------------
       accounting unit    min/proc (s)   max/proc (s)    avg/proc (s)
                  Init       0.0001       0.0002       0.0002
             Treebuild       0.0049       0.0078       0.0063
              Treewalk       0.2020       0.2395       0.2235
              Treesumm       0.0010       0.0028       0.0020
     Memory Alloc/Free       0.0019       0.0021       0.0020
           Memory HtoD       0.0032       0.0063       0.0047
           Memory DtoH       0.0005       0.0006       0.0005
       accounting unit        min GB/s       max GB/s        avg GB/s
           Memory HtoD       1.9225       2.8374       2.4285
           Memory DtoH       1.1932       1.3650       1.2855
 --------------------------- EVOLUTION Calc. Essentials -----------------------
       accounting unit    min/proc (s)   max/proc (s)    avg/proc (s)
                  Init       0.0001       0.0001       0.0001
             Treebuild       0.0034       0.0046       0.0040
               Treesum       0.0008       0.0008       0.0008
              Treefind       0.0053       0.0056       0.0054
     Memory Alloc/Free       0.0025       0.0029       0.0028
           Memory HtoD       0.0016       0.0017       0.0016
           Memory DtoH       0.0036       0.0037       0.0037
       accounting unit        min GB/s       max GB/s        avg GB/s
           Memory HtoD       1.6555       1.7607       1.6986
           Memory DtoH       1.6845       1.9853       1.8401

For pinned memory,the bandwidth entries for HtoD (Host to Device)and DtoH(Device to Host) can be higher.