SCC Nodes: scc-ea1..ea4, scc-eb1..eb4, scc-ec1..ec4, scc-fa1..fa4, scc-fb1..fb4, scc-fc1..fc4
Below is the result from running the pgaccelinfo on the SCC E&F nodes.
CUDA Device Number reports the GPU device number. For E/F nodes with 3 GPUs, their device numbers are: 0, 1, 2. Here is a fortran example on associating each of 3 OpenMP threads (i.e., CPU) to a specific GPU device:
call omp_set_num_threads(3) ! compile code with -mp to turn on OpenMP
!$omp PARALLEL private(i)
i = omp_get_thread_num()
call acc_set_device_num(i, acc_device_nvidia)
!$omp end parallel
============================================================
CUDA Driver Version: 4020
NVRM version: NVIDIA UNIX x86_64 Kernel Module 295.71 Thu Aug 2 19:22:08 PDT 2012
CUDA Device Number: 0
Device Name: Tesla M2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: exclusive
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1546 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 87731 microseconds
Current free memory: 2748571648
Upload time (4MB): 1782 microseconds (1417 ms pinned)
Download time: 1523 microseconds (1307 ms pinned)
Upload bandwidth: 2353 MB/sec (2959 MB/sec pinned)
Download bandwidth: 2753 MB/sec (3209 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20
CUDA Device Number: 1
Device Name: Tesla M2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: exclusive
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1546 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 87731 microseconds
Current free memory: 2748817408
Upload time (4MB): 1770 microseconds (1425 ms pinned)
Download time: 1532 microseconds (1312 ms pinned)
Upload bandwidth: 2369 MB/sec (2943 MB/sec pinned)
Download bandwidth: 2737 MB/sec (3196 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc20
CUDA Device Number: 2
Device Name: Tesla M2050
Device Revision Number: 2.0
Global Memory Size: 2817982464
Number of Multiprocessors: 14
Number of Cores: 448
Concurrent Copy and Execution: Yes
Total Constant Memory: 65536
Total Shared Memory per Block: 49152
Registers per Block: 32768
Warp Size: 32
Maximum Threads per Block: 1024
Maximum Block Dimensions: 1024, 1024, 64
Maximum Grid Dimensions: 65535 x 65535 x 65535
Maximum Memory Pitch: 2147483647B
Texture Alignment: 512B
Clock Rate: 1147 MHz
Execution Timeout: No
Integrated Device: No
Can Map Host Memory: Yes
Compute Mode: exclusive
Concurrent Kernels: Yes
ECC Enabled: Yes
Memory Clock Rate: 1546 MHz
Memory Bus Width: 384 bits
L2 Cache Size: 786432 bytes
Max Threads Per SMP: 1536
Async Engines: 2
Unified Addressing: Yes
Initialization time: 87731 microseconds
Current free memory: 2748817408
Upload time (4MB): 1789 microseconds (1421 ms pinned)
Download time: 1533 microseconds (1307 ms pinned)
Upload bandwidth: 2344 MB/sec (2951 MB/sec pinned)
Download bandwidth: 2736 MB/sec (3209 MB/sec pinned)
PGI Compiler Option: -ta=nvidia,cc2
