NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close

NVIDIA’s latest Tesla accelerator is without a doubt the most powerful GPU available. With almost 3,000 CUDA cores and 12GB GDDR5 memory, it wins in practically every* performance test you’ll see. As with the “Kepler” K20 GPUs, the Tesla K40 supports NVIDIA’s latest SMX, Dynamic Parallelism and Hyper-Q capabilities (CUDA compute capability 3.5). It also introduces professional-level GPU Boost capability to squeeze every bit of performance your code can pull from the GPU’s 235W power envelope.

Maximum GPU Memory and Compute Performance: Tesla K40 GPU Accelerator

Integrated in Microway NumberSmasher GPU Servers and GPU Clusters

Photograph of the new NVIDIA Tesla "Atlas" K40 "Kepler" GPU Accelerator

Specifications

  • 2880 CUDA GPU cores (GK110b)
  • 4.2 TFLOPS single; 1.4 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • GPU Boost increased clock speeds
  • Supports Dynamic Parallelism and HyperQ features
  • Active and Passive heatsinks available for installation in workstations and specially-designed GPU servers

The new GPU also leverages PCI-E 3.0 to achieve over 10 gigabytes per second transfers between the host (CPUs) and the devices (GPUs):

[root@node3 tests]# ./gpu_bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla K40m
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10038.7

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     10046.7

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)        Bandwidth(MB/s)
   33554432                     202665.0

Result = PASS

Technical Details

Here is the full list of capabilities reported by NVIDIA’s SMI tool. Memory error detection and correction (ECC) is supported on all components of the Tesla GPU. Notice that GPU Boost allows the top CUDA core clock frequency to be set to 745 MHz, 810 MHz or 875 MHz:

[root@node3 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Nov 11 21:42:13 2013
Driver Version                      : 325.15

Attached GPUs                       : 3
GPU 0000:02:00.0
    Product Name                    : Tesla K40m
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Disabled
    Accounting Mode Buffer Size     : 128
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 032391304xxxx
    GPU UUID                        : GPU-3964f3ae-5ee0-2afc-5d93-9f1edd2axxxx
    VBIOS Version                   : 80.80.24.00.06
    Inforom Version
        Image Version               : 2081.0202.01.04
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x02
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x102310DE
        Bus Id                      : 0000:02:00.0
        Sub System Id               : 0x097E10DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
    Fan Speed                       : N/A
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    Memory Usage
        Total                       : 11519 MB
        Used                        : 69 MB
        Free                        : 11450 MB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : 0
                L1 Cache            : 0
                L2 Cache            : 0
                Texture Memory      : 0
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        Gpu                         : 26 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 19.49 W
        Power Limit                 : 235.00 W
        Default Power Limit         : 235.00 W
        Enforced Power Limit        : 235.00 W
        Min Power Limit             : 150.00 W
        Max Power Limit             : 235.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 324 MHz
    Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 745 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 875 MHz
        SM                          : 875 MHz
        Memory                      : 3004 MHz
    Compute Processes               : None
[root@node3 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Mon Nov 11 21:42:45 2013
Driver Version                      : 325.15

Attached GPUs                       : 3
GPU 0000:02:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 875 MHz
            Graphics                : 810 MHz
            Graphics                : 745 MHz
            Graphics                : 666 MHz
        Memory                      : 324 MHz
            Graphics                : 324 MHz

NVIDIA deviceQuery on Tesla K40

The output below, from the CUDA 5.5 SDK samples, shows additional details of the architecture and capabilities of the Tesla K40 GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K40m"
  CUDA Driver Version / Runtime Version          5.5 / 5.5
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 12288 MBytes (12884705280 bytes)
  (15) Multiprocessors, (192) CUDA Cores/MP:     2880 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           2 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.5, NumDevs = 1, Device0 = Tesla K40m
Result = PASS

*Caveat on Tesla K40 performance boost: users with very specific, memory-intensive, single-precision floating point and/or integer math may be better served by the NVIDIA Tesla K10 GPU Accelerator with 8GB GDDR5 memory. Please speak with one of our GPU experts.

Additional Tesla K40 Information

To learn more about the differences between the Tesla K40 and other versions of the Tesla product line, please review our In-Depth Comparison of NVIDIA Tesla “Kepler” GPU Accelerators.

Eliot Eshelman

About Eliot Eshelman

My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
This entry was posted in Benchmarking, Hardware and tagged , , . Bookmark the permalink.

One Response to NVIDIA Tesla K40 “Atlas” GPU Accelerator (Kepler GK110b) Up Close

  1. Eliot Eshelman Eliot Eshelman says:

    It’s worth taking careful note of the following: GPU data transfer speeds depend upon both the memory clock and graphics clock. A Tesla K40 GPU with GPU Boost at full speed can achieve a 10% to 15% transfer speed improvement:

    [root@node4 tests]# bandwidthTest --memory=pinned --device=0
    [CUDA Bandwidth Test] - Starting...
    Running on...

    Device 0: Tesla K40m
    Quick Mode

    Host to Device Bandwidth, 1 Device(s)
    PINNED Memory Transfers
    Transfer Size (Bytes) Bandwidth(MB/s)
    33554432 11296.7

    Device to Host Bandwidth, 1 Device(s)
    PINNED Memory Transfers
    Transfer Size (Bytes) Bandwidth(MB/s)
    33554432 11795.1

    Device to Device Bandwidth, 1 Device(s)
    PINNED Memory Transfers
    Transfer Size (Bytes) Bandwidth(MB/s)
    33554432 229481.5

    Result = PASS

Leave a Reply

Your email address will not be published. Required fields are marked *