NVIDIA Tesla M40 24GB GPU Accelerator (Maxwell GM200) Up Close

Eliot Eshelman

·

April 1, 2016

NVIDIA has announced a new version of their popular Tesla M40 GPU – one with 24GB of high-speed GDDR5 memory. The name hasn’t really changed – the new GPU is named NVIDIA Tesla M40 24GB. If you are curious about the original version with less memory, we have a detailed examination of the original M40 GPU.

As support for GPUs grows – particularly in the exploding fields of Machine Learning and Deep Learning – there has been increasing need for large quantities of GPU memory. The Tesla M40 24GB provides the most memory available to date in a single-GPU Tesla card. The remaining specifications of the new M40 match that of the original: 7 TFLOPS of single-precision floating point performance.

The Tesla M40 continues to be the only high-performance Tesla compute GPU based upon the “Maxwell” architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU memory and performance: Tesla M40 24GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Specifications

3072 CUDA GPU cores (GM200)
7.0 TFLOPS single; 0.21 TFLOPS double-precision
24GB GDDR5 memory
Memory bandwidth up to 288 GB/s
PCI-E x16 Gen3 interface to system
Dynamic GPU Boost for optimal clock speeds
Passive heatsink design for installation in qualified GPU servers

Technical Details

The nvidia-smi status report shown below reflects the capabilities of the new M40 24GB GPU:

[root@node4 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Fri May 20 15:35:26 2016
Driver Version                      : 361.28

Attached GPUs                       : 4
GPU 0000:84:00.0
    Product Name                    : Tesla M40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : xxxxxxxxxxxxx
    GPU UUID                        : GPU-dbacebc6-3878-d72d-ebe9-87fb50xxxxxx
    Minor Number                    : 3
    VBIOS Version                   : 84.00.56.00.03
    MultiGPU Board                  : No
    Board ID                        : 0xXXXX
    Inforom Version
        Image Version               : G600.xxxx.xx.xx
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0xXXXXXXXX
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x117110DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 3
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : N/A
    Performance State               : P0
    Clocks Throttle Reasons
        Idle                        : Not Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Sync Boost                  : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 23039 MiB
        Used                        : 23009 MiB
        Free                        : 30 MiB
    BAR1 Memory Usage
        Total                       : 32768 MiB
        Used                        : 4 MiB
        Free                        : 32764 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 99 %
        Memory                      : 100 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 51 C
        GPU Shutdown Temp           : 92 C
        GPU Slowdown Temp           : 89 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 124.63 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 1113 MHz
        SM                          : 1113 MHz
        Memory                      : 3004 MHz
        Video                       : 1025 MHz
    Applications Clocks
        Graphics                    : 1114 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 947 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 1114 MHz
        SM                          : 1114 MHz
        Memory                      : 3004 MHz
        Video                       : 1024 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None

NVIDIA deviceQuery on Tesla M40 24GB

The output below, from the CUDA 7.5 SDK samples, shows the output of the architecture and capabilities of the Tesla M40 24GB GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla M40"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 23040 MBytes (24159059968 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla M40
Result = PASS

Additional Tesla M40 24GB Information

To learn more about the differences between the Tesla M40 24GB and other versions of the Tesla product line, please review our “Kepler” and “Maxwell” Tesla GPU knowledge center articles:

To learn more about GPU-accelerated servers and clusters which provide the Tesla M40 24GB, please see our NVIDIA GPU technology page. Although we are able to provide the M40 in tower workstation systems, the design of the heatsink does not allow for quiet workstations.

This post was last updated on 2016-06-23

Implementing NVIDIA AI Blueprint

Microway Achieves DGX SuperPOD Specialization Partner Status with NVIDIA

DGX A100 review: Throughput and Hardware Summary