NVIDIA Tesla M40 12GB GPU Accelerator (Maxwell GM200) Up Close

With the release of Tesla M40, NVIDIA continues to diversify its professional compute GPU lineup. Designed specifically for Deep Learning applications, the M40 provides 7 TFLOPS of single-precision floating point performance and 12GB of high-speed GDDR5 memory. It works extremely well with the popular Deep Learning software frameworks and may also find its way into other industries that need single-precision accuracy.

The Tesla M40 is also notable for being the first Tesla GPU to be based upon NVIDIA’s “Maxwell” GPU architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.

Maximum single-GPU performance: Tesla M40 12GB GPU

Available in Microway NumberSmasher GPU Servers and GPU Clusters

Photo of the NVIDIA Tesla M40 12GB GPU Accelerator

Specifications

  • 3072 CUDA GPU cores (GM200)
  • 7.0 TFLOPS single; 0.21 TFLOPS double-precision
  • 12GB GDDR5 memory
  • Memory bandwidth up to 288 GB/s
  • PCI-E x16 Gen3 interface to system
  • Dynamic GPU Boost for optimal clock speeds
  • Passive heatsink design for installation in qualified GPU servers

As with all other modern Tesla GPUs, you should expect it to be able to max out the PCI-E 3.0 bus to achieve ~12GB/sec of data transfers between the system and each GPU:

[root@node6 ~]# gpu_bandwidthTest --memory=pinned --device=0
[CUDA Bandwidth Test] - Starting...
Running on...

 Device 0: Tesla M40
 Quick Mode

 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12108.0

 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			12870.2

 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)	Bandwidth(MB/s)
   33554432			210331.7

Result = PASS

Technical Details

Below is the full status reported by NVIDIA’s SMI tool. Memory error detection and correction (ECC) is supported on all components of the Tesla GPU. Notice that the M40 supports a wide range of operating frequencies:

[root@node6 ~]# nvidia-smi -a -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Feb 10 10:30:31 2016
Driver Version                      : 352.79

Attached GPUs                       : 4
GPU 0000:84:00.0
    Product Name                    : Tesla M40
    Product Brand                   : Tesla
    Display Mode                    : Disabled
    Display Active                  : Disabled
    Persistence Mode                : Enabled
    Accounting Mode                 : Enabled
    Accounting Mode Buffer Size     : 1920
    Driver Model
        Current                     : N/A
        Pending                     : N/A
    Serial Number                   : 0320116xxxxxx
    GPU UUID                        : GPU-dbacebc6-3878-d72d-ebe9-87fb50xxxxxx
    Minor Number                    : 3
    VBIOS Version                   : 84.00.48.00.01
    MultiGPU Board                  : No
    Board ID                        : 0x8400
    Inforom Version
        Image Version               : G600.0202.02.01
        OEM Object                  : 1.1
        ECC Object                  : 3.0
        Power Management Object     : N/A
    GPU Operation Mode
        Current                     : N/A
        Pending                     : N/A
    PCI
        Bus                         : 0x84
        Device                      : 0x00
        Domain                      : 0x0000
        Device Id                   : 0x17FD10DE
        Bus Id                      : 0000:84:00.0
        Sub System Id               : 0x117110DE
        GPU Link Info
            PCIe Generation
                Max                 : 3
                Current             : 1
            Link Width
                Max                 : 16x
                Current             : 16x
        Bridge Chip
            Type                    : N/A
            Firmware                : N/A
        Replays since reset         : 0
        Tx Throughput               : 0 KB/s
        Rx Throughput               : 0 KB/s
    Fan Speed                       : 0 %
    Performance State               : P8
    Clocks Throttle Reasons
        Idle                        : Active
        Applications Clocks Setting : Not Active
        SW Power Cap                : Not Active
        HW Slowdown                 : Not Active
        Unknown                     : Not Active
    FB Memory Usage
        Total                       : 11519 MiB
        Used                        : 55 MiB
        Free                        : 11464 MiB
    BAR1 Memory Usage
        Total                       : 16384 MiB
        Used                        : 2 MiB
        Free                        : 16382 MiB
    Compute Mode                    : Default
    Utilization
        Gpu                         : 0 %
        Memory                      : 0 %
        Encoder                     : 0 %
        Decoder                     : 0 %
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled
    ECC Errors
        Volatile
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
        Aggregate
            Single Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
            Double Bit            
                Device Memory       : 0
                Register File       : N/A
                L1 Cache            : N/A
                L2 Cache            : N/A
                Texture Memory      : N/A
                Total               : 0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No
    Temperature
        GPU Current Temp            : 25 C
        GPU Shutdown Temp           : 92 C
        GPU Slowdown Temp           : 89 C
    Power Readings
        Power Management            : Supported
        Power Draw                  : 17.24 W
        Power Limit                 : 250.00 W
        Default Power Limit         : 250.00 W
        Enforced Power Limit        : 250.00 W
        Min Power Limit             : 180.00 W
        Max Power Limit             : 250.00 W
    Clocks
        Graphics                    : 324 MHz
        SM                          : 324 MHz
        Memory                      : 405 MHz
    Applications Clocks
        Graphics                    : 1114 MHz
        Memory                      : 3004 MHz
    Default Applications Clocks
        Graphics                    : 947 MHz
        Memory                      : 3004 MHz
    Max Clocks
        Graphics                    : 1113 MHz
        SM                          : 1113 MHz
        Memory                      : 3004 MHz
    Clock Policy
        Auto Boost                  : On
        Auto Boost Default          : On
    Processes                       : None
[root@node6 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0

==============NVSMI LOG==============

Timestamp                           : Wed Feb 10 10:31:16 2016
Driver Version                      : 352.79

Attached GPUs                       : 4
GPU 0000:84:00.0
    Supported Clocks
        Memory                      : 3004 MHz
            Graphics                : 1114 MHz
            Graphics                : 1088 MHz
            Graphics                : 1063 MHz
            Graphics                : 1038 MHz
            Graphics                : 1013 MHz
            Graphics                : 987 MHz
            Graphics                : 962 MHz
            Graphics                : 949 MHz
            Graphics                : 924 MHz
            Graphics                : 899 MHz
            Graphics                : 873 MHz
            Graphics                : 848 MHz
            Graphics                : 823 MHz
            Graphics                : 797 MHz
            Graphics                : 772 MHz
            Graphics                : 747 MHz
            Graphics                : 721 MHz
            Graphics                : 696 MHz
            Graphics                : 671 MHz
            Graphics                : 645 MHz
            Graphics                : 620 MHz
            Graphics                : 595 MHz
            Graphics                : 557 MHz
            Graphics                : 532 MHz
        Memory                      : 405 MHz
            Graphics                : 324 MHz

NVIDIA deviceQuery on Tesla M40

The output below, from the CUDA 7.5 SDK samples, shows additional details of the architecture and capabilities of the Tesla M40 GPU accelerators.

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla M40"
  CUDA Driver Version / Runtime Version          7.5 / 7.5
  CUDA Capability Major/Minor version number:    5.2
  Total amount of global memory:                 11520 MBytes (12079464448 bytes)
  (24) Multiprocessors, (128) CUDA Cores/MP:     3072 CUDA Cores
  GPU Max Clock rate:                            1112 MHz (1.11 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
  Maximum Layered 1D Texture Size, (num) layers  1D=(16384), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(16384, 16384), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Enabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 132 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla M40
Result = PASS

Additional Tesla M40 12GB Information

To learn more about the differences between the Tesla M40 12GB and other versions of the Tesla product line, please review our “Kepler” and “Maxwell” Tesla GPU knowledge center articles:

To learn more about GPU-accelerated servers and clusters which provide the Tesla M40, please see our NVIDIA GPU technology page. Although we are able to provide the M40 in tower workstation systems, the design of the heatsink does not allow for quiet workstations.

Photo of the NVIDIA Tesla M40 12GB GPU Accelerator showing the PCI-Express connector

Eliot Eshelman

About Eliot Eshelman

My interests span from astrophysics to bacteriophages; high-performance computers to small spherical magnets. I've been an avid Linux geek (with a focus on HPC) for more than a decade. I work as Microway's Vice President of Strategic Accounts and HPC Initiatives.
This entry was posted in Benchmarking, Hardware and tagged , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published.