With the release of Tesla M40, NVIDIA continues to diversify its professional compute GPU lineup. Designed specifically for Deep Learning applications, the M40 provides 7 TFLOPS of single-precision floating point performance and 12GB of high-speed GDDR5 memory. It works extremely well with the popular Deep Learning software frameworks and may also find its way into other industries that need single-precision accuracy.
The Tesla M40 is also notable for being the first Tesla GPU to be based upon NVIDIA’s “Maxwell” GPU architecture. “Maxwell” provides excellent performance per watt, as evidenced by the fact that this GPU provides 7 TFLOPS within a 250W power envelope.
Maximum single-GPU performance: Tesla M40 12GB GPU
Available in Microway NumberSmasher GPU Servers and GPU Clusters
Specifications
- 3072 CUDA GPU cores (GM200)
- 7.0 TFLOPS single; 0.21 TFLOPS double-precision
- 12GB GDDR5 memory
- Memory bandwidth up to 288 GB/s
- PCI-E x16 Gen3 interface to system
- Dynamic GPU Boost for optimal clock speeds
- Passive heatsink design for installation in qualified GPU servers
As with all other modern Tesla GPUs, you should expect it to be able to max out the PCI-E 3.0 bus to achieve ~12GB/sec of data transfers between the system and each GPU:
[root@node6 ~]# gpu_bandwidthTest --memory=pinned --device=0 [CUDA Bandwidth Test] - Starting... Running on... Device 0: Tesla M40 Quick Mode Host to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12108.0 Device to Host Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 12870.2 Device to Device Bandwidth, 1 Device(s) PINNED Memory Transfers Transfer Size (Bytes) Bandwidth(MB/s) 33554432 210331.7 Result = PASS
Technical Details
Below is the full status reported by NVIDIA’s SMI tool. Memory error detection and correction (ECC) is supported on all components of the Tesla GPU. Notice that the M40 supports a wide range of operating frequencies:
[root@node6 ~]# nvidia-smi -a -i 0
==============NVSMI LOG==============
Timestamp : Wed Feb 10 10:30:31 2016
Driver Version : 352.79
Attached GPUs : 4
GPU 0000:84:00.0
Product Name : Tesla M40
Product Brand : Tesla
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Enabled
Accounting Mode Buffer Size : 1920
Driver Model
Current : N/A
Pending : N/A
Serial Number : 0320116xxxxxx
GPU UUID : GPU-dbacebc6-3878-d72d-ebe9-87fb50xxxxxx
Minor Number : 3
VBIOS Version : 84.00.48.00.01
MultiGPU Board : No
Board ID : 0x8400
Inforom Version
Image Version : G600.0202.02.01
OEM Object : 1.1
ECC Object : 3.0
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
PCI
Bus : 0x84
Device : 0x00
Domain : 0x0000
Device Id : 0x17FD10DE
Bus Id : 0000:84:00.0
Sub System Id : 0x117110DE
GPU Link Info
PCIe Generation
Max : 3
Current : 1
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays since reset : 0
Tx Throughput : 0 KB/s
Rx Throughput : 0 KB/s
Fan Speed : 0 %
Performance State : P8
Clocks Throttle Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
Unknown : Not Active
FB Memory Usage
Total : 11519 MiB
Used : 55 MiB
Free : 11464 MiB
BAR1 Memory Usage
Total : 16384 MiB
Used : 2 MiB
Free : 16382 MiB
Compute Mode : Default
Utilization
Gpu : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
Ecc Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
Single Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : 0
Aggregate
Single Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : 0
Double Bit
Device Memory : 0
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Texture Memory : N/A
Total : 0
Retired Pages
Single Bit ECC : 0
Double Bit ECC : 0
Pending : No
Temperature
GPU Current Temp : 25 C
GPU Shutdown Temp : 92 C
GPU Slowdown Temp : 89 C
Power Readings
Power Management : Supported
Power Draw : 17.24 W
Power Limit : 250.00 W
Default Power Limit : 250.00 W
Enforced Power Limit : 250.00 W
Min Power Limit : 180.00 W
Max Power Limit : 250.00 W
Clocks
Graphics : 324 MHz
SM : 324 MHz
Memory : 405 MHz
Applications Clocks
Graphics : 1114 MHz
Memory : 3004 MHz
Default Applications Clocks
Graphics : 947 MHz
Memory : 3004 MHz
Max Clocks
Graphics : 1113 MHz
SM : 1113 MHz
Memory : 3004 MHz
Clock Policy
Auto Boost : On
Auto Boost Default : On
Processes : None
[root@node6 ~]# nvidia-smi -q -d SUPPORTED_CLOCKS -i 0
==============NVSMI LOG==============
Timestamp : Wed Feb 10 10:31:16 2016
Driver Version : 352.79
Attached GPUs : 4
GPU 0000:84:00.0
Supported Clocks
Memory : 3004 MHz
Graphics : 1114 MHz
Graphics : 1088 MHz
Graphics : 1063 MHz
Graphics : 1038 MHz
Graphics : 1013 MHz
Graphics : 987 MHz
Graphics : 962 MHz
Graphics : 949 MHz
Graphics : 924 MHz
Graphics : 899 MHz
Graphics : 873 MHz
Graphics : 848 MHz
Graphics : 823 MHz
Graphics : 797 MHz
Graphics : 772 MHz
Graphics : 747 MHz
Graphics : 721 MHz
Graphics : 696 MHz
Graphics : 671 MHz
Graphics : 645 MHz
Graphics : 620 MHz
Graphics : 595 MHz
Graphics : 557 MHz
Graphics : 532 MHz
Memory : 405 MHz
Graphics : 324 MHz
NVIDIA deviceQuery on Tesla M40
The output below, from the CUDA 7.5 SDK samples, shows additional details of the architecture and capabilities of the Tesla M40 GPU accelerators.
deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla M40"
CUDA Driver Version / Runtime Version 7.5 / 7.5
CUDA Capability Major/Minor version number: 5.2
Total amount of global memory: 11520 MBytes (12079464448 bytes)
(24) Multiprocessors, (128) CUDA Cores/MP: 3072 CUDA Cores
GPU Max Clock rate: 1112 MHz (1.11 GHz)
Memory Clock rate: 3004 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 3145728 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536, 65536), 3D=(4096, 4096, 4096)
Maximum Layered 1D Texture Size, (num) layers 1D=(16384), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(16384, 16384), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Enabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 132 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 7.5, CUDA Runtime Version = 7.5, NumDevs = 1, Device0 = Tesla M40
Result = PASS
Additional Tesla M40 12GB Information
To learn more about the differences between the Tesla M40 12GB and other versions of the Tesla product line, please review our “Kepler” and “Maxwell” Tesla GPU knowledge center articles:
- In-Depth Comparison of NVIDIA Tesla “Kepler” GPU Accelerators
- In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators
To learn more about GPU-accelerated servers and clusters which provide the Tesla M40, please see our NVIDIA GPU technology page. Although we are able to provide the M40 in tower workstation systems, the design of the heatsink does not allow for quiet workstations.


