Revision for “Check for memory errors on NVIDIA GPUs” created on February 14, 2019 @ 13:10:35
Check for memory errors on NVIDIA GPUs
Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller "single-bit" errors are transparently corrected. Larger "double-bit" memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).
There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the <a href="https://www.microway.com/knowledge-center-articles/high-level-linux-troubleshooting/" rel="noopener" target="_blank">system logs</a>. However, the GPUs themselves will also store the type and date of the event.
<em>It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered "bad" when a single error occurs (or even when a number of errors occurs). <strong>If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review.</strong> You may also wish to review the <a href="https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html" rel="noopener" target="_blank">NVIDIA documentation</a></em>
To review the current health of the GPUs in a system, use the nvidia-smi utility:
Timestamp : Thu Feb 14 10:58:34 2019
Attached GPUs : 4
If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports <code>Pending: Yes</code>, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.
To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:
gpu_uuid, retired_pages.address, retired_pages.cause
A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below: