Check for memory errors on NVIDIA GPUs

Eliot Eshelman

February 14, 2019

Professional NVIDIA GPUs (the Tesla and Quadro products) are equipped with error-correcting code (ECC) memory, which allows the system to detect when memory errors occur. Smaller “single-bit” errors are transparently corrected. Larger “double-bit” memory errors will cause applications to crash, but are at least detected (GPUs without ECC memory would continue operating on the corrupted data).

There are conditions under which GPU events are reported to the Linux kernel, in which case you will see such errors in the system logs. However, the GPUs themselves will also store the type and date of the event.

It’s important to note that not all ECC errors are due to hardware failures. Stray cosmic rays are known to cause bit flips. For this reason, memory is not considered “bad” when a single error occurs (or even when a number of errors occurs). If you have a device reporting tens or hundreds of Double Bit errors, please contact Microway tech support for review. You may also wish to review the NVIDIA documentation

To review the current health of the GPUs in a system, use the nvidia-smi utility:

[root@node7 ~]# nvidia-smi -q -d PAGE_RETIREMENT

==============NVSMI LOG==============

Timestamp                           : Thu Feb 14 10:58:34 2019
Driver Version                      : 410.48

Attached GPUs                       : 4
GPU 00000000:18:00.0
    Retired Pages
        Single Bit ECC              : 0
        Double Bit ECC              : 0
        Pending                     : No

GPU 00000000:3B:00.0
    Retired Pages
        Single Bit ECC              : 15
        Double Bit ECC              : 0
        Pending                     : No

The output above shows one card with no issues and one card with a minor quantity of single-bit errors (the card is still functional and in operation).

If the above report indicates that memory pages have been retired, then you may wish to see additional details (including when the pages were retired). If nvidia-smi reports Pending: Yes, then memory errors have occurred since the last time the system rebooted. In either case, there may be older page retirements that took place.

To review a complete listing of the GPU memory pages which have been retired (including the unique ID of each GPU), run:

[root@node7 ~]# nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv

gpu_uuid, retired_pages.address, retired_pages.cause
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c05e, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005ca0d, Single Bit ECC
GPU-9fa5168d-97bf-98aa-33b9-45329682f627, 0x000000000005c72e, Single Bit ECC
...

A different type of output must be selected in order to read the timestamps of page retirements. The output is in XML format and may require a bit more effort to parse. In short, try running a report such as shown below:

[root@node7 ~]# nvidia-smi -i 1 -q -x| grep -i -A1 retired_page_addr

<retired_page_address>0x000000000005c05e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005ca0d</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:25 2017</retired_page_timestamp>
--
<retired_page_address>0x000000000005c72e</retired_page_address>
<retired_page_timestamp>Mon Dec 18 06:25:31 2017</retired_page_timestamp>
...

Check for memory errors on NVIDIA GPUs

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs