High-Level Linux Troubleshooting

microway

·

July 25, 2013

Whether you’re working on a cluster, a server or a workstation, most installations of Linux are similar. When something goes wrong, you need to determine the exact issue before you can get it resolved. This article provides a top-level overview of Linux troubleshooting.

Linux Kernel Messages

The Linux kernel is often aware of issues as they occur. If you suspect you’re facing a hardware issue or serious software issue (crashes/segfaults), the kernel can probably provide more information.

To see the most recent messages, run:
dmesg | tail -n50

To find older messages, read through the log file /var/log/messages (on some systems /var/log/kern.log). The Linux kernel prints many messages during normal operation (especially during the boot process), so don’t assume everything you see is a serious error.

Memory Errors

If your dmesg output contains messages similar to the examples below, your system is encountering errors when accessing memory. Because modern system components are closely integrated, such an error may be caused by several different types of hardware failure. Please send the dmesg output to our support team.

sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010091
TSC 0 ADDR 10877e640 MISC 21420c8c86 PROCESSOR 0:206d6 TIME 1369016551 SOCKET 0 APIC 0
EDAC MC0: CE row 0, channel 0, label "CPU_SrcID#0_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0091 (ch=1), addr 
= 0x108778e40 => socket=0, Channel=0(mask=1), rank=0

kernel:[Hardware Error]: CPU:56 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c55c00080080a13
kernel:[Hardware Error]:     MC4_ADDR: 0x000000720157c6f0
kernel:[Hardware Error]: Northbridge Error (node 7): DRAM ECC error detected on the NB.

NVIDIA GPU Errors

Kernel messages which contain the terms NVRM or Xid indicate some type of event occurred on an NVIDIA GPU. Such messages may not be fatal, so please contact Microway support for additional review. Consult NVIDIA documentation for the full list of Xid errors. Some examples of higher-priority issues are shown below.

NVRM: GPU at 0000:83:00: GPU-722f9c93-9a7f-08e3-6cc2-a5d8e3331e7f
NVRM: Xid (PCI:0000:83:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU

NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:83:00.0 has fallen off the bus.

Software RAID Errors

The dmesg output below shows an example message for a system with a degraded software RAID. This occurs when one of the hard drives fails, and will require a hardware swap. Please send a copy of the file /proc/mdstat to our support team.

[2010086.462608] md/raid1:md1: Disk failure on sdb1, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[2010086.474910] RAID1 conf printout:
[2010086.474914]  --- wd:1 rd:2
[2010086.474917]  disk 0, wo:1, o:0, dev:sdb1
[2010086.474919]  disk 1, wo:0, o:1, dev:sda1
[2010086.480441] RAID1 conf printout:
[2010086.480444]  --- wd:1 rd:2
[2010086.480447]  disk 1, wo:0, o:1, dev:sda1

Application Errors

If your scientific code is not working properly but you can find no system errors or messages, this is an indication that Linux and the hardware are working fine. It is likely that your code has a bug, your compiler has a bug or one of the scientific/math libraries has a bug. There are also cases where it is simply a compatibility issue – recompiling with a different compiler/library may fix the issue (e.g., OpenMPI instead of MVAPICH2, Intel compiler vs GNU compiler).

System Hangs/Crashes

Many different conditions can be described as a “system hang”. There are a variety of possible causes for such behavior. Please reference what to do when your system hangs.

No Linux Kernel Messages; System Reboots/Powers Off

If your system is rebooting or powering off with no warning, Linux will not be able to log the cause. You should verify that both your power and cooling are sufficient. The room should be roughly 74°F – systems that overheat will automatically power themselves off.

If power and cooling are reliable, then the most likely explanation is a hardware issue. Our support team can help you track down the issue.

High-Level Linux Troubleshooting

Linux Kernel Messages

Memory Errors

NVIDIA GPU Errors

Software RAID Errors

Application Errors

System Hangs/Crashes

No Linux Kernel Messages; System Reboots/Powers Off

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs