<em>Whether you’re working on a cluster, a server or a workstation, most installations of Linux are similar. When something goes wrong, you need to determine the exact issue before you can get it resolved. This article provides a top-level overview of Linux troubleshooting.</em>
<h2>Linux Kernel Messages</h2>
The Linux kernel is often aware of issues as they occur. If you suspect you’re facing a hardware issue or serious software issue (crashes/segfaults), the kernel can probably provide more information.
To see the most recent messages, run:
<code>dmesg | tail -n50</code>
To find older messages, read through the log file <code>/var/log/messages</code> (on some systems <code>/var/log/kern.log</code>). The Linux kernel prints many messages during normal operation (especially during the boot process), so don’t assume everything you see is a serious error.
If your dmesg output contains messages similar to the examples below, your system is encountering errors when accessing memory. Because modern system components are closely integrated, such an error may be caused by several different types of hardware failure. Please send the dmesg output to our support team.
<pre>sbridge: HANDLING MCE MEMORY ERROR
CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010091
TSC 0 ADDR 10877e640 MISC 21420c8c86 PROCESSOR 0:206d6 TIME 1369016551 SOCKET 0 APIC 0
EDAC MC0: CE row 0, channel 0, label "CPU_SrcID#0_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0091 (ch=1), addr
= 0x108778e40 => socket=0, Channel=0(mask=1), rank=0</pre>
<pre>kernel:[Hardware Error]: CPU:56 MC4_STATUS[-|CE|MiscV|-|AddrV|-|-|CECC]: 0x9c55c00080080a13
kernel:[Hardware Error]: MC4_ADDR: 0x000000720157c6f0
kernel:[Hardware Error]: Northbridge Error (node 7): DRAM ECC error detected on the NB.</pre>
<h3>NVIDIA GPU Errors</h3>
Kernel messages which contain the terms <code>NVRM</code> or <code>Xid</code> indicate some type of event occurred on an NVIDIA GPU. Such messages may not be fatal, so please contact Microway support for additional review. Consult <a href="https://docs.nvidia.com/deploy/xid-errors/index.html" rel="noopener" target="_blank">NVIDIA documentation</a> for the full list of Xid errors. Some examples of higher-priority issues are shown below.
<pre>NVRM: GPU at 0000:83:00: GPU-722f9c93-9a7f-08e3-6cc2-a5d8e3331e7f
NVRM: Xid (PCI:0000:83:00): 48, An uncorrectable double bit error (DBE) has been detected on GPU
<pre>NVRM: Xid (PCI:0000:83:00): 79, GPU has fallen off the bus.
NVRM: GPU at 0000:83:00.0 has fallen off the bus.</pre>
<h3>Software RAID Errors</h3>
The dmesg output below shows an example message for a system with a degraded software RAID. This occurs when one of the hard drives fails, and will require a hardware swap. Please send a copy of the file <code>/proc/mdstat</code> to our support team.
<pre>[2010086.462608] md/raid1:md1: Disk failure on sdb1, disabling device.
md/raid1:md1: Operation continuing on 1 devices.
[2010086.474910] RAID1 conf printout:
[2010086.474914] — wd:1 rd:2
[2010086.474917] disk 0, wo:1, o:0, dev:sdb1
[2010086.474919] disk 1, wo:0, o:1, dev:sda1
[2010086.480441] RAID1 conf printout:
[2010086.480444] — wd:1 rd:2
[2010086.480447] disk 1, wo:0, o:1, dev:sda1</pre>
If your scientific code is not working properly but you can find no system errors or messages, this is an indication that Linux and the hardware are working fine. It is likely that your code has a bug, your compiler has a bug or one of the scientific/math libraries has a bug. There are also cases where it is simply a compatibility issue – recompiling with a different compiler/library may fix the issue (e.g., OpenMPI instead of MVAPICH2, Intel compiler vs GNU compiler).
Many different conditions can be described as a "system hang". There are a variety of possible causes for such behavior. Please reference <a href="https://www.microway.com/knowledge-center-articles/what-to-do-when-your-system-hangs/" target="_blank">what to do when your system hangs</a>.
<h2>No Linux Kernel Messages; System Reboots/Powers Off</h2>
If your system is rebooting or powering off with no warning, Linux will not be able to log the cause. You should verify that both your power and cooling are sufficient. The room should be roughly 74°F – systems that overheat will automatically power themselves off.
If power and cooling are reliable, then the most likely explanation is a hardware issue. Our support team can help you track down the issue.