Linux Kernel Messages
The Linux kernel is often aware of issues as they occur. If you suspect you’re facing a hardware issue or serious software issue (crashes/segfaults), the kernel can probably provide more information.
To see the most recent messages, run:
dmesg | tail -n50
To find older messages, read through the log file
/var/log/messages. The Linux kernel prints many messages during normal operation (especially during the boot process), so don’t assume everything you see is a serious error.
If your dmesg output contains messages similar to the example below, your system is encountering errors when accessing memory. Because modern system components are closely integrated, such an error may be caused by several different types of hardware failure. Please send the dmesg output to our support team.
sbridge: HANDLING MCE MEMORY ERROR CPU 0: Machine Check Exception: 0 Bank 5: 8c00004000010091 TSC 0 ADDR 10877e640 MISC 21420c8c86 PROCESSOR 0:206d6 TIME 1369016551 SOCKET 0 APIC 0 EDAC MC0: CE row 0, channel 0, label "CPU_SrcID#0_Channel#0_DIMM#0": 1 Unknown error(s): memory read on FATAL area : cpu=0 Err=0001:0091 (ch=1), addr = 0x108778e40 => socket=0, Channel=0(mask=1), rank=0
Software RAID Errors
The dmesg output below shows an example message for a system with a degraded software RAID. This occurs when one of the hard drives fails, and will require a hardware swap. Please send a copy of the file
/proc/mdstat to our support team.
[2010086.462608] md/raid1:md1: Disk failure on sdb1, disabling device. md/raid1:md1: Operation continuing on 1 devices. [2010086.474910] RAID1 conf printout: [2010086.474914] --- wd:1 rd:2 [2010086.474917] disk 0, wo:1, o:0, dev:sdb1 [2010086.474919] disk 1, wo:0, o:1, dev:sda1 [2010086.480441] RAID1 conf printout: [2010086.480444] --- wd:1 rd:2 [2010086.480447] disk 1, wo:0, o:1, dev:sda1
If your scientific code is not working properly but you can find no system errors or messages, this is an indication that Linux and the hardware are working fine. It is likely that your code has a bug, your compiler has a bug or one of the scientific/math libraries has a bug. There are also cases where it is simply a compatibility issue – recompiling with a different compiler/library may fix the issue (e.g., OpenMPI instead of MVAPICH2, Intel compiler vs GNU compiler).
Many different conditions can be described as a “system hang”. There are a variety of possible causes for such behavior. Please reference what to do when your system hangs.
No Linux Kernel Messages; System Reboots/Powers Off
If your system is rebooting or powering off with no warning, Linux will not be able to log the cause. You should verify that both your power and cooling are sufficient. The room should be roughly 74°F – systems that overheat will automatically power themselves off.
If power and cooling are reliable, then the most likely explanation is a hardware issue. Our support team can help you track down the issue.