6 minute read

Yesterday I’ve been asked by a friend who is currently working on an autonomous driving platform, running ROS on Ubuntu Linux, how to understand what caused the system to hang: she told me that the computer controlling the platform suddenly had a stroke (that luckily had no adverse consequences, but could have been way worse) and that she seriously wanted to find out what happened.

Unfortunately, there is no simple answer: while definitely a great feat of engineering, the Linux kernel is not infallible. Being so close to the hardware means having a complex architecture exposed to all sorts of dangers: the kernel responsible for an incredible amount of critical tasks that could fail and create all sorts of problems. Moreover, there is also a myriad of problems that could be caused by misbehaving hardware devices.

Presented here is a synthesis of the most useful log files, and of the many ways the provided tools can be exploited to really understand what kind of bad trip the system was having while trying to kill you.

The log files

Logs are kept in \var\log, here’s some notable examples:

Name Description
boot.log A log that records everything that happened during the boot sequence
dmesg A log file used by the omonimous tool
kern.log A specific log for kernel events
syslog A global system log that records virtually everything

Let’s try and analyze all of them.

boot.log

The several copies of boot.log kept here represent an exact copy of the boot console messages, plus some timestamping. They’re useful to determine the root cause of a boot problem, be it a delay or hanging before reaching the multiuser target (in case we’re running systemd) or the login prompt.

incaseofsystemd

Here’s an excerpt from my boot.log.1:

$ sudo head -n 20 boot.log.1
[sudo] password for quartz: 
------------ Wed Nov 11 15:30:34 CET 2020 ------------
[  OK  ] Started Show Plymouth Boot Screen.
[  OK  ] Started Forward Password Requests to Plymouth Directory Watch.
[  OK  ] Reached target Local Encrypted Volumes.
[  OK  ] Found device TOSHIBA_THNSNH128GCST 1.
         Starting File System Check on /dev/disk/by-uuid/DFB7-381A...
[  OK  ] Listening on Load/Save RF Kill Switch Status /dev/rfkill Watch.
         Starting Load/Save RF Kill Switch Status...
[  OK  ] Started File System Check Daemon to report status.
[  OK  ] Finished File System Check on /dev/disk/by-uuid/DFB7-381A.
         Mounting /boot/efi...
[  OK  ] Mounted /boot/efi.
[  OK  ] Reached target Local File Systems.
         Starting Load AppArmor profiles...
         Starting Set console font and keymap...
         Starting Tell Plymouth To Write Out Runtime Data...
         Starting QEMU KVM preparation - module, ksm, hugepages...
         Starting Create Volatile Files and Directories...

It should be fairly readable, albeit a little hard to search through. It shouldn’t be very long though, so you should be fine with a quick glance.

dmesg

dmesg is the kernel ring buffer or, in layman’s terms, a memory region in the kernel that keeps track of its activity: since the very beginning of the boot sequence, just out of the bootloader, dmesg records and timestamps everything the kernel and its affiliates do.

The message format

First of all, let’s take a look at the structure of the messages:

$ dmesg | head -n 10
[    0.000000] microcode: microcode updated early to revision 0x2f, date = 2019-02-17
[    0.000000] Linux version 5.4.74-rt41 (quartz@melchior) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #1 SMP PREEMPT_RT Tue Nov 3 21:55:23 CET 2020
[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-5.4.74-rt41 root=UUID=567d8513-a279-4f63-b486-af9d2897a3be ro quiet splash vt.handoff=7
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000]   Hygon HygonGenuine
[    0.000000]   Centaur CentaurHauls
[    0.000000]   zhaoxin   Shanghai  
[    0.000000] Disabled fast string operations
$ dmesg | tail -n 10
[17885.157559] ata1.00: ACPI cmd ef/10:03:00:00:00:a0 (SET FEATURES) filtered out
[17885.158069] ata1.00: configured for UDMA/100
[17885.255281] usb 1-1.3: reset full-speed USB device number 3 using ehci-pci
[17885.701050] psmouse serio1: synaptics: queried max coordinates: x [..5472], y [..4448]
[17885.891382] OOM killer enabled.
[17885.891424] ACPI: \_SB_.PCI0.LPC_.EC__.BAT1: docking
[17885.900415] ACPI: \_SB_.PCI0.LPC_.EC__.BAT1: Unable to dock!
[17885.891392] Restarting tasks ... done.
[17886.607287] PM: suspend exit
[17886.858980] e1000e: enp0s25 NIC Link is Down

As you can see, the messages are timestamped, but by default dmesg outputs them in a tick-based way. By leveraging some of the flags, we can basically print them out however we want:

$ dmesg -T | tail -n 1
[mer nov 11 20:52:06 2020] IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready

By using the -T flag, for example, we can ask for a human-readable time/date instead of the incremental timestamp. Be noted that the date might not be accurate if the system has gone to sleep, so keep it in mind when looking at dmesg output on portable computers or home workstations.

Filtering output by type

dmesg is a treasure trove of informations but the output might look a bit intimidating at first, and it’s a pain to search through: you might be looking for a specific kind of message but be unable to grep through them, because there are no recurrent keywords. Once again, dmesg comes to rescue: the -f, --facility and -l, --level flags with one of the following parameters (from dmesg --help) might help you look for specific things:

$ dmesg --help | tail -n 21
Supported log facilities:
    kern - kernel messages
    user - random user-level messages
    mail - mail system
  daemon - system daemons
    auth - security/authorization messages
  syslog - messages generated internally by syslogd
     lpr - line printer subsystem
    news - network news subsystem

Supported log levels (priorities):
   emerg - system is unusable
   alert - action must be taken immediately
    crit - critical conditions
     err - error conditions
    warn - warning conditions
  notice - normal but significant condition
    info - informational
   debug - debug-level messages

For more details see dmesg(1).

For example, we might display the latest available error message this way:

$ dmesg -l err | tail -n 1
[17885.900415] ACPI: \_SB_.PCI0.LPC_.EC__.BAT1: Unable to dock!

…and this tells you that I’ve undocked my ThinkPad 😀.

Other combinations

dmesg is an extremely powerful and flexible tool: please take a cursory look at its --help flag and manual page, as I’m sure you’ll be able to find someway to do whatever you need.

kern.log

While dmesg handles kernel logs dynamically, it’s mostly useful to analyze the current session. For past kernel message it might be worth it to look at the kern.log. Besides, the output format of dmesg is much more civilized. Here’s an excerpt from kern.log:

$ cat kern.log | tail -n 5
Nov 11 20:55:43 melchior kernel: [19306.237788] 002: wlp3s0: associated
Nov 11 20:55:43 melchior kernel: [19306.389998] 003: IPv6: ADDRCONF(NETDEV_CHANGE): wlp3s0: link becomes ready
Nov 11 23:36:00 melchior kernel: [28922.398944] 000: audit: type=1326 audit(1605134160.384:35): auid=1000 uid=1000 gid=1000 ses=3 subj==snap.snap-store.ubuntu-software (enforce) pid=1864 comm="snap-store" exe="/snap/snap-store/481/usr/bin/snap-store" sig=0 arch=c000003e syscall=93 compat=0 ip=0x7f06966c6527 code=0x50000
Nov 12 00:00:44 melchior kernel: [30406.721556] 000: audit: type=1400 audit(1605135644.648:36): apparmor="DENIED" operation="capable" profile="/usr/sbin/cups-browsed" pid=14896 comm="cups-browsed" capability=23  capname="sys_nice"
Nov 12 00:04:33 melchior kernel: [30635.426733] 002: perf: interrupt took too long (6497 > 6461), lowering kernel.perf_event_max_sample_rate to 30750

As you can see, the messages are quite similar to those found in dmesg: in my opinion it’s still worth to take a good look at kern.log after searching through dmesg, it can’t hurt to try and you might find some more detailed informations.

syslog

This file contains a different kind of logs, mostly userland related, that might shed some light on the misbehavior of applications. System control applications like to spam this log with mostly routine messages. You can search this log for events that happen in the userland, but be careful: there’s a lot of noise in there. Here’s an example:

$ cat syslog | tail -n 5
Nov 12 02:17:54 melchior whoopsie[1098]: [02:17:54] The default IPv4 route is: /org/freedesktop/NetworkManager/ActiveConnection/2
Nov 12 02:17:54 melchior whoopsie[1098]: [02:17:54] Not a paid data plan: /org/freedesktop/NetworkManager/ActiveConnection/2
Nov 12 02:17:54 melchior whoopsie[1098]: [02:17:54] Found usable connection: /org/freedesktop/NetworkManager/ActiveConnection/2
Nov 12 02:17:56 melchior whoopsie[1098]: [02:17:56] online
Nov 12 02:18:05 melchior systemd[1]: NetworkManager-dispatcher.service: Succeeded.

Wrapping up

As always the Linux operating system offers advanced facilities for users, administrators and programmers, that allow for fine-grained tuning and control of the system. Looking at logs might not be a good way to stop the system from hanging in the first place, but it’s a very useful tool to debug problematic situations.

Sources

https://www.plesk.com/blog/featured/linux-logs-explained/ makes a great attempt at explaining this same, very interesting, topic.