We have an server application running on Ubuntu 22.04. The application logs are pretty spammy.
journalctl -S -1hour | wc --lines121349journalctl -S -1hour | wc --bytes32382836
Once in a while a server will become completely unresponsive for several minutes. The CPU will go to 100% and our applications are so unresponsive that other agents stop reporting any metrics.
When the incident is over we notice that our application didn't crash does log a bunch of timeout errors because it wasn't able to do anything for the last few minutes. However, it does just continue working after that.
I found this in /var/log/syslog
Feb 28 17:44:29 ip-10-11-0-205 kernel: [81118.252411] systemd[1]: Started ntp-systemd-netif.service.Feb 28 17:44:53 ip-10-11-0-205 kernel: [81142.367869] systemd[1]: ntp-systemd-netif.service: Deactivated successfully.Feb 28 17:45:13 ip-10-11-0-205 kernel: [81162.387816] systemd[1]: systemd-journald.service: State 'stop-watchdog' timed out. Killing.Feb 28 17:45:14 ip-10-11-0-205 kernel: [81162.731840] systemd[1]: systemd-journald.service: Killing process 117 (systemd-journal) with signal SIGKILL.Feb 28 17:45:17 ip-10-11-0-205 kernel: [81165.657264] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=9/KILLFeb 28 17:45:17 ip-10-11-0-205 kernel: [81165.696972] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.Feb 28 17:45:18 ip-10-11-0-205 kernel: [81166.651435] systemd[1]: systemd-journald.service: Consumed 2min 5.531s CPU time.Feb 28 17:45:25 ip-10-11-0-205 kernel: [81174.079307] systemd[1]: systemd-journald.service: Scheduled restart job, restart counter is at 1.Feb 28 17:45:25 ip-10-11-0-205 kernel: [81174.085466] systemd[1]: Stopped Journal Service.Feb 28 17:45:25 ip-10-11-0-205 kernel: [81174.200482] systemd[1]: systemd-journald.service: Consumed 2min 5.531s CPU time.Feb 28 17:45:29 ip-10-11-0-205 kernel: [81177.822409] systemd[1]: Starting Journal Service...Feb 28 17:45:35 ip-10-11-0-205 kernel: [81183.821550] systemd-journald[240259]: File /var/log/journal/ec212477ed3f3049adade2e820950984/system.journal corrupted or uncleanly shut down, renaming and replacing.Feb 28 17:45:38 ip-10-11-0-205 kernel: [81187.226010] systemd[1]: Started Journal Service.
So it sounds like systemd-journal
is causing the issue.
Questions:
- Is this just taking a long time to process my logs?
- Why does it need to hog the whole host?
- If yes, are there some settings I can change so this doesn't go crazy? (e.g. trim logs more frequently, etc)