I have been fighting against a new server build that I am doing where the CPU keeps locking up after about a day of running. I thought it was something with a bad install so I reinstalled 22.04 LTS again but still having the same issue. Now this could be a ZFS issue but wanted to ask here first as it could be a Ubuntu/Kernel interactions between my CPU. The only way out is a system reset. Logs don't seem to have much information either.
Basically, about every 24 hours (+/- a few hours), I lose my ssh session to my server and its not reachable over the network. Below is an example of what I see on the local console.
NMI watchdog: Watchdog detected hard lockup on cpu 15rcu: rcu_sched kthread timer wakeup didn't happen for 805965 jiffies! g9450769 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x200rcu: #Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behaviorrcu: INFO: rcusched detected expedited stalls on CPUs/task: { 2-... 14-...} 554795 jiffies s: 1301 root: 0x4204watchdog: BUG: soft lockup - CPU#2 stuck for 2552s! [kworker/2:59748]watchdog: BUG: soft lockup - CPU#14 stuck for 2917s! [systemd:1]
Currently all I have on the system (other than base Ubuntu 22.04) is a Zabbix agent, OpenSSH, and OpenZFS. The reason I think it might be ZFS is that I am currently trying to resilver a 16TB replacement drive for the pool and maybe that is causing problems.
Looking at the Zabbix logs for CPU and memory right before it stops checking in, CPU % is at 5 and Memory usage is at 30%. Swap is 8GB and is at 100% free.
So far I have tried the following
- Reinstalled Ubuntu 22.04 LTS
- Defaulted BIOs
- Reseated CPU
- Reseated RAM
- Checked System Logs
Hardware Specs
- CPU: Ryzen 7 1800x
- Memory: 32GB DDR4 non ECC 2133Mhz
- PCI Cards: Quadro P4000 GPU, LSI HBA, Intel 4x 1Gb Nic
- Disks: 4x 16TB HDDS in ZFS Pool, 2 Mirrored SSDs for boot, 1 SSD in zfs pool as l2arc.
- OS Version: 22.04.4 LTS