I know the Linux expertise in ML is strong and could use some guidance on a Debian WordPress VM issue. Here's some context on the VM that is having performance issues (almost daily now) during the same time window:
-
Bitnami WordPress instance with NGINX deployed into an e2-micro instance in Google Compute Engine. It uses MariaDB specifically as its database. It's Debian 5.10.205-2. The VM was deployed in the last month or two to migrate to a more current version of the OS from the previous version of the site. We were experiencing all of the problems described below using the former instance of the site before it was migrated. The only migration work that was done was a backup and restore of the images and the database to a new VM with the latest Bitnami WordPress image, and the problem has persisted even after that.
-
Most every morning (with few exceptions) around 5 AM CST there is a spike in disk IO which causes CPU IO wait to spike and sends swap through the roof for 1 - 2 hours minimum. Queue length also goes way up too. The site's behavior when you visit during this time is to eventually throw an NGINX 504 gateway timeout error. The only way to resolve the problem that we have found is to login to the VM via SSH and either reboot it or run the Bitnami service restart command (sudo /opt/bitnami/ctlscript.sh restart). Either of those will return the VM to a working, functional, responsive state until the same time window hits the next day. There seems to be no real difference in the reboot vs. the services restart in terms of keeping the problem at bay other than the reboot might prevent it for the day after you rebooted (but not always). All Bitnami WordPress services seem to be running during the problem window (nothing seems to be failing).
-
If someone were to run iotop during the problem window, they would see the mariadb service as the top culprit as shown below (which seems to indicate something is hitting the database really hard during this time window). Outside of the problem window you don't see IO spikes. The mariabd service may jump up and use 30% IO for a second and then disappear from atop the list. During the problem window you can guarantee mariadb and the php-fpm services (and several instances of them) will be at the top.
-
I've also noticed during the problem window systemd-journal-flush.service shows loaded and failed when you run systemctl. That seems to make sense during a period of high IO and a lot of swap with high CPU IO wait, but I would love insight from others.
-
This site is a website for a podcast and hosts the main feed for the show. The WordPress database itself is tiny (like 10 MB) but has several hundred posts in it. The only other data really stored in WordPress would be small PNG files that get used for featured images. We have at least 3 GB free on the VM's disk at this point.
-
From a plugin standpoint in WordPress we have everything disabled except Akismet, Jetpack, Blubrry PowerPresss, and Updraft Plus. We thought it might be Updraft, but every backup completes in 10 - 11 seconds with no errors. The time for backups is like 6:30 PM CST in the evening, We confirmed that by looking at time stamps of files inside the VM's OS and even tried de-activating Updraft to see if the problem went away (it did not).
-
Outside of the problem window the VM has plenty of free memory when you run free -m and is using little or no swap. The site works great outside of the specific time window.
-
We looked at cron jobs, and nothing seems out of the ordinary. It feels like there is some kind of scheduled task for the database specifically that is causing the problem, but I do not know how to pinpoint it or what queries are being run against the database. I tried installing sar to get some details but apparently have too much rust on my Linux chops (which were minimal) since the days of building and administering Elastix PBXs. There do not seem to be any scheduled scripts, etc. from looking at wp-config.php either.
Has anyone here seen an issue like this? If so, does it make any sense why the database would be hit with so much IO during a specific time period like this? By the way, this site isn't getting a crazy amount of traffic either. I looked at Jetpack stats, and it gets anywhere from 5 to 20 or 30 visits in a day. Any guidance is greatly appreciated.
I'll also add that looking at dashboards in Google Compute Engine confirms the time window of issue. The database process seems to show up as top usage of CPU and memory during the problem window.