NGINX Just Stop Working
-
I have NGINX running on an Ubuntu 20.04 instance. It runs nothing but NGINX and is hosted on hyper-v server. Everything is running with all updates applied.
The issue is that NGINX will randomly just stop routing requests. Websites and services are going offline and looking at the NGINX logs (/var/log/nginx) as well as the syslog doesn't show any errors but when I check to see if the NGINX service is running, it shows as stopped. All I have to do is reboot the Ubuntu server and everything works again (restarting NGINX service doesn't always fix the issue but a server reboot works every time). No other change required, just a reboot.
It's proxying for only a dozen sites and services and traffic is not that high. Looking at resource utilization doesn't indicate there are problems there.
I'm already running auto reboots every night but these random stops continue to happen (before someone asks, no, the issues are not correlated with the reboot schedule). Before I enable debug logging, I thought I'd reach out here to see if anyone else had experienced this before and how you might have fixed it. Should I be looking elsewhere for details on what might be causing this?
-
What do the logs say leading up to it stopping?
-
What does error.log say?
Are you running certbot?
-
Sorry, running out the door for a client. I'll grab the logs and post the contents this weekend.
I am running certbot for Let's Encrypt.
-
@NashBrydges Since nginx is running, this should return ok, but you might want to try a
nginx -t
-
-
@Obsolesce Yes, all packages are up to date.
-
Here is the only entry in the NGINX error log for the last time NGINX stopped.
2021/01/08 22:34:03 [error] 847#847: *195 access forbidden by rule, client: 195.154.63.222, server: plextrack.jpslconsulting.ca, request: "GET / HTTP/1.1", host: "plextrack.jpslconsulting.ca"
The Let's Encrypt log shows no activity immediately before the outage.
Syslog also shows no errors. It has entries from 3AM to 3:155AM and 9:59PM to 10:02PM on the day of the last incident however the outage occurred between 7:06PM and 10:00PM so the only related entries in this log are at the time the outage was discovered and Ubuntu restarted.
-
I also ran the NGINX test and all looks good.
-
certbot.timer failing?
https://stackoverflow.com/a/52967898
-
@black3dynamite I'm not seeing any evidence of this failing in the letsencrypt.log file syslog or nginx logs (both access and error). Would those logs be elsewhere? Obviously I don't want to have to manually renew certs.
-
@NashBrydges letsencrypt.log is the only one I'm aware of. Actually are using systemd to renew your certs or cronjob?
-
@black3dynamite systemd...
-
@NashBrydges said in NGINX Just Stop Working:
I also ran the NGINX test and all looks good.
If they weren't it wouldn't even start up.
-
@NashBrydges said in NGINX Just Stop Working:
Here is the only entry in the NGINX error log for the last time NGINX stopped.
The error log is where it records HTTP errors, not Nginx software errors.
-
@scottalanmiller Well at this point I'm looking at any log that has "error" in the name. Lol
-
@NashBrydges said in NGINX Just Stop Working:
@scottalanmiller Well at this point I'm looking at any log that has "error" in the name. Lol
This should show you what there is for Nginx itself....
grep nginx /var/log/messages
-
@scottalanmiller said in NGINX Just Stop Working:
grep nginx /var/log/messages
/var/log/messages
Does not exist.
-
@NashBrydges said in NGINX Just Stop Working:
@scottalanmiller said in NGINX Just Stop Working:
grep nginx /var/log/messages
/var/log/messages
Does not exist.
Oh sorry, use Ubuntu's log. That's RHELs.
-
@scottalanmiller said in NGINX Just Stop Working:
@NashBrydges said in NGINX Just Stop Working:
Here is the only entry in the NGINX error log for the last time NGINX stopped.
The error log is where it records HTTP errors, not Nginx software errors.
Which is useful for in a case I've seen where the service was started by other means, and showed all addresses were already in use.
-
@scottalanmiller said in NGINX Just Stop Working:
@NashBrydges said in NGINX Just Stop Working:
@scottalanmiller said in NGINX Just Stop Working:
grep nginx /var/log/messages
/var/log/messages
Does not exist.
Oh sorry, use Ubuntu's log. That's RHELs.
I thought the modern distros these days were
journalctl -f
orjournalctl|grep blah|less
-
@dafyre said in NGINX Just Stop Working:
I thought the modern distros these days were journalctl -f or journalctl|grep blah|less
Just another way to see the same thing.
-
@scottalanmiller Yeah, those log messages are in the syslog file in Ubuntu. The only log entries in syslog are related to when I rebooted the server. GREP output is too large for those related nginx entries in the log for me to post here but just reviewed every line and no errors that I could find.
-
@NashBrydges said in NGINX Just Stop Working:
@scottalanmiller Yeah, those log messages are in the syslog file in Ubuntu. The only log entries in syslog are related to when I rebooted the server. GREP output is too large for those related nginx entries in the log for me to post here but just reviewed every line and no errors that I could find.
Not necessarily looking for errors, just information as to what might have happened.
-
@scottalanmiller It's too large to upload the output here so uploaded a text file to Box and sharing link here. This is a simple txt file.
-
@NashBrydges said in NGINX Just Stop Working:
@scottalanmiller It's too large to upload the output here so uploaded a text file to Box and sharing link here. This is a simple txt file.
Well I see a major issue is that the hostname of the box appears to be nginx. So that essentially makes it unsearchable because you are trying to diagnose the nginx service. Start by giving the system a more appropriate name and that alone will clear up the logs so that you can search for services.
Rule of thumb - hostnames should be meaningful with information that you can't put elsewhere, but not overlap with things like service names.
Name the systems based on things like purpose, not code. So our standard here, if nginx is a proxy, the system would be named something that ends in rproxy to denote its role. If it was an Nginx based web server, it would be named lemp and so forth.
That's why you have a huge log that you can't sort through instead of a tiny one that's super easy to sort through.
-
@NashBrydges are you sure that you are on the right box? The nginx process has literally no logs, at all. The service has never logged that it started, stopped, ran or anything else.
Also, LXD is running on this system. It appears that this is a hypervisor host rather than a web server.
-
You can still search the logs, you just have to account for the name.
So instead of a standard search...
grep nginx mylog
You have to do it twice...
grep "nginx nginx" mylog
-
At this point it's likely a more efficient use of time to migrate to a new VM if there's no indication of anything in any logs. Maybe debug logging could shine some light on something like you said, hopefully it does it again when expected, or maybe it won't help at all.... who knows. But being that it can be rebooted all the time, doesn't sound like a LoB service, so yeah, I'd just bring up a new VM and migrate as there is spare time and not waste so much time on further troubleshooting. Before you swap the traffic over to the new one, you can run tests on the new one to see if the same issue exists, too.
-
@NashBrydges said in NGINX Just Stop Working:
It runs nothing but NGINX and is hosted on hyper-v server.
The logs that you showed don't agree here. They show it running other things and not Nginx.