Tracking Down an Apache & WordPress Memory Leak



  • Okay, so I am on the trail of a memory leak in a totally unmaintained WordPress site that has, brace yourself, seventy six plugins installed in it! The site was working, more or less, until recently. I was only brought in once the memory leak was in full effect. The people who got the site into this state have no clue what they were doing, and aren't available to give us any history.

    Updates were really not that bad, but everything is fully updated now. We moved servers from CentOS 6 to Ubuntu 19.04. I've removed what plugins I was sure could be pulled out. I've replaced lesser known caches with Breeze, that we use everywhere. I've added WordFence and got it scanning, but the site is so large that it is struggling to make it through. But so far it is coming back clean so it does not seem like it is going to be the source of any information.

    I've scoured logs and we are essentially not getting any errors at this point, we were before but I have fixed that stuff. Now it is just something with the site. Basically as requests come in Apache spins up new processes and appears to never use old ones again. Then, of course, it uses up a crazy amount of memory and crashes. It will either crash out because it runs out of memory, or it runs out of processes and freezes. This varies predictably based on the Apache settings.

    The issue appears to be that a thread does not "release". Once you are using one, it appears that the process just hangs, sort of. And instead of reusing the process, Apache just makes a new one.

    I've been digging for hours. Hoping someone has some good guidance on where to start tracking this down.



  • /etc/apache2/mods-enabled/mpm_prefork.conf

    <IfModule mpm_prefork_module>
            StartServers                     5
            MinSpareServers           5
            MaxSpareServers          10
            MaxRequestWorkers        500
            ServerLimit             500
            MaxConnectionsPerChild  10
    </IfModule>
    

    The MPM Prefork settings are terrible and really just set to give me time to track down the issues. The system is not very busy and does not need anywhere near 500 RequestWorkers. It should be fine with twenty or so. But since they never release, I need a lot just to get enough time to do something before it runs out. I keep lowering the MaxConnectionsPerChild to try to see if it will ever get low enough to be able to reuse existing RequestWorkers, but so far, it doesn't matter.

    This was all at defaults and the issue was there. So the problem cannot be in this configuration. This is just breathing room settings, and is not a cause, nor a fix. No settings here will ultimately matter because I can only take the RequestWorkers setting so high before I run out of memory. So any number here is just a bandaid.



  • Anytime that I restart Apache, everything is just find. A restart brings memory down to 1GB in use or there abouts. The site is super responsive, and all is well. But the process count starts climbing again immediately, memory balloons, and in twenty minutes we are locked up again. Obviously just restarting Apache every ~20 minutes will keep things going, but that's pretty awful.



  • @scottalanmiller

    Not sure why it happens. but if you use memcache that will reduce the load on DB and PHP and apache, so that might help

    Restore Apache config to default, is PHP running as mod apache ? take it out perhaps the leak is in PHP



  • We got it. Had to open the Nginx logs and noticed too many "posts" in the error log. Dug in and it was three ranges overseas all hitting with a "post timeout attack." It was a light DDoS where sessions were being opened and held causing nginx to wait on a timeout. This caused Apache to just increment forever. Once we blocked those ranges, the Apache thread count started to drop for the first time, and memory started to release. And the continuous flood of nginx error logs ceased.

    If you are looking at nginx error logs, this is what you look for: upstream timed out (110: Connection timed out) while reading response header from upstream, client:

    You can use this command to collect the offending IP addresses:

    grep "upstream timed out" error.log | cut -d' ' -f20
    

    Then use your firewall to shut them down. We are all good now! Woot.