GitLab in Massive Data Loss with No Working Backups



  • Much derided GitLab who made the astonishing move to leave the cloud for physical servers just a few short months ago today had to admit that their new physical systems did not have working backups.

    The document concludes with the following:

    So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.

    The world doesn't contain enough faces and palms to even begin to offer a reaction to that sentence. Or, perhaps, to summarise the mistakes the startup candidly details as follows:

    • LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
    • Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
    • SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
    • Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
    • The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
    • The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
    • Our backups to S3 apparently don’t work either: the bucket is empty

    0_1485952731917_Screenshot from 2017-02-01 13-35-29.png



  • Likely no one working on the IT side of the world will be surprised that this happened after GitLab decided to be cocky and snub the entire IT industry claiming that their developers were better than the world's top IT minds and had no need for best practices. These people are clearly incompetent. How could a company whose primary value is the backups of the storage that they host have missed five different backup mechanisms and not noticed that literally nothing was being backed up? These aren't corrupt backups, these are "we didn't even take backups." No one set them up and checked the very first run to make sure that they worked! At all.



  • wow. . .



  • It took only something like six weeks for this "we know IT better than the IT industry" to blow up in their faces. It's like they weren't even trying. I feel badly for them, sort of, but they really, really were asking for this. Hubris has no place in IT, we need to be humble because this job is hard and dangerous and following the guidance and best practices from those that have researched this stuff matters, a lot.



  • I'm wondering what Mr. Hubris Pablo Carranza is going to try to say about this after his statements in December. We all thought he was crazy then. How did GitLab's CEO not stop this before it ever left the "I have this crazy idea" stage?



  • OMG, did anyone check out Pablo's LinkedIn profile? GitLab, who hired him to head infrastructure, was literally his first IT job, ever (that he lists.) He's had some pretty impressive programming experience, but that's a different field as we well know. How he went from programming to "head of infrastructure" with literally no experience is no one's fault except for the people who hired him. No wonder he doesn't know the basics, when would he have had time to learn them? Why would anyone bring in a head of infrastructure that is an L0 on their first job in the field? He started putting GitLab on the path to destruction more or less the moment he was hired, exactly as you'd expect from a newbie that has no idea what he doesn't know. Sounds like GitLab hired totally wrong and put someone in a position for which they could not possibly or reasonably be prepared to handle.



  • Surprisingly, one of Pablo's side projects is a KVM backup tool.



  • That is pretty funny that he doesn't have any experience for the Infrastructure Lead position he was hired for.....



  • @DustinB3403 said in GitLab in Massive Data Loss with No Working Backups:

    That is pretty funny that he doesn't have any experience for the Infrastructure Lead position he was hired for.....

    But, but.... he has a college degree!



  • This is worth discussing as this is a key factor, we assume, in how GitLab got to where they were. Pablo, it would seem, is probably a great guy that was thrown in way, way over his head. Imagine if Pablo was to pop into SW and start posting ideas about how he was going to leave the cloud, eschew virtualization and go all physical and was going to use scripts that his team would make for backups rather than using tried and true products.

    We'd say he was a crazy noob who isn't special and needs to not reinvent the wheel and to use common and best practices, common sense and to learn something about IT before doing nutty things thinking that the general good decision making in IT doesn't apply to him like it does to everyone else. We'd not be surprised at all to learn that he had just taken not only a new job, but his very first IT job and was only a few months into it (even today, long after the switch to physical is complete, Pablo has only been in IT for less than eleven months, he's not even finished his first year yet.) We'd laugh about noobs and crazy ideas and the kinds of people that show up on SW and talk about fake titles given out to sound good so that even first day newbies tend to get senior management titles and very strongly correct his crazy ideas.

    This is just a very public company with a lot to lose that let this happen. Here is PC's LI profile so that we can understand the background. Notice his last few years were just managing change management, not even really very technical. He's got what looks like decent programming background, but literally nothing that would suggest he was ready for any kind of IT role with oversight. He previous roles don't appear to have anything IT related, nor anything that would obviously signal the kind of business experience necessary to understand how IT technical decisions fit into the bigger picture. What did GitLab look at to even consider him for this role? If I was interviewing him in IT, his resume says L1 helpdesk or maybe entry level system administrator to "see if he would have the necessary operational mindset" before committing to keeping him on. Nothing in his experience points to operational mindset, the top skill for infrastructure roles, so how did he get on someone's radar in the first place?

    My guess, and this is only a guess, is that there is an endemic issue with management at GitLab. Maybe someone sees IT as a joke and things programmers are just high end IT? Maybe someone in management is just very foolish and doesn't think that they, as a hosting company, need good IT? Maybe they are just a normal, clueless SMB? Maybe a VC forced a nephew on them as part of a funding deal? Who knows, but root cause analysis would be a good idea for GitLab to do. Yes, Pablo did some insanely reckless things that even newbies on SW were be mocked for mentioning they wanted to do and should be held accountable for that. But there is no reasonable way that Pablo even got an interview for such a position or the power to do this kind of damage without something being very, very wrong much higher up in the organization.

    0_1485954600653_Screenshot from 2017-02-01 13-57-25.png

    0_1485954613299_Screenshot from 2017-02-01 13-57-57.png

    0_1485954625918_Screenshot from 2017-02-01 13-58-13.png



  • Given the time and scope of a move like this and the time frame that Pablo only began the job in February, and the very likely wrong assumption that Pablo initiated the change rather than just speaking on behalf of it, and the time till the switch over... it seems likely that the project of tearing the infrastructure apart in an epic rip and replace disaster must have started pretty much the moment he joined the company. A pretty bold move for a newbie in their first days in a new career. Imagine being hired fresh into automotive engineering for Ford with no experience and on day one announcing that wheels were legacy and you were going to make all new cars used tank treads and skis in the front instead.



  • 0_1485955643971_Screenshot from 2017-02-01 14-26-01.png

    Too easy to make a joke here. Are they hiring because clearly they forgot to hire IT people? Or is it funny that they are so adamant about hiring when, likely, they will be out of business in a few days.





  • @scottalanmiller said in GitLab in Massive Data Loss with No Working Backups:

    0_1485955643971_Screenshot from 2017-02-01 14-26-01.png

    Too easy to make a joke here. Are they hiring because clearly they forgot to hire IT people? Or is it funny that they are so adamant about hiring when, likely, they will be out of business in a few days.

    So long as that contract is guaranteed pay... regardless of time worked / business viability.



  • Looks like recovery is on track at roughly 5% per hour.

    0_1485956894907_Screenshot from 2017-02-01 14-47-38.png



  • "That's one of the things we want to start doing or looking at".

    In reference to creating backups... . . .



  • @DustinB3403 said in GitLab in Massive Data Loss with No Working Backups:

    "That's one of the things we want to start doing or looking at".

    In reference to creating backups... . . .

    Is that a real quote? LOL



  • @scottalanmiller said in GitLab in Massive Data Loss with No Working Backups:

    @DustinB3403 said in GitLab in Massive Data Loss with No Working Backups:

    "That's one of the things we want to start doing or looking at".

    In reference to creating backups... . . .

    Is that a real quote? LOL

    Yeah. not even joking.



  • "Could use time machine right?" ummm wtf...



  • If ever there was an RGE*, this would be it.
     
     
     
     
    *Resume Generating Event