Burned by Eschewing Best Practices


  • Service Provider

    It happens all of the time but in the heat of the moment people generally don't want to talk about how they felt a best practice didn't apply to them, skipped it and now have a disaster. It's common to ignore discussions around improvement using the excuse that things need to be fixed now. Of course they do, but we need to learn from mistakes as well or we will just keep repeating the pattern.

    So I wanted to start collecting examples of where this happens to show examples of how things that sound good can go very wrong when established good practices are ignored.


  • Service Provider

    We have no idea who made the bad decisions here, likely not the person now facing the problems, but here is the thread: http://community.spiceworks.com/topic/1184421-offline-blade-physical-servers-to-esxi

    Things that were done wrong that all sounded great but led to a hosed environment:

    • Blades instead of normal servers
    • Treated "single chassis" as redundant or magic ignoring that it was a very simple single point of failure
    • Did not virtualize
    • No backups

    The first two issues left them with no working Active Directory because all AD DCs were in the same blade chassis that failed. There is no easy recovery method offline because they didn't virtualize.



  • The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."


  • Service Provider

    @johnhooks said:

    The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

    I bet he failed to show any ROI on that move!



  • @scottalanmiller said:

    @johnhooks said:

    The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

    I bet he failed to show any ROI on that move!

    Ha he's a disaster. This was the only business in the world with guaranteed income and they still screwed it up. They had a contract with about 6 small HOA's in the area. They were the only cable company allowed to provide cable, internet, and phone (unless it was something like dish). They bragged about their FTTH setup, except they had horribly antiquated equipment that would break if you looked at it sideways.

    They were upgrading their Minerva on demand system. Instead of migrating all of the data, they (he) just wiped it clean and installed the new system. So at 2:00 in the afternoon the VOD system went down, all of the movies were gone, and if you were in the middle of a movie too bad. So now the VOD content has to be downloaded again from their providers, BUT all of the old content that won't be released again was gone. So all of the Game of Thrones episodes that were released and weren't being released again were gone.....

    They also literally ran their company off of an Access "database" that was designed by some lady who must have thought Access was some fancy spreadsheet application. The database only met the first normal form and that's because it's pretty much impossible to not meet it with a relational database. All of the notes for each address (they kept all of the information on houses as well since they did the installs and had some strange way of doing it) were kept in one memo field on each record. They printed out the actual form in access as work orders and kept paper copies. So when that giant memo clob was corrupted (which happened a good bit for many records) the only data we had was a printed snapshot of that memo field.



  • @johnhooks said:

    @scottalanmiller said:

    @johnhooks said:

    The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

    I bet he failed to show any ROI on that move!

    Ha he's a disaster. This was the only business in the world with guaranteed income and they still screwed it up. They had a contract with about 6 small HOA's in the area. They were the only cable company allowed to provide cable, internet, and phone (unless it was something like dish). They bragged about their FTTH setup, except they had horribly antiquated equipment that would break if you looked at it sideways.

    They were upgrading their Minerva on demand system. Instead of migrating all of the data, they (he) just wiped it clean and installed the new system. So at 2:00 in the afternoon the VOD system went down, all of the movies were gone, and if you were in the middle of a movie too bad. So now the VOD content has to be downloaded again from their providers, BUT all of the old content that won't be released again was gone. So all of the Game of Thrones episodes that were released and weren't being released again were gone.....

    They also literally ran their company off of an Access "database" that was designed by some lady who must have thought Access was some fancy spreadsheet application. The database only met the first normal form and that's because it's pretty much impossible to not meet it with a relational database. All of the notes for each address (they kept all of the information on houses as well since they did the installs and had some strange way of doing it) were kept in one memo field on each record. They printed out the actual form in access as work orders and kept paper copies. So when that giant memo clob was corrupted (which happened a good bit for many records) the only data we had was a printed snapshot of that memo field.

    ^^^ None of this is any surprise after you said "Florida". Down there, a stack of DVD backups in the microwave counted as "secure data disposal, in alignment with our contractual obligations and state laws". Not even joking. Didn't even have a dedicated "data disposal" microwave... just the one in the break room.



  • Oh I have a good one. This ingrained the practice of asking one simple question "What has changed recently?" and the practice of reevaluating the situation to avoid going "Down the rabbit hole"

    This was many years ago. I was working as an on-site IT Tech going to various homes & businesses throughout the day. I arrived on-site and fixed their issues. As a good measure I went ahead and updated and cleaned temps . Lo and behold, the computer booted to a black screen with blinking cursor. I did EVERYTHING, and I mean EVERYTHING to attempt to repair the boot. Bootsector repair, chkdsk /r , + various other fixes. I was still green in some ways and did not stop and step back to reevaluate the situation. I spent 2-3 hours attempting to fix the boot on this lady's laptop with no success. Eventually I figured it out. The Hard Drive had been upgraded recently to a 320GB. Unfortunately, the laptop only had 28 / 32 bit LBA (forget which one is applicable) and the previous tech had installed this large drive without paying attention to this limitation. So once the windows update wrote data above the 137GB mark, the drive was not bootable. I partitioned the drive into 120GB partitions and defragged and voila! It started booting again


  • Service Provider

    That used to catch a lot of people.



  • Creating separate Local Administrator accounts for End Users

    So in this topic, a sole IT Administrator is stuck on creating or granting administrative access for his developers, because they occasionally need to restart a service, or run something with administrative privileges.

    Rather than taking the advice offered by the community, he's settled on creating separate accounts for everyone in the domain, that will have local administrative rights.

    Yet fails to see why this is a very bad idea. (Even my recommendation of granted a few users local administrative rights via GPO) I recommend against, as there are so many possible causes where abuse and damages will occur. To which this administrator will not be able to do anything about it (at the local PC level)



  • @Brains said:

    The Hard Drive had been upgraded recently to a 320GB. Unfortunately, the laptop only had 28 / 32 bit LBA (forget which one is applicable) and the previous tech had installed this large drive without paying attention to this limitation. So once the windows update wrote data above the 137GB mark, the drive was not bootable. I partitioned the drive into 120GB partitions and defragged and voila! It started booting again

    Say what? So the machine was booting from a partition that was larger than 137 GB to begin with, and didn't have a problem until you had more than 137 GB of data on it? That's a new one on me.

    Nice catch.



  • @Dashrender Yea its really crazy. Its because the HDD wrote date above the 137GB mark which it could not address


  • Service Provider

    @Brains said:

    @Dashrender Yea its really crazy. Its because the HDD wrote date above the 137GB mark which it could not address

    Basically the filesystem went out into no-man's land and corrupted.


  • Service Provider

    Another example. Minor, but again, best practices make life easy.

    Beyond not understanding how some basics work (like thinking that the ILO needed an OS to work, which would defeat its purpose; and getting the name of the hardware wrong) he decided that he didn't want to virtualize and is trying to get modern Windows 2012 R2 onto a rather old HP Proliant DL360 G5. The hardware is perfectly viable but far older than HP is going to support for a modern OS with drivers. Had he virtualized like would have been sensible he would have never even known that there was a driver issue. His entire issue exists only because he avoided a simple best practice and then didn't think of switching to the best practice once he hit the roadblock caused by avoiding it.



  • Here is a future case of Burned by Eschewing Best Practices in the making, Spiceworks Link

    So far from the article the OP is looking for help with a renovated virtualization project, yet already has an IVPD on Xenserver, and wants to build a new IVPD on Hyper-V.

    And "budget is not a concern" (which this really should be an article all on its own!)



  • Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.



  • @Carnival-Boy said:

    Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.

    lol, while it can provide very high levels of uptime and reliability, a two server software sync'ed solution should provide even higher of both.



  • IVPD is generally a 3-2-1 system but it can scale as well.

    Just because the scale increases in size, doesn't change the design of the system. Which is dependent on everything working to provide reliability.

    If a SAN, switch, server, or any multiplicity of problems occurs the environment will be gimped. Effectively crippled to where recover-ability options will be extremely limited until all services are restored.

    By adding more equipment, there is more complexity, with more complexity, more experience is required to support and maintain the systems. Which when something goes wrong, if an expert is not on-site (or immediately available) downtime and cost of said down time can sky rocket.

    A much simpler approach (at least for what was "seen" in this topic) is a smaller footprint. With equipment that is capable of running independently of the other systems. For which everything can run on either system.



  • @Carnival-Boy said:

    Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.

    IVPD - Inverted Pyramid of Doom. Sure you have two servers but they rely on a single storage subsystem. This is great and makes some things simpler in the long run... until you get to the point that hardware SANs are no more resilient then traditional servers, in fact they are traditional servers with some storage appliance software installed on top then sold for a massive profit. If you have dual SANs that makes it easier but you also have to look at the storage network in and of itself, do you also have dual switches, dual connections to both hosts, etc. This is a massive overhead in both budget and specialized skills.

    In reality all of that work and money really doesn't make you any more reliable then having two physical hosts that have enough capacity to host the other's VMs when something goes wrong. Even more reliable in most cases. To get this even more reliable you could also forgo VM level replication and do everything at the application level.



  • Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.

    Sure there is complexity in there. But there's complexity with DAGs and file syncing as well.



  • @Carnival-Boy said:

    Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.

    Sure there is complexity in there. But there's complexity with DAGs and file syncing as well.

    Sure they do at what cost though? Is the cost worth it when you could get the same functionality for 10's of thousands of dollars less?

    I've run into the dual controllers are totally going to save the world sales person before, which one of my former managers bought into. Turns out they share the same backplane and interaction with the drives. So when one controller locked up due to a firmware issue the other controller wouldn't take over. Everything went offline for ~8 hours until they overnighted us a new controller and we updated the firmware on both of them.



  • Yeah, but cost isn't an issue as money is no object.



  • @Carnival-Boy said:

    Yeah, but cost isn't an issue as money is no object.

    Any business that says that is just trying to fail! The whole point of a for profit company is to make money, and they should be doing so with smart spending.



  • I'm not disagreeing, but if the OP says money is no object then you should treat that as fact. Maybe he has a magic money tree. Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point. The point I'm trying to understand is why dual SANs and dual switches equals this pyramid of doom thing.



  • It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.

    It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.

    But it's just wasteful in most cases.



  • @DustinB3403 said:

    It simply creates a larger pyramid, with more parts, which makes the entire system way more complex to troubleshoot, and fix should something happen.

    It doesn't force the system to be less reliable when compared to the standard 3-2-1 model, as you are in fact creating a level of redundancy by implementing a 2nd SAN to backup the first.

    But it's just wasteful in most cases.

    Basically this. You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.

    There is a point where this makes sense... but not at 6 servers and two physical hosts. I'm not sure where the tipping point is but probably at the hundreds of virtual servers mark.



  • @coliver said:

    You aren't any more reliable then the dual host scenario and you've introduced several more layers of potential failure to your system.

    You are actually mathematically substantially less reliable with that setup, at least from a hardware failure perspective.



  • You got any facts to back that up? I find it extremely difficult to evaluate reliability. Anyway, you can't just judge it from a hardware failure perspective, since we're comparing hardware redundancy versus software redundancy (eg DAGs, file syncing). Both are complicated. Both require expertise to administer and both are risky.


  • Service Provider

    @Carnival-Boy said:

    I'm not disagreeing, but if the OP says money is no object then you should treat that as fact.

    I don't agree. Knowing someone is wrong, confused or doesn't understand something is exactly when they need help most, not the least. Tons and tons of what we do in IT is recognizing when people don't know what they need to know and helping them. In a case like this where we know they have to be wrong and don't understand what they are doing, should we really help them hurt themselves?

    I totally get that this goes against my "always give people the benefit of the doubt" theory about never hurt the innocent to protect the guilty, but this is a case where money is never no object, it's simply not true, and it means someone desperately needs help and don't understand that they don't know.


  • Service Provider

    @Carnival-Boy said:

    Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point.

    That would actually make money the ONLY object. Budget would be the whole concern, not just part of it.



  • Well, ok, but the OP isn't' actually on ML so it's a moot point. What I'm really interested in is what the problem is with his solution (ignoring the financial cost) and why it is one of your inverted pyramid thingies. I'm not arguing, I just don't understand and want to learn.


Log in to reply
 

Looks like your connection to MangoLassi was lost, please wait while we try to reconnect.