Burned by Eschewing Best Practices

scottalanmiller

It happens all of the time but in the heat of the moment people generally don't want to talk about how they felt a best practice didn't apply to them, skipped it and now have a disaster. It's common to ignore discussions around improvement using the excuse that things need to be fixed now. Of course they do, but we need to learn from mistakes as well or we will just keep repeating the pattern.

So I wanted to start collecting examples of where this happens to show examples of how things that sound good can go very wrong when established good practices are ignored.

scottalanmiller

We have no idea who made the bad decisions here, likely not the person now facing the problems, but here is the thread: http://community.spiceworks.com/topic/1184421-offline-blade-physical-servers-to-esxi

Things that were done wrong that all sounded great but led to a hosed environment:

Blades instead of normal servers
Treated "single chassis" as redundant or magic ignoring that it was a very simple single point of failure
Did not virtualize
No backups

The first two issues left them with no working Active Directory because all AD DCs were in the same blade chassis that failed. There is no easy recovery method offline because they didn't virtualize.

stacksofplates

The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

scottalanmiller

@johnhooks said:

The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

I bet he failed to show any ROI on that move!

stacksofplates

@scottalanmiller said:

@johnhooks said:

The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

I bet he failed to show any ROI on that move!

Ha he's a disaster. This was the only business in the world with guaranteed income and they still screwed it up. They had a contract with about 6 small HOA's in the area. They were the only cable company allowed to provide cable, internet, and phone (unless it was something like dish). They bragged about their FTTH setup, except they had horribly antiquated equipment that would break if you looked at it sideways.

They were upgrading their Minerva on demand system. Instead of migrating all of the data, they (he) just wiped it clean and installed the new system. So at 2:00 in the afternoon the VOD system went down, all of the movies were gone, and if you were in the middle of a movie too bad. So now the VOD content has to be downloaded again from their providers, BUT all of the old content that won't be released again was gone. So all of the Game of Thrones episodes that were released and weren't being released again were gone.....

They also literally ran their company off of an Access "database" that was designed by some lady who must have thought Access was some fancy spreadsheet application. The database only met the first normal form and that's because it's pretty much impossible to not meet it with a relational database. All of the notes for each address (they kept all of the information on houses as well since they did the installs and had some strange way of doing it) were kept in one memo field on each record. They printed out the actual form in access as work orders and kept paper copies. So when that giant memo clob was corrupted (which happened a good bit for many records) the only data we had was a printed snapshot of that memo field.

RojoLoco

@johnhooks said:

@scottalanmiller said:

@johnhooks said:

The "IT Director" at a small cable company I worked for in Florida told me he was trying to "move away" from virtualization. He was complaining because they were "getting slow."

I bet he failed to show any ROI on that move!

Ha he's a disaster. This was the only business in the world with guaranteed income and they still screwed it up. They had a contract with about 6 small HOA's in the area. They were the only cable company allowed to provide cable, internet, and phone (unless it was something like dish). They bragged about their FTTH setup, except they had horribly antiquated equipment that would break if you looked at it sideways.

They were upgrading their Minerva on demand system. Instead of migrating all of the data, they (he) just wiped it clean and installed the new system. So at 2:00 in the afternoon the VOD system went down, all of the movies were gone, and if you were in the middle of a movie too bad. So now the VOD content has to be downloaded again from their providers, BUT all of the old content that won't be released again was gone. So all of the Game of Thrones episodes that were released and weren't being released again were gone.....

They also literally ran their company off of an Access "database" that was designed by some lady who must have thought Access was some fancy spreadsheet application. The database only met the first normal form and that's because it's pretty much impossible to not meet it with a relational database. All of the notes for each address (they kept all of the information on houses as well since they did the installs and had some strange way of doing it) were kept in one memo field on each record. They printed out the actual form in access as work orders and kept paper copies. So when that giant memo clob was corrupted (which happened a good bit for many records) the only data we had was a printed snapshot of that memo field.

^^^ None of this is any surprise after you said "Florida". Down there, a stack of DVD backups in the microwave counted as "secure data disposal, in alignment with our contractual obligations and state laws". Not even joking. Didn't even have a dedicated "data disposal" microwave... just the one in the break room.

Brains

Oh I have a good one. This ingrained the practice of asking one simple question "What has changed recently?" and the practice of reevaluating the situation to avoid going "Down the rabbit hole"

This was many years ago. I was working as an on-site IT Tech going to various homes & businesses throughout the day. I arrived on-site and fixed their issues. As a good measure I went ahead and updated and cleaned temps . Lo and behold, the computer booted to a black screen with blinking cursor. I did EVERYTHING, and I mean EVERYTHING to attempt to repair the boot. Bootsector repair, chkdsk /r , + various other fixes. I was still green in some ways and did not stop and step back to reevaluate the situation. I spent 2-3 hours attempting to fix the boot on this lady's laptop with no success. Eventually I figured it out. The Hard Drive had been upgraded recently to a 320GB. Unfortunately, the laptop only had 28 / 32 bit LBA (forget which one is applicable) and the previous tech had installed this large drive without paying attention to this limitation. So once the windows update wrote data above the 137GB mark, the drive was not bootable. I partitioned the drive into 120GB partitions and defragged and voila! It started booting again

scottalanmiller

That used to catch a lot of people.

DustinB3403

Creating separate Local Administrator accounts for End Users

So in this topic, a sole IT Administrator is stuck on creating or granting administrative access for his developers, because they occasionally need to restart a service, or run something with administrative privileges.

Rather than taking the advice offered by the community, he's settled on creating separate accounts for everyone in the domain, that will have local administrative rights.

Yet fails to see why this is a very bad idea. (Even my recommendation of granted a few users local administrative rights via GPO) I recommend against, as there are so many possible causes where abuse and damages will occur. To which this administrator will not be able to do anything about it (at the local PC level)

Dashrender

@Brains said:

The Hard Drive had been upgraded recently to a 320GB. Unfortunately, the laptop only had 28 / 32 bit LBA (forget which one is applicable) and the previous tech had installed this large drive without paying attention to this limitation. So once the windows update wrote data above the 137GB mark, the drive was not bootable. I partitioned the drive into 120GB partitions and defragged and voila! It started booting again

Say what? So the machine was booting from a partition that was larger than 137 GB to begin with, and didn't have a problem until you had more than 137 GB of data on it? That's a new one on me.

Nice catch.

Brains

@Dashrender Yea its really crazy. Its because the HDD wrote date above the 137GB mark which it could not address

scottalanmiller

@Brains said:

@Dashrender Yea its really crazy. Its because the HDD wrote date above the 137GB mark which it could not address

Basically the filesystem went out into no-man's land and corrupted.

scottalanmiller

Another example. Minor, but again, best practices make life easy.

Beyond not understanding how some basics work (like thinking that the ILO needed an OS to work, which would defeat its purpose; and getting the name of the hardware wrong) he decided that he didn't want to virtualize and is trying to get modern Windows 2012 R2 onto a rather old HP Proliant DL360 G5. The hardware is perfectly viable but far older than HP is going to support for a modern OS with drivers. Had he virtualized like would have been sensible he would have never even known that there was a driver issue. His entire issue exists only because he avoided a simple best practice and then didn't think of switching to the best practice once he hit the roadblock caused by avoiding it.

DustinB3403

Here is a future case of Burned by Eschewing Best Practices in the making, Spiceworks Link

So far from the article the OP is looking for help with a renovated virtualization project, yet already has an IVPD on Xenserver, and wants to build a new IVPD on Hyper-V.

And "budget is not a concern" (which this really should be an article all on its own!)

Carnival Boy

Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.

Dashrender

@Carnival-Boy said:

Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.

lol, while it can provide very high levels of uptime and reliability, a two server software sync'ed solution should provide even higher of both.

DustinB3403

IVPD is generally a 3-2-1 system but it can scale as well.

Just because the scale increases in size, doesn't change the design of the system. Which is dependent on everything working to provide reliability.

If a SAN, switch, server, or any multiplicity of problems occurs the environment will be gimped. Effectively crippled to where recover-ability options will be extremely limited until all services are restored.

By adding more equipment, there is more complexity, with more complexity, more experience is required to support and maintain the systems. Which when something goes wrong, if an expert is not on-site (or immediately available) downtime and cost of said down time can sky rocket.

A much simpler approach (at least for what was "seen" in this topic) is a smaller footprint. With equipment that is capable of running independently of the other systems. For which everything can run on either system.

coliver

@Carnival-Boy said:

Not sure why? Can you summarise the IVPD article to explain why his solution is likely to see him burned? It looks like a waste of money, but if money is no object then I don't see why it's not a valid solution offering very high levels of uptime and reliability.

IVPD - Inverted Pyramid of Doom. Sure you have two servers but they rely on a single storage subsystem. This is great and makes some things simpler in the long run... until you get to the point that hardware SANs are no more resilient then traditional servers, in fact they are traditional servers with some storage appliance software installed on top then sold for a massive profit. If you have dual SANs that makes it easier but you also have to look at the storage network in and of itself, do you also have dual switches, dual connections to both hosts, etc. This is a massive overhead in both budget and specialized skills.

In reality all of that work and money really doesn't make you any more reliable then having two physical hosts that have enough capacity to host the other's VMs when something goes wrong. Even more reliable in most cases. To get this even more reliable you could also forgo VM level replication and do everything at the application level.

Carnival Boy

Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.

Sure there is complexity in there. But there's complexity with DAGs and file syncing as well.

coliver

@Carnival-Boy said:

Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.

Sure there is complexity in there. But there's complexity with DAGs and file syncing as well.

Sure they do at what cost though? Is the cost worth it when you could get the same functionality for 10's of thousands of dollars less?

I've run into the dual controllers are totally going to save the world sales person before, which one of my former managers bought into. Turns out they share the same backplane and interaction with the drives. So when one controller locked up due to a firmware issue the other controller wouldn't take over. Everything went offline for ~8 hours until they overnighted us a new controller and we updated the firmware on both of them.