Are SMBs focused on the wrong things to improve reliability, up time and redundancy?

A Former User

Is is just me or do SMB IT folks get picky about the wrong things a lot when talking about reliability/up time and redundancy issues? I've seen a few things recently that leads me to believe this and one of them is this: http://community.spiceworks.com/topic/826298-are-refurbished-hard-drives-worth-it?

While on the surface to be extremely concerned about a hard drive failure is a good thing. But, I'd like to say harddrive failure is going to happen at some point and it's not that big a deal. I mean what does a single failed harddrive cause? In a server generally just a bit of performance lose while the raid is rebuilding.

On a Desktop, nothing (data files) should be stored locally so loss is minimal. I've always kept images for each department so all you have to do is take a few minutes to image it. Then activate the software if needed and join to the domain. Very little loss besides their Outlook NK2 file (okay I know that's not used anymore but you know what I mean)

Isn't a motherboard failure much bigger a deal than a harddrive failure and something we can't really plan for? A harddrive failure risk is so easy to mitigate; why would we focus on that as the single point of failure?

What's your thoughts?

thanksajdotcom

I can agree with this. However, motherboard failures are more rare and as long as you have support contracts and good hardware refresh policies, it shouldn't be a huge deal. In reality, you contact the vendor and have a new computer by next business day, image it and get it back to working order. But SMBs don't live in a perfect world filled with support contracts and documentation. We don't live in the world inside @scottalanmiller's head. We live in reality. So we deal with things as they are.

thanksajdotcom

That being said, I don't disagree with your points.

A Former User

@thanksajdotcom said:

I can agree with this. However, motherboard failures are more rare and as long as you have support contracts and good hardware refresh policies, it shouldn't be a huge deal. In reality, you contact the vendor and have a new computer by next business day.

I've had more motherboard failures that I can't do anything about than Harddrive. With a motherboard I'm reliant on somebody else to ship the motherboard out next day. With Harddrives I can have many spares ready to go on-site so it's a non-issue and even if I were to run out I can get them really fast.

If it's a desktop I have many spares pre-imaged with a generic "general" image I have them deploy at anytime. So failure of a harddrive or motherboard is a non issue. With servers failure of a harddrive is a non-issue I can just replace it and lose no data. The Raid Card failing would be a much much bigger issues than the drives failing. Harddrive failure is nothing to me, I plan for it.

scottalanmiller

Spot on. With technologies like RAID 10, drive array failure approaches an impossibility. With studies putting MTTF (Mean Time To Failure - that is average time before the very first failure) at greater than 160,000 array years (possibly much greater than, the study capped out without a single failure) worrying about array failure is pretty silly unless you start going to parity arrays and consumer disks.

But nearly everything else is likely to fail on you. Fans, Mobos, memory, backplanes, cables, etc. In my experience, which includes close to 100,000 server years personally, it is memory first and foremost that constitutes server risk with motherboards being second. Most everything else like fans, power supplies, etc. are redundant even in pretty entry level machines and die so seldom as to be practically ignored as a risk.

scottalanmiller

@thanksajdotcom said:

I can agree with this. However, motherboard failures are more rare and as long as you have support contracts and good hardware refresh policies, it shouldn't be a huge deal.

Except that a support contract on drives means that you will, effectively, never lose storage. Ever. You could have a thousand SMB IT Pros and not expect a single one of them, if they do things sensibly, to ever have a server go down, at all, because of storage loss. But in a thousand you'd get many, many career motherboard failures. Any motherboard "event" results in downtime on a normal server. No normal storage issue does.

scottalanmiller

That being said, enterprise class servers have hot swappable motherboards and memory. So these concerns really only apply to the SMB, which is what we are discussing here, but it is worth noting that if you were running an HP SuperDome or an Oracle M9000 everything except the chassis can be replaced and, arguably, you can even do that! And all with zero downtime.

scottalanmiller

@thecreativeone91 said:

The Raid Card failing would be a much much bigger issues than the drives failing. Harddrive failure is nothing to me, I plan for it.

Long before you get to redundant motherboards and memory, you can get redundant storage controllers and make even that a non-issue. You aren't going to get that in your DL380 or R720 systems, but one or two small steps up from there you can get those features. I believe Oracle offers them in the M5000 and up. Different vendors at different price points, of course.

scottalanmiller

@thanksajdotcom said:

We don't live in the world inside @scottalanmiller's head. We live in reality. So we deal with things as they are.

But not because they can't, only because they choose not to. Which is exactly what the thread is about. SMBs could live in practical, reliable, cost effective worlds. But they choose to blow loads of money or time or energy worrying about impractical things and not putting the effort where it needs to be: on the things that are actually likely to be a problem.

When discussing bad decision making, you can't use bad decision making as an excuse, it is circular reasoning. Yes, SMBs make bad decisions, granted. But the thread is asking.... "is this one of those places?" And the answer is "yes, it certainly is."

Are there ways, through nothing but better thinking and decision making, that SMBs could improve these things and get better reliability for less money? Yes, there are.

A Former User

@scottalanmiller said:

But not because they can't, only because they choose not to. Which is exactly what the thread is about. SMBs could live in practical, reliable, cost effective worlds. But they choose to blow loads of money or time or energy worrying about impractical things and not putting the effort where it needs to be: on the things that are actually likely to be a problem.

Case in point of wasting money http://community.spiceworks.com/topic/828072-new-procurve-2920-48g-w-10gbe-connections-i-m-baffled?

scottalanmiller

@thecreativeone91 good example. The guy has lost absolutely all perspective. Completely unreasonable. I think of this as being "weird." Why would he jump to the conclusion that he needs networking gear far in excess of anyone, anywhere for a completely normal, mostly trivial application. CAD isn't trivial, but it generally sits only on the high side of normal for workstation by workstation data access. It needs far more than accounting, but it is hardly a "demanding" application. I've supported big CAD shops and it needs consideration, but not the kind of consideration that would require an expert of anything. Just can't run it all off of, you know, a single RAID 1 or something silly like that. Just bare bones, entry level knowledge of storage kind of stuff.

But he gets it into his head that this one, completely out of nowhere, aspect of his system needs to be taken to a degree that, quite literally, I've never seen before. I have never once seen someone try to deliver bonded 10GigE to the desktop.

I'm unsure how to describe this phenomenon. I refer to it as "being weird" but there must be a more meaningful way to label it. It's almost like attempting to be overly clever while disregarding all common knowledge (and sense.) He seems to be convinced he is a special case (he is not) and that he's come up with some cleverly unique solution (he has not) and it never occurred to him to try the obvious things that everyone would naturally do. He's completely ignored the fact that he pointed out immediately a massive bottleneck and described exactly the performance problems that one would expect from his setup.

And then, even once everyone clearly sees what is wrong and can tell that he is crazy, he sticks to his "I'm a special case" idea and thinks that somehow moving a tiny 1GB file requires 20Gb/s throughput, even when attached to a small, legacy spinning RAID 1 array!!

thanksajdotcom

@scottalanmiller said:

That being said, enterprise class servers have hot swappable motherboards and memory. So these concerns really only apply to the SMB, which is what we are discussing here, but it is worth noting that if you were running an HP SuperDome or an Oracle M9000 everything except the chassis can be replaced and, arguably, you can even do that! And all with zero downtime.

Hot swapable motherboards? I've never heard of such a thing...

scottalanmiller

@thanksajdotcom Pretty much anything in the mid-range and mainframe classes has had this for decades. It's been standard for a long time as motherboards used to be much more flaky than they are today. In a mainframe you have always needed to be able to replace absolutely any component while the system was still running. So rather than traditional, monolithic motherboards like you see in commodity servers you have CPU and memory modules that attached hot-swappable to a massive backplane.

scottalanmiller

This isn't just used to replace failed motherboards but often to allow systems to be upgraded to a newer generation of processors while the system is still running.

thanksajdotcom

@scottalanmiller said:

This isn't just used to replace failed motherboards but often to allow systems to be upgraded to a newer generation of processors while the system is still running.

Damn...

scottalanmiller

The gap between the enterprise and the commodity server world is still pretty shocking. From the number of threads, cores or just processors available in a system, to total memory sizes to features like being able to hot swap everything from a fan to an entire chassis, there just aren't those features widely available in commodity servers.

Once you move to Power, Sparc and Itanium everything changes.