The Inverted Pyramid of Doom Challenge

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

The other issue with Scale (and this is no offense to them) or anyone in the roll your own hypervisor game right now is there are a number of verticle applications that the business can not avoid, that REQUIRE specific hypervisors, or storage be certified. To get this certification you have to spend months working with their support staff to validate capabilities (performance, availability, predictability being one that can strike a lot of hybrid or HCI players out) as well as commit to cross engineering support with them.

This is always tough and is certainly a challenge to any product. It would be interesting to see a survey of just how often this becomes an issue and how it is addressed in different scenarios. From my perspective, and few companies can do this, it's a good way to vet potential products. Any software vendor that needs to know what is "under the hood" isn't ready for production at all. They might need to specify IOPS or resiliency or whatever, sure. But caring about the RAID level used, that it is RAID or RAIN, what hypervisor is underneath the OS that they are given - those are immediate show stoppers, no vendor with those kinds of artificial excuses to not provide support are show the door. Management should never even know that the company exists as they are not viable options and not prepared to support their products. Whether it is because they are incompetent, looking for kickbacks or just making any excuse to not provide support does not matter, it's something that a business should not be relying on for production.

This right here makes no sense to me. You are ok with recommending infrastructure that can ONLY be procured from a single vendor for all expansions and has zero cost control of support renewal spikes, hardware purchasing, software purchasing (Proprietary hypervisor only sold with hardware), but you can't buy a piece of software that can run on THOUSANDS of different hardware configurations and more than 1 bare metal platform?

In Medicine for EMR's Caché effectively controls the database market for anyone with multiple branches with 300+ beds and one of every service. (Yes there is All Scripts that runs on MS SQL, and no it doesn't scale and is only used for clinics and the smallest hospitals as Philip will tell you). If you tell the chief of medicine you will only offer him tools that will not scale to his needs you will (and should) get fired. There are people who try to break out from the stronghold they have (MD Anderson who has a system that is nothing more than a collection of PDF's) but its awful, and doctors actively choose not to work in these hospitals because the IT systems are so painful to use (You can't actually look up what medicines someone is on, you have to click through random PDF's attached to them or find a nurse). The retraining costs that would be required to fulfill an IT mandate of "can run on any OS or hypervisors" to migrate this platform (or many major EMR's) are staggered. IT doesn't have this much power even in the enterprise. Sometimes the business driver for a platform outweighs the loss of stack control, or conformity of infrastructure (I"m sure the HFT IT guys had this drilled into their head a long time ago). This is partly the reason many people still quietly have a HP-UX, or AS400 in the corner still churning their ERP..

I agree that you should strongly avoid lock in where possible, but when 99% of the other community is running it on X and your the 1% that makes support calls a lot more difficult, not just because they are not familiar with your toolset. A lot of these products have cross engineering escalation directly into the platforms they have certified. We have lock-in on databases for most application stacks (and live with it no matter how many damn yachts we buy Larry). The key things are:

Know the costs going in. Don't act surprised when you buy software for 1 million and discover you need 500K worth of complementary products and hardware to deploy it.
Know what parts you can swap if they fail to deliver. (Hardware, support, OS, Database, Hypervisor) and be comfortable with any with reduced choice, or no choice. Different people may need different levels of support for each.
Also know what your options for the platform for hosted or OPEX non-hardware offerings are (IE can I replicate to a multi-tenant DR platform to make DR a lowered Opex).

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

I agree that you should strongly avoid lock in where possible, but when 99% of the other community is running it on X and your the 1% that makes support calls a lot more difficult, not just because they are not familiar with your toolset.

If your application isn't working, why are you looking at my hardware? I've never once, ever, seen a company that needed to call their EMR vendor to get their storage working, or their hypervisor. What scenario do you picture this happening in? What's the use case where your application vendor is being asked to support your infrastructure stack? And, where does it end?

There are only four enterprise hypervisors in any case, so if you are vendor that demands this level of integration, you need only support the four. Sure, someone new might come along, but this is a really silly limition to my thinking. It's no business of an application maker what platform is delivering the system, only that it is delivered. If that makes their job more complicated, you have other issues. If they even ask to see your underlying system, you have issues.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

The retraining costs that would be required to fulfill an IT mandate of "can run on any OS or hypervisors" to migrate this platform (or many major EMR's) are staggered.

Well, no. The cost is literally zero. In fact, it takes cost and effort to not support it. OS and hypervisors are totally different here. Writing for an OS takes work, because that's your application deployment target. That's where you need to target the OS in question. The hypervisor is no business to an application writer. That's below the OS, on the other side of your interface. There is zero effort, literally zero, for the application team.

So what I see isn't a lack of effort, it's throwing in additional effort to try to place blame elsewhere for problems that aren't there. I've run application development teams, if your team is this clueless, I'm terrified to tell the business that I let you even in the door, let alone deployed your products.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

IT doesn't have this much power even in the enterprise.

It does in finance, that's for sure. Any business that takes supportability and viability into account would never have IT not in a veto position here. IT may not pick the products, but it's a minimal level of business competence for them to be able to veto things that are not supportable (likely by the vendor), viable or secure.

In healthcare where cost effective, stable, supportable and secure don't matter, sure. But that's not the business world, either. You don't run quality IT in that field, not good business practices. It's its own thing and often decisions are made for reasons very, very different than "what's good for making money or supporting healthcare." I've been told flat out by healthcare management that "lowering cost, making money or providing better healthcare" were of zero interest to them because they were non-profit and the patients were not their customers.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

This is partly the reason many people still quietly have a HP-UX, or AS400 in the corner still churning their ERP..

That "can" happen, but I never see those companies. What I find are always companies that lack the skills, resources or business acumen to do an application migration or to plan for one and get stuck cycle after cycle deploying something far too expensive because they did not develop the skills, acquire the skills or prioritize the planning to protect themselves - all business failings of bad management and not a "reason to strategize around this process."

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

I agree that you should strongly avoid lock in where possible, but when 99% of the other community is running it on X and your the 1% that makes support calls a lot more difficult, not just because they are not familiar with your toolset.

If your application isn't working, why are you looking at my hardware? I've never once, ever, seen a company that needed to call their EMR vendor to get their storage working, or their hypervisor. What scenario do you picture this happening in? What's the use case where your application vendor is being asked to support your infrastructure stack? And, where does it end?

Because performance and availability problems come from the bottom up not the top down. SQL has storage as a dependency, storage doesn't have SQL as a dependency, and everything rolls downhill...

IF I'm running EPIC and want to understand why a KPI was missed and if there was a correlation to something in the infrastructure stack I can overlay the syslog of the entire stack, as well as the SNMP/SMI-S, API and hypervisor performance stats (application stats) the EUC environment stats (Hyperspace either on Citrix or View) and see EXACTLY what caused that query to run slow. There are tools for this. These tools though are not simple to build and if they don't have full API access to the storage, or hypervisor, with full documentation and this stuff built (including log clarification) its an expensive opportunity to migrate this to a new stack, and something they want to be restrictive on.

SAP HANA is a incredible pain in the ass to tune and setup, and depending on the underlying disk engine may have different block sizing or other best practices. This is one of the times where things like NUMA affinity can actually make or break an experience. Getting their support to understand a platform enough to help customers tune this, their PSO assist in deployments, and their support identify known problems with the partner ecosystem means they are incredibly restrictive (Hence they ares still defining HCI requirements).

It costs the software vendors money in support to deal with non-known platforms (Even if they don't suck). The difference between two vendors arguing and pointing fingers, and two vendors collaborating at the engineering level to understand and solve problems together is massive in customer experience (and the cost required).

The amount of effort that goes into things like this, and reference architectures is honestly staggering and humbling (these people are a lot smarter than me). I used to assume it was just vendors being cranky, but seeing the effort required of a multi-billion dollar.

At the end of the day EPIC, SAP and other ERP platforms will take the blame (not Scale, or Netapp, or EMC) if the platform delivers an awful experience (Doctors or CFO's will just remember that stuff was slow and broke a lot) so being fussy about what you choose to support is STRONGLY in their business interests and is balanced against offering enough choice that they do not inflate their costs too high. Its a balancing act.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Because performance and availability problems come from the bottom up not the top down. SQL has storage as a dependency, storage doesn't have SQL as a dependency, and everything rolls downhill...

That doesn't make sense, though. Applications care that they have enough CPU, memory, IOPS, bandwidth, etc. That's it. They don't care how it is delivered, only that it is available when needed. This would be, again, a failing of any application team and any IT team if they look to the application for issues involving not providing enough resources for performance.

If your point here is that incompetent IT departments tend to buy unsupportable, crappy software.. sure. No one is denying that people don't do their jobs well. But that doesn't mean we should recommend doing things poorly just because lots of people aren't good at their jobs.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

SAP HANA is a incredible pain in the ass to tune and setup, and depending on the underlying disk engine may have different block sizing or other best practices. This is one of the times where things like NUMA affinity can actually make or break an experience. Getting their support to understand a platform enough to help customers tune this, their PSO assist in deployments, and their support identify known problems with the partner ecosystem means they are incredibly restrictive (Hence they ares still defining HCI requirements).

Right, so at this point you are talking about outsourcing your IT department as this isn't application work, this is infrastructure work. So this is turning into a completely different discussion. Now you are hiring an external IT consulting group that doesn't know the platform(s) that you might be running. That's a totally different discussion.

But what we are talking about here is needing an application vendor to do underlying IT work for them. It's a different animal. It does happen and there is nothing wrong with outsourcing IT work, obviously, I'm a huge proponent of that. But there is no need to get it from the application vendor, that some do might make sense in some cases, that application teams demand that they also be your IT department is a problem unless you are committed to them delivering their platform as an appliance and it should be treated that way in that case. Nothing wrong with that per se and a lot of places do just that.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

At the end of the day EPIC, SAP and other ERP platforms will take the blame (not Scale, or Netapp, or EMC) if the platform delivers an awful experience (Doctors or CFO's will just remember that stuff was slow and broke a lot) so being fussy about what you choose to support is STRONGLY in their business interests and is balanced against offering enough choice that they do not inflate their costs too high. Its a balancing act.

Yes, I totally understand, vendors that target irrational, emotion, incompetent businesses have an interest in doing things that are not in the interest of those customers. As Scott Adams' defines it, the stupid rich. You make your best money by overcharging for bad products and marketing to those that aren't smart enough to figure out how they are getting screwed. I don't blame the vendors for making money, I blame customers for buying into it.

If our goal is to make money off of the businesses, we do one thing. If we are IT and our job is to make good decisions and protect the business from predatory vendors, we do another.

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

The retraining costs that would be required to fulfill an IT mandate of "can run on any OS or hypervisors" to migrate this platform (or many major EMR's) are staggered.

Well, no. The cost is literally zero. In fact, it takes cost and effort to not support it. OS and hypervisors are totally different here. Writing for an OS takes work, because that's your application deployment target. That's where you need to target the OS in question. The hypervisor is no business to an application writer. That's below the OS, on the other side of your interface. There is zero effort, literally zero, for the application team.
So what I see isn't a lack of effort, it's throwing in additional effort to try to place blame elsewhere for problems that aren't there. I've run application development teams, if your team is this clueless, I'm terrified to tell the business that I let you even in the door, let alone deployed your products.

Applications can certainly care what the hypervisor/storage is and integrate with it.

If your leveraging the underlying platform for writable clones, for test/dev QA workflows (See this a lot with Oracle DB applications and in oil gas where the application might even directly call Netapp API's and manage the array for this stuff).

Backups - (Some Hypervisors have changed block tracking so a backup takes minutes, others don't meaning a full can take hours). BTW I hear Hyper-V is getting this in 2016 (Veeam had to write their own for their platform).

VDI - always leverages so many hypervisor based integrations that the experience can be wildly different. Citrix and View both need 3D Video Cards to do certain things and being locked into a platform that has limited support for that can be a problem.

Security - Guest introspection support to hit compliance needs, Micro-segmentation requirements (EPIC has drop in templates for NSX, Possibly HCI at some point). If you want actual micro-semtnagation and inspection on containers there isn't anything on the market that competes with Photon yet. At some point there may be ACI templates but that will require network hardware lock in (Nexus 9K) and that's even crazier (Applications defining what switch you can buy!).

Monitoring - app owners want full stack monitoring and this is a gap in a lot of products. Here's an example.

Applications caring about hypervisor and hardware will become more pronounced when Apache's Pass and Hurley hits the market and applications start being developed to access Byte addressable persistent memory, and FPGA co-processors. I don't expect all 4 to have equal support for this on day one, and the persistent memory stuff is going to be game changing (and also a return to the old days, as technology proves itself to be cyclical once again!).

scottalanmiller

One of the major problems with the "application vendors define all the rules" approach is that they end up years or decades behind. And they tend to work very hard to define things that get them kick backs. How many applications being deployed today still demand that you use RAID 5 on non-SSD disks (for "performance" they say), don't allow virtualization at all, require that you buy SQL Server Enterprise (even for a five user business), require full failover at the application level, only allow certain generation processors and more.... all for a PHP web app that would work even better with a free MariaDB instance, could have been deployed on Linux, would have worked perfectly on SSDs in RAID1 and had no need for failover because the downtime wasn't important?

We see this constantly. Application side requirements mean that we are at the mercy of someone who isn't an IT shop to define IT. In what other case would we ever allow a nearly random third party to define our company infrastructure? Does your palette maker get to tell you what trucks to buy? Does your CPA get to tell you what management structure to use or how to hire the right people? Does your toilet paper supplier get to demand what brand of light bulbs you put in the bathrooms or what faucets you install? Does your soda vending machine guy get to choose the brand of PCs the receptionist uses?

Of course not. You'd never do business with a vendor that did that in any other case. It is your business to determine your own infrastructure needs. IT is part of your operations.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

VDI - always leverages so many hypervisor based integrations that the experience can be wildly different. Citrix and View both need 3D Video Cards to do certain things and being locked into a platform that has limited support for that can be a problem.

VDI is not an app, though. It works directly on the hypervisor, so is a different thing.

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

SAP HANA is a incredible pain in the ass to tune and setup, and depending on the underlying disk engine may have different block sizing or other best practices. This is one of the times where things like NUMA affinity can actually make or break an experience. Getting their support to understand a platform enough to help customers tune this, their PSO assist in deployments, and their support identify known problems with the partner ecosystem means they are incredibly restrictive (Hence they ares still defining HCI requirements).

Right, so at this point you are talking about outsourcing your IT department as this isn't application work, this is infrastructure work. So this is turning into a completely different discussion. Now you are hiring an external IT consulting group that doesn't know the platform(s) that you might be running. That's a totally different discussion.

But what we are talking about here is needing an application vendor to do underlying IT work for them. It's a different animal. It does happen and there is nothing wrong with outsourcing IT work, obviously, I'm a huge proponent of that. But there is no need to get it from the application vendor, that some do might make sense in some cases, that application teams demand that they also be your IT department is a problem unless you are committed to them delivering their platform as an appliance and it should be treated that way in that case. Nothing wrong with that per se and a lot of places do just that.

This is largely SAP's support model today. They have validated appliance or tightly controlled RA's. Honestly making in memory database work is such a niche skill set I don't blame them for demanding to take that on for two reasons.

Its weird work most in house IT will not know.
They can charge a ton of money.
No one complains because if your using HANA for what its intended the 2 million you spend on it is a joke vs the benefit it brings to your org.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Backups - (Some Hypervisors have changed block tracking so a backup takes minutes, others don't meaning a full can take hours). BTW I hear Hyper-V is getting this in 2016 (Veeam had to write their own for their platform).

Sure... but what does the application layer care? Either the application takes care of its own backups and doesn't care what the hypervisor does, or it relies on IT to handle backups and it isn't any of their concern either.

Again, this is an application vendor or programmer trying to get involved in IT decisions, processes and designs. Do you let the company that makes your sofa determine how big your fireplace has to be because "they want to ensure that you are cozy?"

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

This is largely SAP's support model today. They have validated appliance or tightly controlled RA's. Honestly making in memory database work is such a niche skill set I don't blame them for demanding to take that on for two reasons.

Its weird work most in house IT will not know.

They can charge a ton of money.

No one complains because if your using HANA for what its intended the 2 million you spend on it is a joke vs the benefit it brings to your org.

And it helps to explain why SAP failure rates are through the roof. They are the standard example used in nearly all studies for where projects just totally fall flat. SAP is famous for not delivering working systems. When they do pull it off, they are very good. But very often, they don't. Dealing with one right now, actually, and SAP has been failing big time, even in a very small deployment.

But it's also very important to note that when we are talking about systems getting this big, how often does it have any value to merging this onto other infrastructure, anyway? If I was getting SAP and spending millions to acquire it, why bother with trying to make all other systems fit onto the "leftovers" of the SAP one?

The argument for "why someone like SAP needs to determine all kinds of things under the hood" equally causes problems that other vendors might say "well SAP messed this up so we won't support it." So it very strongly says that if you are going to accept a vendor that needs this kind of control to try to make money and avoid blame, that you need to isolate your different vendor's systems so that they can all make these independent choices.

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Because performance and availability problems come from the bottom up not the top down. SQL has storage as a dependency, storage doesn't have SQL as a dependency, and everything rolls downhill...

That doesn't make sense, though. Applications care that they have enough CPU, memory, IOPS, bandwidth, etc. That's it. They don't care how it is delivered, only that it is available when needed. This would be, again, a failing of any application team and any IT team if they look to the application for issues involving not providing enough resources for performance.

If your point here is that incompetent IT departments tend to buy unsupportable, crappy software.. sure. No one is denying that people don't do their jobs well. But that doesn't mean we should recommend doing things poorly just because lots of people aren't good at their jobs.

Most IT departments (Even enterprises) are not skilled (or skilled well) at troubleshooting infrastructure (Especially beats like ERP that might have a dozen interdependent systems) without assistance. Most ERP vendors know this and so rather than let the customers deploy a database for 20K users on a Hyper-V host with a 3 Disk RAID 5 (and then the project be written off as a failure and their name be damaged) take this choice away.

For the 5 years I consulted "why is this slow" was one of the most common engagements. 9/10 of the time I was chasing some crazy application issue it had nothing to do with the application. Generally it was staring people in the face, had a giant RED alarm, and was fairly obvious (Disk latency isn't supposed to be 1200ms, and NL-SAS drives shouldn't be used for DB's in 5 billion dollar companies Yo?). Assuming that vendors are crossing a line by assuming internal IT doesn't understand what it will take to deliver their applications is CRITICAL to being a successful application vendor. I've seen users, IT and C-suite trash applications that worked fine, but the infrastructure was all wrong....

This is part of a huge reason for many vendors pushing for SaaS offerings, or OPEX offerings. If you don't bundle high levels of support that can extend beyond the application your risking your revenue. Much like why Scale (and other highly successful HCI vendors) try to own support of the ENTIRE stack. If they didn't own support of the hypervisor people would do awful, awful things then blame them.

Its not fair, but its the reality we live in...

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Most IT departments (Even enterprises) are not skilled (or skilled well) at troubleshooting infrastructure (Especially beats like ERP that might have a dozen interdependent systems) without assistance. Most ERP vendors know this and so rather than let the customers deploy a database for 20K users on a Hyper-V host with a 3 Disk RAID 5 (and then the project be written off as a failure and their name be damaged) take this choice away.

Most ERP vendors are not skilled at this either, though. SAP continuously fails at this. Their competitors are often worse. Some actually do the opposite and specifically require three disk RAID 5 rather than do this to avoid it.

Sure, most IT departments are bad. But again we are going down the "assumption of bad decisions" bad. The one that says "we should make bad decisions, because we make bad decisions, so we start recommending bad things." It's not good logic to say "people are often dumb, so we assume you are and make products based on that." That might be good logic for making money (and why vendors always recommend it) but it's not a good idea to do business with those vendors.

Basically how I read this is "in these cases, your vendor is building their products based around you not being competent and that you will make bad decisions." That's great, that just, to me, repeats by original point that IT should rule out those vendors.

StorageNinja

@scottalanmiller said in The Inverted Pyramid of Doom Challenge:

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Backups - (Some Hypervisors have changed block tracking so a backup takes minutes, others don't meaning a full can take hours). BTW I hear Hyper-V is getting this in 2016 (Veeam had to write their own for their platform).

Sure... but what does the application layer care? Either the application takes care of its own backups and doesn't care what the hypervisor does, or it relies on IT to handle backups and it isn't any of their concern either.

Again, this is an application vendor or programmer trying to get involved in IT decisions, processes and designs. Do you let the company that makes your sofa determine how big your fireplace has to be because "they want to ensure that you are cozy?"

Application owners have RPO/RTO's and they expect the infrastructure people to often take care of that. (When I have a 5TB OLTP database, in guest options generally fail to deliver somehow).

If I buy a couch or desk that's massive for a tiny apartment I could see the sales guy asking how big my door way is to make sure they can deliver it. Otherwise I'll be saying "GALLERY FURNITURE SUCKS THEY SELL COUCHES THAT DON"T WORK". This is what users, application owners, and infrastructure people do today. Vendors MUST protect their name. I'm not saying these whiners make any sense, but people do this.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

For the 5 years I consulted "why is this slow" was one of the most common engagements. 9/10 of the time I was chasing some crazy application issue it had nothing to do with the application. Generally it was staring people in the face, had a giant RED alarm, and was fairly obvious (Disk latency isn't supposed to be 1200ms, and NL-SAS drives shouldn't be used for DB's in 5 billion dollar companies Yo?). Assuming that vendors are crossing a line by assuming internal IT doesn't understand what it will take to deliver their applications is CRITICAL to being a successful application vendor. I've seen users, IT and C-suite trash applications that worked fine, but the infrastructure was all wrong....

And that's why external IT consulting was brought in. Not a random application vendor.

scottalanmiller

@John-Nicholson said in The Inverted Pyramid of Doom Challenge:

Application owners have RPO/RTO's and they expect the infrastructure people to often take care of that. (When I have a 5TB OLTP database, in guest options generally fail to deliver somehow).

Yup, and if they outside IT to the application vendor, that SLA isn't owned by the actual IT department but by someone who came in, put in something new and ran away. If the application owners need a reliable RPO/RTO, they need to work with IT, not work against them.