ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    ZFS Based Storage for Medium VMWare Workload

    Scheduled Pinned Locked Moved SAM-SD
    zfsstoragevirtualizationfilesystemsraid
    156 Posts 9 Posters 86.8k Views
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • donaldlandruD
      donaldlandru
      last edited by scottalanmiller

      Ok, so a little background. the storage situation at my organization is our weakest link in our network. Currently we have a single HP MSA P2000 with 12 spindles (7200 rpm) serving two separate ESXi clusters. We have a 2 node cluster for our operations (Exchange, AD, SharePoint Foundation, and other miscellaneous applications) and a 3 node cluster for development machines. Development is our core business, in simple terms we do SI work for Oracle Retail applications which includes custom development. Some in the organization argue this data may be even more important than the aforementioned operations systems, thankfully IMO my boss (the CEO) disagrees with that opinion. Also, when presenting this same information (rolled up better to speak CEO), my bosses response was whatever I think is the better solution. The company really does stand behind me in what I suggest, I just don't want to add additional risk.

      It is not uncommon for us to max out the disk i/o on 12 spindles sharing the load of almost 150 virtual machines and everyone is on board that something needs to be changed.

      Here is what the business cares about the solution: Reliable solution that provides necessary resources for the development environments to operate effectively (read: we do not do performance testing in-house as by the very nature, it is much a your mileage may vary depending on your deployment situation).

      In addition to the business requirements, I have added my own requirements that my boss agrees with and blesses.

      1. Operations and Development must be on separate storage devices
      2. Storage systems must be built of business class hardware (no RED drives -- although I would allow this in a future Veeam backup storage target)
      3. Must be expandable to accommodate future growth

      Requirements for development storage

      • 9+ Tib of usable storage
      • Support a minimum of 1100 random iops (what our current system is peaking at)
      • disks must be in some kind of array (zfs, raid, mdadm, etc)

      Proposed solutions:

      #1 a.k.a the safe option
      HP StoreVirtual 4530 with 12 TB (7.2k) spindles in RAID6 -- this is our vendor recommendation. This is an HP renew quote with 3 years 5x9 support next-day on-site for ~$15,000

      Pros
      Can purchase support
      Single-vendor -- "one throat to choke"
      Integrated solution
      Cons
      Less performance than solution #2 out of the box
      More expensive to upgrade later (additional shelves and drives at HP prices)
      All used hardware

      #2 ZFS Solution ~$10,000
      24 spindle 900Gb (7.2k SAS) in 12 mirrored vdevs
      Based on Supermicro SC216E16 chassis
      X9SRH-7F Motherboard
      Intel E5-1620v2 CPU
      64 GB of RAM
      No L2ARC or ZIL planned
      Dual 10gig NICs

      Pros
      Better performance out of the box (twice the spindle count)
      Non-vendor specific parts means upgrades require less investment

      Cons
      Self-supported
      I am the support contract 😕
      Multiple vendors and suppliers to acquire parts
      Combination of new and used hardware (the chassis) to get this price point

      Alright, tear me apart tell me I am wrong or provide any other useful feedback. The biggest concerns I have exist in both platforms (drives fail, controllers fail, data goes bad, etc) and have to be mitigated either way. That is what we have backups for -- in my opinion the HP gets me the following things:

      1. The "ability" to purchase a support contract
      2. Next-day on-site of a tech or parts if needed

      With the $4000 saved from not buying the HP support contract I can buy a duplicate Supermicro system, and a couple extra hard drives, and have the same level of protection.

      Note: this is my first time posting an actual give me feedback topic, I tried to include all information I felt was relevant. If more is needed I can provide.

      scottalanmillerS donaldlandruD S 10 Replies Last reply Reply Quote 2
      • scottalanmillerS
        scottalanmiller
        last edited by

        Before I dive into it, what is the need around ZFS? It sounds like you are leading with the solution, rather than the goal, which will not lead us in the direction of a best answer. We should step back and think at the goal level and determine what it is that we want to accomplish. Maybe ZFS will be the answer, but what it if isn't? Leading with the answer and looking for the question isn't the best way to design a solution.

        donaldlandruD 1 Reply Last reply Reply Quote 1
        • scottalanmillerS
          scottalanmiller @donaldlandru
          last edited by

          @donaldlandru said:

          We have a 2 node cluster for our operations (Exchange, AD, SharePoint Foundation, and other miscellaneous applications) and a 3 node cluster for development machines.

          So a two node cluster and a three node cluster. This seems straightforward.... no external storage at all. The rule of thumb of external storage is that it should not be considered until you are above four nodes in a single cluster and even then, not normally until much larger. What is the purpose of having external storage at all?

          donaldlandruD 1 Reply Last reply Reply Quote 1
          • scottalanmillerS
            scottalanmiller
            last edited by

            Another question: what is the purpose for the clusters? Currently you have an inverted pyramid of doom, not the best design as you know. But this implies that there are no needs around high availability. In fact, it means that you are currently below "standard availability" and this should mean that dropped out of clusters to just go to stand alone servers would itself be an improvement. What is the reason for having clusters at all given that reliability hasn't been a factor thus far?

            donaldlandruD 1 Reply Last reply Reply Quote 1
            • donaldlandruD
              donaldlandru @scottalanmiller
              last edited by

              @scottalanmiller said:

              Before I dive into it, what is the need around ZFS? It sounds like you are leading with the solution, rather than the goal, which will not lead us in the direction of a best answer. We should step back and think at the goal level and determine what it is that we want to accomplish. Maybe ZFS will be the answer, but what it if isn't? Leading with the answer and looking for the question isn't the best way to design a solution.

              In a sense I am, only due to outside of the MSA and Windows based storage this is what I am most familiar with. Seeing as if we don't go with a vendor supported solution, this would require the minimal effort to support. Doesn't make it the right answer, just the one I am most comfortable with putting my name next too.

              1 Reply Last reply Reply Quote 0
              • scottalanmillerS
                scottalanmiller @donaldlandru
                last edited by

                @donaldlandru said:

                1. Operations and Development must be on separate storage devices

                Mostly makes sense. This heavily suggests that the local storage options will be best then as you lose the only real potential leverage for having external storage which was tiny bits of cost savings that might have arisen by having five servers share one storage unit. Without that, really hard to come up with a way to have external storage. It was essentially impossible even with five.

                1 Reply Last reply Reply Quote 0
                • scottalanmillerS
                  scottalanmiller @donaldlandru
                  last edited by

                  @donaldlandru said:

                  1. Storage systems must be built of business class hardware (no RED drives -- although I would allow this in a future Veeam backup storage target)

                  What's the reason for this? Red drives are just as reliable, or meaningfully so, as any other drive type in certain scenarios. I'm not saying that Red is going to be right or make any sense, but as a requirement this doesn't match the concept of a business goal. This is another "solution looking for a problem." Red drives are perfectly viable for the most enterprise of applications, when they fit the bill.

                  Even for a SAM-SD, which by definition is all about being enterprise storage, WD Red are perfectly acceptable. The idea that consumer drives are risky is purely one tied to the use of already more risky parity arrays. The same factors that would make you classify WD Red as "non-business class" also qualifies RAID 6 in the same way. So it would rule both or neither out, depending on the application of this rule but not one or the other.

                  1 Reply Last reply Reply Quote 1
                  • donaldlandruD
                    donaldlandru @scottalanmiller
                    last edited by

                    @scottalanmiller said:

                    @donaldlandru said:

                    We have a 2 node cluster for our operations (Exchange, AD, SharePoint Foundation, and other miscellaneous applications) and a 3 node cluster for development machines.

                    So a two node cluster and a three node cluster. This seems straightforward.... no external storage at all. The rule of thumb of external storage is that it should not be considered until you are above four nodes in a single cluster and even then, not normally until much larger. What is the purpose of having external storage at all?

                    This setup was implemented when I first started four years ago, we used a third-party consultant and they designed this at the solution for the operations cluster. There were initial plans to do something different for the development cluster, but due to cost of the SAN (which may or may not have been needed) it was then value-engineered by the people leading the project, and little regard to my input, as I was the new guy.

                    My initial plan was to build a four-node cluster with shared storage without the ops/dev silos. The ops (2node) cluster is licensed with VMWare Essentials Plus and the dev cluster is licensed with VMware essentials. I do rely on vmotion and drs in the ops cluster for better utilizing resources and doing maintenance.

                    VMotion is of little use to me in the dev cluster as these machines (RAM: 288GB, 64GB, 16GB) don't have enough resources to host everything should a node drop so it is mainly licensed for the backup API access

                    scottalanmillerS 2 Replies Last reply Reply Quote 0
                    • scottalanmillerS
                      scottalanmiller @donaldlandru
                      last edited by

                      @donaldlandru said:

                      1. Must be expandable to accommodate future growth

                      Expandability often costs a ton today and delivers very little value "tomorrow." Is this truly an important business goal? It is very often cheaper to do the right thing for today and the immediate future and evaluate again in one, two or five years - whenever factor have changed and you are in a position to make a new decision. Planning for expansion introduces unnecessary risk to the project.

                      1 Reply Last reply Reply Quote 1
                      • scottalanmillerS
                        scottalanmiller @donaldlandru
                        last edited by

                        @donaldlandru said:

                        VMotion is of little use to me in the dev cluster as these machines (RAM: 288GB, 64GB, 16GB) don't have enough resources to host everything should a node drop so it is mainly licensed for the backup API access

                        This tells us two things:

                        • VMware is the wrong platform for you almost certainly. You are paying a premium to get less than you would get for free elsewhere.
                        • There is no reason for a cluster or external storage as even the most minimal features of it are being skipped.
                        1 Reply Last reply Reply Quote 2
                        • scottalanmillerS
                          scottalanmiller
                          last edited by

                          By dropping VMware vSphere Essentials you are looking at a roughly $1200 savings right away. Both HyperV and XenServer will do what you need absolutely free.

                          DashrenderD 1 Reply Last reply Reply Quote 0
                          • scottalanmillerS
                            scottalanmiller
                            last edited by

                            That $1200 number was based off of Essentials. Just saw that you have Essentials Plus. What is that for? Eliminating that will save you many thousands of dollars! This just went from a "little win" to a major one!

                            donaldlandruD S 2 Replies Last reply Reply Quote 0
                            • scottalanmillerS
                              scottalanmiller @donaldlandru
                              last edited by

                              @donaldlandru said:

                              I do rely on vmotion and drs in the ops cluster for better utilizing resources and doing maintenance.

                              Better to be fast and cheap than to be slow, expensive and have to balance. Easier to throw "speed" at the problem than to do live balancing if that is all that you are getting out of it.

                              Maintenance should be trivial, what planned outages are you avoiding that warrant the heavier risk of unplanned ones?

                              1 Reply Last reply Reply Quote 1
                              • scottalanmillerS
                                scottalanmiller @donaldlandru
                                last edited by

                                @donaldlandru said:

                                Requirements for development storage

                                • 9+ Tib of usable storage
                                • Support a minimum of 1100 random iops (what our current system is peaking at)

                                If split between five nodes, that's a minimal number. My eight year old desktop has 100,000 IOPS! This is less than 250 IOPS per machine, you can often hit that with a small RAID 1 pair in each box! And 10TB is just 2TB per box. This isn't a big problem to tackle when you break it down. Actually pretty moderate needs.

                                1 Reply Last reply Reply Quote 1
                                • donaldlandruD
                                  donaldlandru @scottalanmiller
                                  last edited by

                                  @scottalanmiller said:

                                  That $1200 number was based off of Essentials. Just saw that you have Essentials Plus. What is that for? Eliminating that will save you many thousands of dollars! This just went from a "little win" to a major one!

                                  Essentials plus is to allow us to use VMotion on operations cluster, where is would likely be cheaper in the long-run to acquire MS Server datacenter licensing and building redundant services, this was the approved solution to move VM's back and forth for node maintenance / upgrades.

                                  The ops layout is
                                  2x AD DC (one hosts DHCP server)
                                  1x SQL server for SharePoint
                                  1x SharePoint foundation
                                  1x Exchange server
                                  1x File Server (hosts a bunch of other services because of no additional server licenses)
                                  handful of other CentOS servers for monitoring, help desk, internal web server

                                  The ops cluster could likely be decommissioned and what little remaining services could be collocated on the dev environments if I could only convince the owners to go with Office 365

                                  1 Reply Last reply Reply Quote 0
                                  • scottalanmillerS
                                    scottalanmiller @donaldlandru
                                    last edited by

                                    @donaldlandru said:

                                    #1 a.k.a the safe option
                                    HP StoreVirtual 4530 with 12 TB (7.2k) spindles in RAID6 -- this is our vendor recommendation. This is an HP renew quote with 3 years 5x9 support next-day on-site for ~$15,000

                                    http://www8.hp.com/us/en/products/disk-storage/product-detail.html?oid=6255484

                                    Other than being able to blame a vendor for losing data or uptime rather than being on the hook yourself, what makes this safe? Looking at it architecturally, I would call it reckless to the business as it is an inverted pyramid of doom. The unit is nothing but a normal server on which everything rests. How do you handle it failing? How do you do maintenance if you can't do bring it down? And it is just RAID 6, which is fine, but no aspect of this makes it very safe.

                                    Having a vendor to blame is nice, but the vendor is only responsible for the product, not the system architectural design. Outages caused by this would still be your throat, not HP's. It's not that it is a bad unit, I just don't see how it could be used appropriately in this kind of a setup.

                                    1 Reply Last reply Reply Quote 0
                                    • scottalanmillerS
                                      scottalanmiller @donaldlandru
                                      last edited by

                                      @donaldlandru said:

                                      The biggest concerns I have exist in both platforms (drives fail, controllers fail, data goes bad, etc) and have to be mitigated either way. That is what we have backups for -- in my opinion the HP gets me the following things:

                                      This is where you really have to look carefully. You have this big risk (and cost) that you know this does not mitigate. But having local drives with stand alone servers would partially mitigate this and local drives with replication would mitigate this better than nearly any possible approach. So you appear to have options that are faster, cheaper and potentially easier that also solve the biggest problem.

                                      1 Reply Last reply Reply Quote 0
                                      • scottalanmillerS
                                        scottalanmiller @donaldlandru
                                        last edited by

                                        @donaldlandru said:

                                        24 spindle 900Gb (7.2k SAS) in 12 mirrored vdevs

                                        That's RAID 01, you never want that. You want 12 mirrors in a stripe for RAID 10.

                                        Understanding RAID 10 and RAID 01.

                                        donaldlandruD 1 Reply Last reply Reply Quote 1
                                        • donaldlandruD
                                          donaldlandru
                                          last edited by donaldlandru

                                          Ok.. your feedback is actually showing something I have been afraid of, I have severe tunnel vision is servicing the current solution.
                                          Doing a quick inventory as to why I am trying to do that:

                                          1. We have the investment into this. Like another recent thread here discussed once an SMB gets heavily invested one way it is hard to switch. To be honest, I am not sure how I could convince them too at this point. This actually seems like an opportunity for a great learning experience
                                          2. Training of supporting resources -- I have a counterpart in our off-shore office that is just getting up to speed on how VMware works -- to be this will be even harder to change
                                          3. I have been using Vmware for 4 years at the office and at home, so I am comfortable with it. This reason should also make the list as to why I should change it.

                                          One limiting factor I see right now is our current chassis are 1U with 2-4 drive bays which would hamper a local storage deployment.

                                          Edit -- Stepping back and thinking, the lack of drive bays are not a valid limiting factor as I could easily add SAS and do DAS storage on these nodes.

                                          scottalanmillerS 3 Replies Last reply Reply Quote 0
                                          • donaldlandruD
                                            donaldlandru @scottalanmiller
                                            last edited by

                                            @scottalanmiller said:

                                            @donaldlandru said:

                                            24 spindle 900Gb (7.2k SAS) in 12 mirrored vdevs

                                            That's RAID 01, you never want that. You want 12 mirrors in a stripe for RAID 10.

                                            Understanding RAID 10 and RAID 01.

                                            This was modeled after the way TrueNAS (commercial version of FreeNAS) quoted us.

                                            scottalanmillerS 1 Reply Last reply Reply Quote 0
                                            • 1
                                            • 2
                                            • 3
                                            • 4
                                            • 5
                                            • 6
                                            • 7
                                            • 8
                                            • 8 / 8
                                            • First post
                                              Last post