ZFS Based Storage for Medium VMWare Workload
- 
 @donaldlandru said: Out of the hardware in our datacenter I have had the one MSA controller fail, a P420 in the HP DL360p G8 and a perc in the dell 2950, all inside the four years I have been here. To me this shows no better level of reliability than the other. Both of the controller failures in the blades caused downtime to the organization, the failure in the MSA did not. Those are crazy high failure rates for all of those. PERCs I have not measured in large quantity but SmartArrays I have, by the thousands, and the failure rates are miniscule, a fraction of the failure rates of memory sticks, for example. 
- 
 @Dashrender said: Because of the volatility of your dev environment, I wonder if using a SAM-SD for central storage would be best. What happens if the entire storage array is down? Can you live for a day or two without it on the dev environment? What are you planning for backups on it? What is your RTO and RPO? His proposed ZFS-based storage option is a SAM-SD, just in case anyone missed that. 
- 
 @Dashrender said: Because of the volatility of your dev environment, I wonder if using a SAM-SD for central storage would be best. What happens if the entire storage array is down? Can you live for a day or two without it on the dev environment? What are you planning for backups on it? What is your RTO and RPO? Your operations systems - I like the two node sync'ed approach, if you even really need that, but you already have the two servers. That is pretty much where this all started, do I need to fork out the money to HP or is the other way good enough. In operations the RTO/RPO is 24 hours. We carry our HP care pack on the MSA. Everything is backed up by Veeam several hours throughout the day and replicated offsite. We have physical access to the offsite location in case of datacenter failure for faster recovery. For the development environments up to six months ago there was no backup of the development environments as the thought was this could be rebuilt from scratch. This was until I outlined the effort it would take to bring everything back. -- roughly 6 months. Now the RPO is one week with a RTO of 72 hours. 
- 
 @scottalanmiller said: @Dashrender said: Because of the volatility of your dev environment, I wonder if using a SAM-SD for central storage would be best. What happens if the entire storage array is down? Can you live for a day or two without it on the dev environment? What are you planning for backups on it? What is your RTO and RPO? His proposed ZFS-based storage option is a SAM-SD, just in case anyone missed that. You're right it is, but for the dev environment it might be all that he needs with a good backup solution. He's currently hamstrung by his old servers - two of which are slated to be replaced in the next year or so. Perhaps he should do nothing until it's time to replace those boxes. 
- 
 @Dashrender said: @scottalanmiller said: @Dashrender said: Because of the volatility of your dev environment, I wonder if using a SAM-SD for central storage would be best. What happens if the entire storage array is down? Can you live for a day or two without it on the dev environment? What are you planning for backups on it? What is your RTO and RPO? His proposed ZFS-based storage option is a SAM-SD, just in case anyone missed that. You're right it is, but for the dev environment it might be all that he needs with a good backup solution. He's currently hamstrung by his old servers - two of which are slated to be replaced in the next year or so. Perhaps he should do nothing until it's time to replace those boxes. I can't do nothing, I do not have enough storage to host a new client that starts soon. I have to do something there. I am not opposed to overall architecture changes in a refresh cycle, but in the meantime -- I have a budget and need disk. 
- 
 That all supports that HA is total overkill. HA is for when ten minutes is too long. Not for when "we can be down for an hour or two in a disaster." 
- 
 @donaldlandru said: Here is what the business cares about the solution: Reliable solution that provides necessary resources for the development environments to operate effectively (read: we do not do performance testing in-house as by the very nature, it is much a your mileage may vary depending on your deployment situation). In addition to the business requirements, I have added my own requirements that my boss agrees with and blesses. - Operations and Development must be on separate storage devices
- Storage systems must be built of business class hardware (no RED drives -- although I would allow this in a future Veeam backup storage target)
- Must be expandable to accommodate future growth
 Requirements for development storage - 9+ Tib of usable storage
- Support a minimum of 1100 random iops (what our current system is peaking at)
- disks must be in some kind of array (zfs, raid, mdadm, etc)
 Back to the original requirements list. HA and FT are not listed as needed for the development environment. This conversation went sideways when we started digging into the operations side (where there should be HA) and I have a weak point, the storage. 
- 
 @donaldlandru said: Back to the original requirements list. HA and FT are not listed as needed for the development environment. This conversation went sideways when we started digging into the operations side (where there should be HA) and I have a weak point, the storage. Okay, so we are looking exclusively at the non-production side? But production completely lacks HA today, it should be a different thread, but your "actions" say you dont need HA in production even if you feel that you do. Either what you have today isn't good enough and has to be replaced there, or HA isn't needed since you've happily been without it for so long. This can't be overlooked - you are stuck with either falling short of a need or not being clear on the needs for production. 
- 
 For dev, why do anything except replace the nodes with a single node that can handle the load? Cheap, simple, easy. 
- 
 The cost of external storage for the compute nodes is a huge percentage of the cost of just replacing the whole thing, right? If you could spend $14K on an MSA for them, you should be able to spend around $16K, I'm guessing, to get a single node with more CPU and more RAM than you have between the two nodes currently while getting a storage system that is bigger and likely orders of magnitude faster. 
- 
 @scottalanmiller said: @donaldlandru said: Back to the original requirements list. HA and FT are not listed as needed for the development environment. This conversation went sideways when we started digging into the operations side (where there should be HA) and I have a weak point, the storage. Okay, so we are looking exclusively at the non-production side? But production completely lacks HA today, it should be a different thread, but your "actions" say you dont need HA in production even if you feel that you do. Either what you have today isn't good enough and has to be replaced there, or HA isn't needed since you've happily been without it for so long. This can't be overlooked - you are stuck with either falling short of a need or not being clear on the needs for production. Ahh -- there is the detail I missed. Just re-read my post and that doesn't make this clear. Yes, the discussion was supposed to pertain to the non-production side. My apologies. I agree we do lack true HA in the production side as there is a single weak link (one storage array), the solution here depends on our move to Office 365 as that would take most of the operations load off of the network and change the requirements completely. We have qasi-HA with the current solution, but now based on new enlightenment I would agree it is not fully HA. 
- 
 Curiosity got the better of me, so I went to xByte to see... You can build a nice SAM-SD based on a Dell R720 from xBytes for around 10k ... But that included 256GB of ram and 8 x 1.2 TB SAS drives (they don't have any larger drives listed on their web site)... and 3 Year Warranty... (I have a PDF (https://beta.wellston.biz/xByte SAM-SD.pdf) of how I configured it if everybody wants to see)... 
- 
 @donaldlandru said: Ahh -- there is the detail I missed. Just re-read my post and that doesn't make this clear. Yes, the discussion was supposed to pertain to the non-production side. My apologies. LOL, a rather sizeable detail  I think we've been focused almost entirely on the operations cluster in our discussion and/or putting the two together to assess needs as a whole - which is worth considering, is there actually a good reason that they are independent to this level? I think we've been focused almost entirely on the operations cluster in our discussion and/or putting the two together to assess needs as a whole - which is worth considering, is there actually a good reason that they are independent to this level?
- 
 @dafyre said: Curiosity got the better of me, so I went to xByte to see... You can build a nice SAM-SD based on a Dell R720 from xBytes for around 10k ... But that included 256GB of ram and 8 x 1.2 TB SAS drives (they don't have any larger drives listed on their web site)... and 3 Year Warranty... (I have a PDF (https://beta.wellston.biz/xByte SAM-SD.pdf) of how I configured it if everybody wants to see)... Yup, using xByte and the PowerEdge R720xd (did you do the 720 or the 720xd?) you can get quite a monster of a server. We have a reference PowerEdge R720xd at the NTG Labs for this. Only 128GB of RAM, though  With the 720xd you can do 12x LFF drives plus two SSDs in CacheCade.  Sure, you are going to spend a little more for that than what you quoted, but not tons more and that is a 50% leap in drive capacity and an insane leap in potential IOPS with the CacheCade included. With the 720xd you can do 12x LFF drives plus two SSDs in CacheCade.  Sure, you are going to spend a little more for that than what you quoted, but not tons more and that is a 50% leap in drive capacity and an insane leap in potential IOPS with the CacheCade included.
- 
 @scottalanmiller said: The cost of external storage for the compute nodes is a huge percentage of the cost of just replacing the whole thing, right? If you could spend $14K on an MSA for them, you should be able to spend around $16K, I'm guessing, to get a single node with more CPU and more RAM than you have between the two nodes currently while getting a storage system that is bigger and likely orders of magnitude faster. HP DL360p Gen 8 with 2 Intel E5-2640 and 384GB ram cost us roughly $13k each -- this is without local drives. On our current large compute node I am only 20% utilized on CPU and 50% utilized on RAM (at peak). I am however, out of storage. Which I can add for as cheap as $5k with RED drives or $10k with Seagate SAS drives. The $13k does not include VMWare licensing, which is obviously much debated if I even need it; however, send I am decommissioning 4 CPUs when we upgrade I still have available licenses. 
- 
 @donaldlandru said: I agree we do lack true HA in the production side as there is a single weak link (one storage array), the solution here depends on our move to Office 365 as that would take most of the operations load off of the network and change the requirements completely. Good deal. We use O365, it is mostly great. 
- 
 @donaldlandru said: Which I can add for as cheap as $5k with RED drives or $10k with Seagate SAS drives. WD makes RE and Red drives. Don't call them RED, it is hard to tell if you are meaning to say RE or Red. The Red Pro and SE drives fall between the Red and the RE drives in the lineup. Red and RE drives are not related. RE comes in SAS, Red is SATA only. 
- 
 @scottalanmiller said: @donaldlandru said: Ahh -- there is the detail I missed. Just re-read my post and that doesn't make this clear. Yes, the discussion was supposed to pertain to the non-production side. My apologies. LOL, a rather sizeable detail  I think we've been focused almost entirely on the operations cluster in our discussion and/or putting the two together to assess needs as a whole - which is worth considering, is there actually a good reason that they are independent to this level? I think we've been focused almost entirely on the operations cluster in our discussion and/or putting the two together to assess needs as a whole - which is worth considering, is there actually a good reason that they are independent to this level?LOL -- it's all in the details is there a :sheepish: emoji??? Nope. As to them being separate this why a design consideration outside of my control being hired in mid process. I believe the thought was to have a separate pane of glass. I would much rather have a three node cluster in this case holding both roles but what I have is what I have. If I bring up the operations nodes only have 1CPU each and only 64GB of memory I just cringe and this goes a third direction. 
- 
 Used to have emojis, they broke. 
- 
 @scottalanmiller That was definitely the R720, not the XD... I get to go back and do it again in a little bit. 



