How To Replace a Failed Drive in Hardware RAID

  • Assumptions before we begin:

    • You have hardware RAID (SmartArray, LSI, Adaptec, PERC, MegaRAID, etc.)
    • You have hot swap (every enterprise server is hot swap, but it is possible to configure without in some cases.)

    What we have here, while not naturally intrinsic to hardware RAID but as all manufacturers do this exclusively we get to make the connection, is called Blind Hot Swap. This makes our lives very easy. This how to applies to essentially all standard servers configured normally.

    Once a drive has failed this is what we do, and this is all that we do:

    1. Identify the failed drive, normal from a light indicator on the front of the drive slot
    2. Get replacement drive
    3. Remove the failed drive
    4. Insert the replacement drive into the same slot
    5. Wait for the lights to tell you that all is healthy

    At no point should we need access to the hypervisor or operating system, we need nothing except the replacement parts and access to the server itself. Often in large enterprises this process is performed by datacenter staff, not by systems administrators as this is purely a hardware task and requires no IT knowledge or interaction. The system identifies what is wrong and handles all of the repair on its own.

    Absolutely do not power down a system in a state with a failed drive. This puts undo stress on the RAID array and increases risk.

    A server that is repairing (resilvering) its array can be used as normal as it should operate as normal, only more slowly. However while under use the RAID array will not resilver at optimum speed. If you wish to speed the repair process you should reduce the workload of the RAID array as much as possible.

  • i have done this process for one server ya