Building a Basic DRBD Cluster on OpenSuse 12.2
-
One of the best features of Linux is the DRBD network RAID system. DRBD is now a part of the Linux kernel so all recent Linux distributions ship with it natively. Some distributions, like RHEL 6, are just old enough to not have it yet included but any subsequent releases will have it. The platforms using it more heavily today are OpenSuse and Ubuntu. Since we will be moving on in later articles to address other aspects of high availability clustering which is, to the best of my knowledge, not working on the last several releases of Ubuntu (notably 12.04 and 12.10) we will base our project off of OpenSuse 12.2. Suse has long been the “go to” Linux distribution for those most interested in storage. The Suse community has embraced advanced storage technologies, such as a broad inclusion of alternative filesystems, since the earliest days and was the original pioneer of the Reiser filesystem projects and is now the first major distribution to include BtrFS, widely expected to be the “next big thing” for storage on the Linux platform.
The use of DRBD for network RAID 1 (shown as R(1) or R[1] in Network RAID Notation, depending on configuration) is the building block for many things. It provides a reliable block device base on which we can build our clusters. DRBD has pioneered block replication and was the inspiration for FreeBSD’s HAST system. DRBD is currently on 8.4 but OpenSuse 12.2 is on 8.3, as is Ubuntu 12.10, so we will stick with the included version as it is mature and robust.
Before beginning, we need to install the drbd components packages as well as remove the minimal_base system if you, like me, start your OpenSuse installs from a bare minimum and build them up as packages are needed.
zypper -n remove patterns-openSUSE-minimal_base zypper -n install drbd
Now at this stage we will need to assume that you already have the block device that you wish to add to your storage cluster ready. This could be an entire disk drive (such as /dev/sdb) or a single partition of a block device (such as /dev/sdba3) a logical volume or a software RAID device (like /dev/md2) or a hardware RAID logical drive or just about anything you can imagine. In our example we will work with /dev/sdb to leverage an entire drive as this is very simple to see and is a common scenario when working with VirtualBox in a lab and it is very simple to extrapolate how to replace with a different device.
One thing that can be tricky with a DRBD setup, or any type of cluster, is that to avoid working with IP addressing for everything you are stuck putting peer names into the /etc/hosts file to protect yourself against DNS errors. DNS resolution can cause latency or hiccups or outright outages that you really can’t risk with a cluster but using straight IP addresses is too crusty so the answer is traditionally to rely on the hosts file instead. So let’s put our two cluster nodes into our hosts file to make this easy on ourselves. Our primary host is going to be “bert” and our secondary host is going to be “ernie”. We’ll also assume that we are using 10.0.0.1 and 10.0.0.2 for our nodes network RAID addresses (the addresses with which they will talk to each other for block transfers, this doesn’t necessarily have to be used for any other purpose.)
Note on Node Primacy: DRBD clusters do not have a primary and secondary host hole in a persistent state. At any given moment either node may be the primary or the secondary but neither physical node is permanently assigned such a role. The DRBD pair act as a single unit of storage and will flip roles based on external factors. We only think of one as primary and the other as secondary for the act of cluster creation during which we must have the first node created by primary by the nature of being the only node at the time of creation. Once the secondary node has joined to the cluster and become fully synchronized to the primary the two are “one” in their own minds and hopefully in ours. Once the cluster is created, restrain from thinking of one unit as being the “main” one and the other as a backup. This is a peering relationship and both nodes are equal. We only care which node is acting as primary at any given moment for the purposes of troubleshooting or maintenance.
echo "10.0.0.1 bert" >> /etc/hosts echo "10.0.0.2 ernie" >> /etc/hosts
Now we are ready to actually create the configuration file for DRBD. This is a very basic configuration file but will serve our purposes for setting up basic DRBD replication. Input the following as the contents of /etc/drbd.conf:
global { usage-count no; } common { syncer { rate 33M; } } resource r0 { protocol C; startup { wfc-timeout 15; degr-wfc-timeout 60; } net { cram-hmac-alg sha1; shared-secret "secretphrase"; } on bert { device /dev/drbd0; disk /dev/sdb; address 10.0.0.1:7788; meta-disk internal; } on ernie { device /dev/drbd0; disk /dev/sdb; address 10.0.0.2:7788; meta-disk internal; } }
A couple of notes about the above configuration file:
- Syncer Rate: This is the background synchronization rate that DRBD uses to make the two remote filesystems consistent. The rule of thumb is to set this to no more than one third of your available synchronization bandwidth (leaving two thirds for foreground synchronization) to keep from overly impacting write performance when the system brings itself back into sync. You can alter this by hand when necessary on a live system so this is just the default rate that you set that it uses unless you tell it otherwise. So if you know that write performance doesn’t matter overnight and you want to sync a large process, you could crank this to 95% for off-hours and let it sync up as quickly as possible. This unit is in Bytes, not Bits. So be careful. The 33M number that I used here is appropriate for a dedicated GigE connection which is pretty typical of DRBD installations.
- Protocol: There are three protocols available for DRBD – A, B and C. C is the safest and the most common to use, it is full synchronization (nothing happens on the master node until the slave node has confirmed that it is up to date.) Safe comes with a performance penalty. Synchronous network RAID 1 would be written R(1). Protocols A and B are asynchronous replication, so would be written R[1]. They lower the write latency to the master by not requiring the slave node to confirm a successful write before they confirm the master write back to the operating system. This means that there is a brief moment when the systems are in a state of inconsistency. Protocol B is called “memory synchronous” or “semi-synchronous” because it confirms that the peer (slave) node has received the replication data but does not wait for it to commit the write to disk before continuing. This is nearly always safe, especially if the two systems are highly isolated from each other. A joint power outage would be the primary concern here. Protocol A is full asynchronous and introduces no delay for local writes because it does not wait for the remote node in any way and the replication data may be in a queue waiting to be sent in the case of a system failure.
- Port: In this example we are using the default replication port of 7788. This is default but you may choose any port that you like. Each resource requires its own port number. So it is typical to simply increment this number for each additional resource. Remember to open the firewall for whatever port(s) you intend to use for replication.
- Resource Name: In this example I named our DRBD resource “r0″. This is very common and introduced by Linbit as a standard naming convention. Naming after function is also common. A replicated storage cluster used for NFS might be named “nfs” instead of “r0″ to make its purpose clear. I generally do this outside of examples.
- Block Device Name: It is accepted practice to name the resultant block device of a DRBD cluster to “/dev/drbd#” where # starts at zero and increments along with each additional resource created. While you can name the device whatever you like it would be bad form and very confusing to deviate from this practice.
Now that we have our configuration file we need to manually create the metadata that our resource will need to be able to run. If all is going well we can simply run this command and we are almost ready to begin:
drbdadm create-md r0
Once that completes we can start the DRBD process itself which will read our configuration file and begin attempting to process our block replication system.
service drbd start
Now because our cluster has not existed previously, the system will feel that it is in an inconsistent state. We need to tell the system that it is the master and that nothing else matters (become master without prejudice.) So we will execute the following command (caution, never execute this on an existing cluster unless you know exactly what you are doing!):
drbdadm -- --overwrite-data-of-peer primary r0
Now if we check our DRBD status, we should see that we are now running as the primary node in a one node cluster (not very exciting.)
service drbd status cat /proc/drbd
At this point we haves enough for us to begin to use our storage locally. Let’s create a filesystem and mount it to /data:
mkfs.ext4 /dev/drbd0 mkdir /data mount /dev/drbd0 /data cd /data; touch test_file_on_drbd
Now all that we need to do is to set the DRBD process to start automatically upon system boot. And add the mount to automatically mount our newly created filesystem at boot time and our first node is complete.
chkconfig drbd on echo "/dev/drbd0 /data nfs defaults 0 0" >> /etc/fstab
If we reboot we should, if all goes well, see that DRBD has started up and that /data has mounted and is now a normal filesystem that we can use like any other. Now it is time to configure our second node. This will go even faster. In fact, we simply go back to the top and begin the same process again on that host that we did on the first one except on ernie we are going to stop once we turn on DRBD. Now run:
drbdadm secondary r0; chkconfig drbd on
Once we do that, checking our status commands should show that synchronization has begun with a steady stream of data being sent over from bert. Depending on your connection, your syncer rate and the speed and size of your disk array this might take a few minutes or weeks. You can see the progress as it goes, though.
Once it is done, you should see bert running as primary and ernie running as secondary. There is no mechanism to flip this in our current setup, so while primacy is in the eye of the beholder, in this odd case, bert truly is permanently the primary unless you manually switch them.
Why would we use this type of cluster? In this case this is almost certainly going to be used as a building block for more clustering, but not necessarily. This can be a great way of reliably protecting data that you just want to be able to recover “identically” later. Let’s say that ernie dies. Replacement is easy. Zero impact to bert, just replace ernie and over time it syncs back up. If bert dies you have choices, switch to running off of ernie or keep ernie as is, replace bert and let ernie feed all of the data back to bert. A block-identical replacement system. Useful, but hardly what we normally think of as a cluster.
Today’s article was just to introduce us to DRBD and get our feet wet and create a starting point from which to build a high availability cluster with HA starting at the storage layer and going up to the application. As we go further the power of DRBD will become evident.
Originally posted in 2012 on my Linux blog at: http://web.archive.org/web/20140822224153/http://www.scottalanmiller.com/linux/2013/01/28/building-a-basic-drbd-cluster-on-opensuse-12-2/
-
What is the purpose of 'zypper -remove patterns-openSUSE-minimal_base'
I know what it does, but is it necessary? Does this minimal_base package prevent you from installing certain packages at a later date? Or you just making things nice n neat? -
@momurda said in Building a Basic DRBD Cluster on OpenSuse 12.2:
What is the purpose of 'zypper -remove patterns-openSUSE-minimal_base'
I know what it does, but is it necessary? Does this minimal_base package prevent you from installing certain packages at a later date? Or you just making things nice n neat?It's been quite some time, but if I remember correctly it interfered with some packages that we needed as the "minimal" blocked adding a bunch of stuff.