HA Cluster with Linux Containers based on Heartbeat, Pacemaker, DRBD and LXC

Please note that this article / this category refers either on older software / hardware components or is no longer maintained for other reasons. This page is no longer updated and is purely for reference purposes still here in the archive available.

The following article describes how to setup a two node HA (high availability) cluster with

lightweight virtualization (Linux containers, LXC),
data replication (DRBD),
cluster management (Pacemaker, Heartbeat),
logical volume management (LVM),
and a graphical user interface (LCMC).

As a result you will get a very resource- and cost-efficient shared-nothing cluster solution based completely on Open Source. Ubuntu 12.04 LTS is used as operating system.

The cluster is operated in active-active mode in the sense of resources are running on both nodes but without sharing resources with a cluster filesystem. This utilizes both servers and does not degrade one server to a hot-standby only system. However you have to keep in mind that one single server should still be able to run all resources that are normally running on both servers. For example you can only use as much RAM for both servers together as one single server can offer.

The presentation from Linuxcon Europe 2012 gives further details about this topic: Event-News: LinuxCon Europe 2012

Disclaimer: This HOWTO is intended for advanced Linux users that are able to use the command line. Also have a look at https://wiki.ubuntu.com/LxcSecurity to get information about the security issues the use of Linux containers may have at the moment.

Hardware

Thomas Krenn 1U Intel Single-CPU CSE512 Server
8 GB RAM
2x 1000 GB SATA II WD Raid Edition IV 3,5"
Full Remote Management (KVM over LAN, IPMI 2.0)

OS Installation

The Ubuntu 12.04 Server 64-bit installation is done from a USB stick with the help of "usb-creator-gtk". Ensure that you are booting the USB stick from UEFI if you like to install Ubuntu in UEFI boot mode. Linux Software RAID (md) is used for RAID1 functionality.

On top of the RAID1 device the following partitioning scheme is established:

/boot/efi 512 MB (not mirrored with RAID1)
/ 50 GB
/var 10 GB
swap 8 GB
drbd1 465 GB (prepare a software RAID device, don't set a mount point)
drbd2 465 GB (prepare a software RAID device, don't set a mount point)

Here is the resulting Disk layout as shown by "lsblk":

# lsblk 
NAME                                MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
sda                                   8:0    0 931.5G  0 disk  
├─sda1                                8:1    0 488.3M  0 part  /boot/efi
├─sda2                                8:2    0  46.6G  0 part  
│ └─md0                               9:0    0  46.6G  0 raid1 /
├─sda3                                8:3    0   9.3G  0 part  
│ └─md1                               9:1    0   9.3G  0 raid1 /var
├─sda4                                8:4    0   7.5G  0 part  
│ └─md2                               9:2    0   7.5G  0 raid1 [SWAP]
├─sda5                                8:5    0 433.1G  0 part  
│ └─md3                               9:3    0 433.1G  0 raid1 
└─sda6                                8:6    0 433.1G  0 part  
  └─md4                               9:4    0 433.1G  0 raid1 
sdb                                   8:16   0 931.5G  0 disk  
├─sdb1                                8:17   0 488.3M  0 part  
├─sdb2                                8:18   0  46.6G  0 part  
│ └─md0                               9:0    0  46.6G  0 raid1 /
├─sdb3                                8:19   0   9.3G  0 part  
│ └─md1                               9:1    0   9.3G  0 raid1 /var
├─sdb4                                8:20   0   7.5G  0 part  
│ └─md2                               9:2    0   7.5G  0 raid1 [SWAP]
├─sdb5                                8:21   0 433.1G  0 part  
│ └─md3                               9:3    0 433.1G  0 raid1 
└─sdb6                                8:22   0 433.1G  0 part  
  └─md4                               9:4    0 433.1G  0 raid1 
drbd1                               147:1    0 433.1G  0 disk  
drbd2                               147:2    0 433.1G  1 disk

EFI bootloader on second disk

Since the UEFI boot partition does not support software RAID, we need to install the UEFI bootloader on the second disk (/dev/sdb) manually.

node1 + node2: mount | grep sda1
node1 + node2: umount /boot/efi
node1 + node2: mkfs.vfat /dev/sdb1
node1 + node2: parted /dev/sdb set 1 boot on
node1 + node2: mount /dev/sdb1 /boot/efi
node1 + node2: grub-install --bootloader-id ubuntu2 /dev/sdb

We recommend to try booting from the second disk. You have to boot into the BIOS and select "ubuntu2" there. The system should boot up normally.

Since we had software RAID boot errors occasionally, we increased the GRUB boot timeout from 2 to 5 (/etc/default/grub).

Moreover os-prober problems like described in 7.2 can be avoided when adding "GRUB_DISABLE_OS_PROBER=true" to /etc/default/grub.

Please also add the option "nobootwait" to /etc/fstab for the /boot/efi entry. This ensures that the system automatically boots when the first drive is missing.

System installation and configuration

node1 + node2: apt-get install lxc lvm2 vim postfix mailutils screen

We are using eth0 as our external interface. Therefore we need to create a bridge br0 (on top of eth0) to use it for Linux containers networking connectivity. /etc/network/interfaces.

# The primary network interface
auto br0
iface br0 inet static
        bridge_ports eth0
        bridge_fd 0
        bridge_maxwait 0
	address 203.0.113.10
	netmask 255.255.255.0
	network 203.0.113.0
	broadcast 203.0.113.255
	gateway 203.0.113.1
	dns-nameservers 203.0.113.1 8.8.8.8

The network interface "eth1" is used as internal interface for DRBD replication and Heartbeat communication. We assign 192.168.255.1/24 to node1 and 192.168.255.2/24 to node2.

After that we are launching the Linux Cluster Management Console (LCMC). You can download it from http://lcmc.sourceforge.net/ or use the web start version from there.

With the help of LCMC we are installing and configuring Pacemaker, Heartbeat and DRBD.

The following additional steps are done after configuration:

Add a delay for Heartbeat:

node1 + node2: echo "sleep 20" > /etc/default/heartbeat

Modify the LVM configuration:
- see http://www.drbd.org/users-guide/s-lvm-drbd-as-pv.html
- Please reboot both nodes now, to ensure the LVM filter is working.
Create DRBD devices drbd1 and drbd2 with LCMC
Create LVM volume groups on the commandline

node1: drbdadm primary r1
node1: pvcreate /dev/drbd1
node1: vgcreate vg_node1 /dev/drbd1
node1: vgchange -a n vg_node1
node1: drbdadm secondary r1

node2: drbdadm primary r2
node2: pvcreate /dev/drbd2
node2: vgcreate vg_node2 /dev/drbd2
node2: vgchange -a n vg_node2
node2: drbdadm secondary r2

Create DRBD (linbit:drbd) Pacemaker resources in LCMC
Install the latest lxc resource agent for Pacemaker on node1 and node2:

node1 + node2: cd /usr/lib/ocf/resource.d/heartbeat
node1 + node2: mv lxc lxc.orig
node1 + node2: wget https://raw.github.com/ClusterLabs/resource-agents/master/heartbeat/lxc
node1 + node2: chmod +x lxc
# set the lxc package on hold until the bug is fixed
node1 + node2: echo lxc "hold" | dpkg --set-selections

Install the "screen" package from Debian wheezy to fix a race condition (see [1]):

node1 + node2: wget http://ftp.de.debian.org/debian/pool/main/s/screen/screen_4.1.0~20120320gitdb59704-5_amd64.deb
node1 + node2: dpkg -i screen_4.1.0~20120320gitdb59704-5_amd64.deb
# set the screen package on hold until the bug is fixed
node1 + node2: echo screen "hold" | dpkg --set-selections

Increase default timeouts in "/usr/lib/ocf/resource.d/heartbeat/lxc" resource agents for "start" and "stop" actions to "60" seconds and for monitor "timeout" to "30" seconds. This makes unintended restarts and switchovers of containers less likely.

<actions>
<action name="start"        timeout="60" />
<action name="stop"         timeout="60" />
<action name="monitor"      timeout="30" interval="60" depth="0"/>
<action name="validate-all" timeout="20" />
<action name="meta-data"    timeout="5" />
</actions>

Set the resource-agents package on hold to preserve changes:

node1 + node2: echo resource-agents "hold" | dpkg --set-selections

Enable logging to /var/log/messages
- Edit /etc/rsyslog.d/50-default.conf for that

Use the LCMC to set "Stonith enabled" to "false" and "No Quorum Policy" to "ignore" since we are using resource-level fencing (see [2])
Click on Advanced in the Pacemaker configuration and set "start-failure-is-fatal" to false. This ensures that a failure in starting one single container does not move around all other containers.
Disable the DRBD init script, Pacemaker should take care of DRBD.

node1 + node2: update-rc.d -f drbd remove

Enable the dopd (drbd-peer-outdater) daemon (see [3])
Dopd needs to be able to execute drbdsetup and drbdmeta with root rights. Therefore the setuid bit has to be set for the files. Please remove the executable bit for other!

  chgrp haclient /sbin/drbdsetup
  chmod o-x /sbin/drbdsetup
  chmod u+s /sbin/drbdsetup

  chgrp haclient /sbin/drbdmeta
  chmod o-x /sbin/drbdmeta
  chmod u+s /sbin/drbdmeta

Moreover the dopd helper script needs to be able to read the DRBD configuration. Please note, if you change DRBD configuration with LCMC you have to manually set the file rights for /etc/drbd.conf afterwards again!

  chgrp haclient /etc/drbd.conf
  chmod g+r /etc/drbd.conf
  chgrp haclient -R /etc/drbd.d
  chmod g+r -R /etc/drbd.d

lxc-ps included in Ubuntu 12.04 does not work with the lxc resource agent. Use a newer lxc-ps version based on bash.

node1 + node2: cd /usr/bin/
node1 + node2: mv lxc-ps lxc-ps.perl
node1 + node2: wget "http://lxc.git.sourceforge.net/git/gitweb.cgi?p=lxc/lxc;a=blob_plain;f=src/lxc/lxc-ps.in;h=a9923f06b921844ff3d16046418e6cc665e03e1d;hb=8edcbf336673d13bb944f817c9974298a77b7860" -O lxc-ps
node1 + node2: chmod +x lxc-ps

Create the container configuration disk space

Two mount points are created on node1 and node2 as lxc configuration space for containers on node1 and node2. This space only contains the container configuration not that actual root filesystems of the containers. The root filesystems are created as logical volumes in the volume groups vg_node1 and vg_node2.

node1 + node2: mkdir /lxc1
node1 + node2: mkdir /lxc2

node1: lvcreate -n lv_node1 -L 1G vg_node1
node1: mkfs.ext4 /dev/vg_node1/lv_node1
node1: vgchange -a n vg_node1

node2: lvcreate -n lv_node2 -L 1G vg_node2
node2: mkfs.ext4 /dev/vg_node2/lv_node2
node2: vgchange -a n vg_node2

Create LVM (ocf:heartbeat:LVM) Pacemaker resources in LCMC
Create a ocf:heartbeat:Filesystem resource for /dev/vg_node1/lv_node1 on node1 and for /dev/vg_node2/lv_node2 on node2 with a mount point "/lxc1" and "/lxc2" with LCMC

Create a Linux container

The following container is created on node1. That means it is running during normal operations on node1 but can be switched over to node2 together with all other containers running in the same volume group.

node1: lxc-create -n test -t ubuntu -B lvm --lvname test --vgname vg_node1 --fstype ext4 --fssize 1GB
node1: mv /var/lib/lxc/test /lxc1/

Change the network setup in /lxc1/test/config to bind it to "br0" device

lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up

Now create a Pacemaker lxc resource for the created container in LCMC. Make it dependent to the Filesystem resource created previously.
The container should start up immediately after that.
On node1 you can now connect to the Ubuntu container using "lxc-console -n test" with the username "ubuntu" and password "ubuntu" (please change the password after the login). The sudo command allows you to change to the root user.
Now you can configure the network inside the container (/etc/network/interfaces)

auto eth0
iface eth0 inet static
	address 203.0.113.21
	netmask 255.255.255.0
	gateway 203.0.113.1
	dns-nameservers 203.0.113.1 8.8.8.8

You can exit from the lxc-console with [Ctrl]+[A] [Q]
Reboot the container ("reboot" command) and reconnect with "lxc-console -n test"
Check if the network setup is working correctly.

Setting resource limitations for the container is also highly recommended. Here is an example of a very simple RAM limitation to 2GB and 1GB for swap.

## Limits
lxc.cgroup.cpu.shares                  = 768
lxc.cgroup.memory.limit_in_bytes       = 2048M
lxc.cgroup.memory.memsw.limit_in_bytes = 3072M

Please reboot the container after setting the resource limits.

Migrate OpenVZ container to Linux containers (LXC)

This sample was tested with a Debian Squeeze container.

Check the disk requirement of the OpenVZ container

openvz: du -sh /vz/private/<containerid>

Create an empty linux container with your desired size:

node1: lxc-create -n www -B lvm --lvname www --vgname vg_node1 --fstype ext4 --fssize 4GB

Move the container to /lxc1

node1: mv /var/lib/lxc/www /lxcX/

Transfer the configured memory limitations from OpenVZ to the Linux container's LXC configuration file.
Create a default configuration. You may base that on an existing one from the same Linux distribution.
If your OpenVZ container was running 32-bit before, you have to insert the "lxc.arch = x86" parameter in the configuration
Here is an example configuration:

lxc.network.type=veth
lxc.network.link=br0
lxc.network.flags=up

## Container
lxc.utsname                             = www 
lxc.tty                                 = 4
lxc.pts                                 = 1024
lxc.arch                                 = x86

## Capabilities
lxc.cap.drop                            = sys_admin

## Devices
lxc.cgroup.devices.deny                 = a
# /dev/null
lxc.cgroup.devices.allow                = c 1:3 rwm
# /dev/zero
lxc.cgroup.devices.allow                = c 1:5 rwm
# /dev/tty[1-4] consoles
lxc.cgroup.devices.allow                = c 5:1 rwm
lxc.cgroup.devices.allow                = c 5:0 rwm
lxc.cgroup.devices.allow                = c 4:0 rwm
lxc.cgroup.devices.allow                = c 4:1 rwm
# /dev/{,u}random
lxc.cgroup.devices.allow                = c 1:9 rwm
lxc.cgroup.devices.allow                = c 1:8 rwm
lxc.cgroup.devices.allow                = c 136:* rwm
lxc.cgroup.devices.allow                = c 5:2 rwm
# /dev/rtc
lxc.cgroup.devices.allow                = c 254:0 rwm

## Limits
lxc.cgroup.cpu.shares                  = 512
#lxc.cgroup.cpuset.cpus                 = 0
lxc.cgroup.memory.limit_in_bytes       = 512M
lxc.cgroup.memory.memsw.limit_in_bytes = 768M

## Filesystem
lxc.mount.entry                         = proc proc proc nodev,noexec,nosuid 0 0
lxc.mount.entry                         = sysfs sys sysfs defaults,ro 0 0
lxc.rootfs = /dev/vg_XX/www

Mount the newly created logical volume to your desired location

node1: mkdir /mnt/lxcmigrate
node1: mount /dev/vg_X/www /mnt/lxcmigrate/

Copy the running OpenVZ container with rsync

node1: rsync -a --numeric-ids -e 'ssh' openvz:/vz/private/<containerid>/ /mnt/lxcmigrate/

Stop the OpenVZ container and make a final synchronisation to the linux container

openvz: vzctl stop X
node1: rsync -a --delete --numeric-ids -e 'ssh' openvz:/vz/private/<containerid>/ /mnt/lxcmigrate/

Disable autoboot of OpenVZ container

openvz: vzctl set <containerid> --onboot no --save<

Add the following tty statements to /etc/inittab:

0:2345:respawn:/sbin/getty 38400 console
1:2345:respawn:/sbin/getty 38400 tty1
2:23:respawn:/sbin/getty 38400 tty2
3:23:respawn:/sbin/getty 38400 tty3
p0::powerfail:/sbin/init 0

Modify /etc/network/interfaces, replace venet with eth0
Set a root password

node1: cd /mnt/lxcmigrate
node1: chroot .
node1: passwd
node1: exit

Unmount /mnt/lxcmigrate
Start the Linux container manually for a testdrive

node1: lxc-start -n www -f /lxcX/www/config

Login to the container
Optional: grab randomly generated mac address and insert into LXC configuration (lxc.network.hwaddr)
Execute the following statement to ensure that the container shuts down cleanly-

www: kill -PWR 1

If that works out you are ready to create the Pacemaker lxc resource configuration in LCMC.

Best practices

System updates on the cluster nodes

UPDATE: Please add "GRUB_DISABLE_OS_PROBER=true" to /etc/default/grub to avoid problems like described below.

If you are going to install a kernel update on a cluster node, it's recommended to force a switchover of all cluster resources from that node to the other. This can be done via LCMC.

As soon as all cluster resources are running on the other node, the updates can be installed as usual. E.g. in Ubuntu just execute apt-get update; apt-get dist-upgrade to install the latest kernel. Reboot the system after installing the new kernel and switch the resources back afterwards if you like.

Don't forget to remove the migration constraint after the switchover!

If updates are installed on a cluster node while cluster resources are running on it, unexpected things can happen. For example the switch-/failover can be blocked due to opened logical volumes. E.g. the following process can hold a snapshot open.

root@node1:/home/user# ps -eaf
...
root      1335     1  0 21:41 ?        00:00:00 grub-mount /dev/mapper/vg_node1-testsnap /var/lib/os-prober/mount

This can cause problems during switch-/failover. Often that problems can be resolved with a reboot of the system.

Networking issue due to MAC address changes on container reboots

If you experience networking connectivity problems in one of your containers, cached ARP entries in one of your networking devices (e.g. firewall, router, ...) can be the reason for that. Each time a container is started, it is randomly assigned a new MAC address for its veth device. For example Juniper firewalls running JunOS normally have an default ARP aging-timer of 20 minutes. This can lead to downtimes of the containers.

Three possible solutions are:

Decrease the ARP caching timeout on your networking device
Set a fixed MAC address for the LXC container with the option "lxc.network.hwaddr" in the LXC configuration file. You can start each container once, grab the randomized mac address from the container (ip a s) and insert it into the configuration file.
Send a gratuitous ARP package as soon as the networking interface is brought up in the container. In Ubuntu/Debian you have to add the "post-up" line into /etc/network/interfaces:

auto eth0
iface eth0 inet static
	address 203.0.113.10
	netmask 255.255.255.0
	gateway 203.0.113.1
	# dns-* options are implemented by the resolvconf package, if installed
	dns-nameservers 203.0.113.1 8.8.8.8
	post-up /usr/sbin/arping -c 4 -I eth0 -q 203.0.113.10

Device Mapper error when activating LVM volume group with snapshots

If you are using snapshots in the Pacemaker managed volume group, it happens that the LVM resource agent hangs when doing am switchover/failover. The following error message is reported:

May 18 22:02:43 node1 lrmd: [4413]: info: rsc:res_LVM_2 start[41] (pid 12127)
May 18 22:02:43 node1 LVM[12127]: INFO: Activating volume group vg_node2
May 18 22:02:43 node1 LVM[12127]: INFO: Reading all physical volumes. This may take a while... Found volume group "vg_node2" using metadata type lvm2 Found volume group "vg_node1" using metadata type lvm2
May 18 22:02:43 node1 kernel: [170067.325149] device-mapper: table: 252:12: snapshot: Snapshot cow pairing for exception table handover failed
May 18 22:02:43 node1 kernel: [170067.325911] device-mapper: ioctl: error adding target to table
May 18 22:02:45 node1 LVM[12127]: ERROR: device-mapper: reload ioctl failed: Invalid argument 17 logical volume(s) in volume group "vg_node2" now active
May 18 22:02:45 node1 lrmd: [4413]: info: operation start[41] on res_LVM_2 for client 4416: pid 12127 exited with return code 1

To manually fix the issue, a cleanup must be done on the LVM resource in Pacemaker.

I found the following mailing list conversation: http://www.redhat.com/archives/linux-lvm/2012-February/msg00043.html This led me to the assumption it must be related to udev.

I found out that commenting out the udev rules in /lib/udev/rules.d/85-lvm2.rules resolves the issue.

# This file causes block devices with LVM signatures to be automatically
# added to their volume group.
# See udev(8) for syntax

#SUBSYSTEM=="block", ACTION=="add|change", ENV{ID_FS_TYPE}=="lvm*|LVM*", \
#	RUN+="watershed sh -c '/sbin/lvm vgscan; /sbin/lvm vgchange -a y'"

To keep the change I recommend to put the lvm2 package on hold:

sudo apt-mark hold lvm2

Creating Debian Wheezy Container with Ubtuntu 12.04

export SUITE=wheezy
lxc-create -t debian ...
Manually install rsyslog afterwards
Please check if /etc/apt/sources.list contains the Security Mirror
- deb http://security.debian.org wheezy/updates main contrib

Creating Debian Jessie Container with Ubtuntu 12.04

export SUITE=jessie
cd /usr/share/debootstrap/scripts/
ln -s sid jessie
lxc-create -t debian ...
Manually install rsyslog afterwards
Please check if /etc/apt/sources.list contains the Security Mirror
- deb http://security.debian.org jessie/updates main contrib
see below for "Troubles with systemd" to remove systemd

Creating Ubuntu Trusty Container with Ubtuntu 12.04

see help text from:

lxc-create -t ubuntu -h

lxc-create -t ubuntu ... -- -r trusty

LVM Resource Agent Timeout

If you have many logical volumes and snapshots, activating a LVM volume group can take a few minutes. The default start/stop timeout in the LVM resource agent is only 60 seconds. I recommend to increase the start and stop timeout from 60 to 240 seconds. It has solved an issue I had with an HA cluster.

Sep 28 20:34:53 node2 crmd: [2889]: info: do_lrm_rsc_op: Performing key=55:1934:0:a21f2f58-6fa6-487e-82aa-0597971bf516 op=res_LVM_1_start_0 )
Sep 28 20:34:53 node2 lrmd: [2886]: info: rsc:res_LVM_1 start[45] (pid 12395)
Sep 28 20:34:53 node2 LVM[12395]: INFO: Activating volume group vg_node1
Sep 28 20:34:53 node2 LVM[12395]: INFO: Reading all physical volumes. This may take a while... Found volume group "vg_node2" using metadata type lvm2 Found volume group "vg_node1" using metadata type lvm2
Sep 28 20:35:53 node2 lrmd: [2886]: WARN: res_LVM_1:start process (PID 12395) timed out (try 1).  Killing with signal SIGTERM (15).

Extending Container Filesystem

Since the container's filesystem is used on top of a logical volume, it can be easily extended using LVM.

The following example adds 8GB for a container called "www".

# stopping container
crm(live)resource# stop res_lxc_www

# extending logical volume
lvextend -L+8G /dev/vg_node2/www

# checking filesystem (/lost+found must be created) 
e2fsck -f /dev/vg_node2/www

# resizing ext4 filesystem
resize2fs -p /dev/vg_node2/www

# starting container
crm(live)resource# start res_lxc_www

# logging into container to verify disk space with "df -h"
lxc-console -n www

Troubles with systemd

If you are using Ubuntu 12.04 together with LXC, systemd can break your container. See [4].

We experienced that problem with a Debian Jessie container that was accidentally upgraded to systemd during a apt-get dist-upgrade procedure. After that upgrade, the container was unable to boot. lxc-ps showed only one process called "systemd" and did not boot any further.

To revert to System-V startup you can follow these steps:

Ensure that the container is stopped
- crm resource stop res_lxc_www
- lxc-info --name www
mount the Container Filesystem
- mount /dev/vg_node1/www /mnt/
chroot /mnt/
exit
apt-get install sysvinit-core
cd /; umount /mnt/
crm resource start res_lxc_www

Enable tmpfs mounts inside the container

Replace the following line in the lxc container config

lxc.cap.drop                            = sys_admin

with

lxc.cap.drop                            = sys_module mac_admin mac_override sys_time

Author: Christoph Mitasch

Christoph Mitasch works in the Web Operations & Knowledge Transfer team at Thomas-Krenn. He is responsible for the maintenance and further development of the webshop infrastructure. After an internship at IBM Linz, he finished his diploma studies "Computer- and Media-Security" at FH Hagenberg. He lives near Linz and beside working, he is an enthusiastic marathon runner and juggler, where he hold various world-records.