A Citrix XenServer host that is managed by Apache CloudStack will reboot itself along with all running virtual machines running on the host if there is a problem with the primary storage after a period of time. In general this is a good protection method, if the primary storage is no longer available it can cause problems for the virtual machines, Windows VMs may blue screen of death (BSOD) while Linux VMs may enter a read only file system state to protect against data loss. However you may have multiple NFS mount points to act as primary storage and may not want every VM on the XenServer host to power off with the host should this happen. It is possible to work around this by modifying the heartbeat script created by CloudStack on the XenServer host.
When a XenServer host reboots due to a storage accessibility problem, you may see a similar message logged in /var/log/messages as below.
Aug 20 13:05:40 xenserver1 heartbeat: Potential problem with /var/run/sr-mount/211467a3-961c-5c1e-3fc7-56cab4f2cf23/hb-b192a6cc-3456-6788-3457-8749b5677dc3: not reachable since 133 seconds, rebooting system!
In this instance there was a temporary network issue which prevented the XenServer host accessing the NFS mount point. In this case we do not want to have the XenServer host reboot and virtual machines shut down due to a temporary network issue, so we are going to manually edit the heartbeat script.
- On the XenServer host, edit the /opt/xensource/bin/xenheartbeat.sh file and comment out the two instances of “reboot -f” as this is what initiates the reboot command.
- Find the PID of the heartbeat script by running “pidof -x xenheartbeat.sh”
- Restart the script by killing the PID with “kill PID”.
- Reconnect to the host through the Apache Cloudstack user interface./li>
From my investigation the /opt/xensource/bin/xenheartbeat.sh script file only exists on XenServer hosts that are managed by Apache CloudStack, stand alone XenServer hosts did not have this file so this will only apply to hosts managed through CloudStack.
It is also important to note that the file may be modified from a XenServer update, so it should be checked after a major upgrade and modified again if required.
There is currently an open unresolved bug report on this issue, hopefully in the future the heartbeat script becomes more robust as a hypervisor host reboot that has running virtual machines on it should not be taken lightly. Even virtual machines that are not using the problem storage will be shutdown by default as they run on the XenServer host that has detected a storage problem, which is hardly ideal. Removing the reboot lines from the script is not an ideal solution in a production environment if the storage becomes unavailable for a long period of time, but may be used as a work around if you are able to quickly detect the issue and resolve the problem.