XenServer Virtual Machines Stop or Pause During Migration

Recently I had a whole lot of problems migrating virtual machines running on XenServer 6.2 and 6.0.2. Sometimes the migration would fail and the virtual machine would stop or pause resulting in down time, here is how the problem was investigated and fixed.

It all started when I attempted the simple task of applying some newly released patches to a XenServer 6.2 cluster. The pool had 8 hosts and from the pool master I ran the usual ‘xe patch-upload file-name=update.xsupdate’ to upload the updates to all hosts, which succeeded.

The problems began when trying to apply the patches with ‘xe patch-pool-apply uuid=UUID’

My first thought on this failing was that perhaps one of the XenServer hosts did not have enough disk space. This has happened to me before, as currently the root file system is only 4gb (which is very small by today’s standards) so it fills up quite easily when uploading updates.

After checking and confirming that all XenServer hosts had adequate space, I proceeded to run the patch-pool-apply command again, this time it failed with a different error.

The patch apply failed. Please see attached output.
output: 
Backup files already present - aborting. If this is invalid, please remove /opt/xensource/patch-backup/f735af04-069f-4cf0-b80a-795c178c94a8/ and retry.

Basically there were files from my first patch-pool-apply run, so I simply removed the noted directory in the error from all 8 XenServer hosts in the pool with ‘rm -rf /opt/xensource/patch-backup/f735af04-069f-4cf0-b80a-795c178c94a8/’.

I went through this a couple more times before determining that the patch-pool-apply was not going to succeed. My thoughts behind this were that there was 1 XenServer host in the pool that was currently down and out of action due to unrelated problems, so maybe it was trying to patch that but it was not contactable.

Instead of trying to apply to the whole pool at once I instead tried just one XenServer host at a time. I tried a ‘xe patch-apply’ and specified the host-uuid of the pool master that I was logged into as this must be done first. Despite trying to only patch one server, I got the below error.

[root@XenServer1~]# xe patch-apply host-uuid=7e999fa8-6d2e-4db3-8fd5-7308680430cc uuid=f735af04-069f-4cf0-b80a-795c178c94a8
The patch apply failed. Please see attached output.
output: Preparing... ##################################################
xen-hypervisor ##################################################
Preparing... ##################################################
xen-tools ##################################################
Preparing... ##################################################
xapi-core ##################################################
Preparing... ##################################################
vhd-tool ##################################################
Preparing... ##################################################
xapi-xe ##################################################
Preparing... ##################################################
xapi-networkd ##################################################
Preparing... ##################################################
Stopping XCP RRDD plugin xcp-rrdd-gpumon: [ OK ]
perf-tools-rrdd-gpumon ##################################################
Starting XCP RRDD plugin xcp-rrdd-gpumon: [ OK ]
Preparing... ##################################################
Stopping XCP RRDD plugin xcp-rrdd-iostat: [ OK ]
Stopping XCP RRDD plugin xcp-rrdd-squeezed: [ OK ]
Stopping XCP RRDD plugin xcp-rrdd-xenpm: [ OK ]
perf-tools-rrdd-plugins ##################################################
Starting XCP RRDD plugin xcp-rrdd-iostat: [ OK ]
Starting XCP RRDD plugin xcp-rrdd-squeezed: [ OK ]
Starting XCP RRDD plugin xcp-rrdd-xenpm: [ OK ]
Preparing... ##################################################
sm ##################################################
Preparing... ##################################################
blktap ##################################################
Preparing... ##################################################
xen-device-model ##################################################
error: unpacking of archive failed on file /usr/lib/xen/bin/qemu-dm: cpio: rename failed - Inappropriate ioctl for device

So basically it was all going fine until this /usr/lib/xen/bin/qemu-dm file.

Investigating further I found out that for some reason the file was set to immutable, tested with the below command.

[root@XenServer1 ~]# lsattr /usr/lib/xen/bin/qemu-dm
----i-------- /usr/lib/xen/bin/qemu-dm

At the time I was not sure why this was immutable, after confirming with Citrix support that it shouldn’t be I removed it with a ‘chattr -i /usr/lib/xen/bin/qemu-dm’ and was able to successfully complete the ‘xe patch-pool-apply’.

Once the patches were applied the XenServer hosts needed to be rebooted. Before performing the reboot I placed the pool master into maintenance mode to have all virtual machines automatically migrated off, this process completed fine. Once the pool master was back up, I started on the second XenServer host in the pool and tried to migrate the virtual machines to the pool master that was now running the applied updates having now been upgraded. This is where the real trouble began.

Less than half of the virtual machines successfully migrated, the rest took a few minutes before finally entering the “Paused” state. I noticed that virtual machines in XenServer 6.2 got stuck in the paused state, while virtual machines running in XenServer 6.0.2 would just stop. The virtual machines that were stopped could be started back up, however with the paused ones I was not able to resume them. When trying to select resume in XenCenter I got the below error:

There were no servers available to start VM 'VMName'

When trying to migrate it to another host in the pool the below message was shown and was greyed out on all XenServer hosts:

Object has been deleted.VDI:OpaqueRef:Null

After discussing with Citrix support, it was advised that the VMs had already stopped so the only way to proceed was to force a stop which would clear running memory, and boot the servers back up again. This only seemed to happen when migrating VMs from a XenServer host that was pending reboot for the patches and updates I had applied, to one that had already been rebooted and had the updates applied. I was starting to think perhaps there was a bug in the update, this was not the case.

Interestingly I noticed that after a VM had failed to migrate, become stuck on paused, was force stopped then powered back on, it would migrate flawlessly and no longer stop or pause on migration.

I tested migrating VMs within different clusters to see if they had the same problems, they did. I noticed primarily that this was happening with Windows servers, as well as some running Ubuntu. Debian and RedHat instances seemed to not have the problem.

After investigating the XenServer logs, we found the below information.

These logs are from the originating XenServer host, where the virtual machine was originally running and migrating from.

24065 ?        SLsl 00:02:18   0   0  0.5  0.1 ffffff poll_schedule_timeout      \_ /usr/lib/xen/bin/qemu-dm.orig -d 169 -m 2048 -boot dc -serial pty -vcpus 2 -vncunused -k en-us -usb -usbdevice tablet -net nic,vlan=0,macaddr=02:00:00:de:00:01,model=e1000 -net tap,vlan=0,bridge=xapi145,ifname=tap169.0 -acpi -loadvm /var/lib/xen/qemu-resume.169 -videoram 4 -M xenfv -monitor pty -vnc 127.0.0.1:1

These logs are from the receiving host, where the virtual machine was migrating to.

Mar 28 20:32:51 iaascom9 xenopsd: [debug|iaascom9|9|Async.VM.pool_migrate R:b0a716029c0f|xenops] Device.Dm.start domid=7 args: [-d 7 -m 2048 -boot dc -serial pty -vcpus 2 -videoram 4 -vncunused -k en-us -vnc 127.0.0.1:1 -usb -usbdevice tablet -net nic,vlan=0,macaddr=02:00:00:de:00:01,model=rtl8139 -net tap,vlan=0,bridge=xapi145,ifname=tap7.0 -acpi -loadvm /var/lib/xen/qemu-resume.7 -monitor pty]

Note the differences:

Before: "-net nic,vlan=0,macaddr=02:00:00:de:00:01,model=e1000"
After: "-net nic,vlan=0,macaddr=02:00:00:de:00:01,model=rtl8139".

The below error is then observed:

Mar 28 20:32:51 XenServer1 fe: qemu-dm-7[28908]: Unknown savevm section or instance 'e1000' 0
Mar 28 20:32:51 XenServer1 fe: qemu-dm-7[28908]: Error -22 while loading savevm file '/var/lib/xen/qemu-resume.7'
Mar 28 20:32:51 XenServer1 fe: qemu-dm-7[28908]: Linking /var/lib/xen/qemu-resume.7 -> /var/lib/xen/qemu-resume.7-broken

Basically it does not seem to know what the e1000 is all about.

Based on this, it looks like the /usr/lib/xen/bin/qemu-dm file was the problem, the file that was previously immutable preventing updates. After some investigation of the problem, we found this article: http://www.netservers.co.uk/articles/open-source-howtos/citrix_e1000_gigabit

After reading that and checking /usr/lib/xen/bin/ I found that there were indeed qemu-dm and qemu-dm.orig files, so it seems that this hack had previously been applied in the environment. I found out later on that this was intentionally done long ago as the RTL driver was slow being limited at 100Mbit, this was causing problems when performing bare metal backup restores as during this process the guest virtual machine is not running the PV drivers required.

So the /usr/lib/xen/bin/qemu-dm file had been modified at some stage, breaking virtual machine migration in XenServer after the applied updates. This file was a binary file rather than a bash script, I ran an md5sum on qemu-dm in a XenServer pool that had the VM migration problem, and also ran it in a different pool that did not have the problem. The hash was the same, so it was the same binary file in both environments. My assumption here is that after my patching qemu-dm was replaced with a newer or correct version, as the update process did try to modify this file as noted previously when it was set to immutable I believe this is a safe assumption.

Qemu-dm is executed when a VM starts, so once a VM is up and running with the hacked file in place the only way to fix it is to remove the hack and then stop and start the VM. As I had confirmed that qemu-dm now appeared correct after my patching, I needed to shut down a VM that was listed as running e1000 and see if it would change to rtl8139.

There is an easy way to test if you have any VMs running with e1000 or rtl8139, as shown below.

[root@XenServer1 ~]# ps -ef | grep qemu
65576     3114  3113  0 Apr20 ?        00:24:55 qemu-dm-41 -d 41 -m 4096 -boot dc -serial pty -vcpus 4 -videoram 4 -vncunused -k en-us -vnc 127.0.0.1:1 -usb -usbdevice tablet -net nic,vlan=0,macaddr=02:00:6a:90:00:04,model=rtl8139 -net tap,vlan=0,bridge=xapi7,ifname=tap41.0 -acpi -loadvm /var/lib/xen/qemu-resume.41 -monitor pty

Note the model part as this is important, the above example lists “model=rtl8139”, this is good. I tested migrating a few virtual machines that had this listed, and they all migrated fine. I then found some that listed the model as e1000, upon trying to migrate these to a patched and up to date XenServer host, they paused and had to be force stopped. After starting back up and checking with the ‘ps’ command again, I confirmed that it was correctly running with rtl8139.

At this point I had only rebooted a couple of XenServer hosts in the pool to apply the updates, I stopped the upgrade process once I found that VM migration was failing, a problem as large as this needs to be investigated before proceeding. As I had some XenServer hosts that were patched, but pending reboot, I tried to migrate one of the e1000 VMs to these and it worked fine. This confirms that the patching did indeed fix qemu and correct the hacked version that was in place, however it did not take place until XenServer was rebooted. This makes sense as to complete XenServer patching a reboot is required. The failures took place when migrating from a patched but pending reboot XenServer host, to a patched and rebooted XenServer host.

In order to finish the upgrade of the remaining XenServer hosts that had reboots pending, I had to contact all users that had running e1000 virtual machines and schedule a time for them to stop and start the VM up on a fully patched XenServer host. The action of doing this ensured that it started back up correctly with rtl8139 and not e1000. After starting up a stopped VM it came back up automatically on the upgraded host as one that is more patched should be a better selection for it to start on. Slowly I was able to get all virtual machines off of the remaining XenServer hosts and finally complete the XenServer host reboots, allowing them to come up with the correct qemu file in place. After this migration worked perfectly fine.

Conclusion

Hopefully now that this has been resolved, future updates and patching should be a much smoother process as VM migration is now working correctly.

While it may have been possible to simply implement the qemu-dm hacks again so that e1000 servers could migrate, this is unsupported by Citrix so it was better to entirely remove it from the environment to remain in a supported state that works going forward.

Leave a Comment

NOTE - You can use these HTML tags and attributes:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>