After rebooting a virtual machine running CentOS through Apache CloudStack it appeared to be running, however it failed to boot up. The status through CloudStack showed the virtual machine as running, however the console did not load any content and connections to it failed. After checking the virtual machine directly through XenCenter it was clear that it was not actually running.
In XenCenter the virtual machine had the red stop icon on it and was definitely stopped. Performing a reboot through CloudStack did nothing, and stopping the instance through CloudStack resulted in the virtual machine being removed from XenCenter as expected. When starting it back up again and watching XenCenter it did appear to power on for a couple of seconds as shown with the green play icon however it quickly went back to the stopped state.
There were no errors in either CloudStack or XenCenter, meaning the problem was likely within the virtual machine itself. I also checked the XenServer logs through SSH and found nothing logged for this strange behaviour, further confirming that the issue was within the VM. Once XenServer powers on the virtual machine and it shows as running, the operating system is responsible for the rest of the process. Should there be a failure in this process it can power off without any clear logs.
How can you investigate what’s going on inside the virtual machine if it does not start up properly? I simply created another Linux based virtual machine, detached the disk in XenCenter from the original virtual machine, attached it to my new virtual machine and mounted it within the OS of the new VM. Once you have that disk mounted you can check logs, make changes, and perform basic investigation as to why the boot process is not working.
A similar but more lengthy process can be done entirely through CloudStack, you would have to take a snapshot of the volume, create a volume from that snapshot, create a new Linux instance, then attach the new volume to the new instance. You will then be able to mount the volume within the virtual machine and perform any required changes, unmount it when complete, detach it from the virtual machine, and then finally create a new template off of this volume and deploy an instance to replace the one that fails to boot.
Trying a file system check
I chose the first method of working outside of CloudStack as it would be faster. After attaching the disk to my new virtual machine I first tried running a file system check on it, in case something was corrupt. I’ve had this problem previously and a file system check was all that was needed to quickly fix the problem.
In this particular case however the file system check did not resolve the problem. After attaching the disk but before mounting it I ran “e2fsck -fy /dev/xvdb1” which was the root partition on the attached disk, as displayed from “fdisk -l” output. Once the file system check completes you need to detach the disk from the new test virtual machine, attach it back to the original virtual machine and try to start it up.
Selecting which kernel to use during boot
As you may know /boot/grub/grub.conf maintains configuration for a few recent versions of the Linux kernel. I had a look in here and it was set to boot version 2.6.32-504.8.1.el6.x86_64 of the kernel, as this was the first line and “default=0” says to boot using the first one defined. With the disk of the virtual machine that would not start attached and mounted to my test virtual machine, I edited /mnt/boot/grub/grub.conf and changed “default=0” to “default=1”. This would boot the second kernel which will typically be the one that was used prior to a recent update.
After editing the grub.conf file I unmounted the disk, detached it from the test VM, attached it to the original VM and powered the original VM on, and this time it booted correctly. So essentially there was something wrong with the kernel, I did see in /var/log/yum.log it was installed months ago however no reboot was ever performed for months so the newer kernel was never tested or used. This could have happened if the installation had been interrupted for example.
I uninstalled this version of the kernel via yum and performed a “yum update” to go to the most recent kernel available. After this I set the grub.conf file back to “default=0” so that this would be used on next boot and performed a reboot. The virtual machine came back up on the latest kernel with no problems, you can view the running kernel version with “uname -r” and it should match the line defined in grub.conf.
There was nothing logged pointing to the kernel being the problem in /var/log that I could see, I suspect that this was simply because no disk had yet been mounted as the kernel did not even load properly so there was no where to write any logs out to. The boot also happened too quick so nothing was ever displayed in console to point to this being the problem.
In the past I have resolved a Linux virtual machine in XenServer not booting by performing a file system check, however in this instance that was not the problem. The symptoms here were a bit different too, as no errors were logged in CloudStack, XenCenter, or XenServer.
The problem here was specifically related to the kernel version that it was trying to boot with. By attaching the disk of the virtual machine to another server and mounting it we gained the ability to edit contents on disk and change the /boot/grub/grub.conf configuration file to edit the boot order. Once all modifications were complete the disk was unmounted, detached from the test VM and attached back to the original VM which then booted correctly.