Windows Server Maintenance Checklist

Posted by Jarrod Farncomb on January 14, 2015 Leave a comment (0) Go to comments

Server maintenance needs to be performed regularly in order to ensure that your server will continue to run with minimal problems, while a lot of maintenance tasks are automated within the Windows operating system now there are still things that need to be checked and monitored regularly to ensure that Windows is running optimally. Below are steps that should be taken in order to maintain your servers.

Updates

Windows updates have been installed within the last month.
Keeping your server up to date is one of the most important maintenance tasks that needs to be done. Before applying updates to your server, confirm that you have a recent backup or snapshot if working with a virtual machine so that you have the option of reverting back if the updates cause you any unexpected problems.

If possible you should aim to test updates on a test server first if you are applying them to a production server. This allows you to first confirm that the updates will not break your server and will be compatible with any other packages or software that you may be running. If possible make use of a WSUS server, this will allow you to approve and control the updates that go to particular server groups in your environment.

You can apply updates through Windows Update which is accessible through Start > Control Panel > System and Security > Windows Update, or by just typing Windows Update into the Start search box. From here you can check for and install updates and select if they should download and install automatically which is enabled by default and the recommended option. It may not always be possible to have production servers rebooting whenever there are updates even out of hours so you may need to disable these and schedule in updates and reboots as required. If you have many servers you can configure Windows updates through group policy instead of locally on each server, basically it’s going to depend on your environment as to how this best works for you.

Other applications have been updated within the last week.
Other web applications such as WordPress/Drupal/Joomla for example, need to be frequently updated. This is because these sorts of applications act as a gateway to your server, usually by being more accessible than direct server access and by allowing public access in from the Internet. Lots of web applications may well also have third party plugins installed which can be coded by anyone, potentially having many security vulnerabilities with code that has not been audited.

As such it is critical to update these sorts of applications installed on your server very frequently. These content management systems are not managed through Windows updates as they are standalone pieces of software. They may have update functions within them or otherwise you may need to manually download and apply updates from an online source, if you’re unsure contact the application provider for further assistance.

Security

Server access reviewed within the last 6 months.
To increase security you should review who has access to your server. In any given organization you may have staff who have left but still have accounts with access; these should be removed or disabled. There may be local accounts on the server, or domain accounts in active directory if your server is a member of a domain, with varying degrees of access such as an administrator who should no longer be granted such permissions, another group to check is the remote desktop users group as this allows the user to remotely connect. This should be reviewed to avoid a possible security breach. Server access can be reviewed through the security log in event viewer on each server, or on the domain controller in a domain environment. You can check the members of important groups such as administrator, domain administrator, and remote desktop users.

Firewall rules reviewed in the last 6-12 months.
Firewall rules should also be reviewed from time to time to ensure that you are only allowing required inbound and outbound traffic. Requirements for a server change over time, as applications are installed and removed the ports that it is listening on may change, potentially introducing vulnerabilities, so it is important to restrict this traffic correctly. Ideally firewall changes should undergo a change request process, this will allow you to easily review who has put a particular change in place and their reasons for doing so.

Windows operating systems come with Windows Firewall installed and running by default, only inbound traffic is restricted while all outbound traffic is allowed out. You can test for ports that are responsive from another external server by using telnet to a specific port. You can also enable auditing events so that you can log and view denied traffic.

Confirm that users must change passwords.
User accounts should be configured to expire after a period of time, common periods are anywhere between 30-90 days. This is important so that the user password is only valid for a set amount of time before the user is forced to change it. This increases security because if an account is compromised it will not always be able to be used as the password will change to something different – access by an attacker will not be maintained through that account. It’s also worth checking that no users have been set to have their password never expire, for further information see here.

If your accounts are using active directory, this can be set centrally for the accounts there through group policy (see here for more information), otherwise you can set this on a per account basis locally on the server itself through the local users and computers console. However, this is not as scalable as using active directory because you need to implement the changes on all of your servers individually, which will take time and be harder to manage.

Backups

Backups and restores have been tested and confirmed to be working.
It is important to backup your servers in case of data loss. It is equally important to actually test that your backups work and that you can successfully complete a restore. Check that your backups are working on a daily or weekly basis – most backup software should be able to notify you if a backup task fails and this should be investigated and repaired as soon as possible.

It is a good idea to perform a test restore every few months or so to ensure that your backups are working as intended. This may sound time consuming but it is an important exercise. There are countless stories of backups appearing to work until all the data is lost; only then do people realize that they are not actually able to restore the data from backup.

You can backup locally to the same server, which is not recommended as you can lose the backup data if there is a problem at the server level. Alternatively and more preferably you can backup to an external location either on your network, or out on the Internet – this could be your own server or a cloud storage solution like Amazon S3. If backing up to the Internet it is important to consider the sensitivity of your data, you may want to encrypt the data before uploading it to a third party location for instance to keep it secure.

A simple and useful option could also be to enable shadow copies which allow you to easily revert files, though again like the local option this is not ideal, an offsite backup for important data should be maintained.

Monitoring

Monitoring has been checked and confirmed as working correctly.
If your server is used in production you most likely have it monitored for various services. It is important to check and confirm that this monitoring is working as intended and that it is reporting correctly so that you know you will be correctly alerted if there are any issues. It is possible that incorrect firewall rules may disrupt monitoring, or your server may be performing different roles now since the monitoring was originally set up and so may need to be monitored for additional services. You can monitor services and ports internally from your network by using something like a Nagios server, or with an external service such as Pingdom which is ideal for testing external facing services such as public websites as the health check traffic comes in over the Internet just like a normal users making it a good test.

Resource usage has been checked in the last month.
Resource usage is typically checked as a monitoring activity. It is, however, good practice to observe long term monitoring data in order to get an idea of any resource increases or trends which may indicate that you need to upgrade a component of your server so that it is capable of working under the increased load. This will depend on your monitoring solution, however you should be able to monitor CPU usage, free disk space, free physical memory and other variables for certain thresholds and if these start to trigger more often you will know to investigate further. Through Windows you can also define performance monitors to monitor various resources over time and present them as a graph, see here for more information.

Hardware errors have been checked in the last week.
Critical hardware problems will likely show up on your monitoring and be obvious as the server may stop working correctly. You can potentially avoid this scenario by monitoring your system for hardware errors which may give you a heads up that a piece of hardware is having problems and should be replaced in advance before it fails. Through Windows the best place to observe such events is through event viewer, typically under Windows logs > System, look for warnings and critical events.

File system maintenance

Unused applications have been removed in the last month.
You can save both disk space and reduce your attack surface by removing old and unused applications from your server, hardening it as there is less code available for an attacker to make use of. Checking for unused applications and removing them if not required should be done on a monthly basis. You can view the currently installed applications through Start > Control Panel > Programs > Programs and Features, or just searching for Programs and Features in start. It will display a list of everything installed, when it was installed and the size used. If you find anything suspicious that should not be there it should be removed and you should investigate how it got there immediately.

Disk integrity checked in the last month.
The traditional hard drive in a server typically has the most moving parts meaning it has the potential to have the most problems so it should be checked often. With the ‘chkdsk’ command you can scan the hard drive and check for a number of problems, you can do this in Computer by right clicking the drive, select properties > tools > check now. This is how you graphically run chkdsk and repair errors, otherwise you can use the command through command prompt.

NTFS introduced an online self-healing feature in Windows Server 2008 resulting in chkdsk not being needed as often, and since Windows Server 2012 corruption can be scanned for and fixed online with no down time. In previous versions of Windows Server before 2012 the file system volume would need to be taken offline in order to scan and repair, though you can run a scan only online without the repair option at any time. If you are running a server operating system that is older than 2012 you will need to schedule in the down time to run the repair if the chkdsk scan detects any faults on next boot.

Other general tasks

Event logs and statistics are being monitored daily or weekly.
Windows will report important events which can be viewed through the event viewer, these events should be checked at least weekly for warnings and critical issues. By default a server will log it’s events locally which means if you have a large number of servers you need to log into each one to check the events which is not very efficient. To make this task easier you can set up forwarded events where you choose one server to collect all of the events from other servers, allowing you to have one central location to store and view events via event viewer.

Regular scans are being run on a weekly/monthly basis.
In order to stay secure it is important to scan your server for malicious content. Windows Defender is a first line of defence which comes installed with some versions of Windows, however ideally you should be using a more comprehensive antivirus/anti-malware solution and scanning regularly, at least on a weekly basis during a period of time where there will be low resource usage so that the scan will not interfere with normal operations, for instance at 4am on Saturday if your servers are at their peak during the week. Don’t forget to ensure that your software also automatically updates every day or week to ensure that you are able to detect the latest threats.

Check server reliability every month.
Windows server comes with a reliability monitor that allows you to view overall system stability and view details about events that impact server reliability, it provides a stability index over a period of time to give you an idea of how reliably your server is running. Reliability monitor will display your reliability index from 1 to 10 over time, when there are program crashes or other problems the index will drop, the more stable it is over time with fewer problems the higher it will increase.

It’s an easy way to gain a quick overview regarding critical events that have happened on the server over time which can allow you to potentially make connections to events that may have started the problem, for instance you may have a lot of critical events after installing new software which can point you to investigating the software addition further.

WindowsMaintenance, Security, Windows

← Linux Server Maintenance Checklist

Decreasing disk space in Windows Server 2012 R2 →

RootUsers

Guides, tutorials, reviews and news for System Administrators.