How I patch 450+ Windows servers each month with minimal risk

How I patch 450+ Windows servers each month with minimal risk

This is how I think you can patch a large Windows Server estate on a regular basis balancing the needs of Infosec and a risk adverse corporation.

Background

In my wanderings round the internet, I've seen articles suggesting you should turn off Windows Update because of the latest patch releases (e.g. Turn off Windows Update, temporarily, Windows Patch Tuesday - Do this one thing before updating or How to Turn off Automatic Updates in Windows) In my short career so far, I've worked in 2 companies managing the Windows Server estates (my previous employer was an ISP, and currently I work for a medium sized financial organisation). The IT systems had many similarities but also many differences. For example, the Windows servers I managed at the ISP were purely for the serving the internal IT infrastructure, whereas my current role has servers which provide customer facing services. As such, the risk appetite was noticeably different between the two, but neither had turn it off as an option

Focusing on my current role, the team I'm in are responsible for managing the ~450 (and always increasing) Windows servers. The operating systems they run are all from the Server product line and range from Windows Server 2008 all the way through to Windows Server 2016. Like most organisations, we have production servers & equivalent test/development servers. 90%+ of the servers are virtual instances with a handful still being physical. Like most corporate environments, our servers are paid up members of the pets club as apposed to the cattle one, so we need to actively maintain them with ongoing security & critical bug fixes. Clearly with more than a handful of servers to patch regularly, we're not going to be able to do that manually so we automate the process as much as possible. As with so much in IT, there is more than 1 way to get the job done - but this is the way it's currently done and it's been tweaked along the way as improvements are identified. I think it's a good example of how Windows servers can be patched every month, strikes a good balance of patching as soon as possible but leaves a good amount of time for any potential bad patches surface in the wild which reduces the risk. I'm quite proud of the process I helped create and I wanted to blog about it.

Windows Server Update Services (WSUS)

According to Microsoft:-

A WSUS server provides features that you can use to manage and distribute updates through a management console. A WSUS server can also be the update source for other WSUS servers within the organization. The WSUS server that acts as an update source is called an upstream server. In a WSUS implementation, at least one WSUS server on your network must be able to connect to Microsoft Update to get available update information.

While not perfect, I find WSUS to be a very functional tool considering it's a freebie download/role of Windows Server. Microsoft release their updates via the Windows Update system out on the public internet; and by default a Windows OS (Desktop & Server variants) will contact the Windows Update servers directly to get the latest updates. Updates are typically released on the second Tuesday of the month (a.k.a. Patch Tuesday)

In a corporate environment the typical (free) solution is to use WSUS which is a role you install on a local server you control. WSUS pulls all the updates down from Microsoft for the products and categories you need and stores them locally on the WSUS server. Tweak some settings on the managed servers and they'll talk just to your WSUS server for updates. This has the following benefits:-

  • Reduces your internet bandwidth because an update is only downloaded once to the WSUS server, then distributed internally to the other servers (and there can be some sizable updates in a given month)
  • Increases security as you can cut off internet access for your servers because they no longer need to contact the public update servers
  • Gives you an administrative view of the patch status of all your servers
  • Gives you control over what patches come down. Only want Server 2012 R2 & Office 2010 security patches? No problem, tick those boxes in the WSUS settings and those are all the updates you'll sync.

wsus-products
wsus-classifications

Machines which are looking to WSUS can be grouped together in different ways. For example, you can put them all into 1 big group, or split into multiple groups and a hierarchical structure can be setup too. It's up to you and your needs. Updates need to be approved before WSUS clients will apply them, and updates can be applied to some groups and not others.

For the WSUS clients (in our case all our clients are the other Windows Servers in our organisation) these need configuring to talk to the WSUS server, and to automatically patch at a set time. The built-in update client on Windows machines can only be configured to update at a time and day of week, but not on a particular day of the month. So we do this differently. We use Rob Dunn's Windows Update Agent force script which forces the Windows Update client to re-check for updates, then downloads, installs and reboots if required. We run the script via a Scheduled Task to set to run at a specific day and time. Using this script we have a great amount of control over when updates get applied. As most of our servers are AD joined, we use Group Policy Objects (GPOs) linked to the OUs where the servers object sit to configure the update client settings and create the scheduled task. The non-domain joined are manually configured to achieve the same result. As GPOs are linked to OUs, to change when a server is patched is simple case of moving a servers AD computer object between OUs (our OU structure is very simple in that all servers live in 1 OU. If it was more complex, we could use GPO filtering and have the computer objects members of an AD group which grants permission to read and apply the correct WSUS GPO)

For service availability and to spread the risk, we split our servers into multiple groups and patch those groups at different days and times depending on system requirements. The groups we currently have are:-

Group Day Time
X1 Last Monday of the month 02:00
X1b Last Monday of the month 04:30
X2 Last Wednesday of the month 02:00
X2b Last Wednesday of the month 04:30
X3 Last Friday of the month 02:00
X3b Last Friday of the month 04:30
Y1 Last Monday of the month 02:00
Y1b Last Monday of the month 04:30
Y2 Last Wednesday of the month 02:00
Y2b Last Wednesday of the month 04:30
Y3 Last Friday of the month 02:00
Y3b Last Friday of the month 04:30

Any server hosting a dev/test/UAT system would be in an X group while any server hosting a live would be in a Y group. As all X groups come before Y groups, we ensure that patches roll out to non-live servers first. Additionally, for a given system, the dev/test/uat server(s) would be in the same numbered group as the live ones.

For most of our servers, they are arbitrarily split between the numbered groups, but there are some specific examples where we place servers in specific groups:-

  • Clusters of servers (e.g. a web farm with 2 or 3 nodes) would be split across groups to ensure service availability (i.e. server1 in X1, server2 in X2 means the cluster always has a node to run the services on even during server reboots)
  • When 2 servers are dependent on each other and server1 has to be up before server2 comes up, we would put server1 in say X1, and server2 in X1b. This means we can ensure that both servers are patched on the same night, but that one has been done before the other

Having the times of these groups as they are gives us a good window between release and first application in which Microsoft can identify any issues in the wild and pull them back. We've only had to pull 1 patch that Microsoft didn't which was kb4088875 with the Static IP address settings are lost after you apply this update known issue. As we run vSphere for our virtualization platform, this affected us and our first patching group was badly affected. As soon as the issues was identified we declined it in WSUS so no other servers got it. As the issue was later fixed by Microsoft it got applied in the following month's cycle - nothing needed to be fully turned off! ;)

WSUS syncronisation with Microsoft

As previously mentioned, WSUS syncronises the products and categories of updates you want with Microsoft Update. Usually this is set to automatically sync on a daily basis, which for a long time is what we did.

We recently remembered that updates don't always come down to WSUS on Patch Tuesday. This came to light when our DBAs noticed that a SQL update had gone to some test/dev servers and not others. This wasn't ideal. We needed to ensure that once the first X group is patched, we don't want any new updates syncronising from Microsoft. Once the last Y group has completed the syncronisation can start again.

This was achieved by turning off automatic synchronisation and instead running a PowerShell script on the WSUS server every day. The script performs a date check and if the current day is in the first or last week of the month then the script ends. If it's not, it initiates a WSUS syncronisation using this command:-

(Get-WsusServer).GetSubscription().StartSynchronization()

Pros/Cons

To summarise, I feel that the way we do things right now has the following benefits:-

  • Servers are patched every month, and are always patched before the next patch Tuesday
  • The patch windows gives us a good amount of time following patch release to patch deployment. This gives time for buggy patches to be pulled by Microsoft or ourselves if required.
  • Test/dev servers are always patched before live ones, so we get to test the updates on our systems before they go onto live. And as test servers will go on the same day of week as equivalent live one, so there's always a gap of 1 week between them which gives a good amount of time for problems to surface
  • WSUS manual sync script means when we start patching no more updates are retrieved from Microsoft, so all the groups will get the same updates
  • We have the flexibility to split out farms of servers over 3 patching windows to maintain availability (e.g. 3 node file cluster has a node on Monday, another on Wednesday & the the other one on Friday.
  • We can patch servers which need to be patched before others and do this on the same night. E.g. server1 has to be up when server2 is patched and rebooted. This is done by having server1 in the 2am group and server2 in the 4.30 group.

But, there are currently a couple of glaring omissions from our setup that we need to address going forward. These are:-

  • There is no easy, automated mechanism to push out an out-of-band/emergency patch. When Microsoft release out-of-band releases, if our InfoSec requirements mandate it be applied ASAP then it's a very time consuming task to achieve.

  • There could be too much time to wait for a in-band update to be deployed. Depending on how the calendar falls, the worst case scenario is that there can be up to 2 weeks 4 days between Microsoft releasing the patch and it being applied the first servers (i.e. in a 31 day month when patch Tuesday is on the 8th - such as January 2019 - the first X group would not go on until Friday 25th - 17 days later!).

    patching-longest-time-between-release-and-application-1

Conclusion

Our patching process started out in a pretty bad way - mostly ad-hoc manual patching and a lot of servers not patched. But over the course of 6-9 months, organisational trust grew that patching and the processes were robust enough to bring more and more servers into this automated mechanism without bringing systems to their knees. It's been running as described above for some time now. Minor tweaks (such as the PowerShell script initiated WSUS syncronisation) have been done as the need arose and will continue to improve and change as business & industry requirements develop over time