Running maintenance on your 24/7 SBC or VDI farm

You’ve got your SBC or VDI environment finally up and running. It’s running smoothly, major bugs and issues have been solved and users are working happily. That’s usually the easy part. Now comes the part where you have to actually maintain that environment. You’ll have to install windows hotfixes, office hotfixes, patches on your SBC/VDI vendor software, you’ll have to update existing applications and probably also install new applications that have been requested by the business.

So you’ll need a method of deploying all those things. Preferably with the least amount of downtime. There are a lot of companies out there that run 24/7. They have a workforce around the globe and their SBC/VDI environment is completely centralized. So the environment is used in different timezones and needs to be up and running 24/7. So requesting any downtime is a pain in the ass. But you still need to perform changes on a (very) frequent basis which you can’t do while users are working. Nowadays people don’t accept that they have to wait for weeks to get an application. Besides that you need to install windows hotfixes every month if you want to keep up with the hotfix cycle.

Performing maintenance

So you need to perform changes on an environment that can’t have downtime, how do you do it? You need something that performs maintenance on parts of your environment while the other part makes sure users can still work normally. That does require that your environment has enough capacity and obviously it’s need redundancy on core components like your broker.

LoginAM has a maintenance plugin which is an intelligent process to perform maintenance without interrupting the workforce. The maintenance plugin employs a couple of tactics to get this job done:

  1. We’ve got a couple of methods of performing maintenance (I’ll be focussing on the first option in this post)
    1. We can perform maintenance by dividing machines into 2 groups and processing them sequentially (servers inside groups are processed parallel). This is also called 50/50 and is by far the most used method.
    2. We can perform maintenance on the servers 1 by 1.
    3. We can perform maintenance randomly. Same as 1 by 1 but the order is random.
    4. We can perform maintenance on all of the servers at once.
  2. Putting the server in drain mode. So the machine doesn’t accept new users. But it does accept re-connections to existing sessions. Do this for a prolonged period of time and the amount of users (if SBC) will slowly decrease to the point that you can reboot the server without having to logoff users.
  3. Run on a schedule, either daily or weekly. But it’s simple enough for you to work out a more complex schedule on your own. Starting maintenance is just a PowerShell command in our module. So you can create any schedule using scheduled tasks.
  4. Ability to import custom pre and post maintenance scripts. The pre-maintenance script is intended for you to gracefully stop any services that might be running on the machine if needed. The post-maintenance script can be used to verify the functionality of the machine after maintenance has performed all changes. I’ll write a more extensive blogpost about this later on (done! find it here)

If you combine these feature with the fact that LoginAM gives you the ability to automate virtually any application or change. You get a fully automated change process without having to be a BOFH to your users.

Login AM maintenance plugin in depth

Performing maintenance isn’t as simple as putting all the above features together. The process in itself is more complex and LoginAM also provides you with additional options to make your life easier or to finetune the process. I’ll only be discussing the 50/50 maintenance mode. First let’s discuss the 2 additional features the maintenance plugin offers you.

  1. You can exclude specific servers (comma separated list). You might have a server that’s malfunctioning, providing services for a VIP user or whatever reason you can think of to skip maintenance.
  2. You can configure Login AM to send you a report about the environment. This consists of an overview of servers with the status of maintenance and if they have encountered any errors.

To get a better understanding of how maintenance works I’ve created a workflow (see below).

LoginAM-Maintenance
LoginAM Maintenance plugin workflow.

 

Let’s run through the workflow for some additional explanation. Important thing to note here that this is run from the client machines (from a LoginAM perspective). The LoginAM server is only used as a central point for information exchange on some occasions.

  • Maintenance starts through a scheduled task. So if  you don’t find the scheduling options we provide flexible enough you can always create the scheduled task yourself (using a LoginAM package of course)
  • First up is a cache update. Maintenance might have been disabled since the last cache update. So we want to get the latest changes from the server
  • Now we get 3 different checks. First 2 are pretty self explanatory, if maintenance has been disabled or the server has been excluded the machine aborts maintenance.
  • The 3rd check works in combination with the next task. For whatever reason the scheduled task might be late to the party. So we check if the other servers in this collection have already started maintenance. If so, the train has already started moving and this server also aborts maintenance.
  • Then we wait for 5 minutes. Allowing all servers to process the previous task.
  • If all servers have started maintenance they create a list of servers (this is done on the Login AM server).
  • With this list of servers the individual servers can determine in which group they are so they know if they needs to process maintenance first or second.
  • Then we come to the maintenance process per group
    • First up is a check if this is the first (group A) or the second (group B) group. If it’s the second group they check if the first group has processed maintenance correctly. If more then 20% have failed the second group will abort maintenance. This is a fail-safe builtin to prevent that too much machines fail maintenance and subsequently might not be functioning properly anymore.
    • Then, if the logon drain option is enabled, the server is set to drain mode and the server waits for an X amount of minutes waiting for users to logoff.
    • If after that amount users are still using the server they’re send 3 separate messages requesting them to logoff.
    • Finally users are forcibly logged off.
    • Next up is the deployment reboot (with the custom pre & post maintenance scripts which I’ll discuss in a separate post).
    • If the deployment reboot is done all machines in the group wait for each other to be ready before giving the go for the second group.
  • Then the second group processes the sub-tasks mentioned above
  • Finally maintenance is finished.

To give you an example of how this is configured imagine the following scenario:

  • A company has 3 offices which all work in the same SBC/VDI environment located in a single datacenter
  • They have Amsterdam (UTC+2), Shanghai (UTC+8) and San Jose (UTC-7).
  • We’re assuming they all have the same amount of workers.
  • They want to have daily maintenance on all machines in the SBC/VDI environment
  • Preferably nobody should have to forcefully logoff within their local office times.
  • We’ll be using 50/50 maintenance method

So when taking UTC as a standard 8 to 5 office times are:

  • 00:00 until 08:00 for Shanghai
  • 06:00 until 15:00 for Amsterdam
  • 15:00 until 23:00 for San Jose

So Shanghai and Amsterdam have a small overlap. Which will be the busiest time on the platform. You definitely don’t want maintenance during that time. Apart from that average utilization should be at 1/3 of the complete amount of workers we have. So, in my opinion, the best moment to start maintenance is just after the busiest part of the day. Which is 06:00 until 08:00 UTC. At that time only Amsterdam should be working (10:00 local time). So we can start maintenance at 08:00 UTC. From that point onward we have 22 hours in total before 2 offices are working at the same time and you want to have the environment back up and running again.

Since we don’t want users to be logged off forcefully during their local office hours we’ll set a drain time of 7 hours. Also take into account some processing time for the tasks before the drain time starts (say 15 minutes). So after we start maintenance at 08:00 users won’t be logged off before 17:15 UTC (only on the first group of servers since we use 50/50 mode). That means The Amsterdam office is wel past their office hours and since we set drain mode the San Jose office will be logged into the group of servers which will be processing maintenance second.

So 17:15  the first group of machines will be rebooting and start maintenance and all changes. Depending on the amount of changes they’ll be back up and running somewhere between 30 minutes and say 2 hours maximum for all changes that night.

Obviously you don’t want to plan an Office upgrade, major batch of windows patches and what else together. So keep in mind you don’t schedule to much at the same time. Both for the sake of the amount of time it will take but also for the sake of not changing everything at the same time. Which makes troubleshooting in case of errors that much more complicated.

So at the earliest machines will be back online at 17:45 and at the latest 19:15. Somewhere between those times the second group of machines will start maintenance. So they’ll either start drain mode and wait until 00:45 of 02:45. Before starting maintenance and processing changes again for a minimum of 30 minutes and a maximum of 2 hours. So at the earliest they’ll be done around 01:15 or at the latest 04:45. Which is still over an hour early before the Amsterdam office starts their local office time again.

Getting a good schedule for this while keeping your workforce in mind can be a complicated task but it really pays off to have a daily or weekly maintenance window. This is also something to be communicated/discussed with your business. It really is a business decision if they want to be able to get new applications and changes in on a daily base or on a weekly base. Also keep in mind that being on a daily maintenance schedule gives you the option to quickly deploy any emergency changes. For example a critical zero-day security fix from Microsoft or a change for a business critical application which started malfunctioning after something was upgraded on the application backend.

They probably won’t be thrilled about the thought of having the logoff but once you understand the how and why they usually understand. Also keep in mind that if they need to logoff (which they’ll be asked 3 times to do). They can immediately log back in on the other set of servers. If you’ve got your SBC/VDI environment managed correctly they should be able to continue working without any problems.

In conclusion

As you  can see the whole process becomes quite complicated. We run as much tasks from the individual servers without interaction with other components. Minimizing these dependencies makes for a very stable maintenance process. Which is very important whenever you’re automating the entire change process.

Important thing to note is that Login AM has support for a DTAP chain or DTAP street (wiki). This enables you to fully test any changes you make. Making sure the entire automated maintenance process works without a glitch in the production environment.

 

One thought on “Running maintenance on your 24/7 SBC or VDI farm

Leave a Reply