One of the most important safety nets in IT Operations is contingency. Every migration needs a rollback plan in the event that things don’t quite go the way you’d expect, and with a limited timeline to implement a change, or in some cases, a complete migration, the rollback process is one that is an essential component. Without a plan to revert all changes back to their previous state, your migration is destined for failure from the outset. No matter how confident you are (I’ve yet to meet a project manager who doesn’t build in redundancy or rollback in one form or another) there is always going to be something you’ve missed, or a change that produces undesirable results.
It is this seemingly innocent change that can have a domino effect on your migration - unless you have access to a replica environment, the result of the change cannot be realistically predicted. Admittedly, it’s a simple enough process to clone virtual machines to test against, but that’s of no consequence if your change relates to those conducted at hardware level. A classic example of this is a firewall migration. Whilst it would be possible to test policies to ensure their functionality meets the requirement of the business, confirming VPN links for example isn’t so straightforward - especially when you need to rely on external vendors to complete their piece of the puzzle before you can continue. Unless you’re deploying technology into a greenfield site, you do not have the luxury of testing a VPN into a production network during business hours. Based on this, you have a couple of choices
- You perform all testing off hours by switching equipment for the replacement, and perform end to end testing. Once you are satisfied everything works as it should, you put everything back the way you found it, then schedule a date for the migration.
- You configure the firewall using a separate subnet, VLAN, and other associated networking elements meaning the two environments run symmetrically
But which path is the right one ? Good question. There’s no hard and fast rule to which option you go for - although option 2 is more suited to a phased migration approach whilst option 1 is more aligned to “big bang” - in other words, moving everything at the same time. Option 2 is good for testing, but may not reflect reality as you are not targeting the same configuration. As a side note, I’ve often seen situations where residual configuration from option 1 has been left behind, meaning you either land up with a conflict of sorts, or black hole routing.
Making use of a rollback
This is where the rollback plan bridges the gap. If you find yourself in a situation where you either run out of time, or cannot continue owing to physical, logical or external constraints, then you would need to invoke your rollback plan. It’s important to note at this stage that part of the project plan should include a point where the progress is reviewed and assessed, and if necessary, the rollback is executed. My personal preference is within around 40% of the allocated time window - all relevant personnel should reconvene and provide status updates around their areas of responsibility, and give a synopsis of any issues - and be fully prepared to elaborate on these if the need arises. If the responsible manager feels that the project is at risk of overrunning it’s started time frame, or cannot be completed within that window, he or she needs to exercise authority to invoke the rollback plan. When setting the review interval, you should also consider the amount of time required to revert all changes and perform regression testing.
Rollback provides the ideal opportunity to put everything back how it was before you started on your journey - but it does depend on two major factors. Firstly, you need to allocate a suitable time period for the rollback to be completed within. Secondly, unless you have a list of changes that were made to hardware - inclusive of configuration, patching, and a myriad of others, how can you be sure that you’ve covered everything ?
Time after time I see the same problem - something gets missed, and turns out to be fundamental on Monday morning when the changes haven’t been cross checked.
So what should a contingency plan consist of ?
One surefire way to ensure that configurations are preserved prior to making changes is to create backups of running configs - 2 minutes now can save you 2 days of troubleshooting when you can’t remember which change caused your issue. For virtual machines, this is typically a snapshot that can be restored later should the need arise. A word to the wise though - don’t leave the machine running on snapshot for too long as this can rapidly deplete storage space. It’s not a simple process to recover a crashed VM that has run out of disk space.
Keep version and change control records up to date - particularly during the migration. Any change that could negatively impact the remainder of the project should be examined and evaluated, and if necessary, removed from the scope of works (provided this is a feasible step - sometimes negating a process is enough to make a project fail)
Document each step. I can’t stress the importance of this enough. I understand that we all want to get things done in a timely manner, but will you realistically remember all the changes you made in the order they were implemented ?
Use differential tools to examine and easily highlight changes between two configurations. There are a number of free tools on the internet that do this. If you’re using a Windows environment, a personal favourite of mine is WinMerge. Using a diff tool can separate the wood from the trees quickly, and provides a simple overview of changes - very useful in the small hours, I can assure you.
Working on a switch or firewall ? Learn how to use the CLI. This is often superior in terms of power and usually contains commands that are not available from the GUI. Using this approach, it’s perfectly feasible to bulk load configuration, and also back it out using the same mechanism.
What if your rollback plan doesn’t work ? Unfortunately, there is absolutely no way to simulate a rollback during project planning, and this is often made worse by many changes being made at once to multiple systems. It’s not that the rollback doesn’t work - it’s usually always a case of settings being reverted before they should be. In most cases, this has the knock on effect of denying yourself access to a system - and it’s always in a place where there are no local support personnel to assist - at least, not immediately. For every migration I have completed over my career, I’ve always ensured that there is an alternative route to reach a remote device should the primary path become inaccessible. For firewalls, this can be a blessing - particularly as they usually permit access on the public interfaces.
However, delete a route inadvertently and you are toast - you lose access to the firewall full stop - get out of that one. What would I do in a situation like this where the firewall is located in Asia for example, and you are in London ? Again - contingency. You can’t remove a route on a firewall if it was created automatically by the system. In this case, a VLAN or directly connected interface will create it’s own dynamic route, and should still be available. If dealing with a remote firewall, my suggestion here would be Out Of Band Management (OOBM), but not a device connected directly to the firewall itself, as this presents a security risk if not configured properly. A personal preference is a locally connected laptop in the remote location that uses either independent WiFi or a 3G / Mifi presence. Before the migration starts, establish a WebEx or GoToMeeting session (don’t forget to disable UAC here as that can shoot you in the foot), and arrange for a network cable to be plugged into switch fabric, or directly. Direct is better if you can spare the interface, as it removes potential routing issues. Just configure the NIC on the remote machine with an address in the same subnet add the interface you’re connected to, and you’re golden.
I’ve used the above as a get out of jail free card on several occasions, and I can assure you it works.
So what are the takeaways here ?
The most important aspect is to be ready with a response - effectively a “plan b” when things go wrong. Simple planning in advance can save you having to book a flight, or foot the expense of a local IT support firm with no prior knowledge of your network - there’s the security aspect as well; you’d need to provide the password for the device which immediately invokes a change once the remediation is complete. In summary
- Thoroughly plan each migration and allow time for contingency steps. You may not need them, and if you don’t, then you effectively gain time that could be used elsewhere.
- Have an alternative way of reaching a remote device, and ensure necessary third party vendors are going to be available during your maintenance window should this be necessary.
- Take regular config backups of all devices. You don’t necessarily need an expensive tool for this - I actually designed a method to make this work using Linux, a TFTP server, and a custom bash script - let me know if you’d like a copy 🙂
- Regularly analyse (automated diff) configuration changes between configurations. Any changes that are undocumented or previously approved are a cause for alarm and should be investigated
- Ensure that you have adequate documentation, and steps necessary to recover systems in the event of failure
Any thoughts or questions ? Let me know !