I sat on several calls today — on a day off, no less. Sigh.
We shouldn’t really even need to be making a case for automated deployments, but unfortunately I find myself (still) doing so regularly.
Today was to be spent studying for an AWS exam and doing some other productive things. Like most other responsible professionals, I don’t mind hopping on a quick call or two on a day off — in fact, I like it. It makes Monday (today is Friday) a little easier.
But today was not to be spent studying.
The first call (2 hours) was spent trying to understand why a web deployment to a production server had gone awry. Most articles on 502 errors begin by explaining that the error is hard to track down. No amount of checking the Application log seemed to help.
Dev and QA work without issue. Production appears to match in as much as I’m allowed to see. These issues occur often enough that I left several days wiggle room prior to the release. So, luckily, we have 5 days to figure things out.
On the second call, a developer on my team and I watched the SysAdmin deploy several SSIS packages. There are not only packages involved, but of course T-SQL powering those packages, shares that need to be setup, SQL Jobs, etc.
The Admin was doing a remarkable job with excellent attention to detail. He was also logging everything he was doing with screenshots and notes.
Still on one occasion, I asked him to check one of the shares he created where he added permissions to the share, but not to the folder. A missed step.
I asked him to check several additional things. Each of those was done perfectly.
A few minutes later, I remarked to the developer that one of the new Integrations wasn’t appearing on our dashboard. The script to create the database entry was missing, so the developer created it on the fly and amended our instructions. The Integration still didn’t appear. The amended script had a copy-paste error in it and still hadn’t created the correct database entry for the new integration.
We follow a fairly typical deployment pattern.
- It works on the developer’s machine.
- It works in a development environment.
- It works in a QA environment (servers, databases, etc) that the developer cannot access. It’s deployed to QA by the same SysAdmin who deploys to Production. A tester tests in both DEV and QA. Automated tests, where appropriate, occur.
- It gets deployed to Production.
The second call, observing the Production deployment, took over an hour.
The problem is not the people, at least not these two people working on the deployment. They’re both remarkably competent professionals. They have great attention to detail. They were both calm, reserved and professional the entire time.
The problem is with the process and the inability or willingness to move away from this archaic process. What’s worse is the stress that accompanies these deployments. One tiny mistake can cause an entire process to fail.
For us, this can be a failure to transmit financial information in a timely or correct fashion.
What’s needed, of course, is the ability to perform hundreds or even thousands of test deployments against identical environments. Ideally, identical environments, automated deployments, automated test scripts, then tearing the whole thing down, and going again. Inserting planned failure in the process to observe a successful rollback and notification.
Not entirely without coincidence, I had a slide deck open on another machine from a CloudFormation presentation given at a Chicago AWS Group that I was unable to attend.
Of course, CloudFormation and SSIS deployments aren’t really the same animal, but conceptually, the ability to automate infrastructure and the ability to automate application deployment aren’t terribly different.
I’ve been pushing the issue lately. Today’s deployment took a combined team time of 10 hours. Most pieces deployed fine in the end, but the single piece that didn’t means that what did get deployed can’t yet be used.
The team is hundreds of hours into the effort to modernize this process. More modern security, a better UI, regression testing over 800 files (all we had) and coordination for the rollout are all potentially delayed. That’s to say nothing of the time we all should have spent doing more productive things.
There is one thing holding us up from modernizing and that is an inability or unwillingness to improve this process.
But not for much longer.