Anatomy of a Rolling Deploy

I’ve yet to see a discussion of rolling deploys that includes all the edge cases I’ve encountered. Given the importance of a solid deploy process, I’ve compiled the lessons of my experience thus far in hope that some others may learn them the easy way.

Definition: A rolling deploy or rolling update is the act of releasing a new version of your application to a cluster of instances without outage by only affecting one instance at a time. See also Blue Green Deployment. Today we’ll assume your application is a web service.

Consider the obvious deploy process that most of us start with.

  1. Stop app
  2. Push out new code (e.g. push to instance git repo, upload war file, etc..)
  3. Start app

If your app is in active use, you will soon get tired of momentary outages this causes. Fortunately, your app runs on multiple nodes behind a load balancer (right?). Enter the idealized rolling deploy.

The Fairy Tale

Once upon a time there lived a load balancer.

For each instance, the charming prince will:

  1. De-activate instance in load balancer
  2. Stop app
  3. Push out new code
  4. Start app
  5. Reactivate instance in load balancer

Simple. In fact so simple it almost certainly won’t work.

The Real World

Here are some problems you are likely to encounter if you implement the above as-is.

It disrupts requests in progress

You must wait some period of time after traffic had been diverted. Consider using a command like netstat to determine when the requests are finished rather than waiting your full timeout period every time. You might script it like this.

Current users are logged off

If your app is user facing with server-side sessions, they will typically be lost. Depending on platform there are different ways to deal with this, such as session replication and session persistence.

It sets off your alarms

If you have alerting on the status of individual nodes, they should be disabled during this process or you will get paged every time someone does a deploy. Monitoring of the load-balanced URL can be left in place.

It might remove the only app from service

You should check the number of healthy nodes before proceeding. More precisely, if N nodes are required to handle current load make sure N+1 are running before taking one out.

It might put an app in service before it’s ready
You must wait for the app to start hosting.
It might put a broken app in serviceYou must do any relevant status checks or smoke tests before a node is re-enabled.

The Hardened Process

I use this list as a blueprint. Not every step is required for every environment but they all should be considered.

For each instance:

  1. Verify this is not the last remaining node in service
  2. Replicate active user sessions to other instances (if applicable)
  3. Deactivate instance in load balancer
  4. Deactivate instance in monitoring
  5. Wait for active HTTP connections of app port to finish (AKA connection drain)
  6. Stop app (and wait for app to stop)
  7. Push out new code
  8. Start app
  9. Wait for app to start hosting
  10. Test app (query status page, run system tests)
  11. Run any initialization tasks (populate cache, replicate sessions)
  12. Reactivate instance in monitoring
  13. Reactivate instance in load balancer
  14. Notify interested parties that a deploy has taken place

Yet More Things to Consider

Repeatable on Failure

A failed deploy may leave the world in an odd state. Some nodes are in service, some are out. Your deploy should be able to recover from failed state when re-run with no other intervention. To ensure robust automation, ask for each step, “What happens when this step is interrupted or fails? Will a later run be able to handle that state?” If yes, than your process can be called re-entrant (see also idempotent).

Status Pages

This deserves it’s own manifesto but I’ll be give the basics here. Automation is greatly aided by a consistent and machine readable way to determine app status. For instance you might make all of your apps answer “/status” with a JSON packet indicating health of the service, including status of dependencies such as database connections and other diagnostic information. Include all the detail you like in the body (I like to dump the system environment variables), but make the HTTP code indicate overall status – e.g. 200 for OK, 500 for something wrong. You can then point your load balancers at the status URL and they won’t have to parse anything. If you make the generation status page part of your internal shared library, then you can be sure all apps will respond in a consistent way. Dropwizard ships with this out of the box attached to the “/healthcheck” path. Whatever the URL, consider blocking it to outside traffic if your app is web facing.

Advice for the Enterprise

Maintenance windows

If your organization is a tad larger than startup size, someone may be performing maintenance on a server while someone on another team attempts to deploy the service in question. In this case it’s important that the “out of service for deploy” state is distinct from the “out of service for manual maintenance” state. That way the deploy can avoid putting nodes back in service while they are being worked on.

Alternative to touching load balancers directly

Depending on your architecture, it may be problematic for the deploy scripts to access your load balancers, either for security reasons or because the list of load balancers changes periodically. In this case, that coupling can be undone by adding a feature to the status page. By performing some special action, such as creating a specially named file on the target machine, the deploy can cause the status page to report “down” with an explanation. Then, wait the appropriate period of time for the load balancers to notice the new state (usually around 15 seconds).

Miss Anything?

Feel free to comment with other rolling deploy edge cases you’ve seen. Thanks!

Acknowledgements: My automation methodology was greatly influenced by past co-workers including Rebecca Case, Mike Finney, Rick Buford, Dan Watt, Jason McIntosh, Gary Brown and the Data Center 2.0 Team at Carfax.

Originally published November 10th, 2014 for GetSet Learning.