Recently I did some research on software releases and how huge successful companies manage to release their software without causing system failures and poor user experience.
I found this question being answered by a former Facebook release engineer: “How do big companies like Facebook, Google manage software releases without causing system outages and poor user experience?“. In addition to this great answer, there is also an interview at arstechnica “Exclusive: a behind-the-scenes look at Facebook release engineering“.
Facebook follows since the very beginning the principle of zero downtime and no service interruption. How do they accomplish this – even now, when they’re big?
Pushing releases in multiple phases
Facebook pushes their releases in waves. The initial phase, named “latest” always contains the latest code changes (hence the name). All engineers are connected to the “latest” staging system, gather in an IRC channel when the initial push happens and watch logs and error messages meticulously. If the build proves to be okay a push to some servers in production happens (p1-phase). Again, all developers concerned with the release collect in the IRC-channel to watch the release and gather KPI changes, log messages and errors. The next stage, phase p2, includes roughly 5% of all live server systems – again thoroughly monitored. Finally, when phase p2 proves to be good again, the deployment takes place on all server systems. Deployment @ Facebook means copying a 1,5GB binary to all servers – done with a customized bit torrent distribution system.
If an error occurs? Well, the developers are on IRC and held accountable to fix their piece of code. If you crash it, you repair it!
Multiple versions of code running simultaneously
Executing code in Facebooks’ server environment means automatically running multiple versions of code simultaneously. This means an extra effort to address this principle. Hardest is to migrate database schemes from one to another.
Features with on-off toggles
Facebook utilizes a tool named “Gatekeeper” to allow real-time on/off switching and throttling of features. Only few code changes need to be introduced and Facebook operations can control the traffic and which features are available. Code in such environments need to be highly decoupled – no dependencies between features …
Versioned static resources across the web tier
All the web servers in Facebooks server farm are able to serve all static content of all versions being deployed. This means that all servers are equipped with all resources prior to phase p1 deployment. This allows the whole web tier to remain stateless.
If you condense down what’s said in the articles it comes to these points:
- Automate everything!
- Test early, test often!
- Hold developers responsible and let them fix live errors.
- Each release has an owner involved from all stakeholder teams.
- The product is designed to be rolled back. From the beginning.
- The product is designed to execute multiple versions at the same time.
- Run multiple environments!
- Deploy incremental!