Software releases without damages and poor user experience @ Facebook

Recently I did some research on software releases and how huge successful companies manage to release their software without causing system failures and poor user experience.

I found this question being answered by a former Facebook release engineer: “How do big companies like Facebook, Google manage software releases without causing system outages and poor user experience?“. In addition to this great answer, there is also an interview at arstechnica “Exclusive: a behind-the-scenes look at Facebook release engineering“.

Facebook follows since the very beginning the principle of zero downtime and no service interruption. How do they accomplish this – even now, when they’re big?

Pushing releases in multiple phases

Facebook pushes their releases in waves. The initial phase, named “latest” always contains the latest code changes (hence the name). All engineers are connected to the “latest” staging system, gather in an IRC channel when the initial push happens and watch logs and error messages meticulously. If the build proves to be okay a push to some servers in production happens (p1-phase). Again, all developers concerned with the release collect in the IRC-channel to watch the release and gather KPI changes, log messages and errors. The next stage, phase p2, includes roughly 5% of all live server systems – again thoroughly monitored. Finally, when phase p2 proves to be good again, the deployment takes place on all server systems. Deployment @ Facebook means copying a 1,5GB binary to all servers – done with a customized bit torrent distribution system.

If an error occurs? Well, the developers are on IRC and held accountable to fix their piece of code. If you crash it, you repair it!

Multiple versions of code running simultaneously

Executing code in Facebooks’ server environment means automatically running multiple versions of code simultaneously. This means an extra effort to address this principle. Hardest is to migrate database schemes from one to another.

Features with on-off toggles

Facebook utilizes a tool named “Gatekeeper” to allow real-time on/off switching and throttling of features. Only few code changes need to be introduced and Facebook operations can control the traffic and which features are available. Code in such environments need to be highly decoupled – no dependencies between features …

Versioned static resources across the web tier

All the web servers in Facebooks server farm are able to serve all static content of all versions being deployed. This means that all servers are equipped with all resources prior to phase p1 deployment. This allows the whole web tier to remain stateless.

If you condense down what’s said in the articles it comes to these points:

  1. Automate everything!
  2. Test early, test often!
  3. Hold developers responsible and let them fix live errors.
  4. Each release has an owner involved from all stakeholder teams.
  5. The product is designed to be rolled back. From the beginning.
  6. The product is designed to execute multiple versions at the same time.
  7. Run multiple environments!
  8. Deploy incremental!

Principles & rules – basement for a good engineering culture @ Google

Principles and rules seem to be a good foundation for a good (or great) engineering culture. In a recent question on Quora on “How do Google, Facebook, Apple and Dropbox maintain excellent code quality at scale?” there was an interesting link towards the engineering culture established at Google.

According to the source Google established – and since then follows the mentioned principles below.

1. All developers work out of a ~single source depot; shared infrastructure!
2. A developer can fix bugs anywhere in the source tree.
3. Building a product takes 3 commands (“get, config, make”)
4. Uniform coding style guidelines across company
5. Code reviews mandatory for all checkins
6. Pervasive unit testing, written by developers
7. Unit tests run continuously, email sent on failure
8. Powerful tools, shared company-wide
9. Rapid project cycles; developers change projects often; 20% time
10. Peer-driven review process; flat management structure
11. Transparency into projects, code, process, ideas, etc.
12. Dozens of offices around world => hire best people regardless of location

In the Quora answers it becomes obvious … huge code bases are maintainable only when a culture of ownership and pride is established. The first step, however, is obviously to establish a set of rules – the basement for the engineering culture.

Seeding, growing, harvesting!

Agile defined.

How is Agile defined? What does it mean?

In recent discussion within my company and also in discussion with other people, I recognized that people use the term Agile a lot. It is also obvious that people partyl have a thoroughly different understanding of the meaning of Agile.

When reading about the topic in the internet, following some people on twitter and reading books about agility and similar topics it becomes apparent that Agile has arrived at mass-movement. It is no longer well-understood and sharply defined. The term is more or less a buzzword. Think of “SOA”, “Test Driven Development”, “High Availability” or “Big Data”. All of them arrived at the buzzword-level. Anyway, that’s hard to change.

Agile defined

Looking up Agile at a dictionary it returns:

agile, adjective

  1. quick and well-coordinated in movement
  2. active; lively: an agile person.
  3. marked by an ability to think quickly; mentally acute or aware

So, we’re talking about an adjective – a closer description of the state of something. Agile doesn’t stand on its own. It refers to something. Interesting. But what does it refer?

Agile in organizations – defined

Agile is not a framework

In the context of agile organizations the term very commonly gets confused with other agile things. A lot of people refer to their organization as agile since they introduced SCRUM or Kanban as their software development frameworks. These people confuse Agile with frameworks with comparable core values and motivations. In SCRUM the core values are commitment, openness, focus, respect and courage. However, SCRUM as a framework focusses only on a certain aspect of the overall Agile movement.

Agile is not a methodology

Furthermore, some people think Agile is a methodology. If you follow well-known process steps, applying allways the same pattern to certain situations you apply a certain methodology to arrive at a goal. But Agile is not a collection of best-practices, a rule-set and you’re fine.

Agile is not a goal

Others look at their organization with the sole ambition to become Agile. But there is no state you can arrive at and claim – “Now I’m Agile”. Agile is the path, not the goal.

So what is a definition for Agile in organization context?

Agile defined

Read the great blog post from Jeff Patton about Agile development is more culture than process. Also a great source of insights is the Agile Manifesto.

Project Manager? No need in SCRUM any more …

Project Manager? Not known in SCRUM books – hence no need for it?!?!

SCRUM in books transports the idea of “You don’t need project managers any more”. Why is it so? Well, the roles in SCRUM books – product owner, SCRUM master, team – explicitly exclude a project manager.

The project manager’s work is shared across various roles:

  • The product owner manages the backlog, prioritizes the stories and creates release plans.
  • The SCRUM master “enforces” the mechanics of the process, organizes meetings and moderates them.
  • The team takes over the detailed planning sessions and estimations.

When we introduced SCRUM we really enjoyed getting rid of all all the structures and limitations of the project work. We arrived 100% in “agile”. During our SCRUM setup days we decided to move our “project managers” into “SCRUM master”. Now with their new very limited scope of “not breaking” the new process rules. All the other duties – gone!

The Break – Working on Epics

The new modell went all well until … we started our first epic. In waterfall days the epic was called project.

After some time SCRUM works very well organizing the software development. Mainly because the process SCRUM isn’t that hard to apply and everybody likes it – because it’s agile and simple. But the overarching business planing – looking beyond the daily SCRUM work … we lost it almost entirely.

Who does the prioritization work on which project – ahhh epic – we should work next? Is it the product owner? The VP product? The CEO? The whole management team? Who takes care of the resource conflicts arising automatically if you have scarce resources? Who detects those conflicts and escalates them? How do we deal with delays?

Quite some questions. A lot of companies switching from traditional – thoroughly controlled processes to the agile modell encounter comparable issues, we learned.

How did we solve it? Well, we started to recognize that SCRUM is only one piece of the jigsaw puzzle. Furthermore, you need to work on your agile organization. Not only product development and product management are customer of your SCRUM teams. A lot of topics – owned e.g. by online marketing or customer care center – need attention as well.

In the end, when multiple departments within our organization are involved in work we started to include project managers again. They have an inter-department communication and coordination role. So, for us project managers did not disappear entirely, they showed up again in the context of strategic projects with a slightly different role.

Lessons learned?

  • SCRUM is good to organize the product development part of your agile organization
  • There is more than just SCRUM to become truly agile
  • In an agile organization you better have roles like project managers