Reaching goals – lots of micro steps actually make the goal!

During Facebook’s developer conference f8 in 2014 Edwin Smith with the High-Performance Server Infrastructure team shared some insights on the HHVM – the PHP runtime project built around performance (27:37 onwards). In his talk he also described how the team almost failed reaching a very ambitious goal – but finally managed it … with 1% micro steps. They actually overachieved.

What happened? In October 2012 the team was in a position where they had spent nearly 2 years of development time to create a virtual machine / just-in-time compiler to boost Facebook’s execution performance. Already in April 2012 they realized that the newly created project was 3 times slower than the current execution environment – and plan to go live was end 2012. In October 2012 the team realized that following the working model as they did so far will not allow them to make their goal.

So, the need to improve the execution performance by factor 3+ (ambitious goal) meets a hard deadline to go live (time box).

At the time, the team stopped working like they did before and changed to a drastically different model.

New work model to achieve performance goals

They changed from a project working model towards a kanban-like working model. Now, they started focusing on micro-steps. Each of these steps shouldn’t take longer than a day or two. If the success was measurable and positive, great. If not, the team simply documented the effort and moved on (Furiously iterate).

The backlog of ideas for HHVM performance improvements

Prior to starting the work on the final period from October to December the team started with a brainstorming session filling up their backlog. Each of these micro 1% performance improvment steps were documented. The backlog organized like: left–>right impact – with least impact right, top–>bottom effort with least effort in top. Ideally, all steps were located top left (low effort but high impact). Those, however, were already covered.

Tasks done during HHVM performance tuning period

The team documented the finished tasks with positive and no / negative impact on the board as well. A great learning experience.

Validation of the impact was done utilizing a fine grained measuring tool allowing the team to identify even smallest performance improvements.Facebook HHVM result

The result of the effort is amazing. The team managed – focusing on these micro-steps – to get to their goal – and even further.

The team did change to this working model since. They have periods of hard and focused work. They pick a goal and divide the path towards this goal into micro steps. They work for a small amount of time on one of these steps and decide on metrics (validation) to pivot (learning: wrong direction) or to persevere (learning: right direction). When the goal is reached the team does further fine-tuning on the achievements – or goes on vacation. Afterwards, they continue with another iteration.


Facebook and their mobile release process

The talk “Hacker Way: Releasing and Optimizing Mobile Apps for the World” given by Chuck Rossi @Facebook’s f8 conference in 2014 describes how Facebook turned its organization structure to reflect the importance of mobile for Facebook’s future. Chuck heads the company’s release team and is responsible for all releases.

Impact of Mobile strategy on organization

Before re-prioritizing everything within Facebook and focusing on mobile the development team was organized mainly around channels:

Development Organization of Facebook before moving towards mobile

This developer distribution led actually to heavy prioritization problems. The different product teams with focus on Desktop Web did prioritize their topics coming up with a numbered list of items. Those prioritzations were then handed over to the platform experts. They had the problem of seing number #1 priority item of the “Messages team” competing with number #1 priority item of e.g. the “Events team”.

Facebook came over this organization issue by organizing their development differently:Development Organization of Facebook after moving towards mobile

Now, the facebook engineering team has product and platfom experts mixed working on features across all platforms.

Software Releases at Facebook

Facebook has some simple rules – simple but made of stone:

    release can not be postponed. If a feature can’t make it it will not make it into this release.
    Facebook is data driven. KPI’s are watched thorougly after a release. If they don’t develop as expected, a change needs to happen (e.g. fix forward or modification).
    Since the releases are already dated there is always the next release. If you can’t get your feature in today, it will be part of the release tomorrow. This relaxes the overall organization and takes away a lot of the pain experienced when the next release is month away.
    The release team is responsible for delivering a stable product. When the team actually picks the ready developed items (30 to 300 on a daily release) they carefully take the stories into the release candidate. It’s described as “subjective”. They follow a simple rule when building the release package: “If in doubt, there is no doubt”.

Facebook releases their web platform following a plan:Facebooks desktop web plattform release plan

Sunday, 6 p.m. the release team tags the next release branch. That’s done directly from the trunk. The release branch is stabilized until Tuesday, 4 p.m. and then released as a big release including 4000 to 6000 changes – 1 week of development. On Monday, Tuesday, Wednesday, Thursday, Friday, Facebook does two releases a day. These are cherry-picked changes – around 30 to 300 each release.

For Mobile the plan differs obviously a bit:Facebooks native web plattform release plan

On mobile the overall release principle is actually the same as described above. The development cycle is 4 weeks – on the day the previous release gets shipped to the various app stores, the next release candidate is taken from the master. The candidate is then 3,5 weeks into stabilization. Each candidate includes further 100-120 cherry picks taken during this 3 weeks stabilization period. When stabilization is over, the Release Candidate is tested and not touched any more.


Software releases without damages and poor user experience @ Facebook

Recently I did some research on software releases and how huge successful companies manage to release their software without causing system failures and poor user experience.

I found this question being answered by a former Facebook release engineer: “How do big companies like Facebook, Google manage software releases without causing system outages and poor user experience?“. In addition to this great answer, there is also an interview at arstechnica “Exclusive: a behind-the-scenes look at Facebook release engineering“.

Facebook follows since the very beginning the principle of zero downtime and no service interruption. How do they accomplish this – even now, when they’re big?

Pushing releases in multiple phases

Facebook pushes their releases in waves. The initial phase, named “latest” always contains the latest code changes (hence the name). All engineers are connected to the “latest” staging system, gather in an IRC channel when the initial push happens and watch logs and error messages meticulously. If the build proves to be okay a push to some servers in production happens (p1-phase). Again, all developers concerned with the release collect in the IRC-channel to watch the release and gather KPI changes, log messages and errors. The next stage, phase p2, includes roughly 5% of all live server systems – again thoroughly monitored. Finally, when phase p2 proves to be good again, the deployment takes place on all server systems. Deployment @ Facebook means copying a 1,5GB binary to all servers – done with a customized bit torrent distribution system.

If an error occurs? Well, the developers are on IRC and held accountable to fix their piece of code. If you crash it, you repair it!

Multiple versions of code running simultaneously

Executing code in Facebooks’ server environment means automatically running multiple versions of code simultaneously. This means an extra effort to address this principle. Hardest is to migrate database schemes from one to another.

Features with on-off toggles

Facebook utilizes a tool named “Gatekeeper” to allow real-time on/off switching and throttling of features. Only few code changes need to be introduced and Facebook operations can control the traffic and which features are available. Code in such environments need to be highly decoupled – no dependencies between features …

Versioned static resources across the web tier

All the web servers in Facebooks server farm are able to serve all static content of all versions being deployed. This means that all servers are equipped with all resources prior to phase p1 deployment. This allows the whole web tier to remain stateless.

If you condense down what’s said in the articles it comes to these points:

  1. Automate everything!
  2. Test early, test often!
  3. Hold developers responsible and let them fix live errors.
  4. Each release has an owner involved from all stakeholder teams.
  5. The product is designed to be rolled back. From the beginning.
  6. The product is designed to execute multiple versions at the same time.
  7. Run multiple environments!
  8. Deploy incremental!