Website performance talk: Delivering The Goods In Under 1000ms

Paul Irish (@paul_irish) gave a really good keynot presentation during Front End Ops Conference 2014 in SF. The title: “Delivering The Goods In Under 1000ms“.

He focuses on investigating the key question “page size vs. number of requests” what has a bigger impact on website performance.

“… latency is the performance bottleneck for HTTP and well … most of the web”

Aggressive, but good goals to achieve:

Deliver a fast mobile web page load

  • Show the above-the-fold content in under 1 second
  • Serve above-the-fold content, including critical path CSS, in first 14kb of response

Maximum of 200ms server response time

Speed index under 1000

More to read about the Speed Index.

Website performance – best practice to improve

Website performance comes in various flavours – but where to start with improvements? How to improve the performance? What are best practices to follow?

Tony Gentilcore (@tonygentilcore) talks in a blogpost about “An engineer’s guide to optimization“. Tony basically identified 5 steps to follow.

Step 1: Identify the metric. 

Identify a scenario worth being optimized – means it moves a business metric. If – after all thinking and crunching – you’re not able to identify a scenario with a clear relationship between the optimization and any business metric you should look for more pressing problems first – and revisit the performance issue later.

Step 2: Measure the metric.

After you’ve identified the metric – establish a repeatable benchmark of this scenario / metric. Include this metric measurement into your continuous integration / delivery pipelines and watch out for regressions. First, start with synthetic benchmarks and later include the real world (Real User Monitoring).

Step 3: Identify the theoretical optimum of your metric.

Think about your scenario and create a best-best-best-case. What would be the benefit, the performance maximum to gain out of the scenario. Given everything works really well, what would be the top-performance figure?

Step 4: Approach your optimium.

Identify the bottlenecks preventing you to reach the optimum. Work on these bottlenecks – start with the biggest impact first. Don’t stop optimizing until you reach a point where you have to invest more than you benefit.

Step 5: Profit from your achievements!

Web frontend performance – distilled

Web frontend performance – distilled

Web performance used to be (in the good old server-only / server-rendering days) mainly dominated by the performance of your webservers delivering the dynamic content to the browser. Well, this changed quite a lot with application-like web frontends. Their main promise is to replace these annoying request/response pauses with one longer waiting period in the beginning of the session – but then have light-speed for subsequent requests.

Here are some really good links I just discovered today – they all deal with various aspects of frontend web-performance. Let’s start.

Comparing MV* frameworks? There is a great project – named TodoMVC – that compares various frontend-frameworks – amongst them are Backbone.js, AngularJS, Ember.js, KnockoutJS, Dojo, YUI, Agility.js, Knockback.js, CanJS, Maria, Polymer, React, Mithril, Ampersand, Flight, Vue.js, MarionetteJS, Vanilla JS, jQuery and a lot more.

Performance impact comparison by the filament group. A good effort on research was spent on the topic “Research: Performance Impact of Popular JavaScript MVC Frameworks” – focusing on e.g. Angular.js, Backbone.js and Ember.js. Performance testing was done with the previous mentioned implementation of TodoMVC. The raw data is accessible as well. Most interesting are the results (measuring avg. first render time):

Mobile 3G connection on Nexus 5

  • Ember averages about 5 seconds
  • Angular averages about 4 seconds
  • Backbone averages about 1 second

PC via LAN

  • Ember averages about 1.17 seconds
  • Angular averages about 0.88 seconds
  • Backbone averages about 0.29 seconds

Practical hints to accelerate responsive designed websites. In his post “How we make RWD sites load fast as heck” Scott Jehl (@scottjehl) gives some pracitcal hints on what to focus on:

  • Page wight isn’t the only measure; focus on perceived performance
  • Shortening the critical path
  • Going async
  • Inlining code
  • Shooting for 14kb
  • Taking advantage of caches
  • Using grunt in the deploy pipe

Angular 1.x and architecture problems. Another interesting blog article by Peter-Paul Koch (@ppk) focusses explicitly on Angular. “The problem with Angular” talks about severe performance problems with Angular 1.x versions. In his blog he notes

” … Angular 2.0, which would be a radical departure from 1.x. Angular users would have to re-code their sites in order to be able to deploy the new version …”

Wow. That’s interesting and a good indication for serious architecture issues with Angular 1.x …

Thought-leading companies and performance. Good articles / blog posts from leading companies on page speed performance:

Realistic case study on agile development at large scale

A Practical Approach to Large-Scale Agile Development” by Gary Gruver, Mike Young and Pat Fulghum is a real-world example on how scaling of agile software development really works in a huge software producing organisation.

A Practical Approach to Large-Scale Agile Development

The authors describe in an easy-to-read language the journey of the HP firmware organisation starting in 2008 and taking around 3 years with a clear goal in mind: “10x developer productivity improvement”.

In the beginning they were stuck with a waterfall planning process with a huge planning organisation and not being able to move in software development as fast as business expected. A quick summary of activities showed that in the beginning the organization was spending 25% of developer time in planning sessions to plan the next years’ releases. Only 5% was spent on innovation. Nowadays, after acomplishing the 10x goal, 40% of developer’s time is spent on innovation.

The book highlights the relevance of a right mix of agile technologies with a good approach in software architecture and organisational measures to form a successful team of people striving for common goals. A fascinating read!

Most striving is the unemotional view on agile and how to apply it. They purposefully decided not to have self-organizing teams. So, agile is broken? Can’t be applied in such an environment? Not at all! The authors give good reason for not applying all agile patterns from the books – and it is working.

Facebook and their mobile release process

The talk “Hacker Way: Releasing and Optimizing Mobile Apps for the World” given by Chuck Rossi @Facebook’s f8 conference in 2014 describes how Facebook turned its organization structure to reflect the importance of mobile for Facebook’s future. Chuck heads the company’s release team and is responsible for all releases.

Impact of Mobile strategy on organization

Before re-prioritizing everything within Facebook and focusing on mobile the development team was organized mainly around channels:

Development Organization of Facebook before moving towards mobile

This developer distribution led actually to heavy prioritization problems. The different product teams with focus on Desktop Web did prioritize their topics coming up with a numbered list of items. Those prioritzations were then handed over to the platform experts. They had the problem of seing number #1 priority item of the “Messages team” competing with number #1 priority item of e.g. the “Events team”.

Facebook came over this organization issue by organizing their development differently:Development Organization of Facebook after moving towards mobile

Now, the facebook engineering team has product and platfom experts mixed working on features across all platforms.

Software Releases at Facebook

Facebook has some simple rules – simple but made of stone:

  1. WE SHIP ON TIME
    A
    release can not be postponed. If a feature can’t make it it will not make it into this release.
  2. MAKE USERS NO WORSE OFF
    Facebook is data driven. KPI’s are watched thorougly after a release. If they don’t develop as expected, a change needs to happen (e.g. fix forward or modification).
  3. THERE’S ALWAYS THE NEXT ONE
    Since the releases are already dated there is always the next release. If you can’t get your feature in today, it will be part of the release tomorrow. This relaxes the overall organization and takes away a lot of the pain experienced when the next release is month away.
  4. RETREAT TO SAFETY
    The release team is responsible for delivering a stable product. When the team actually picks the ready developed items (30 to 300 on a daily release) they carefully take the stories into the release candidate. It’s described as “subjective”. They follow a simple rule when building the release package: “If in doubt, there is no doubt”.

Facebook releases their web platform following a plan:Facebooks desktop web plattform release plan

Sunday, 6 p.m. the release team tags the next release branch. That’s done directly from the trunk. The release branch is stabilized until Tuesday, 4 p.m. and then released as a big release including 4000 to 6000 changes – 1 week of development. On Monday, Tuesday, Wednesday, Thursday, Friday, Facebook does two releases a day. These are cherry-picked changes – around 30 to 300 each release.

For Mobile the plan differs obviously a bit:Facebooks native web plattform release plan

On mobile the overall release principle is actually the same as described above. The development cycle is 4 weeks – on the day the previous release gets shipped to the various app stores, the next release candidate is taken from the master. The candidate is then 3,5 weeks into stabilization. Each candidate includes further 100-120 cherry picks taken during this 3 weeks stabilization period. When stabilization is over, the Release Candidate is tested and not touched any more.

 

Best decision ever – skip the architect

The architect role is sold as outstanding important in product development efforts – especially when IT is involved. But I’ve learned some lessons.

The secret.de case – skip the architect

Roughly 3 years ago we started a new business – a high class and exclusive casual dating site focusing exclusively at women. The technical decision was soon done by picking Ruby on Rails v3 as web framework and mongoDB as persistence layer. It was a radical shift away from our current technology stack – pure java and postreSQL.

When we started with Sprint 0 we hired external people to support us. One person acted as the Ruby on Rails trainer for 1 week – to get our people up to speed. At the time we started we had 1 skilled Ruby on Rails person focusing on frontend development and one not-so-experienced person with Ruby on Rails focusing on backend development. The remaining team were skilled java developers. The other external people were one Rails nerd and an architect. During the sprints, it turned out that the architect didn’t have any clue about Rails nor pragmatic architecture. He started to document our project with ARC42 templates … so, we decided to put him aside soon. Leaving the team – without lead. No architect, no direction, no guidance – no hope?

Not at all. What happened? The team started to accept the fact that there’s no over-brain available. No-one making decisions for them. No-one giving direction. And, magically, they took over the ownership for the overall project. Each and every design decision was discussed within the team. Planning II got a total new meaning to the team. Sure, quite some mistakes were made – but most of them due to non-experience with the new technology stack. The Rails nerd was out-phased as soon as the platform went live – after 3,5 month of development.

So, in self-empowered teams there is no need for an explicit architect role. Naturally in team configurations there are more experienced people and less experienced people. A good team will distribute the overall responsibility for good architecture work over all heads. Everybody will carry a piece fitting their experience and willingness to contribute. Plus – you will not run into knowledge distribution problems. Everybody is involved. Knowledge is flowing. New persons can be introduced without a lot effort. The teams credo “fix it if it breaks” led to a low-maintenance and up-to-date system. So, I’ve a fast, fast, fast running application and a real high-performing team.

For me – in the end – the best decision ever was to skip the architect.

Rewriting products – why you should keep your fingers off!

Rewriting products … ways to go

At a point in time of your product it might turn out that the maintenance effort starts to increase, the people working with your product start asking questions like “Why does it take that long to achieve XYZ?”, the frontend / GUI doesn’t look that great anymore. Data points start to aggregate towards the obvious solution: Rewriting everything from scratch.

There are multiple posts out there in the community telling you why this is a really, really bad idea.

For me personally the biggest point not rewriting an existing product from scratch is the iceberg of undiscovered processes and dependencies. In a web company, the product actually forms the processes and hence forms the organization. It dictates how e-mail marketing should be done, how editors interact, how landing pages are optimized, how performance marketing is done, how accounting is done, and a lot more. So, in essence it’s the heart of your whole organization. You need to have good reasons to change this! Really good reasons!

This specific post on onstartups.com by Dan Milstein talks about “How To Survive a Ground-Up Rewrite Without Losing Your Sanity” which should be named “Why an Incremental Product Rewrite is superior to an Entire Rewrite”.

Why is the overall approach so tricky?

  • The business value of the rewrite project needs to be crystal clear. The project is doomed to fail if business value is stated as generic promises to “speed up development”, “make developers happy”, “have a new, fancy front-end”, “reduce complexity” and so on.
    Be precise!
    Work with your product team to really nail down the core essence WHY you need to approach the rewrite project. Work out a tangible list of value propositions with clear benefits to the business. Only if you have them nailed down, start your project.
  • The whole project “incremental rewrite” and / or “entire rewrite” takes ages longer than anticipated.
    Why this?

    • Data Migration turns out to be an enormous huge task because the meaning of your data is not 100% clear, it’s historically grown and code and data start melting together and create edge cases of meaning, the overall migration task is lot more complicated than anticipated.
    • Scope of the product is given. In a green-field project you usually start with a minimum viable product (including features A+B+C+D). When launch date approaches you usually start with a fair portion of feature A.
      In a rewrite project, the scope is determined by the existing product. Everybody expects the new product be superior to the old product. So, in essence you have to deliver A+B+C+D.
      Biggest problem starting the rewrite project is that you simply don’t know all features … The ingredients for a long, long running project.

How could it be done?

  • Work in increments. Ask yourself or your stakeholders after each increment “What were the business benefit of the project if I stopped it right now?”
    Don’t work towards a big-bang release. Always be prepared to pivot from your original delivery plan.
  • Be prepared to stop at any time. During the project a lot of learning will be generated. The learning, however, might lead to decisions that force the project to either alter the direction by 180 degree – or even to stop it at all.
    So, work in these increments providing most value to the business and be prepared to change steps in your plan.
  • Data Migration? Dual-Write-Layer. Always! Use a dual-write-layer in any cases when doing data migration. It allows for a fallback solution and prevents inconsistencies in your database. Furthermore, a rolling migration is possible after all – and it can take weeks instead of minutes. Nobody will realize that you’re migrating.
  • Kellan Elliott-McCrea, CTO @etsy.com, recommends utilizing a concept named “Shrink Ray“:
    “We have a pattern we call shrink ray. It’s a graph of how much the old system is still in place. Most of these run as cron jobs that grep the codebase for a key signature. Sometimes usage is from wire monitoring of a component. Sometimes there are leaderboards. There is always a party when it goes to zero. A big party.”
  • Engineer the migration scripts to excellence. The scripts need to be idempotent (re-run save) and should identify false data in the original data. If they do it proves that these scripts and the people working on them have really understood what they should be doing.

 

AOL.com architecture description – software stack & processes

Today, twitter (thank you) delivered to me one of the latest articles from HighScalability.com – an article on AOL.com’s architecture.

It’s really great reading and seeing that not only latest and greatest technologies have to be used to create a really good and stable software stack with well-thought-through processes and a great culture expressing proud and self-identification with the architecture.

Notably, AOL.com works with a distributed team of ~25 people on the 5th re-incarnation of the overall architecture. They re-implemented the whole architecture 5 times already! The latest instance is around 6 years old.

Read it – enjoy it! It unveils also some great insights on delivery and management of the software stack.

Recommendations for building high traffic web software

The really great blog highscalability.com contains a post from Ashwanth Fernando on “22 Recommendations for Building Effective High Traffic Web Software“.

He shares insights from his work experience. Some of them seem to be very Oracle biased – but others are really down to earth and worth considering! The blog post is really worth reading. My favorites are:

  • Consider using a distributed caching framework
  • Consider splitting your web application into services
  • Do not use session stickiness
  • Do terminate SSL on the reverse proxy

Why my favorites? Well, some of them turned out to be really bad design decisions within one of our products …

 

 

Page Load time – how to get it into the organization?

Page load time is crucial. Business and technology people as individuals start understanding the importance of this topic almost immediately and are willing to support any effort to get fast pages out of your service.

But how can you foster a culture of performance and make people aware of the importance of this single important topic – amongst hundred other important topics?

That was one of the challenges early 2013. Management and myself were convinced that 2013 one of our key focus topics is around web performance. t4t_optimizedThe birthday of “T4T”. The acronym stands for …

  • Two – Deliver any web page within 2 seconds to our customers.
  • 4 – Deliver any mobile web page within 4 seconds to our customers over 3G.
  • Two hundred – Any request over the REST API is answered below 200 milliseconds.

So, early 2013 we started T4T as an initiative to bring our page load times down to good values. To measure the page load time we experimented with two tools: Compuware’s Gomez APM tool and New Relic’s APM tool. Gomez was used initially for our Java based platform and New Relic for our Ruby on Rails platform. But we were able to measure and track-down some really nasty code segments (i.e. blocking threads in Java or 900 database requests in Ruby where 2 finally did the same job).

How did we get the idea of T4T into the organization? Any gathering of people with presentation character was used to hammer the message of web performance to the people. Any insight on the importance, any tip, hint, workshop, conference, article, blog post, presentation, anything was shared with the team. Furthermore, T4T was physically visible everywhere in the product development department:

T4T_closeup

THE LOGO – visible … everywhere … creepy!

T4T_Corner

T4T logo and information on page load and web performance at the relax area for software developers and product owners …

T4T_VPOffice

T4T at the VP office door.

For me, especially the endless talking about the topic, raising the importance, questioning of e.g. JPG picture sizes, special topic discussions on CSS sprites vs. standalone images or the usage of web-fonts for navigation elements helped a lot to raise the curiosity of people. Furthermore, giving them some room and time for research work helped a lot.

What did we achieve? Well, one of our platforms – based on Ruby on Rails started with page load time of 2,96s in January 2013. End 2013, the platform was at an impressive 2,15s page load time. In the same time, the amount of page views increased by factor 1,5!

Loadtime_secret_2013

Page Load time over the year 2013

During the same time period, the App server response time dropped from 365ms to 275ms end of year – this time doubling the amount of requests in the same time.

Response_time_secret_2013

App server response time over the year 2013

Most interesting, we had one single release with a simple reshuffling of our external tags. Some of them now load asynchronously – or even after the onLoad() event. This helped us drop the page load time from around 2,5s to 2,1s – 400ms saved!

Impact_of_one_event_secret_adtags_after_onload

Impact of one single release and the move of adtags after the onLoad() event.

So, my takeaways on how to foster such a performance culture?

  1. You need a tangible, easy to grasp goal!
  2. Talk about the topic and the goal. Actually, never stop talking about this specific goal.
  3. Make the goal visible to anybody involved – use a logo.
  4. Measure your success.
  5. Celebrate success!
  6. Be patient. It took us 12 month …