Rewriting products – why you should keep your fingers off!

Rewriting products … ways to go

At a point in time of your product it might turn out that the maintenance effort starts to increase, the people working with your product start asking questions like “Why does it take that long to achieve XYZ?”, the frontend / GUI doesn’t look that great anymore. Data points start to aggregate towards the obvious solution: Rewriting everything from scratch.

There are multiple posts out there in the community telling you why this is a really, really bad idea.

For me personally the biggest point not rewriting an existing product from scratch is the iceberg of undiscovered processes and dependencies. In a web company, the product actually forms the processes and hence forms the organization. It dictates how e-mail marketing should be done, how editors interact, how landing pages are optimized, how performance marketing is done, how accounting is done, and a lot more. So, in essence it’s the heart of your whole organization. You need to have good reasons to change this! Really good reasons!

This specific post on onstartups.com by Dan Milstein talks about “How To Survive a Ground-Up Rewrite Without Losing Your Sanity” which should be named “Why an Incremental Product Rewrite is superior to an Entire Rewrite”.

Why is the overall approach so tricky?

  • The business value of the rewrite project needs to be crystal clear. The project is doomed to fail if business value is stated as generic promises to “speed up development”, “make developers happy”, “have a new, fancy front-end”, “reduce complexity” and so on.
    Be precise!
    Work with your product team to really nail down the core essence WHY you need to approach the rewrite project. Work out a tangible list of value propositions with clear benefits to the business. Only if you have them nailed down, start your project.
  • The whole project “incremental rewrite” and / or “entire rewrite” takes ages longer than anticipated.
    Why this?

    • Data Migration turns out to be an enormous huge task because the meaning of your data is not 100% clear, it’s historically grown and code and data start melting together and create edge cases of meaning, the overall migration task is lot more complicated than anticipated.
    • Scope of the product is given. In a green-field project you usually start with a minimum viable product (including features A+B+C+D). When launch date approaches you usually start with a fair portion of feature A.
      In a rewrite project, the scope is determined by the existing product. Everybody expects the new product be superior to the old product. So, in essence you have to deliver A+B+C+D.
      Biggest problem starting the rewrite project is that you simply don’t know all features … The ingredients for a long, long running project.

How could it be done?

  • Work in increments. Ask yourself or your stakeholders after each increment “What were the business benefit of the project if I stopped it right now?”
    Don’t work towards a big-bang release. Always be prepared to pivot from your original delivery plan.
  • Be prepared to stop at any time. During the project a lot of learning will be generated. The learning, however, might lead to decisions that force the project to either alter the direction by 180 degree – or even to stop it at all.
    So, work in these increments providing most value to the business and be prepared to change steps in your plan.
  • Data Migration? Dual-Write-Layer. Always! Use a dual-write-layer in any cases when doing data migration. It allows for a fallback solution and prevents inconsistencies in your database. Furthermore, a rolling migration is possible after all – and it can take weeks instead of minutes. Nobody will realize that you’re migrating.
  • Kellan Elliott-McCrea, CTO @etsy.com, recommends utilizing a concept named “Shrink Ray“:
    “We have a pattern we call shrink ray. It’s a graph of how much the old system is still in place. Most of these run as cron jobs that grep the codebase for a key signature. Sometimes usage is from wire monitoring of a component. Sometimes there are leaderboards. There is always a party when it goes to zero. A big party.”
  • Engineer the migration scripts to excellence. The scripts need to be idempotent (re-run save) and should identify false data in the original data. If they do it proves that these scripts and the people working on them have really understood what they should be doing.

 

AOL.com architecture description – software stack & processes

Today, twitter (thank you) delivered to me one of the latest articles from HighScalability.com – an article on AOL.com’s architecture.

It’s really great reading and seeing that not only latest and greatest technologies have to be used to create a really good and stable software stack with well-thought-through processes and a great culture expressing proud and self-identification with the architecture.

Notably, AOL.com works with a distributed team of ~25 people on the 5th re-incarnation of the overall architecture. They re-implemented the whole architecture 5 times already! The latest instance is around 6 years old.

Read it – enjoy it! It unveils also some great insights on delivery and management of the software stack.

Recommendations for building high traffic web software

The really great blog highscalability.com contains a post from Ashwanth Fernando on “22 Recommendations for Building Effective High Traffic Web Software“.

He shares insights from his work experience. Some of them seem to be very Oracle biased – but others are really down to earth and worth considering! The blog post is really worth reading. My favorites are:

  • Consider using a distributed caching framework
  • Consider splitting your web application into services
  • Do not use session stickiness
  • Do terminate SSL on the reverse proxy

Why my favorites? Well, some of them turned out to be really bad design decisions within one of our products …

 

 

Page Load time – how to get it into the organization?

Page load time is crucial. Business and technology people as individuals start understanding the importance of this topic almost immediately and are willing to support any effort to get fast pages out of your service.

But how can you foster a culture of performance and make people aware of the importance of this single important topic – amongst hundred other important topics?

That was one of the challenges early 2013. Management and myself were convinced that 2013 one of our key focus topics is around web performance. t4t_optimizedThe birthday of “T4T”. The acronym stands for …

  • Two – Deliver any web page within 2 seconds to our customers.
  • 4 – Deliver any mobile web page within 4 seconds to our customers over 3G.
  • Two hundred – Any request over the REST API is answered below 200 milliseconds.

So, early 2013 we started T4T as an initiative to bring our page load times down to good values. To measure the page load time we experimented with two tools: Compuware’s Gomez APM tool and New Relic’s APM tool. Gomez was used initially for our Java based platform and New Relic for our Ruby on Rails platform. But we were able to measure and track-down some really nasty code segments (i.e. blocking threads in Java or 900 database requests in Ruby where 2 finally did the same job).

How did we get the idea of T4T into the organization? Any gathering of people with presentation character was used to hammer the message of web performance to the people. Any insight on the importance, any tip, hint, workshop, conference, article, blog post, presentation, anything was shared with the team. Furthermore, T4T was physically visible everywhere in the product development department:

T4T_closeup

THE LOGO – visible … everywhere … creepy!

T4T_Corner

T4T logo and information on page load and web performance at the relax area for software developers and product owners …

T4T_VPOffice

T4T at the VP office door.

For me, especially the endless talking about the topic, raising the importance, questioning of e.g. JPG picture sizes, special topic discussions on CSS sprites vs. standalone images or the usage of web-fonts for navigation elements helped a lot to raise the curiosity of people. Furthermore, giving them some room and time for research work helped a lot.

What did we achieve? Well, one of our platforms – based on Ruby on Rails started with page load time of 2,96s in January 2013. End 2013, the platform was at an impressive 2,15s page load time. In the same time, the amount of page views increased by factor 1,5!

Loadtime_secret_2013

Page Load time over the year 2013

During the same time period, the App server response time dropped from 365ms to 275ms end of year – this time doubling the amount of requests in the same time.

Response_time_secret_2013

App server response time over the year 2013

Most interesting, we had one single release with a simple reshuffling of our external tags. Some of them now load asynchronously – or even after the onLoad() event. This helped us drop the page load time from around 2,5s to 2,1s – 400ms saved!

Impact_of_one_event_secret_adtags_after_onload

Impact of one single release and the move of adtags after the onLoad() event.

So, my takeaways on how to foster such a performance culture?

  1. You need a tangible, easy to grasp goal!
  2. Talk about the topic and the goal. Actually, never stop talking about this specific goal.
  3. Make the goal visible to anybody involved – use a logo.
  4. Measure your success.
  5. Celebrate success!
  6. Be patient. It took us 12 month …

Page Load time is crucial for a web service. Why?

For a technical person, page load time feels like being important. It’s a natural tendency, an instinct almost, to make everything perform best. Unfortunately, from a business perspective that’s not really a driver to impress people or make people being responsible for revenues to re-think the importance of page load time.

Why it might be of interest – also for business people?

There is a tight coupling between revenue and page load time in e-commerce businesses. The Infographic “How Loading Time Affects Your Bottom Line” shows slower page response time resulting in increased abandonment rates,  a Forrester study shows that 14% of online shoppers go to another site if they have to wait for a page to load – 23% will stop shopping, Shopzilla redesigned their site to load 5 seconds faster resulting in 10% revenue increase, Bing reported from a trial that a 2-second slowdown of page load time resulted in reduced revenues per user by 4,3%, Amazon measured a relation of 1% sales decrease for every 100 millisecond lost in page speed, and Google reported a revenue decrease by 20% for every 500 millisecond page performance loss, the Mozilla corporation behind Firefox managed to reduce the page load time of their download pages by 2.2 seconds resulting in 60 million additional downloads per year,

Why is it important for people? Why should web sites simply be fast?

There is this article “Our Need For Web Speed: It’s about neuroscience, not entitlement” from radware / strangeloop. It gives deeper insights into human nature and why it is important to run fast websites. A really good motivation for technical and non-technical people to think about the nature of performance. On web performance today, there is this infographic “This is your brain on a slow website” which picks up some of the arguments of the article in a displayable way.

Jakob Nielsen wrote in 2010 already in “Website Response Times” about the impact of slow web pages on humans and gives good reasons why we should definitely try to avoid this bad user experience.

Also good source of information to get an impression of the current state of the union: “Ecommerce Page Speed & Web Performance“.

Furthermore, there is this article on “How Facebook satisfied a need for speed“. Robert Johnson, director of engineering explains how Facebook boosted their speed by factor 2.

Another poster “Visualizing Web Performance” by strangeloop.

Page load time is critical – how to make your site run fast?

Page load time is critical. A lot of people highlight the importance of fast web sites. Amongst them are Steve Souders, Patrick Meenan, Tammy Everts, Stoyan Stefanov and others.

What to do to make your site run fast? There are tons of pages, blogs, hints, tips, tricks and other stuff around in the web. Here’s my favorite collection:

At FriendScout24, we follow this idea of fast web pages as well. I talk in another post in greater details about our goals, achievements and how we actually did it.

Talk on Continuous Delivery at CodeCentric event in Hamburg

End 2013, the general manager of CodeCentric in Munich asked me to do a presentation on our view / achievements / experience in the context of Continuous Delivery. The first of a series of events happened in Hamburg on 26th of November in a nice location.

I put the presentation on SlideShare to make it accessible to others as well. The title of the presentation “Continuous Delivery? Nett oder nötig?”.

The presentation talks about our goals, why we decided to introduce continuous delivery as a way of delivering software to our business, shows some of the experience we made, tells somethings about the challenges we had transforming our architecture to fit the new delivery ways, tools and some more.

In case you have any questions, don’t hesitate to get back to me on michael (at) agile-minds.com.

10 rules to prevent a web site from high scalability

On highscalability.com there is a great post on things you should do to prevent your web site from high scalability: “The 10 Deadly Sins Against Scalability“. The post points to Sean Hull who twitters and writes quite frequently on scalability topics (surprise, surprise).

Sean Hull wrote in his blog about “5 things toxic to scalability” (2011) and “Five More Things Deadly to Scalability” (2013). Definitely worth reading entries on high scalability – and common pitfalls.

Book by Martin L. Abbott, Mihcael T. Fisher on scalability rules.

In the context of this topic, Sean also recommends a book: “Scalability Rules for managers and startups”.

Very good reading to avoid all high-scalability pit-falls right from the beginning!

 

 

 

Software releases without damages and poor user experience @ Facebook

Recently I did some research on software releases and how huge successful companies manage to release their software without causing system failures and poor user experience.

I found this question being answered by a former Facebook release engineer: “How do big companies like Facebook, Google manage software releases without causing system outages and poor user experience?“. In addition to this great answer, there is also an interview at arstechnica “Exclusive: a behind-the-scenes look at Facebook release engineering“.

Facebook follows since the very beginning the principle of zero downtime and no service interruption. How do they accomplish this – even now, when they’re big?

Pushing releases in multiple phases

Facebook pushes their releases in waves. The initial phase, named “latest” always contains the latest code changes (hence the name). All engineers are connected to the “latest” staging system, gather in an IRC channel when the initial push happens and watch logs and error messages meticulously. If the build proves to be okay a push to some servers in production happens (p1-phase). Again, all developers concerned with the release collect in the IRC-channel to watch the release and gather KPI changes, log messages and errors. The next stage, phase p2, includes roughly 5% of all live server systems – again thoroughly monitored. Finally, when phase p2 proves to be good again, the deployment takes place on all server systems. Deployment @ Facebook means copying a 1,5GB binary to all servers – done with a customized bit torrent distribution system.

If an error occurs? Well, the developers are on IRC and held accountable to fix their piece of code. If you crash it, you repair it!

Multiple versions of code running simultaneously

Executing code in Facebooks’ server environment means automatically running multiple versions of code simultaneously. This means an extra effort to address this principle. Hardest is to migrate database schemes from one to another.

Features with on-off toggles

Facebook utilizes a tool named “Gatekeeper” to allow real-time on/off switching and throttling of features. Only few code changes need to be introduced and Facebook operations can control the traffic and which features are available. Code in such environments need to be highly decoupled – no dependencies between features …

Versioned static resources across the web tier

All the web servers in Facebooks server farm are able to serve all static content of all versions being deployed. This means that all servers are equipped with all resources prior to phase p1 deployment. This allows the whole web tier to remain stateless.

If you condense down what’s said in the articles it comes to these points:

  1. Automate everything!
  2. Test early, test often!
  3. Hold developers responsible and let them fix live errors.
  4. Each release has an owner involved from all stakeholder teams.
  5. The product is designed to be rolled back. From the beginning.
  6. The product is designed to execute multiple versions at the same time.
  7. Run multiple environments!
  8. Deploy incremental!

Principles & rules – basement for a good engineering culture @ Google

Principles and rules seem to be a good foundation for a good (or great) engineering culture. In a recent question on Quora on “How do Google, Facebook, Apple and Dropbox maintain excellent code quality at scale?” there was an interesting link towards the engineering culture established at Google.

According to the source Google established – and since then follows the mentioned principles below.

1. All developers work out of a ~single source depot; shared infrastructure!
2. A developer can fix bugs anywhere in the source tree.
3. Building a product takes 3 commands (“get, config, make”)
4. Uniform coding style guidelines across company
5. Code reviews mandatory for all checkins
6. Pervasive unit testing, written by developers
7. Unit tests run continuously, email sent on failure
8. Powerful tools, shared company-wide
9. Rapid project cycles; developers change projects often; 20% time
10. Peer-driven review process; flat management structure
11. Transparency into projects, code, process, ideas, etc.
12. Dozens of offices around world => hire best people regardless of location

In the Quora answers it becomes obvious … huge code bases are maintainable only when a culture of ownership and pride is established. The first step, however, is obviously to establish a set of rules – the basement for the engineering culture.

Seeding, growing, harvesting!