Web Performance Escalation (#WPE)

In 2007 Steve Souders authored a book called High Performance Websites which ignited an entire movement in web engineering called Web Performance Optimization (#WPO). The WPO practices and techniques are focused on the optimization of client-side performance for modern web applications; establishing a set of rules or guidelines for the design and engineering of a web application.
     
Now that I’m jumping back into performance consulting, the interesting thing for me is learning as many of these newer techniques and new tools as I can.  My recent work at Shunra gave me more respect for the impacts of network limitations on client-side performance, so wouldn't you know it: in less than 1 week into a consulting project, I'm already hunting down the root-cause of a client-side browser performance issue.

On this project the performance testing team reported that the average response times for a critical transaction had grown to just over a 3-second page complete time using HP LoadRunner. The team immediately got into a huddle to consider whether we could give a GO or NO-GO on this release, with significant pressure to be defending the business requirement for conversion (e-commerce) and end-user experience. A NO-GO on this release would be a serious show stopper.

Here's a rundown of the observations during the escalation:
  • We escalated a NO-GO vote on the release, because the response time was unacceptable on the cart checkout process; it was slow enough for us to raise the issue with the teams.
  • The testing data was taken from LoadRunner's web virtual user measuring at the transport layer, which wasn't accurate for a rich web presentation layer including asynchronous calls. 
  • The root cause of the performance issue was related to in-line loading of resources (JavaScript and CSS) which we diagnosed and cross-checked in Chrome, FireBug and Charles proxy. The problem was worst on IE7 - which was known to have limitations.
  • The investigation revealed that we need only to debug the client-side performance for the release, but that any load-related scalability issues would only make things worse. Single-user baselines don't include multi-user latency or bottlenecks.
  • We learned the real response time measurement was not "page load complete" but instead a point in the script where a load mask (grey transparent cover with animated spinner) was lifted so that an end-user could continue with their shopping purchase and checkout.
  • The checkout business process was extremely important to the business (no surprise), so if we even have a slight 500ms increase in end-user flow through the cart, the stakeholders would want to know.
  • Development already knew much of the technical root cause for this issue, but they were relying on the performance team to accurately give an official measurement from testing in the performance environment.
  • Once we checked the website analytics, we noticed that less than 7% of the end-users were still on IE7 and even fewer of them were on slow routes to the website. It was a very anticlimactic realization.
That's what we explored and discovered and learned through the 5 days of escalation with the teams. There were three different groups (development, test, and business) all digging into this issue, and so a little miscommunication and some chaos was to be expected. But we found there were also some discrepancies in the testing tools and environmental configurations. Here are a few lessons learned and observed:
  • Don't use a multi-user load testing tool to validate a single-user client side bottleneck. Better yet - combine load and UI performance tools in your approach.
  • The load testing engineer should have also been using LoadRunner’s Ajax TruClient virtual user to accurately measure the client-side performance in FireFox (which was also the developer's standard).
  • Always have your testing completed and test results compiled prior to escalating a major NO-GO vote to a release.  Do your homework – it will pay off.
  • Remember to analyze and calculate risk according to the real-world evidence - like the % of users on an old browser or slow link speed.
  • Performance results form “lower environments” used by development can lead to inconclusive evidence about real-world performance. Be aware of and allow for factoring these inaccuracies.
  • To improve the breakdown of client response times measurements in LoadRunner and Gomez, we need to measure all the points in the rendering of the client experience.
  • Get savvy with client-side performance tools like FireBug and Charles Proxy, otherwise your developers will know more about performance than you do.

2 comments:

Post a Comment