May 12, 2020 // By Mark Swanson
Every day we are adjusting our behaviors in how we interact with each other and conduct business during the COVID-19 pandemic. Businesses in multiple industries are learning to conduct business in new ways. Websites and applications are stressed in new ways as more businesses have unforeseen spikes in usage and patterns that are quite different than normal. Reports of government systems crashing, and site crashes or poor performance of online commerce sites have been common in the news lately. Then there are the COVID-19 tracker sites everyone has been hitting. I know that I have received a “503 service unavailable” message on occasion. These spikes in usage stress the importance of conducting regular performance load testing against the systems that support our ability to provide goods and services, and to support the actions of our employees.
Some industries already have planned spikes in load and test to these conditions.
- Online retailers may load test their systems to plan for Black Friday or Cyber Monday.
- Power companies and utility companies may test their systems to respond to web traffic due to reporting on outages.
- Brokerages and trading companies may load test their systems due to spikes in load due to event driven market conditions.
Although there may be spikes in usage, the spikes are expected. These expected use patterns allow us to plan and develop use cases and scenarios for expected load conditions. In performance load testing we seek to understand performance characteristics of software systems under a predefined load based on expected usage patterns over a defined period. However, before an organization can plan or adapt to rapidly changing use patterns it needs to ask these questions:
What is expected and normal?
Organizations may have mature metrics collection around system resource utilization, but many do not have application performance or utilization metrics. Let’s say you have a web application that supports ecommerce activities. Are you able to recognize the following?
- Number of concurrent users of the system ( Average number per minute, hour, day)
- Type and distribution or user actions ( What actions are the users performing and how often?)
- Location distribution of your users
- Peak periods of utilization
- Transaction or request response time or throughput
There are many strategies for gathering this type of information. Here are just a few:
- Leveraging an Application Performance Monitoring system (APM):
- If you are fortunate enough to have an APM solution implemented, you already have many of these metrics. APM solutions can gather user session data, infrastructure monitoring, business metrics ( bounce rate and conversion goals), user experience monitoring (UEM) and much more.
- Leverage web analytics data:
- If you have an externally facing application, you may already be using third party analytics solutions like Google Analytics to gather marketing data about your website. This information can be invaluable to help construct real life scenarios for how your application is being used.
- Utilize web and application logs:
- Your systems logging data can provide valuable insight into how your application is being used. While it is often difficult to fully reverse engineer user scenarios from log data, it is a place to start. Also, your organization may already have a centralized log management solution (CLM) that allows you to mine log data more effectively.
- Utilize your application data:
- Your data is key. You should already have data around business transactions stored in your databases and this data can help construct usage patterns across a given time horizon.
- Speak with the key business stakeholders and organization experts:
- Your organization may be filled with experts that understand how your organization’s data will be leveraged with the development of a new application, new features or functionality, or changing use patterns for an existing application. Don’t underestimate the value of your people.
Does your application have a performance baseline?
Once you can construct expected utilization patterns you are ready to define performance use cases and scenarios for performance load testing to establish a performance baseline. A baseline is basically a minimum or starting point that can be used for future comparisons.
Your use cases are step by step user actions for a given user in the system. Based on your analysis you may discover that you have many different types of user journeys within your application. The distribution of these user journeys is what goes into a test scenario.
- Use Case: Step by step actions of a given user (or simulated) visit
- Scenario: Number and distribution of use cases (and users) across a given time horizon and location
The use cases are used to build your load testing scripts. They then can be incorporated into a load testing scenario that can be run with a load testing tool. The results of this test can be your baseline for expected performance. Getting an initial performance baseline is a good place to start.
Do you understand the current limits of your current architecture?
Now that you know the expected usage and expected performance of your application under load, it is time to expand your testing. With load testing we seek to understand performance under expected load conditions, however we also need to know how far we can stress the application before it no longer responds to our performance requirements. That is where performance stress testing is valuable. With stress testing, we seek to understand how much load can be applied to a system before it no longer meets the requirements (or availability for that matter). There are a number of strategies that can be used to stress test an application but you may want to start with gradually increasing the number of concurrent users in your load test scenarios over time to see at what point the system know longer responds or the system is no longer available. It is important to get a full view of application and system resource utilization metrics to identify these limits. The limits may be imposed based on available system resources or serialization of resources in term of queueing or locking. This is where full stack application and infrastructure monitoring is invaluable.
How easy does your application scale to increasing utilization requirements?
Knowing the system limits is one thing, but how can you adjust before your system reaches its limits? The ability of an application to scale is affected by many factors including the efficiency of application and database code, caching, and application infrastructure (just to name a few). Application infrastructure can scale horizontally (adding more servers to distribute the load) or vertically by increasing the size of underlying servers or instances supporting the application. Many cloud platforms make it easy to dynamically scale the infrastructure through auto-scaling, but this is rarely seamless. Creating load test scenarios to stress the underlying infrastructure and test the scalability of the hardware and how auto-scaling responds in the real world should be part of your load testing strategy.
If your organization can answer these questions you are in a much better position to respond to unforeseen patterns in usage like we are seeing with COVID-19 pandemic. Furthermore, you can begin to expand on your load testing strategy by incorporating additional edge cases of performance stress testing and to incorporate regular performance load testing into your application release cycle.