May 3, 2015 // By Andy Tinkham
Earlier this year, the astronomers of the International Earth Rotation and Reference Systems Service in Paris announced that 2015 will include a leap second. On June 30th, the clocks will change from 23:59:59 UTC to 23:59:60 and then to 00:00:00 to keep our atomic clocks in alignment with the Earth’s rotation. This is the 26th time this event has occurred since 1972 when the system was first implemented.
The last time we added a leap second (back in 2012), several high profile websites crashed or had other problems due to this extra second. StumbleUpon, FourSquare, LinkedIn, Reddit, Mozilla and Yelp all encountered problems, as did Australia’s Amadeus airline reservation system. These problems were traced back to a bug in the Linux operating system that caused the system to deadlock. While the bug in Linux has long since been fixed, and people are generally more aware of the potential risk as a result of 2012, some software may still be vulnerable – how can you determine whether your software might be vulnerable?
The first step is to understand how your software uses time. Many systems probably will not even notice the leap second – they don’t query the time every second and may not even see the 60 second mark. However, time is one of those insidious things in computing – computers themselves are so time dependent (with even their CPU speeds being measured in cycles per second) and time being so readily available to programmers, that time is used for many things that we might not consider. Time is used to seed random number generators, to give files unique names, and a number of other uses. Short of taking on a long-term, detailed code analysis, we may not uncover all the places our systems use time.
We need to consider another factor in our analysis – the risk that something bad would happen when that second ticks over. Some pieces of software are absolutely critical, others can be down for a bit with minimal impact. A personal blog, for example, that gets a single digit number of hits per day would have a low risk of an impact. A system processing large amounts of data 24 hours a day would likely be a much higher risk. We need to look at our systems piece by piece, asking ourselves what a failure in that piece would look like, how we’d know if such a failure occurred, and the impact of that failure occurring. Then, we can focus our efforts on the areas most likely to have serious problems, addressing the less severe or less likely to have problems areas later as time allows.
This reflects, at its core, both sides of risk-based testing. First, we’re using risk as a lens to identify potential test cases. For the leap second, we’re focusing on a specific set of risks – those associated with our system working with time and those associated to critical functions of our systems. Using this lens opens up additional test cases that we might not see if we looked at the system through a different lens (such as requirements). This gives us a broader understanding of what we could do to test out system. Then, recognizing that we never have time to run all the tests we could choose, we use risk as a test prioritization factor, addressing the areas most likely to exhibit problems or to have the biggest impacts on our systems and users if a problem did occur. In this way, we can introduce flexibility into our testing, allowing us to streamline our actions, paying attention to the value of the actions we choose, and continue to maintain our team’s velocity while minimizing our liability when rare events like leap seconds occur.