Pass/Fail: A Terrible Way to Test
{This is a post I wrote for the public tech blog at one of the companies I have worked for, OnLive, a cloud gaming service. It's aged pretty well, and the company is no longer around to stop me, so I'm publishing it here!}
Software QA is full of pragmatic choices and best guesses. However, the ubiquity of pass/fail tests doesn't help at all. How can we make automated test results clearer, more reliable, and more useful?
In many companies, you can find two main areas in which testing takes place:
Both categories can benefit from being more mindful about the data behind the pass/fail result. Let’s take a few example cases:
a) Verify a new user can complete the signup process. Although a frequently-changing UI might make this a good case for using a QA engineer to evaluate this part of the site, if the UI is stable it could also be automated with common test automation tools.
b) Validating a geoip function that returns the ISO 31661 country code associated with a given IP address. Maybe a case for “MonteCarlo” testing, where random values are input as IP addresses.
Now, how can we improve these test cases from their humdrum pass/fail lives? By taking a page from the manufacturing QA textbook, and making an effort to record measurements instead of pass/fail.
In the case of the signup test, how about measuring response times? Get the login page. How long until it was completely loaded? Enter a bad username and password, and click ‘Login’. How long does the error message take to appear?
What about the geoip function? Give it a thousand random IP addresses, salted with 25% intentionally-invalid (but also random). What was the mean and median response time of the function? What percentage of requests were timeouts? What percentage were detected as invalid IP addresses? Was that number higher (or lower!) than it should be?
In addition, if you have a test database or message queue involved, you can see what happens if you increase or decrease traffic or delays. You could tie in tcpreplay (http://tcpreplay.synfin.net/) to generate traffic, use wanem (http://wanem.sourceforge.net/) to simulate network conditions, or write a script to artificially load your database.
Now you have a wonderful thing. Instead of a bunch of pass/fail tests, you have a set of numbers.
In the field of manufacturing quality, there is a notion of “control” of your process. You can use numeric measurements like plastic durometer measurements, or circuit impedance, or switch actuations until failure, and look at them over time. You can see from one manufacturing run to the next how those numbers vary, and make educated decisions about whether a problem needs fixing, or just mitigation, and how urgently.
See wikipedia’s Control Chart entry (http://en.wikipedia.org/wiki/Control_chart) for more info, but in general you can graph these results over time and set alerts if they stray more than 1 or 2 standard deviations from the past data set. If there are questions about how much deviation should be allowed, you can investigate and get feedback from users or management.
You can set absolute limits that the numbers should never go above or below. You can perform arithmetic and examine the results. If an operation queries three separate services, then the sum of their response times should be examined in addition to the individual results. Trending measured results won’t solve all your QA problems, but it can certainly be a powerful tool in improving your software!
Software QA is full of pragmatic choices and best guesses. However, the ubiquity of pass/fail tests doesn't help at all. How can we make automated test results clearer, more reliable, and more useful?
In many companies, you can find two main areas in which testing takes place:
- Unit testing of modules and executables at check-in time, sometimes as part of a continuous integration process and
- Integration and system testing, performed on a test instance, or controlled subset of your whole site or system.
Both categories can benefit from being more mindful about the data behind the pass/fail result. Let’s take a few example cases:
a) Verify a new user can complete the signup process. Although a frequently-changing UI might make this a good case for using a QA engineer to evaluate this part of the site, if the UI is stable it could also be automated with common test automation tools.
b) Validating a geoip function that returns the ISO 31661 country code associated with a given IP address. Maybe a case for “MonteCarlo” testing, where random values are input as IP addresses.
Now, how can we improve these test cases from their humdrum pass/fail lives? By taking a page from the manufacturing QA textbook, and making an effort to record measurements instead of pass/fail.
In the case of the signup test, how about measuring response times? Get the login page. How long until it was completely loaded? Enter a bad username and password, and click ‘Login’. How long does the error message take to appear?
What about the geoip function? Give it a thousand random IP addresses, salted with 25% intentionally-invalid (but also random). What was the mean and median response time of the function? What percentage of requests were timeouts? What percentage were detected as invalid IP addresses? Was that number higher (or lower!) than it should be?
In addition, if you have a test database or message queue involved, you can see what happens if you increase or decrease traffic or delays. You could tie in tcpreplay (http://tcpreplay.synfin.net/) to generate traffic, use wanem (http://wanem.sourceforge.net/) to simulate network conditions, or write a script to artificially load your database.
Now you have a wonderful thing. Instead of a bunch of pass/fail tests, you have a set of numbers.
In the field of manufacturing quality, there is a notion of “control” of your process. You can use numeric measurements like plastic durometer measurements, or circuit impedance, or switch actuations until failure, and look at them over time. You can see from one manufacturing run to the next how those numbers vary, and make educated decisions about whether a problem needs fixing, or just mitigation, and how urgently.
See wikipedia’s Control Chart entry (http://en.wikipedia.org/wiki/Control_chart) for more info, but in general you can graph these results over time and set alerts if they stray more than 1 or 2 standard deviations from the past data set. If there are questions about how much deviation should be allowed, you can investigate and get feedback from users or management.
You can set absolute limits that the numbers should never go above or below. You can perform arithmetic and examine the results. If an operation queries three separate services, then the sum of their response times should be examined in addition to the individual results. Trending measured results won’t solve all your QA problems, but it can certainly be a powerful tool in improving your software!
Comments
Post a Comment