Friday, August 22, 2014

Let the CI server take care of these Sporadics

In my last post on Sporadics (http://bit.ly/sporadics-1) I came to the conclusion that Sporadics do matter. Sporadics litter your test suites with failures. Any test suite with failures requires attention. Someone needs to have a closer look to figure out what went wrong.

If test suites usually succeed, any test failure would signal an issue that was just introduced by the very change or in case of nightly integration runs by a limited number of changes. In any case it is worth the work one has to invest.

With Sporadics creating a constant buzz of test failures there are test suites that fail with a varying number of failures. Any of them any time requires someone to have a closer look and chances are that only the “known” issues occurred. If this happens over and over again less and less attention will be paid and more and more “unknown” issues will go unnoticed. With an increasing number of Sporadics it becomes easier for a severe failure to slip through. The safety net will cease to be of any worth.

Sporadics are not only dangerous but expensive and frustrating too. So, someone or something has to deal with them. Who or what could this possibly be?

In agile methodology we want to automate as much as possible if not everything. So what about automation of sporadic detection?

It seems to be feasible, doesn't it? We have monotonous and repetitive work which in fact is just plain pattern matching: Does the failure look like one we already have in our list? Just compare test output of any kind like log files, backtraces and so on with the known pattern and mark the failure as known issue maybe even with a unique identifier attached to it. A rather simple script could do this.

And the best place to run such a script would be the CI server where all the changes for mainline are being built and tested anyway. And just add it to the nightly builds of mainline too. With that we would have a nice automated detection of sporadics and would always know whether or not a change introduces new bugs we really have to care for instantly.

Well, sounds promising. No more of these stupid tasks to scan test results. Just have a look at the CI server output and you instantly know the change looks good or not.

Done. Continuous integration and automation did the job.

Really?

Three questions arise - at least:
  1. How precise should the pattern matching work? 
  2. What if we find a known Sporadic in a test run? What should be the action to be taken? 
  3. How would a new test failure be marked as Sporadic and how would a Sporadic be removed from the list? 
Suppose the productive code or test code changes and the known Sporadic slightly moves its location in the code or the lines written to the log change and do not fit to the stored log content of the known Sporadic anymore. How flexible should the pattern matcher be? If it's too tight it might report known Sporadics as new ones thus prolonging the list of known Sporadics unduly. If it's too lose it might report new failures as known ones which would increase the risk of severe issues to slip through.

Only these few thoughts show that a Sporadics detector itself would be a quite sophisticated piece of code which brings its own risks that would need to be taken care of. And don't forget the amount of work required to develop and maintain it.

Suppose a Sporadics detector is in place and works quite well. What happens if it detects known Sporadics in a test run? Should it just report them and mark the test run a success if there are no other failures? Should it try to rerun the Sporadics to make them a success in the second run? Or in the third run? Or …?

How far would you go to make the tests green? And what if the Sporadic wouldn't turn green this time? Would it be a failure then? Although you know it's a Sporadic? Probably one would do without rerunning Sporadics. Just rely on the detection. If a known Sporadic has been identified then accept it as it is and remove it from the test result.

Suppose there is a test failure and it is not a known Sporadic. Should it be taken as a failure someone has to investigate immediately? Or would you try a rerun of all failures to check whether or not they will fail again?

In the first scenario some developer or QA guy would have to investigate the failure. If it turns out to be a Sporadic then this guy would have to add it to the database of known Sporadics for the Sporadic detector to find it the next time it occurs.

In the second scenario the Sporadic detector would do the job of rerunning and guessing. If it does not fail in a number of reruns again, it will be considered a Sporadic and will be added to the database. If not, the test run will be a failure.

So managing the list of known Sporadics will introduce either some extra work by a developer or QA person. Or in case of automation it will add extra run time on each and every test run for every test failure will be treated as a possible Sporadic. Thus test runs will be prolonged, turn around times will increase and the throughput of the CI server will decrease. Which in turn means extra costs.

If one is willing to spend these costs to catch those Sporadics. What is the tradeoff? With this automation of Sporadic detection we take some load from the developers. They do no longer have to analyze all these Sporadics manually. Which gives them more time to add new features. Sure one would like this one!

Really?

We were talking about attitude towards Sporadics. No-one takes them seriously. They keep on piling up. Does the automation of Sporadic detection help in changing this behavior? I don't think so. It's the complete opposite. Sporadics are getting even more out of focus. Developers do not need to care for them anymore. They are no longer a pain in the … for them. One does not need to be a fortuneteller to expect the number of Sporadics to be further increasing.

To be honest. If the corpus of tests is in bad condition and there are more failing test runs then ones that succeed such a mechanism (as cheap as possible) might help to classify the test failures to have a list of Sporadics one could work on. But then don't the developers do have these lists of their own Sporadics already? They did use them for their manual checks. They are written down in some issue tracker, wiki, excel sheet, text document or on plain post-it notes on some whiteboard. All these solutions are by far cheaper then the automation and automated central management of Sporadics. Last but not least these solutions keep the developers in the loop. They could feel the pain of Sporadics everyday.

Something else would be required to solve the Sporadics issue. Something that really puts pressure on the developers to solve it. But thats another story.



The opinions expressed in this blog are my own views and not those of SAP

No comments:

Post a Comment