Thursday, August 28, 2014

Where do these Sporadics come from?

In part2 of this series I concluded that automating the detection of intermittent, random or non-deterministic tests (aka Sporadics) comes with unforeseeable extra costs. It may serve as a monitoring of Sporadics and might yield information to rank the Sporadics to decide which should be solved first. But if the efforts stop there then one has basically invested in managing status quo instead of changing it for the better. 

Before proposing a possible solution to the Sporadics problem I need to elaborate more on how a test becomes a Sporadic. Understanding this will lend hints to the solution to be established.

The basic assumption is that all tests have been successful when they were first added to their test suites in the first place. Otherwise talking about Sporadics would not be the topic to talk about …

Suppose there is a test which ran successful for a fair amount of time. Weeks or even months. Everything has been fine until the day came as it turned red for the very first time. A test failure occured. Nothing special. What happened back then? Someone might have investigated the issue. After some thorough work one came to one or more of these conclusions:

  1. The test failed because of a bug in the test, the bug got fixed
  2. The test failed because of a bug in the productive code, the bug got fixed
  3. The test ran successful again when the complete test run has been repeated
  4. The test ran successful when repeated in separation, so there supposedly was no issue with the test or the code under test itself
  5. The test failure does not relate to the change made to the productive code at all. Strange, but well (imagine a shrug, when reading)
  6. There has been some filer outage and the test wasn’t able to read or write a file or there has been any other issue with some part of the infrastructure the test was using
  7. You name it

(I will not go into detail about inherent fragile tests some of which could be a great source of Sporadics. Nor will I elaborate on possible root causes that would make a test intermittent. This is not the point in this post. I will save this for later ones.)

We are not talking about the first to items in this list. These are the good cases where the safety net worked and the required actions have been taken.

We are not talking about the occurrences when a real root cause analysis has been made, the problem has been found and fixed. These things happen probably more often than not, especially if finding 5 was not accompanied by finding 3 or 4, but not every time as the number of Sporadics in a corpus of tests will tell you.

Item number 6 would be an easy catch: Infrastructure issue. The guys maintaining it fixed the issue or the issue has been a temporary one. However the test has been good all the time. No-one has to do anything about it. Sure it will be green again when the infrastructure issue does not reappear.

Items 3 and 4 tend to be soothing enough for the guy investigating this issue that no other actions followed. Looks good now. So, just merge into mainline. Must have been some hiccup sort of.

Item number 5 consumed the most time investigating. It looks strange, but a rerun standalone and as part of its test suite succeeded. Somehow it leaves a bad taste in the mouth. But hey, didn’t it succeed all the time? And now it does again. Let’s leave it alone and do some important work, what d’you think?

While item 6 is a bad signal itself, items 3, 4, 5 are the ones that could break the neck of the company. If we are lucky they “only” signal issues with the tests themselves. Some hidden dependency that appears in some strange situations just on days of certain signs of the zodiac while the moon is in a particular position … you’ve got the idea. If we are not that lucky they were only representing the tip of the iceberg and there has been some non-deterministic behavior in our productive code which might lead to loss of data or any other hazardous event when in production use at a customer’s site. Maybe you only have a hard time analyzing it working long hours and at weekends or you might be facing a PR disaster or even worse a substantial claim for damages.

Experience shows that test failures like items 3 through 5 are the more easily shrugged off the more often they occur. These failures somehow are not taken as serious as they should be. Quite often there is an argumentation including pure statistics or issue tracker records for the code under test showing that it is not worth the time one would have to spend to find the root cause or if the root cause is known to fix it. Me personally, I witnessed such a line of thought more than once. However, tests that failed this way for a first time will start to fail a second, a third time and over again. And there you are. Now you have a Sporadic. A known issue. An item on a list or a record in a database. And one by one they creep in.

Why could this happen? In an ideal world a developer would follow the Continuous Integration principle and thus would be eager to get rid of any test failure in mainline builds at any time. For we are not living in such an ideal world things are a bit different. There are test failures that won't be fixed for reasons listet above. It would be too easy to blame developers for not caring. 

Developers find themselves confronted with various, concurrent and sometimes even contradicting requirements. Features always come first for the company gets paid for it. Maintenance is important too. Not to forget quality. Or some refactorings. Issues with sporadic failing tests for features delivered long ago (several weeks or months) tend to get lost in this. They just don’t reach the level of urgency they would require to get the necessary attention.

In part1 I was complaining about the attitude developers show towards the Sporadics. This attitude gets influenced by the whole environment developers find themselves in. So just introducing yet another tool will not help. There is also the social aspect of this. Or as +Steve Sether pointed out in his comment on my second post of this series:

"Since the problem is essentially a social problem, I think we should look towards social science for guidance.  It's been experimentally verified that people discount any potential badness that happens in the future.  So, bad thing in the future is much better (in people’s minds) than bad thing in the near present."

So let's explore the possible solution next time around. The teasing will have an end then ...

Read also:

The opinions expressed in this blog are my own views and not those of SAP


Post a Comment