Saturday, September 6, 2014

Tune up the Speakers – Let Sporadics Scream at you

In my last post on this topic I pointed out that the main reason for Sporadics to creep into your test base would be the lack of urgency to prevent them from doing so. These days’ developers find themselves confronted with concurring requirements and fixing Sporadics is one of the least pressing ones. Thus making them more pressing would be required to move them up the queue. And we’ve talked about a mind shift that would be required to pay more attention towards Sporadics, so called, to get them analyzed and out of the test base. Just putting more pressure on them from a technical side would not help. In my first post of this series I was complaining about the it’s-just-a-sporadic attitude which contributes to the decrease of urgency. This has to go too.

So, what could be done? The first step would be to make the Sporadics ring in your ear each time they would show up.

Actually there are two solutions an informal and a technical supported one sharing a common thought. One works fine for smaller teams or projects the other may be required for larger teams or projects. Both solutions work with these principles in mind:

“There are no Sporadics. There are failing tests only”

Make people understand this. There are no second class failures that could be somehow ignored or paid less attention to. Every test failure indicates an issue to be solved as pointed out earlier. No statistics needed. No guessing on known issues required. The second principle is one borrowed from “Continuous Delivery” [1]:

“Do not check in on a broken build”

When the build and test pipeline breaks something is wrong with your product. It does not make sense to move on sandy grounds. The reason for the previous failure might be a bug in the application. The basic assumption is that the responsible developers would investigate the failure and provide a fix eventually. Only after this has been done the next check-in is allowed. As the authors of [1] pointed out breaking this rule would prolong the time until the build gets green again for analysis on a moving target is much harder to do. And second people would get used to see broken builds which would either yields ignorance or a big effort to clean up the build every now and then which nobody really wants to do.

Both these principles could actively be lived by smaller teams or projects without the use of any additional processes or tools when people accept them as the way they want to work. I’ve seen this working in real life. As long as the people involved share these principles and do not cease to enforce them they would be perfectly fine with this. But this may turn out to be hard work and sometimes the discipline erodes. If the window gets broken then it will be hard fixing it. This is getting even harder when talking about big projects with many teams involved. This is why I come up with this approach or proposal of mine. It is basically about the how I make a large organization react properly and instantly on broken builds not about how a certain kind of Sporadic failure could be fixed to make the test more stable. I will come to this later, though. For I do think all the post on the net lack one important discussion: Is the intermittent test itself really the test you should have or would another testing strategy for the feature under test help avoiding the issues that usually make tests unstable.

Proposal: Green Mainline Policy


When gentlemen agreements do not help to keep the “Do not check in on a broken build” principle alive, then it might be time to establish a tool supported process that enforces the desired behavior. I will call this a “Green Mainline Policy” or “Green Branch Policy” for it does not only be applicable to the mainline. (In the following I will only refer to mainline for the sake of readability.)

The objectives of the proposal are:
Reduce or eliminate the occurrence of Sporadics
Make fixing a broken build the highest priority task
Must work for large projects

So, what’s the proposal then? Basically it’s quite simple. Suppose there is a continuous build and test after a change has been submitted to mainline. If this run succeeds everything is fine, obviously. If it fails the CI server will
close mainline for any further submits except for the person/team responsible to fix the issue
open a ticket in the issue tracker
send a mail to the team/person responsible for the broken test
When the fix has been provided it will be accepted to be submitted to mainline and a new build and test run starts. Only if it succeeds the restrictions will be lifted and everything is back to normal.

Instead of letting the CI server handle the policy there could be QA people or mainline owners who would execute this policy manually. This depends on how business is done usually. The main point is to really close mainline for further changes. No-one will be able to bring any changes – except the fix -  near mainline anymore until the issue got fixed. With that the “Do not check in on a broken build” does not only depend on people behaving properly but will be enforced from outside of the team and thus will be much harder to undermine.

Reality check. There is a project I know of which incorporated this “Green Mainline Policy”. Although there have been some resentments in the beginning the policy proofed to be of value. Not only that Sporadics get analyzed (and very often) fixed quickly teams adopted this policy for their team codelines as a rule to follow without process enforcement. There has been an improved quality chill factor. Quality felt better than before when everyone was bothered by ever broken mainline builds. Now quality issues become evident and the improvement of quality is visible to everyone: Less and less broken mainline builds.

Invitation to further discussion


However, it was quite easy to type this down and it might be tempting to leave it like this. But this wouldn’t be the right thing to do for the proposal does raise a lot of questions which need to be answered. I will come to this in a second. First I would like to discuss what this simple approach could do.

It puts a whole lot of urgency on the fixing of the issue which broke the build. As it is mainline would only be opened when a fix arrives. This issue might it be a regression or any other (d)effect will not show up anymore given the fix has been a thorough one and does not, for example,  just increase a timeout that proofed to be too short. The lack of urgency for Sporadics would be gone. There couldn’t possibly be more urgency than a stopped mainline production would produce. There are no excuses for not fixing the issue anymore. But does this change the attitude we talked about earlier? Probably not. It’s more like an extrinsic motivation – one might call it pressure – than an intrinsic one to avoid or at least quickly remove Sporadics.

There are other questions popping up:
How long would mainline be closed if fixing the issue would take a day, or two, or even longer?
How would you verify the fix? Running the whole build and test pipeline again? Just verifying the failing test?
What if the failure could not be reproduced?

Much depends on the very nature of the failure. There are so many possible sources of failure in a test that one would need to put a more differentiated answer. It’s plain to see that the approach might not remain this simplistic. I will try to elaborate on this next time around.


Read also:
Part 1: Sporadics don't matter
Part 2: Let the CI server take care of these Sporadics

[1] Humble, Jez, and David Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Addison-Wesley, 6th printing 2012, p.66


The opinions expressed in this blog are my own views and not those of SAP

No comments:

Post a Comment