Monday, October 27, 2014

On Coaching Agile: Killer Questions - What do you want to test?

When coaching teams or individuals in agile techniques like TDD, Refactoring, Clean Code, CI, you name it then I've found out that certain questions yield interesting effects. My absolute favorite would be this one here:

What do you want to test?

In the meaning of: What feature or behavior do you want to test?

One of the key take aways in trainings on agile techniques would be the knowledge about and the appliance of the testing pyramid. Many teams I trained would have a test structure that imitates an upside down pyramid or a trapeze standing on the short edge or even a big letter T. When they attend a training they would learn about TDD and unit tests and are eager to write them for their code too. And what seems to be rather easy in the training with a well designed and small sample application tends to become overwhelmingly difficult when they get back to their own code base which usually would be more complex and more ridden with legacy code issues.

Nevertheless they start out and want to write the first unit test themselves. And here is where the trouble starts. How would you do that? Most of them lack experience of what a good unit test would look like. What unit should you begin with? And which tests would make sense? If you never ever tried to write unit tests for your production code these questions are tough. Everything is so much more complicated compared to end-to-end testing where there is no need to understand the inner structure of your code and where design issues are not preventing you from testing.

The following is condensed from several sessions with developers trying to put a piece of their code under test. There is no single developer who behaved exactly this way. I took the freedom to exaggerate the situation for the educational purpose. But please be aware that any part of this is taken from real situations: 

So, when I pair (actually: triple) with my developers before writing one single line of code I ask them: 

What do you want to test?

Most of the developers really have no good answer to this question. They are able to explain the implementation to its tiniest detail but they fail answering this rather simple question. That is why asking it will usually be the begin of a long discussion where they tell me what actually happens in their code which things get to be done and how. This is when I would guide the discussion to the structure of the code and the building blocks it consists of. We try to identify the parts that would not heavily make use of components which are not under control of the team. To begin with we would focus on the components that depend on components under control by the team to make things a little bit easier. Most of the time this initial discussion already brings about a better understanding of their own code and architecture which will help a lot in the upcoming part of the session. Besides, me as a coach I'm able to understand the coding on a relaively high level of abstraction which helps me to not get lost in too much details. I would gain enough knowledge to understand the features and what tests there should be. With that I'm able to guide them further down the road.

Once the component to put under test has been identified I would ask the question again: 

What do you want to test?

Many times this is when I get my tour through the component and what it would be doing one more time. This is when I try to find a nice unit to start testing with. Sometimes they would pick a unit to test themselves and sometimes not. Whatever they would pick I'd try to start with. It is important that they learn to decide themselves. It is important that they take over the lead as much as possible. It is not my show. It is not about how I would proceed. It's more about being a facilitator, to lend a helping hand. When there is a unit to test I ask one more time

What do you want to test?

Now it's about to find the method to test. Often the units picked had one method that would stand for what the whole unit is all about. Anything else in this unit are just private helper methods. Now I would try to nail them down on a certain behavior of this unit, a certain method to test. Usually I would get the same tour through algorithms and underlying layers again. And here it is important to come down to what a unit test is all about: testing a certain behavior for a defined input producing an expected output. And this is what they really have to understand here:

It is about this behavior. That's the reason we write unit tests. To secure a certain behavior of the unit.

It is not about testing all paths and all combinations. It is about testing a behavior which is directly derived from the acceptance criteria of a user story. Any code that is not covered by such acceptance criteria is useless for it implements stuff no one ever asked for. When developers write their tests with the implementation in mind then they might get a high code coverage but what we would really need is a high feature coverage. By asking the question

What (feature/behavior) do you want to test?

one would always stay in the outside-in perspective which helps to keep the acceptance criteria in mind. With that writing down expected results of tests becomes pretty easy for everything follows directly from the acceptance criteria.

So, to get to the point which tests have to be written I very frequently ask my question. It proofed to be of use to not let them go the whole tour again although they always try. Early interruption is required. But be careful. Because this repetition easily will become very annoying for the people it would be wise to guide through additional questions to point them to the expected behavior or to the obvious negative tests that would be needed. If the procedure does not seem to come to an end you should propose something by asking: What if we try this and that? In the end we want to write unit tests. We don't want to drive the developers crazy.

Especially when we talk about legacy code it is more like covering the existing code with negative tests for these are usually difficult to write as end-to-end tests if at all. This would add value to the existing test base and yields fast positive return on investment by disclosing possible fatal errors that would cause the application to crash. There never has been a session without at least one of these. In most cases these tests are quite easy to write and the bugs they are disclosing are often easy to fix. With that the developers would have a successful first unit test session. In later sessions the very same technique could be used to bring about the positive tests as well.


A checklist to summarize:


Pro:
  • focuses on feature/behavior instead of algorithms and lines of code 
    • opens up your mind for new approaches
    • maybe the existing implementation could be improved?
  • forces to divide the feature into portions to be tested 
    • unveils structure of the feature at hand from an outside in point of view
    • discovers possible negative test opportunities instead of focusing on the happy path
  • annoys and breaks indifference
  • as any question forces the developer to come up with his own answers 
    • solution will be better accepted 
Con:
  • easily becomes too annoying and causes resistance 
    • guidance through hints or additional hint like questions required

Read also:

On Sporadics - How to deal with intermittent tests in continuous delivery


The opinions expressed in this blog are my own views and not those of SAP

Thursday, October 23, 2014

Interlude: A practical experience with quarantine

Sometimes one develops nice ideas or puts down some thoughts and then reality kicks in and what has been a nice theory gets under pressure by the events of daily life. Same for me. Recently I was talking about Sporadics and the attitude of shrugging them off. I was talking about people and the need to put them first and processes second. And I was talking about a quarantine to give people a chance to tackle difficult to solve issues without the need to close mainline. I also was talking about my ambivalent feelings towards a quarantine. Just recently these ambivalent feelings proved themselves valid as a quarantine turned into a ever growing bowl of bad apples and regressions entered mainline faster then one would be able to count up to five. What happened?

A quarantine has been set up with these rules:
  • any regression would be a possible reason to close mainline
  • if the regression turns out to be 
    • trivial then mainline will be closed until the fix arrives and proves to remove the regression
    • non-trivial and a fix would take some time and it is
      • a reproducible regression in a third party component then it will be quarantined for two weeks and mainline remains open
      • a reproducible regression in the application itself then it will be quarantined for two days and mainline remains open
    • a sporadic regression and
      • it showed up the first time then it will be observed as a known regression - no fix will be requested and mainline remains open
      • it showed up the second time then it will close mainline and an analysis will be requested, if analysis shows the fix would be
        • trivial then mainline remains closed until the fix arrived and proved to remove the regression
        • non-trivial then it will be quarantined and mainline will be opened again
  • a quarantined regression that gets not fixed until its time limit has been exceeded will close mainline

It turned out to be a hell of a job to enforce this <sarcasm>simple</sarcasm> set of rules and to convince people to consider removal of regressions their primary task and that disabling the test that unveils them wouldn't be the solution if not the test itself turns out to be buggy. That's why I need to touch the topic of quarantine once again.

The original idea would work if certain assumptions would be true:
  • "... quarantine would serve as a to-do list of open issues to be worked on with highest priority"
  • quarantine rules will be strictly applied

If one of them does not hold the whole concept would be broken. I further expressed my concerns about a quarantine:

"If there is a culture where these things are always overruled by something seemingly more important then a quarantine will just be another issue tracker with loads of issues never to be handled or like the to-do lists on dozens of desks which never will be worked on to the very end. It's just another container of things one should try to do when there is some time left."
And exactly this is what happened. And it made me think. Either the concept of quarantine was a dead end or the way it has been incorporated was plain wrong. What are my observations?

First, the set of rules is non-trivial. It takes some time to understand which possible cases there are and whether or not a regression would close mainline, will be quarantined or just observed. This rule set is hard to remember for the people to enforce it as well as for the people that would need to fix regressions.

Second, if you consider a green mainline the goal to achieve and if you accept the fact that red pipeline runs require a thorough analysis of what went wrong then the design of this quarantine has its flaws. It would allow for red pipeline runs to enter mainline in the case of a Sporadic occurring the first time. Any deterministic regression gets handled by closing mainline or at least putting it under quarantine while non-deterministic regressions could enter mainline.

Third, there has been no tool to put regressions in quarantine and to report on them with every pipeline run in order to make sure newly failing tests show up easily. Instead all test failures were present and required analysis. Many pipeline runs were analyzed that only contained quarantined test failures causing a lot of wasted work.

Forth, the rule set wasn't enforced completely. Regressions that reached their maximum time in quarantine did not cause mainline to be closed. Instead they started littering pipeline runs and along with the missing quarantine reporting it became even harder to find newly failing tests thus regressions a single change introduced.

Obviously this quarantine has been a failure both in concept and in execution. There needs to be another configuration of it to make it work. There needs to be something very simple and straight so that everyone could memorize it and that a tool could enforce it. And the latter would be the basic shift in my mind. Last time I was talking about not having an automatism but to add a human factor:
"Instead of having an uncompromisable automatism which would just close mainline until a proven fix arrived a human would be able to weigh different aspects against each other and to decide not to close mainline even if a Sporadic popped up"
And this was my fifth observation. It seems that at least in an environment where Sporadics are being shrugged of and features are rated higher then bug or regression fixes this human factor would invite people to discuss the rules and to ask for exceptions from them to just be able to do this and that very quickly. A human would be tempted to grant these exceptions and in special cases does grant them. As experience shows in many cases not sticking to the rules turns out to be the cause of a lot of trouble. In the case at hand what was supposed to be a shortcut, a speedup turned out to be a mess and a slowdown of things. Additional regressions were introduced.

With that experience in mind I try to remember what quarantine was all about. Which goals were expected to be achieved?

First of all the baseline of it all is the idea of continuous delivery where release candidates get produced all the time. If a certain pipeline run fails, this run does not yield a release candidate. Continuous delivery allows for frequent releases of small increments of the application thus getting fast feedback on any new features instead of waiting for yearly or half-yearly releases. The increment to be shipped has to be free of regressions. Otherwise it wouldn't be a release candidate for the pipeline to produce it would have been stopped when the error popped up.

So, if a quarantine gets introduced it must make sure
  • the software to be shipped is free of regressions and
  • the time between two release candidates is as short as possible to allow for small increments

If a regression gets up to two weeks in quarantine then experience shows that it will stay there for pretty much exactly this time. Considering this it would not be possible to quarantine a regression for two weeks because this would spread the time between two release candidates to at least this amount of time. If within two weeks one more regression of this sort gets quarantined the time between two release candidates would grow even longer. One regression every two weeks would make it impossible to produce a release candidate although one regression every two weeks would be a pretty good rate. What's more, the increment would be quite large. If a quarantine of two weeks is not possible then waiting for fixes of 3rd party components would not be possible any more except this 3rd party component comes with releases in a high frequency. With that the application has to mitigate issues in any 3rd party library by not using the buggy feature at all, which may mean to work around of it or to skip the own feature if it is dependent of this very feature the 3rd party component offers and which turned out to be buggy. If this becomes the default mitigation in the application a quarantine of two weeks wouldn't be needed anymore. Independent of the regression being located in the application or in any 3rd party component the fix has to be made in the application itself. The team developing the application would regain control over the application which in itself would be a nice thing to achieve.

So, allowed time in quarantine needs to be short.

Another flaw in the above quarantine rule set has been the fact that there would be a loophole for regressions to enter mainline. The fact that Sporadics need to show up twice before they would cause a close of mainline and with that open up the possibility to quarantine the regression needs to be dealt with. I'd propose to quarantine any regression no matter whether they are deterministic or not. There would be two days to get them fixed. If the fix does not arrive in due time the regression will leave quarantine and mainline will be closed. No test protocols need to be scanned for known not yet quarantined issues. The quarantine tool would report quarantined regressions properly and any new test failure would show up directly. While mainline is closed no other changes could be pushed to it. This would be an optimistic approach for it is based on the assumption that regressions get fixed with highest priority.

The new quarantine rule set would look like this:
  • any regression will be quarantined for up to two days
  • if the fix arrives within this time and it proved to be valid then the failing test will leave quarantine
  • if the fix does not arrive then the failing test leaves quarantine and causes the mainline to be closed
  • if the fix does arrive and it proved to be valid mainline will be opened again

that's it. Simple and straight forward. No distinction between types of regressions. If required an upper limit for regressions in quarantined could be added. If this has been reached mainline would be closed also. But the rules to open mainline would get a bit more complex for there are two conditions to consider. That's why I would prefer not to have the upper limit of quarantined regressions in the rule set. The two day limit is tight enough to put pressure on the resolution of regressions.

This rule set is simple enough that a tool could implement it, thus removing the human factor from the decision making whether or not to close mainline. The human factor should be used to decide upon the proper limits for regressions to stay in quarantine for this seems to be dependent from the current situation an application development finds it self in. So in the end the CI server could take care of the regressions by enforcing a quarantine although I wasn't fond of this idea earlier. But it would do so based on a simple rule set which does not require a huge and sophisticated tool and the pipeline runs still would be reported as failures and no release candidate will be produced as long as there are quarantined regressions.

I will report whether this approach works in reality.


Read also:

Part 1: Sporadics don't matter
Part 2: Let the CI server take care of these Sporadics
Part 3: Where do these Sporadics come from
Part 4: Tune up the speakers - Let Sporadics scream at you
Part 5: Process vs. People - Or: Processes won't fix you bugs
Part 6: To Quarantine or Not To Quarantine Contagious Tests


The opinions expressed in this blog are my own views and not those of SAP