My 2Cents on Agile: Continuous Integration

Showing posts with label Continuous Integration. Show all posts

Thursday, April 7, 2016

Approaches to Distributed Development of a Software Product

Note: This article will be published as a series of installments. See the installment history at the end of the article to track changes.

Introduction

Just recently I came to think about a proper setup for a product developed by distributed teams. As it happens the use of git was a prerequisite. When working with Distributed Version Control Systems (DVCS) like git teams see themselves faced with the task to figure out what would be their way of organizing source code they contribute to a larger system. As git is a powerful tool with loads of features and means to do things one way or the other it offers both simple and rather complex solutions. In this post I want to explore some major approaches and compare them with each other. I admit that I am biased by concepts like Continuous Integration (CI) and Continuous Delivery (CD) which may influence my conclusions.

I consider this an experiment and will define an example product to set some constraints for the exploration. Some findings may be restricted to this setup others may be more commonly valid. However none of the conclusions claim to be universal. The approaches investigated are taken from daily life. Every single of them crossed my way and I consider all of them worth looking at. Any idea as far fetched or remote as possible would be worth to at least provide reasons why it wouldn't be a good idea.

Sample Product

Let's assume a product of considerable size, say 1M LoC. Let's further assume the product consists of a number of large components, say 5-10, which themselves may be made of smaller components. Team setup follows the top level component structure by and large, although an individual may see the need for changes in several components. The development team consists of 50-500 people actually touching code. Finally the product ships as one. There are no independent releases nor patches of parts of it.

None of the components is intended to be reused by other products. Components are a reflection of the current architecture of the product.

The current dependency structure of our product looks like:

Components A, B and E are top level components forming a collection of services the product consists of. Component F forms a UI framework Components A and B plug in to when present. F does not care for any components plugging into it. Components BA and BB are backends to A and B respectively. Component E is an extension to B.

Components are not necessarily identical with libraries, archives or any other sort of distributable artifact. Basically they are a logical structuring of the source code.

The product has a maintenance obligation with respect to already released versions.

What approaches could be applied to organize the source code of the given product? These are the ones that pop up on my mind:

Component Repositories
Topic Branch
Feature Branch
Trunk Based Development

Approaches

When exploring the different approaches I will try to shed some light on a bunch of questions that are far too often not getting considered. These questions touch several aspects of a software development life cycle. Think of questions like:

How do we get access to the component we depend on to make use of it in our component?
How do we make sure we get information about public API changes soon enough for us to incorporate them?
Should there be orchestrated schedules for component releases?
How do we handle splits/merges of components?
How do new top level components come to life?
How do top level components cease to exist?
How often do components ship new versions?
How do we make sure there will only be one version of each component used inside the product? Or how do we make sure components A and B are developed against the same version of component F?
When will integration testing be done?
How will it be done?
How does a component test itself in the context of the product? Or in the subcontext of its dependency tree?
How is the product being assembled?
Which component versions should be used?
How are component versions get managed?

The list may not be complete. It already holds some tough issues, though. What would be the answers when we work with separate Component Repositories?

Component Repositories

Our development team values highly decoupled components which interact via public APIs only. Any use of non-public APIs is prohibited. The developers understand the temptation of using non-public APIs for the sake of re-use and want to avoid that by obscuring the sources of their components towards other components as much as possible.

Our development team learned that a git repository for a large product worked on by many developers tends to become large which would increase time to clone and fetch. I've seen such repositories exceeding 4GB.

Idea: A small repository only containing one top level component worked on by 5-10 people seems to be a fair trade. The team could work in isolation. Their repository would not be littered with source code they do not own or are likely never touch. Things are simple when it comes to developing their component.

Development

For a developer working on component F with no dependencies or only dependencies to 3rd party life would be rather easy in this world. There are only things present one has to deal with directly. One could build the whole component including tests quite focused. As developers are free to add or remove sub components of their component rather freely they would not feel much of a downside.

Not all of the teams are happy with that setup, though. While the team providing the top level component F is quite happy with this approach, the teams depending on them are not. Why is that?

In order to plug into component F component A and B need to know about and have access to the current valid public API to at least be able to mock the dependency away in their tests and to use the right calls in their production code. However this may be done in the particular language the product is being built with, there has to be some sort of communication. Either interfaces have to be provided as files or as API documentation. Depending on the language these files have to be present during build. Latest when running integration tests of component A or B with component F a real component F would be required.

This would add an obligation to any top level component development teams responsibilities: There has to be some sort of releasing the component in a way other teams could rely on for their development. They have to maintain a release schedule and they have to actually release the component and make it available for the other teams to consume. Usually there will be some stable and some development version available. These versions could be used by components A and B for their development.

Integration

As long as component F publishes new versions on a regular base there will be some sort of "continuous" integration be available. Component A and B could make use of the latest component F version and report bugs if found in component F or fix their components usage of F accordingly. Depending on the release cycle of F the feedback loop length would stretch from rather short to pretty long. During development phase this may not be a problem but when release date closes in it rather certainly will turn into an issue.

A real continuous integration would be hard to achieve. Even if component F publishes release candidates with every pipeline run they would have to be verified by the depending on components before they could turn into released versions. Thus component F depends on the pipelines of each component up the dependency tree to verify successful usage of F and any component that uses F and so forth. The verification pipeline for F will become pretty long and in case of bugs found it would have to start all over again.

Whats more, if component A uses the development version of F to stay close to the newest features of F it relies on these development versions actually being released before release date of the product as no development versions of F will be shipped with the released version of the product.

Another complication would be the possible divergence of F's version used by components A and B. Just to make sure they are actually using the very same version of F there needs to be some governance enforcing this constraint.

Integration Testing

When teams are focused on developing their components they tend to consider any usage of their component the business of someone else. The integration of component A or B with component F will probably tested but the product as a whole will not. Who would be responsible for performing this assembly task with all its required testing?

The product in question is an assembly of the just right versions of all its components. Thus the product would be represented by a bill of material (BOM) only. The product assembly would drag all the named versions of the components and would perform the required packaging. What about the testing then? There would have to be a team that would take care of this assembly and the integration testing to make sure the BOM holds a valid and working combination of component versions. The assembly pipeline would have to run the integration tests of all components and eventually would have to provide for additional integration tests on product level. This team would not at all develop anything in terms of production code which bears the risk of them not knowing about features implemented. A dedicated communication would be required to make sure the assembly (or testing) team knows what to test for.

Another risk would be product level test breaking due to changes in components. As the product level tests run in the product assembly pipeline no component pipeline will run them and thus will not get feedback from them. It is the same as component A running integration tests with component F which could find bugs in F outside the pipeline of F. At any of these points there will be feedback someone would have to communicate to the depending on component. This feedback would have to dribble down the dependency tree with all the communication that comes along with that.

It would be best if the component could test itself in the context of the product within its own pipeline. To do that it would have to get access to the current BOM describing the product and to the product level tests. In order to run the product based tests it would have to build the product based on this BOM and replace itself with the component version under test. Component A's build and test process suddenly needs to know about the product and its assembly thus duplicating knowledge.

Another way would be to trigger the product assembly pipeline by replacing the version of component A in the BOM with the latest release candidate of A. If the product assembly pipeline succeeds the release candidate could be considered verified, it could be released and the BOM of the product could be changed accordingly. In this case the knowledge would not be duplicated but we would need a feedback loop from the product assembly pipeline back to the pipeline of component A. In order to get close to continuous integration any pipeline run of component A would include and wait for a pipeline run of the product assembly.

Release

As said before the product is being represented as a bill of material (BOM) containing the proper components and their versions. At this point uniqueness of components could be enforced, i.e. the version of component F to be used. Following the one repository per top level component the product as the topmost component will reside in its own repository along with the product level tests.

Releasing would include collecting all released versions of components making up the product and to run the product level tests as there would be no other tests available. If the versions of lets say component B and component F do not fit together for component B was using a different version of F in their integration testing there would be the risk that product level test would not discover this mismatch as they will not redo the level of testing done at component B's level. To avoid this means would have to be provided that would mitigate this like: components integration tests will be made available to the using components as well, a component will get hold on the BOM of the product to make sure they will use the proper versions of all components they depend on and so on.

This would introduce yet another set of communications required to mitigate issues induced by the general approach of Component Repositories.

Refactoring

As long as refactoring takes place within the boundaries of a top-level component things are fine. When it comes to structural refactorings of the product, i.e. introduction of new top-level components, removal of top-level components, it becomes cumbersome.

If there will be a new service C we could just create a new component repository and start working on C, adding it to the integration like the other components.

If an existing component ceases to exist in a newer product version we just could not get rid of it as long as the maintenance obligation exists. There will be a legacy component around in a repository no one will work full-time any more. This usually will cause the component to rot for no one will take on responsibility for that one. It is just to far out of sight.

Component Repositories make it hard to factor out new components. Consider a part of component A that would be useful for component B as well. How would B get hold of it? The cost of introducing a new component repository for the new reusable component would be quite high. So, copying the code and adding it to component B's repository seems reasonable especially as long as one would be wondering whether this part really is reusable by B. If it would be reusable and if someone really would avoid the code duplication and open up a new component repository who would be responsible for that? The new found component would not be a top-level component, so there would be no dedicated team despite the one for component A. Would this team be responsible for two repositories now?

Summary

The Component Repositories approach has got its advantages when it comes to the development of a leaf component. As soon as interaction with other components due to dependencies is involved things get messy. Components suddenly need a release management and version governance to make sure every component is on the same page. Especially the product assembly part will become a matter of discussions for no component development team will take responsibility for this integration level. A product assembly team would have to deal with that and would have to take care for product level testing itself.

Communication would be key in this approach. Whether it is done by introduction of additional automatisms to connect component repository pipelines with each other or by human interaction it adds complexity and the "one has to think of it" sort of things which tend to not been thought of.

As long as the component is not a real deliverable in its own right, i.e. will be used outside of the product, will be patched individually, I would consider this approach as not practicable.

Approach	Component Repositories
Development (leaf component)
Development (non-leaf component)
Integration
Release
Refactoring
Organizational Complexity

Conclusion

As I've only considered one approach yet the only conclusion I could offer now is that I would not like to go for component repositories not knowing a better alternative for now.

Change History

This is the first installment.

Thursday, October 23, 2014

Interlude: A practical experience with quarantine

Sometimes one develops nice ideas or puts down some thoughts and then reality kicks in and what has been a nice theory gets under pressure by the events of daily life. Same for me. Recently I was talking about Sporadics and the attitude of shrugging them off. I was talking about people and the need to put them first and processes second. And I was talking about a quarantine to give people a chance to tackle difficult to solve issues without the need to close mainline. I also was talking about my ambivalent feelings towards a quarantine. Just recently these ambivalent feelings proved themselves valid as a quarantine turned into a ever growing bowl of bad apples and regressions entered mainline faster then one would be able to count up to five. What happened?

A quarantine has been set up with these rules:

any regression would be a possible reason to close mainline
if the regression turns out to be

trivial then mainline will be closed until the fix arrives and proves to remove the regression
non-trivial and a fix would take some time and it is

a reproducible regression in a third party component then it will be quarantined for two weeks and mainline remains open
a reproducible regression in the application itself then it will be quarantined for two days and mainline remains open

a sporadic regression and

it showed up the first time then it will be observed as a known regression - no fix will be requested and mainline remains open
it showed up the second time then it will close mainline and an analysis will be requested, if analysis shows the fix would be

trivial then mainline remains closed until the fix arrived and proved to remove the regression
non-trivial then it will be quarantined and mainline will be opened again

a quarantined regression that gets not fixed until its time limit has been exceeded will close mainline

It turned out to be a hell of a job to enforce this <sarcasm>simple</sarcasm> set of rules and to convince people to consider removal of regressions their primary task and that disabling the test that unveils them wouldn't be the solution if not the test itself turns out to be buggy. That's why I need to touch the topic of quarantine once again.

The original idea would work if certain assumptions would be true:

"... quarantine would serve as a to-do list of open issues to be worked on with highest priority"
quarantine rules will be strictly applied

If one of them does not hold the whole concept would be broken. I further expressed my concerns about a quarantine:

"If there is a culture where these things are always overruled by something seemingly more important then a quarantine will just be another issue tracker with loads of issues never to be handled or like the to-do lists on dozens of desks which never will be worked on to the very end. It's just another container of things one should try to do when there is some time left."

And exactly this is what happened. And it made me think. Either the concept of quarantine was a dead end or the way it has been incorporated was plain wrong. What are my observations?

First, the set of rules is non-trivial. It takes some time to understand which possible cases there are and whether or not a regression would close mainline, will be quarantined or just observed. This rule set is hard to remember for the people to enforce it as well as for the people that would need to fix regressions.

Second, if you consider a green mainline the goal to achieve and if you accept the fact that red pipeline runs require a thorough analysis of what went wrong then the design of this quarantine has its flaws. It would allow for red pipeline runs to enter mainline in the case of a Sporadic occurring the first time. Any deterministic regression gets handled by closing mainline or at least putting it under quarantine while non-deterministic regressions could enter mainline.

Third, there has been no tool to put regressions in quarantine and to report on them with every pipeline run in order to make sure newly failing tests show up easily. Instead all test failures were present and required analysis. Many pipeline runs were analyzed that only contained quarantined test failures causing a lot of wasted work.

Forth, the rule set wasn't enforced completely. Regressions that reached their maximum time in quarantine did not cause mainline to be closed. Instead they started littering pipeline runs and along with the missing quarantine reporting it became even harder to find newly failing tests thus regressions a single change introduced.

Obviously this quarantine has been a failure both in concept and in execution. There needs to be another configuration of it to make it work. There needs to be something very simple and straight so that everyone could memorize it and that a tool could enforce it. And the latter would be the basic shift in my mind. Last time I was talking about not having an automatism but to add a human factor:

"Instead of having an uncompromisable automatism which would just close mainline until a proven fix arrived a human would be able to weigh different aspects against each other and to decide not to close mainline even if a Sporadic popped up"

And this was my fifth observation. It seems that at least in an environment where Sporadics are being shrugged of and features are rated higher then bug or regression fixes this human factor would invite people to discuss the rules and to ask for exceptions from them to just be able to do this and that very quickly. A human would be tempted to grant these exceptions and in special cases does grant them. As experience shows in many cases not sticking to the rules turns out to be the cause of a lot of trouble. In the case at hand what was supposed to be a shortcut, a speedup turned out to be a mess and a slowdown of things. Additional regressions were introduced.

With that experience in mind I try to remember what quarantine was all about. Which goals were expected to be achieved?

First of all the baseline of it all is the idea of continuous delivery where release candidates get produced all the time. If a certain pipeline run fails, this run does not yield a release candidate. Continuous delivery allows for frequent releases of small increments of the application thus getting fast feedback on any new features instead of waiting for yearly or half-yearly releases. The increment to be shipped has to be free of regressions. Otherwise it wouldn't be a release candidate for the pipeline to produce it would have been stopped when the error popped up.

So, if a quarantine gets introduced it must make sure

the software to be shipped is free of regressions and
the time between two release candidates is as short as possible to allow for small increments

If a regression gets up to two weeks in quarantine then experience shows that it will stay there for pretty much exactly this time. Considering this it would not be possible to quarantine a regression for two weeks because this would spread the time between two release candidates to at least this amount of time. If within two weeks one more regression of this sort gets quarantined the time between two release candidates would grow even longer. One regression every two weeks would make it impossible to produce a release candidate although one regression every two weeks would be a pretty good rate. What's more, the increment would be quite large. If a quarantine of two weeks is not possible then waiting for fixes of 3rd party components would not be possible any more except this 3rd party component comes with releases in a high frequency. With that the application has to mitigate issues in any 3rd party library by not using the buggy feature at all, which may mean to work around of it or to skip the own feature if it is dependent of this very feature the 3rd party component offers and which turned out to be buggy. If this becomes the default mitigation in the application a quarantine of two weeks wouldn't be needed anymore. Independent of the regression being located in the application or in any 3rd party component the fix has to be made in the application itself. The team developing the application would regain control over the application which in itself would be a nice thing to achieve.

So, allowed time in quarantine needs to be short.

Another flaw in the above quarantine rule set has been the fact that there would be a loophole for regressions to enter mainline. The fact that Sporadics need to show up twice before they would cause a close of mainline and with that open up the possibility to quarantine the regression needs to be dealt with. I'd propose to quarantine any regression no matter whether they are deterministic or not. There would be two days to get them fixed. If the fix does not arrive in due time the regression will leave quarantine and mainline will be closed. No test protocols need to be scanned for known not yet quarantined issues. The quarantine tool would report quarantined regressions properly and any new test failure would show up directly. While mainline is closed no other changes could be pushed to it. This would be an optimistic approach for it is based on the assumption that regressions get fixed with highest priority.

The new quarantine rule set would look like this:

any regression will be quarantined for up to two days
if the fix arrives within this time and it proved to be valid then the failing test will leave quarantine
if the fix does not arrive then the failing test leaves quarantine and causes the mainline to be closed
if the fix does arrive and it proved to be valid mainline will be opened again

that's it. Simple and straight forward. No distinction between types of regressions. If required an upper limit for regressions in quarantined could be added. If this has been reached mainline would be closed also. But the rules to open mainline would get a bit more complex for there are two conditions to consider. That's why I would prefer not to have the upper limit of quarantined regressions in the rule set. The two day limit is tight enough to put pressure on the resolution of regressions.

This rule set is simple enough that a tool could implement it, thus removing the human factor from the decision making whether or not to close mainline. The human factor should be used to decide upon the proper limits for regressions to stay in quarantine for this seems to be dependent from the current situation an application development finds it self in. So in the end the CI server could take care of the regressions by enforcing a quarantine although I wasn't fond of this idea earlier. But it would do so based on a simple rule set which does not require a huge and sophisticated tool and the pipeline runs still would be reported as failures and no release candidate will be produced as long as there are quarantined regressions.

I will report whether this approach works in reality.

Read also:

Part 1: Sporadics don't matter
Part 2: Let the CI server take care of these Sporadics
Part 3: Where do these Sporadics come from
Part 4: Tune up the speakers - Let Sporadics scream at you
Part 5: Process vs. People - Or: Processes won't fix you bugs
Part 6: To Quarantine or Not To Quarantine Contagious Tests

The opinions expressed in this blog are my own views and not those of SAP

Sunday, September 28, 2014

To Quarantine or Not To Quarantine Contagious Tests

Something is rotten in (the state of) your code base and spreads like a disease. Where there was only one soon enough there will be many. They are contagious as hell - Sporadic failing tests in already secured code. Regressions that is.

In my last post I extended the Green Mainline Policy by a human factor which might help to bring the Green Mainline Policy forward in environments that are ridden with bugs and regressions and which want to find a way out of it. With this human touch comes the freedom of choice. Instead of having an uncompromisable automatism which would just close mainline until a proven fix arrived a human would be able to weigh different aspects against each other and to decide not to close mainline even if a Sporadic popped up. I also talked about the fact that in cases like this at least an initial analysis would be requested which would be the base for the decision. What could this decision look like? What options does the mainline owner have?

Well obviously mainline could be closed until a fix arrives. This would work if the cause of the test failure could be fixed easily. Mainline wouldn't be closed for long.

The second option would be to accept the fact that a fix would take fairly long, too long to close mainline this long time thus hindering other developers or teams to bring new changes to it. Acceptance of the fact alone wouldn't be enough for the test still fails and will do so with every new build pipeline run. Every subsequent run will or may show this error dependent of it being a Sporadic or a constant regression. In either way this test failure will make it hard to assess the run quickly for in case the run fails it is not obvious whether it fails due to the "known" issue or a not yet known one. A closer analysis would be required. In this situation there needs to be a mechanism that ensures the fix to come as quickly as possible while it makes sure the "known" issue does not shadow other failures. The mechanism has to make sure that a failure in a run really is a failure that needs attention payed to it.

The mechanism is known as Quarantine. I'm not the first to talk about a quarantine. Martin Fowler did cover this briefly in [1] and there are CI servers around that support a quarantine out of the box and I'm sure there are plenty of articles on this topic around. So, what's my take on this?

My feelings towards a quarantine are ambivalent. At the one hand I'd welcome it as a means to separate intermittent tests from the ones that would provide valuable feedback. On the other hand I suspect a quarantine would easily turn into a collection of failing tests that just keeps growing without anyone taking any actions on them. Because of that I would prefer a quarantine to be guarded with strict rules in order for it to be of constant value.

The basic idea of Green Mainline Policy is the fact that code in mainline is and stays regression free for it represents the state of the code that has already been shipped to customers or will be shipped with the next release. Keeping it from showing regressions including and especially sporadic ones is the major task. Usually this would be done while mainline is closed. If this is not possible a regression could be quarantined in very special cases.

I'd see these four reasons: The regression is caused by

a bug in a test that would need to undergo a major refactoring to fix it which takes some time
a bug in the application code itself which would need to undergo a major refactoring to fix it which takes some time
a bug in an external component where the fix of which takes some time to arrive if at all
a hard to analyze sporadicly failing test which would need to undergo thorough analysis before it could be classified as one of the first three cases

In these cases quarantine would serve as a to-do list of open issues to be worked on with highest priority. Any deterministic regression should fall in one of the first three categories.

However quarantine will always be the broken window. It holds the failing tests that are not supposed to be there at all. What once have been useful tests that guarded the code base turned into a useless mess with no other feedback than the mere fact that they have become of no value. They could not be trusted anymore.

Once a quarantine exists there will be the question: Why not add just one more intermittent test? Just one. Really! That is the crucial question of Green Mainline Policy for once you have opened the door for one others might slip in as well. In my opinion the strict rule set and its strict appliance to each and every regression without any exception will be the only way to keep this under control. Otherwise you would start to hide failing tests in plain sight.

But it is not sufficient to have a rule set for getting into quarantine. You would also need a clear rule for getting out of it. The first way out of quarantine would be a fix. When the fix arrives the previously failing test will be removed from quarantine. For deterministic regressions it's a matter of priorities. If fixing them is highest priority the according tests will leave quarantine soon enough. It is getting interesting if no fix is available which could have several reasons:

the priorities for analyzing and fixing complicated issues are not high enough
fixing really takes long for a major issue has been detected which requires extended efforts to get fixed
the analysis of a sporadicly failing test takes very long or seems not feasible at all
the external component provider does not come up with the required fixes

The first of these reasons could be mitigated by inventing a timeout or a maximum amount limit for the quarantine. If the timeout would be reached the respecting test would move out of quarantine and if still failing mainline will be closed. The second one appears to be a candidate for the maximum amount limit. If the maximum amount limit would be reached the next test to be moved into quarantine would close mainline for at least one other test has to move out of quarantine which by definition only would succeed if a fix would be available. These two mechanisms would work as a priority manager. They would raise the priority of fixes automatically.

And what about the other two reasons? The Sporadic and the issue with the external component? It is not predictable when or if a fix will be available in due time. How long would you wait for a fix? It does not make any sense to wait forever. Release date will come. Would you ship a software with known regressions or issues with an external component? Probably not. So, they have to be dealt with. If quarantine is not empty at release date then the product would have to be shipped with limitations. Timeout or maximum amount do not help with that. At some point in time late enough to remain optimistic and early enough to mitigate the bugs that will not get fixed you need to decide whether or not to wait for the fixes any longer. Then time has come to apply other strategies. In the end you need to decide between:

finding a workaround to avoid the bug (if possible)
disable the feature if no workaround exists - one couldn't use it anyway (deliver with limitation, might be solved by later patch)
go back to earlier version of 3rd party component (if possible)

It might prove difficult to hit the best point in time for this decision. And here is the bad news about a quarantine: It moves the need to handle regressions into the future. It might be a not-so-far-away future but it still has the consequence that regressions not get handled when they occur. When you're striving for a continuous delivery then this would break your neck for you would have to ship a product with known regressions or you wouldn't ship for a long time if you are not prepared to ship with known regressions. Both of which does not comply to the continuous delivery idea at all.

So, what's my conclusion then? I'm convinced that a quarantine could be a useful tool. But it depends. The tool itself will not fix you any bug. We've talked about this earlier. If there is a culture where regressions and Sporadics are considered evil and highest priority then this tool could help cleaning them up by making them visible. If there is a culture where these things are always overruled by something seemingly more important then a quarantine will just be another issue tracker with loads of issues never to be handled or like the to-do lists on dozens of desks which never will be worked on to the very end. It's just another container of things one should try to do when there is some time left. It's the people again and the culture they live up to that makes the difference.

Me personally I'd rather not have a quarantine. It's too many rules to know, to follow, and to think about. But I understand that sometimes you would need it to reach the level of maturity not to need it anymore.

Sunday, September 14, 2014

Process vs people – Or: Processes won't fix you bugs

In the last post I proposed a technical solution, a process to raise the urgency of failing tests in mainline production. By closing mainline for changes when the build is broken the “Green Mainline Policy” is able to put so much pressure on developers that

“There are no excuses for not fixing the issue anymore. But does this change the attitude we talked about earlier? Probably not.“

It is obvious that this proposal, so far, could work as a technical emergency procedure only. Nothing has been said about or done to reduce the occurrence of Sporadics in the first place. Thus up to now there is only a mitigation strategy for the second order problem of Sporadics showing up in mainline. This second order problem could be attacked by a technical solution. In software development we are very good at technical solutions. We do not embark on stupid tasks we would rather write a script to do it for us. And it is even one of the agile principles to automate as much as possible to get rid of recurring secondary tasks that would keep us from anything but coding new features. What we are usually not so good at is the non-technical part of it all. Attitude is non-technical. And changing attitudes is not scriptable. Changing the “Sporadics don't matter” attitude or the habit of ignoring them would be to tackle the root cause of the issue. As usual attacking the root cause is much more complicated than alleviating symptoms. (A nice blog on this could be found at http://www.false-summits.com/?tag=sopd)

What is missing to make the “Green Mainline Policy” a holistic approach that tackles the root cause?

To present you with the typical agile coach reply: It depends.

Suppose a project that suffers heavily from Sporadics. Introducing the “Green Mainline Policy” and enforcing it in this strict manner would only add to the multitude of pressures from all directions developers are faced with. The developers have to be part of the equation to make this policy a success. So, it is time to think about the people involved. We want them to understand Sporadics are a mess and need to be fixed quickly. And we want them to prevent those Sporadics from showing up in the first place. We want them to change their attitude, their usual behavioral pattern to make this policy a success.

And this is where it starts to get tricky. There is no simple answer to that, because we are talking about people. People are different. People are special. When dealing with people one has to deal with gurus, techies, quality-guys, process-lovers and process-haters, people fond of this and people fond of that (preferably the very opposite). There are people that want to change the world and people that want to make a living. All of them have their reasons. And all of them are valid and respectable. When we are about to enforce new policies or processes we need to understand that people act in (at least) two roles: the Person and the Developer. The change in processes affects the Developer. He has to adopt to new policies or processes. As a person I might not like these policies. As a developer I would have the obligation to stick to them otherwise my job may be at risk.

Bringing about a “Green Mainline Policy” to a project that is riddled by Sporadics is to bring about a transformation that takes time. We want to change attitudes and habits of professionals that have their experience some of which may be on the job for a few years others for a dozen or even more. They are used to what they always did, what they learned back then or in between. Every single person needs to be picked up from where they are in terms of knowledge and in terms of the way they usually work.

There might be a number of people that would welcome the policy for it supports the way they always wanted to work and probably did in the past. But it is likely that the majority of people will have reservations. My experience as an agile coach shows that most of them are born out of lack of knowledge how one could do otherwise. Many times in a training when telling people about test isolation, testing pyramid and things alike I heard sentences like these:

“Why didn't anyone tell me this before?”

“Wow! It's awesome!”

What I've seen in my life as an agile coach is a huge amount of end-to-end tests and a lack if not absence of any other type of test. I consider this the main source for Sporadics. In my opinion the agile techniques like test isolation and the knowledge about the testing pyramid as well as the SOLID principles would be a foundation for developers to build on. These techniques give them the instruments to stabilize their tests, to improve the code they write, and to get rid of the superfluous end-to-end tests. The improved code will be better testable which would bring about even more stable and quicker tests.

Teaching would be one aspect. Coaching is the other. My experience shows that a training is a nice place to learn about the techniques and even to practice them but there always will be the real code that does not support the newly learned techniques just like that. It proved to be a good idea to have experts around that understand the techniques and that know the code base for they are able to support the developers to bring the new knowledge to the old code base. Only if a developer will get support in an individual case at his code he will gain the insights he really needs. This support will help to counter the:

“This might work in area XYZ but in my area things are different. It won't work at all.”

Don't ask me how often I heard that one. But I did not tell this to make jokes about it. Someone who said this needs support to find a way how the agile techniques may work for her. When such an obstacle could be overcome this would convince people the most.

So, I'd like to extend the “Green Mainline Policy”:

Invest in the people by teaching them agile techniques
Make sure to provide ongoing coaching resources and support in individual cases by experts that know the code base and the “environment” as well as the agile techniques
Invest in cleaning up the code, make it more testable
Invest in cleaning up the tests, stabilize them

It is obvious that these things will take time. That's why I would propose to go for a more human handling of the “Green Mainline Policy” emergency process. There should be owners of the mainline that function as a gatekeeper. They would decide when to close mainline for changes. When mainline breaks they would have a look at the failure and classify it. If it qualifies for instant fixing than this will be requested and mainline will be closed. Sometimes the fix may not come that easy then at least a root cause analysis should be requested and mainline will be opened when this analysis proved there would be no easy fix. And so on. It depends on the situation you find yourself in. Sometimes even the strict CI server solution would work fine for you right from the start. In all other cases the reins need to be tightened in accordance with the progress in education and code cleanup. Things need to be improved incrementally. Any small step will make it better than before.

Yes, this is no easy solution. It will take time to implement. But it took quite a while to make the mess. Cleaning up needs its time too. Next time I will have a closer look at the types of Sporadics. Knowing these types will pave the way to propose solutions how to avoid them.

Saturday, September 6, 2014

Tune up the Speakers – Let Sporadics Scream at you

In my last post on this topic I pointed out that the main reason for Sporadics to creep into your test base would be the lack of urgency to prevent them from doing so. These days’ developers find themselves confronted with concurring requirements and fixing Sporadics is one of the least pressing ones. Thus making them more pressing would be required to move them up the queue. And we’ve talked about a mind shift that would be required to pay more attention towards Sporadics, so called, to get them analyzed and out of the test base. Just putting more pressure on them from a technical side would not help. In my first post of this series I was complaining about the it’s-just-a-sporadic attitude which contributes to the decrease of urgency. This has to go too.

So, what could be done? The first step would be to make the Sporadics ring in your ear each time they would show up.

Actually there are two solutions an informal and a technical supported one sharing a common thought. One works fine for smaller teams or projects the other may be required for larger teams or projects. Both solutions work with these principles in mind:

“There are no Sporadics. There are failing tests only”

Make people understand this. There are no second class failures that could be somehow ignored or paid less attention to. Every test failure indicates an issue to be solved as pointed out earlier. No statistics needed. No guessing on known issues required. The second principle is one borrowed from “Continuous Delivery” [1]:

“Do not check in on a broken build”

When the build and test pipeline breaks something is wrong with your product. It does not make sense to move on sandy grounds. The reason for the previous failure might be a bug in the application. The basic assumption is that the responsible developers would investigate the failure and provide a fix eventually. Only after this has been done the next check-in is allowed. As the authors of [1] pointed out breaking this rule would prolong the time until the build gets green again for analysis on a moving target is much harder to do. And second people would get used to see broken builds which would either yields ignorance or a big effort to clean up the build every now and then which nobody really wants to do.

Both these principles could actively be lived by smaller teams or projects without the use of any additional processes or tools when people accept them as the way they want to work. I’ve seen this working in real life. As long as the people involved share these principles and do not cease to enforce them they would be perfectly fine with this. But this may turn out to be hard work and sometimes the discipline erodes. If the window gets broken then it will be hard fixing it. This is getting even harder when talking about big projects with many teams involved. This is why I come up with this approach or proposal of mine. It is basically about the how I make a large organization react properly and instantly on broken builds not about how a certain kind of Sporadic failure could be fixed to make the test more stable. I will come to this later, though. For I do think all the post on the net lack one important discussion: Is the intermittent test itself really the test you should have or would another testing strategy for the feature under test help avoiding the issues that usually make tests unstable.

Proposal: Green Mainline Policy

When gentlemen agreements do not help to keep the “Do not check in on a broken build” principle alive, then it might be time to establish a tool supported process that enforces the desired behavior. I will call this a “Green Mainline Policy” or “Green Branch Policy” for it does not only be applicable to the mainline. (In the following I will only refer to mainline for the sake of readability.)

The objectives of the proposal are:
• Reduce or eliminate the occurrence of Sporadics
• Make fixing a broken build the highest priority task
• Must work for large projects

So, what’s the proposal then? Basically it’s quite simple. Suppose there is a continuous build and test after a change has been submitted to mainline. If this run succeeds everything is fine, obviously. If it fails the CI server will
• close mainline for any further submits except for the person/team responsible to fix the issue
• open a ticket in the issue tracker
• send a mail to the team/person responsible for the broken test
When the fix has been provided it will be accepted to be submitted to mainline and a new build and test run starts. Only if it succeeds the restrictions will be lifted and everything is back to normal.

Instead of letting the CI server handle the policy there could be QA people or mainline owners who would execute this policy manually. This depends on how business is done usually. The main point is to really close mainline for further changes. No-one will be able to bring any changes – except the fix - near mainline anymore until the issue got fixed. With that the “Do not check in on a broken build” does not only depend on people behaving properly but will be enforced from outside of the team and thus will be much harder to undermine.

Reality check. There is a project I know of which incorporated this “Green Mainline Policy”. Although there have been some resentments in the beginning the policy proofed to be of value. Not only that Sporadics get analyzed (and very often) fixed quickly teams adopted this policy for their team codelines as a rule to follow without process enforcement. There has been an improved quality chill factor. Quality felt better than before when everyone was bothered by ever broken mainline builds. Now quality issues become evident and the improvement of quality is visible to everyone: Less and less broken mainline builds.

Invitation to further discussion

However, it was quite easy to type this down and it might be tempting to leave it like this. But this wouldn’t be the right thing to do for the proposal does raise a lot of questions which need to be answered. I will come to this in a second. First I would like to discuss what this simple approach could do.

It puts a whole lot of urgency on the fixing of the issue which broke the build. As it is mainline would only be opened when a fix arrives. This issue might it be a regression or any other (d)effect will not show up anymore given the fix has been a thorough one and does not, for example, just increase a timeout that proofed to be too short. The lack of urgency for Sporadics would be gone. There couldn’t possibly be more urgency than a stopped mainline production would produce. There are no excuses for not fixing the issue anymore. But does this change the attitude we talked about earlier? Probably not. It’s more like an extrinsic motivation – one might call it pressure – than an intrinsic one to avoid or at least quickly remove Sporadics.

There are other questions popping up:
• How long would mainline be closed if fixing the issue would take a day, or two, or even longer?
• How would you verify the fix? Running the whole build and test pipeline again? Just verifying the failing test?
• What if the failure could not be reproduced?

Much depends on the very nature of the failure. There are so many possible sources of failure in a test that one would need to put a more differentiated answer. It’s plain to see that the approach might not remain this simplistic. I will try to elaborate on this next time around.

Read also:
Part 1: Sporadics don't matter
Part 2: Let the CI server take care of these Sporadics

Part 3: Where do these Sporadics come from

[1] Humble, Jez, and David Farley, Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation, Addison-Wesley, 6th printing 2012, p.66

The opinions expressed in this blog are my own views and not those of SAP

Thursday, August 28, 2014

Where do these Sporadics come from?

In part2 of this series I concluded that automating the detection of intermittent, random or non-deterministic tests (aka Sporadics) comes with unforeseeable extra costs. It may serve as a monitoring of Sporadics and might yield information to rank the Sporadics to decide which should be solved first. But if the efforts stop there then one has basically invested in managing status quo instead of changing it for the better.

Before proposing a possible solution to the Sporadics problem I need to elaborate more on how a test becomes a Sporadic. Understanding this will lend hints to the solution to be established.

The basic assumption is that all tests have been successful when they were first added to their test suites in the first place. Otherwise talking about Sporadics would not be the topic to talk about …

Suppose there is a test which ran successful for a fair amount of time. Weeks or even months. Everything has been fine until the day came as it turned red for the very first time. A test failure occured. Nothing special. What happened back then? Someone might have investigated the issue. After some thorough work one came to one or more of these conclusions:

The test failed because of a bug in the test, the bug got fixed
The test failed because of a bug in the productive code, the bug got fixed
The test ran successful again when the complete test run has been repeated
The test ran successful when repeated in separation, so there supposedly was no issue with the test or the code under test itself
The test failure does not relate to the change made to the productive code at all. Strange, but well (imagine a shrug, when reading)
There has been some filer outage and the test wasn’t able to read or write a file or there has been any other issue with some part of the infrastructure the test was using
You name it

(I will not go into detail about inherent fragile tests some of which could be a great source of Sporadics. Nor will I elaborate on possible root causes that would make a test intermittent. This is not the point in this post. I will save this for later ones.)

We are not talking about the first to items in this list. These are the good cases where the safety net worked and the required actions have been taken.

We are not talking about the occurrences when a real root cause analysis has been made, the problem has been found and fixed. These things happen probably more often than not, especially if finding 5 was not accompanied by finding 3 or 4, but not every time as the number of Sporadics in a corpus of tests will tell you.

Item number 6 would be an easy catch: Infrastructure issue. The guys maintaining it fixed the issue or the issue has been a temporary one. However the test has been good all the time. No-one has to do anything about it. Sure it will be green again when the infrastructure issue does not reappear.

Items 3 and 4 tend to be soothing enough for the guy investigating this issue that no other actions followed. Looks good now. So, just merge into mainline. Must have been some hiccup sort of.

Item number 5 consumed the most time investigating. It looks strange, but a rerun standalone and as part of its test suite succeeded. Somehow it leaves a bad taste in the mouth. But hey, didn’t it succeed all the time? And now it does again. Let’s leave it alone and do some important work, what d’you think?

While item 6 is a bad signal itself, items 3, 4, 5 are the ones that could break the neck of the company. If we are lucky they “only” signal issues with the tests themselves. Some hidden dependency that appears in some strange situations just on days of certain signs of the zodiac while the moon is in a particular position … you’ve got the idea. If we are not that lucky they were only representing the tip of the iceberg and there has been some non-deterministic behavior in our productive code which might lead to loss of data or any other hazardous event when in production use at a customer’s site. Maybe you only have a hard time analyzing it working long hours and at weekends or you might be facing a PR disaster or even worse a substantial claim for damages.

Experience shows that test failures like items 3 through 5 are the more easily shrugged off the more often they occur. These failures somehow are not taken as serious as they should be. Quite often there is an argumentation including pure statistics or issue tracker records for the code under test showing that it is not worth the time one would have to spend to find the root cause or if the root cause is known to fix it. Me personally, I witnessed such a line of thought more than once. However, tests that failed this way for a first time will start to fail a second, a third time and over again. And there you are. Now you have a Sporadic. A known issue. An item on a list or a record in a database. And one by one they creep in.

Why could this happen? In an ideal world a developer would follow the Continuous Integration principle and thus would be eager to get rid of any test failure in mainline builds at any time. For we are not living in such an ideal world things are a bit different. There are test failures that won't be fixed for reasons listet above. It would be too easy to blame developers for not caring.

Developers find themselves confronted with various, concurrent and sometimes even contradicting requirements. Features always come first for the company gets paid for it. Maintenance is important too. Not to forget quality. Or some refactorings. Issues with sporadic failing tests for features delivered long ago (several weeks or months) tend to get lost in this. They just don’t reach the level of urgency they would require to get the necessary attention.

In part1 I was complaining about the attitude developers show towards the Sporadics. This attitude gets influenced by the whole environment developers find themselves in. So just introducing yet another tool will not help. There is also the social aspect of this. Or as +Steve Sether pointed out in his comment on my second post of this series:

"Since the problem is essentially a social problem, I think we should look towards social science for guidance. It's been experimentally verified that people discount any potential badness that happens in the future. So, bad thing in the future is much better (in people’s minds) than bad thing in the near present."

So let's explore the possible solution next time around. The teasing will have an end then ...

Friday, August 22, 2014

Let the CI server take care of these Sporadics

In my last post on Sporadics (http://bit.ly/sporadics-1) I came to the conclusion that Sporadics do matter. Sporadics litter your test suites with failures. Any test suite with failures requires attention. Someone needs to have a closer look to figure out what went wrong.

If test suites usually succeed, any test failure would signal an issue that was just introduced by the very change or in case of nightly integration runs by a limited number of changes. In any case it is worth the work one has to invest.

With Sporadics creating a constant buzz of test failures there are test suites that fail with a varying number of failures. Any of them any time requires someone to have a closer look and chances are that only the “known” issues occurred. If this happens over and over again less and less attention will be paid and more and more “unknown” issues will go unnoticed. With an increasing number of Sporadics it becomes easier for a severe failure to slip through. The safety net will cease to be of any worth.

Sporadics are not only dangerous but expensive and frustrating too. So, someone or something has to deal with them. Who or what could this possibly be?

In agile methodology we want to automate as much as possible if not everything. So what about automation of sporadic detection?

It seems to be feasible, doesn't it? We have monotonous and repetitive work which in fact is just plain pattern matching: Does the failure look like one we already have in our list? Just compare test output of any kind like log files, backtraces and so on with the known pattern and mark the failure as known issue maybe even with a unique identifier attached to it. A rather simple script could do this.

And the best place to run such a script would be the CI server where all the changes for mainline are being built and tested anyway. And just add it to the nightly builds of mainline too. With that we would have a nice automated detection of sporadics and would always know whether or not a change introduces new bugs we really have to care for instantly.

Well, sounds promising. No more of these stupid tasks to scan test results. Just have a look at the CI server output and you instantly know the change looks good or not.

Done. Continuous integration and automation did the job.

Really?

Three questions arise - at least:

How precise should the pattern matching work?
What if we find a known Sporadic in a test run? What should be the action to be taken?
How would a new test failure be marked as Sporadic and how would a Sporadic be removed from the list?

Suppose the productive code or test code changes and the known Sporadic slightly moves its location in the code or the lines written to the log change and do not fit to the stored log content of the known Sporadic anymore. How flexible should the pattern matcher be? If it's too tight it might report known Sporadics as new ones thus prolonging the list of known Sporadics unduly. If it's too lose it might report new failures as known ones which would increase the risk of severe issues to slip through.

Only these few thoughts show that a Sporadics detector itself would be a quite sophisticated piece of code which brings its own risks that would need to be taken care of. And don't forget the amount of work required to develop and maintain it.

Suppose a Sporadics detector is in place and works quite well. What happens if it detects known Sporadics in a test run? Should it just report them and mark the test run a success if there are no other failures? Should it try to rerun the Sporadics to make them a success in the second run? Or in the third run? Or …?

How far would you go to make the tests green? And what if the Sporadic wouldn't turn green this time? Would it be a failure then? Although you know it's a Sporadic? Probably one would do without rerunning Sporadics. Just rely on the detection. If a known Sporadic has been identified then accept it as it is and remove it from the test result.

Suppose there is a test failure and it is not a known Sporadic. Should it be taken as a failure someone has to investigate immediately? Or would you try a rerun of all failures to check whether or not they will fail again?

In the first scenario some developer or QA guy would have to investigate the failure. If it turns out to be a Sporadic then this guy would have to add it to the database of known Sporadics for the Sporadic detector to find it the next time it occurs.

In the second scenario the Sporadic detector would do the job of rerunning and guessing. If it does not fail in a number of reruns again, it will be considered a Sporadic and will be added to the database. If not, the test run will be a failure.

So managing the list of known Sporadics will introduce either some extra work by a developer or QA person. Or in case of automation it will add extra run time on each and every test run for every test failure will be treated as a possible Sporadic. Thus test runs will be prolonged, turn around times will increase and the throughput of the CI server will decrease. Which in turn means extra costs.

If one is willing to spend these costs to catch those Sporadics. What is the tradeoff? With this automation of Sporadic detection we take some load from the developers. They do no longer have to analyze all these Sporadics manually. Which gives them more time to add new features. Sure one would like this one!

Really?

We were talking about attitude towards Sporadics. No-one takes them seriously. They keep on piling up. Does the automation of Sporadic detection help in changing this behavior? I don't think so. It's the complete opposite. Sporadics are getting even more out of focus. Developers do not need to care for them anymore. They are no longer a pain in the … for them. One does not need to be a fortuneteller to expect the number of Sporadics to be further increasing.

To be honest. If the corpus of tests is in bad condition and there are more failing test runs then ones that succeed such a mechanism (as cheap as possible) might help to classify the test failures to have a list of Sporadics one could work on. But then don't the developers do have these lists of their own Sporadics already? They did use them for their manual checks. They are written down in some issue tracker, wiki, excel sheet, text document or on plain post-it notes on some whiteboard. All these solutions are by far cheaper then the automation and automated central management of Sporadics. Last but not least these solutions keep the developers in the loop. They could feel the pain of Sporadics everyday.

Something else would be required to solve the Sporadics issue. Something that really puts pressure on the developers to solve it. But thats another story.

The opinions expressed in this blog are my own views and not those of SAP

Friday, August 15, 2014

Sporadics don't matter

In my daily work I quite frequently stumble over the phenomenon of - as we call it - Sporadics. Sporadics are tests that fail every now and then for a bunch of reasons. Other great blog posts call them intermittent test failures (http://spin.atomicobject.com/2012/04/27/intermittent-test-failures/) or non-deterministic failing tests (http://martinfowler.com/articles/nonDeterminism.html).

However the phenomenon bothers me a lot these days. But it’s not so much the Sporadic itself but the attitude towards them. Every time a Sporadic occurs I hear sentences like:

“This is just a Sporadic you can merge into mainline anyway”

“You can’t get rid of Sporadics, you have to deal with them statistically”

“It’s too expensive to fix them”

“It’s infrastructure anyway”

“There are too many of them to tackle them. Just ignore.”

In general people (aka developers) tend to consider them as a minor issue or as a matter of fate you can’t cope with anyway. That leads to the impression that Sporadics just exist but they do not tell you anything about the quality of your product just like the “real” test failures do. Real test failures signal a broken feature that has to be fixed or a changed one that makes the tests fail for expectations changed. One could fix that. One could even write a ticket for and resolve it in due time. Everything is nice and comparatively easy when it comes to “real” test failures.

However, over the last years I came to the impression that the “real” test failures which always get much attention and awareness are not the ones I should be afraid of but that the Sporadics are really the ticking bomb. How come?

As said before for some reason developers tend to avoid analyzing Sporadics. Probably this hasn’t been always the case but comes out of experience. Often enough the root cause of the failure turned out to be some problem with the test infrastructure or a bug in the test itself. So, no real problem with the application has been found. Somehow this notion has been settled in. Sporadic failure means infrastructure or test issue means no bug means ignore the red test and merge into mainline.

But what happens underneath? Is this all there is? What about the application under test showing a non-deterministic behavior?

These things happen (own experience). And they are what gives me the creeps. Non-deterministic behavior of the application yields strange behavior at customer site which is very hard to analyze and thus very expensive to fix. And who’s to blame? If Sporadics are considered unimportant or even irrelevant no one will fix them. They keep piling up. And somewhere in this pile there these little bombs will hide. No one will stumble over these issues. They just keep creeping into the mainline by and by.

That’s why I consider Sporadics as the most important test failures out there. If a Sporadic occurs this has to be analyzed very fast. If you are lucky it really is an infrastructure or test design issue. But nevertheless go and fix them. Keep the Sporadics away from your test suites. Otherwise the information the tests could provide you with will decrease and at some point in time vanish into nothing.

In this series of blog posts I want to explore the phenomenon of Sporadics. I want to find out what they really are and what to do about them. Each Sporadic tells a story about your test environment, your test design and your application design. Each Sporadic failure serves as a signal to take action. Let’s see what one could do about them. I hope you will join me on this journey.

The opinions expressed in this blog are my own views and not those of SAP

My 2Cents on Agile