Introduction
Just recently I came to think about a proper setup for a product developed by distributed teams. As it happens the use of git was a prerequisite. When working with Distributed Version Control Systems (DVCS) like git teams see themselves faced with the task to figure out what would be their way of organizing source code they contribute to a larger system. As git is a powerful tool with loads of features and means to do things one way or the other it offers both simple and rather complex solutions. In this post I want to explore some major approaches and compare them with each other. I admit that I am biased by concepts like Continuous Integration (CI) and Continuous Delivery (CD) which may influence my conclusions.
I consider this an experiment and will define an example product to set some constraints for the exploration. Some findings may be restricted to this setup others may be more commonly valid. However none of the conclusions claim to be universal. The approaches investigated are taken from daily life. Every single of them crossed my way and I consider all of them worth looking at. Any idea as far fetched or remote as possible would be worth to at least provide reasons why it wouldn't be a good idea.
Sample Product
Let's assume a product of considerable size, say 1M LoC. Let's further assume the product consists of a number of large components, say 5-10, which themselves may be made of smaller components. Team setup follows the top level component structure by and large, although an individual may see the need for changes in several components. The development team consists of 50-500 people actually touching code. Finally the product ships as one. There are no independent releases nor patches of parts of it.
None of the components is intended to be reused by other products. Components are a reflection of the current architecture of the product.
The current dependency structure of our product looks like:
Components A, B and E are top level components forming a collection of services the product consists of. Component F forms a UI framework Components A and B plug in to when present. F does not care for any components plugging into it. Components BA and BB are backends to A and B respectively. Component E is an extension to B.
Components are not necessarily identical with libraries, archives or any other sort of distributable artifact. Basically they are a logical structuring of the source code.
The product has a maintenance obligation with respect to already released versions.
What approaches could be applied to organize the source code of the given product? These are the ones that pop up on my mind:
- Component Repositories
- Topic Branch
- Feature Branch
- Trunk Based Development
Approaches
When exploring the different approaches I will try to shed some light on a bunch of questions that are far too often not getting considered. These questions touch several aspects of a software development life cycle. Think of questions like:
- How do we get access to the component we depend on to make use of it in our component?
- How do we make sure we get information about public API changes soon enough for us to incorporate them?
- Should there be orchestrated schedules for component releases?
- How do we handle splits/merges of components?
- How do new top level components come to life?
- How do top level components cease to exist?
- How often do components ship new versions?
- How do we make sure there will only be one version of each component used inside the product? Or how do we make sure components A and B are developed against the same version of component F?
- When will integration testing be done?
- How will it be done?
- How does a component test itself in the context of the product? Or in the subcontext of its dependency tree?
- How is the product being assembled?
- Which component versions should be used?
- How are component versions get managed?
The list may not be complete. It already holds some tough issues, though. What would be the answers when we work with separate Component Repositories?
Component Repositories
Our development team values highly decoupled components which interact via public APIs only. Any use of non-public APIs is prohibited. The developers understand the temptation of using non-public APIs for the sake of re-use and want to avoid that by obscuring the sources of their components towards other components as much as possible.
Our development team learned that a git repository for a large product worked on by many developers tends to become large which would increase time to clone and fetch. I've seen such repositories exceeding 4GB.
Idea: A small repository only containing one top level component worked on by 5-10 people seems to be a fair trade. The team could work in isolation. Their repository would not be littered with source code they do not own or are likely never touch. Things are simple when it comes to developing their component.
Development
For a developer working on component F with no dependencies or only dependencies to 3rd party life would be rather easy in this world. There are only things present one has to deal with directly. One could build the whole component including tests quite focused. As developers are free to add or remove sub components of their component rather freely they would not feel much of a downside.
Not all of the teams are happy with that setup, though. While the team providing the top level component F is quite happy with this approach, the teams depending on them are not. Why is that?
In order to plug into component F component A and B need to know about and have access to the current valid public API to at least be able to mock the dependency away in their tests and to use the right calls in their production code. However this may be done in the particular language the product is being built with, there has to be some sort of communication. Either interfaces have to be provided as files or as API documentation. Depending on the language these files have to be present during build. Latest when running integration tests of component A or B with component F a real component F would be required.
This would add an obligation to any top level component development teams responsibilities: There has to be some sort of releasing the component in a way other teams could rely on for their development. They have to maintain a release schedule and they have to actually release the component and make it available for the other teams to consume. Usually there will be some stable and some development version available. These versions could be used by components A and B for their development.
Integration
As long as component F publishes new versions on a regular base there will be some sort of "continuous" integration be available. Component A and B could make use of the latest component F version and report bugs if found in component F or fix their components usage of F accordingly. Depending on the release cycle of F the feedback loop length would stretch from rather short to pretty long. During development phase this may not be a problem but when release date closes in it rather certainly will turn into an issue.
A real continuous integration would be hard to achieve. Even if component F publishes release candidates with every pipeline run they would have to be verified by the depending on components before they could turn into released versions. Thus component F depends on the pipelines of each component up the dependency tree to verify successful usage of F and any component that uses F and so forth. The verification pipeline for F will become pretty long and in case of bugs found it would have to start all over again.
Whats more, if component A uses the development version of F to stay close to the newest features of F it relies on these development versions actually being released before release date of the product as no development versions of F will be shipped with the released version of the product.
Another complication would be the possible divergence of F's version used by components A and B. Just to make sure they are actually using the very same version of F there needs to be some governance enforcing this constraint.
Integration Testing
When teams are focused on developing their components they tend to consider any usage of their component the business of someone else. The integration of component A or B with component F will probably tested but the product as a whole will not. Who would be responsible for performing this assembly task with all its required testing?
The product in question is an assembly of the just right versions of all its components. Thus the product would be represented by a bill of material (BOM) only. The product assembly would drag all the named versions of the components and would perform the required packaging. What about the testing then? There would have to be a team that would take care of this assembly and the integration testing to make sure the BOM holds a valid and working combination of component versions. The assembly pipeline would have to run the integration tests of all components and eventually would have to provide for additional integration tests on product level. This team would not at all develop anything in terms of production code which bears the risk of them not knowing about features implemented. A dedicated communication would be required to make sure the assembly (or testing) team knows what to test for.
Another risk would be product level test breaking due to changes in components. As the product level tests run in the product assembly pipeline no component pipeline will run them and thus will not get feedback from them. It is the same as component A running integration tests with component F which could find bugs in F outside the pipeline of F. At any of these points there will be feedback someone would have to communicate to the depending on component. This feedback would have to dribble down the dependency tree with all the communication that comes along with that.
It would be best if the component could test itself in the context of the product within its own pipeline. To do that it would have to get access to the current BOM describing the product and to the product level tests. In order to run the product based tests it would have to build the product based on this BOM and replace itself with the component version under test. Component A's build and test process suddenly needs to know about the product and its assembly thus duplicating knowledge.
Another way would be to trigger the product assembly pipeline by replacing the version of component A in the BOM with the latest release candidate of A. If the product assembly pipeline succeeds the release candidate could be considered verified, it could be released and the BOM of the product could be changed accordingly. In this case the knowledge would not be duplicated but we would need a feedback loop from the product assembly pipeline back to the pipeline of component A. In order to get close to continuous integration any pipeline run of component A would include and wait for a pipeline run of the product assembly.
Release
As said before the product is being represented as a bill of material (BOM) containing the proper components and their versions. At this point uniqueness of components could be enforced, i.e. the version of component F to be used. Following the one repository per top level component the product as the topmost component will reside in its own repository along with the product level tests.
Releasing would include collecting all released versions of components making up the product and to run the product level tests as there would be no other tests available. If the versions of lets say component B and component F do not fit together for component B was using a different version of F in their integration testing there would be the risk that product level test would not discover this mismatch as they will not redo the level of testing done at component B's level. To avoid this means would have to be provided that would mitigate this like: components integration tests will be made available to the using components as well, a component will get hold on the BOM of the product to make sure they will use the proper versions of all components they depend on and so on.
This would introduce yet another set of communications required to mitigate issues induced by the general approach of Component Repositories.
Refactoring
As long as refactoring takes place within the boundaries of a top-level component things are fine. When it comes to structural refactorings of the product, i.e. introduction of new top-level components, removal of top-level components, it becomes cumbersome.
If there will be a new service C we could just create a new component repository and start working on C, adding it to the integration like the other components.
If an existing component ceases to exist in a newer product version we just could not get rid of it as long as the maintenance obligation exists. There will be a legacy component around in a repository no one will work full-time any more. This usually will cause the component to rot for no one will take on responsibility for that one. It is just to far out of sight.
Component Repositories make it hard to factor out new components. Consider a part of component A that would be useful for component B as well. How would B get hold of it? The cost of introducing a new component repository for the new reusable component would be quite high. So, copying the code and adding it to component B's repository seems reasonable especially as long as one would be wondering whether this part really is reusable by B. If it would be reusable and if someone really would avoid the code duplication and open up a new component repository who would be responsible for that? The new found component would not be a top-level component, so there would be no dedicated team despite the one for component A. Would this team be responsible for two repositories now?
Summary
The Component Repositories approach has got its advantages when it comes to the development of a leaf component. As soon as interaction with other components due to dependencies is involved things get messy. Components suddenly need a release management and version governance to make sure every component is on the same page. Especially the product assembly part will become a matter of discussions for no component development team will take responsibility for this integration level. A product assembly team would have to deal with that and would have to take care for product level testing itself.
Communication would be key in this approach. Whether it is done by introduction of additional automatisms to connect component repository pipelines with each other or by human interaction it adds complexity and the "one has to think of it" sort of things which tend to not been thought of.
As long as the component is not a real deliverable in its own right, i.e. will be used outside of the product, will be patched individually, I would consider this approach as not practicable.
Approach | Component Repositories |
Development (leaf component) |
|
Development (non-leaf component) |
|
Integration |
|
Release |
|
Refactoring |
|
Organizational Complexity |
|
Conclusion
As I've only considered one approach yet the only conclusion I could offer now is that I would not like to go for component repositories not knowing a better alternative for now.
Change History
This is the first installment.