Continuous delivery best practices
This section provides details about continuous delivery best practices.
The practices in this section are vendor-neutral. To read case studies or
opinionated implementations with specific tools, take a look at the
Community section. You can also find additional resources in the
How to use this guide
If you are new to continuous delivery practices and want to understand their
benefits and prerequisites to starting your journey, read the
If you have some familiarity with continuous delivery, but need help figuring
out what to prioritize, read about assessment tools to help you identify
areas to focus on.
The rest of the subsections in this guide provide information about key areas of
Some practices depend on others, while others span the entire software
lifecycle. For example, best practices for continuous integration depend on
version control. Security best practices are most effective when applied across
the entire software supply chain. Best practices also involve collaboration
across functional teams.
1 - Where to Start?
An introduction to Continuous Delivery
1.1 - Understanding the Problem
Learn about the fundamental challenges that Continuous Delivery addresses
Software development at scale is an activity within the broader context of Product Commercialization.
Our goal is not to build software, but to build a product that leverages software or machine learning to solve the problem that the product addresses.
The problems of product commercialization drive the approach that we use to optimize software delivery. Let’s start by understanding these problems.
It is often assumed that product development looks like this:
- Have a brilliant idea for a product
- Build the finished product
- Sell the product
In practice, the reality is that your idea is actually an assumption about what might represent a product, but at that stage, you have no evidence to show that anyone wants to buy it.
You could follow the flow above, but if you are going to invest say $10M in the process of developing your product, you are effectively gambling that $10M on the chance that there will be enough customers to return your investment and make a profit.
Best practice in product commercialization looks more like this:
- Have 1,000 ideas
- Evaluate and rank each idea for potential merit to find the best
- Take this idea and find a handful of customers who have the problem it solves
- Work iteratively with your customers to build a series of experiments that test the usefulness of your idea at the minimum viable scale
- If your customers get excited and start offering to pay for the product, invest in scaling to a full implementation; otherwise
- Stop. Pick the next idea from your list and go back to step 3
This ensures that you are only investing heavily in activities that have strong evidence of product-market fit.
The trouble is, even with this process, less than one in ten product ideas will succeed in the marketplace, so it is very important to optimize this process to run as lean as possible so that the cost of running your learning experiments is manageable within the reserves of investment available to you.
What does this mean from a software development perspective?
Well, in the first model, you would expect to receive a full set of requirements up front and could carefully optimize a design to take account of everything that the product needs to be able to do. Then this could be decomposed into features and given to teams to implement against specifications.
In the lean model, all you have to start is a set of assumptions about what a customer’s problem might be. In this case, the role of the development team is actually to help design and build experiments that validate (or invalidate) each assumption, in conjunction with real customers.
In the first model, you expect to spend a year or so making mistakes and learning about the problem domain in private, before a big public launch of a polished application.
In the lean model, you have to do your learning in front of the customer, regularly delivering new experimental features into live production environments.
This learning takes the form of an iterative loop:
- Propose a hypothesis about what feature might be of value to the customer
- Design and develop an experiment to test this feature in the customer’s environment
- Run the experiment with the customer and evaluate
- Adjust the experiment and re-run; or
- Move to the next experiment; or
- Shut down the product
Of course, you can expect to have a finite budget to invest in learning experiments, and your customers will have a finite attention span, so there is a hard limit on how many times you can iterate through this loop before failing out. As a result, it is crucial to optimize this loop so that you can afford as many iterations as possible to maximize your chance of discovering a feature set that demonstrates product-market fit.
This means that we can derive two critical metrics that will assist us in staying focused upon optimizing our product commercialization process at all times:
Your ability to execute the above loop is constrained by the frequency at which you can deploy your product into production. If it takes six months to move a feature into production, you will run out of money and customers long before you get to run more than that one experiment. If it takes you a week to deploy, you probably won’t be able to run enough experiments to discover the right product.
This implies the need to optimize the entire delivery process to be able to move features into production on a cadence of days to hours.
Understanding the time it takes from the point of identifying a new hypothesis you wish to test, to the point at which an experiment to test it has been deployed in production, helps to define the granularity of the experiments which you are able to run. If it is going to take a team three months to implement an experiment, then consideration should be given to reducing the scope of the experiment, or splitting the problem into smaller components that can be developed in parallel by multiple feature teams.
Optimizing the delivery process to consistently reduce lead times increases the capacity to run more experimental iterations within your available runway.
Working in this iterative fashion requires a fundamental conceptual shift across delivery teams. Instead of the assumption that “the product must be perfect, eventually”, it is necessary to assume that failure is to be expected regularly and as a result, be primarily very good at fixing things fast.
This should not be interpreted to mean ‘quality is lower’. Realistically, the idea of building products that don’t break is a naive oversimplification. Instead, we must consider that things can break; we should anticipate the ways in which they might be expected to break and mitigate the risk associated with breakages.
An unanticipated breakage that takes a product down for a week can kill a business. A temporary outage of one feature of a product is merely a minor annoyance and a failed feature deployment that is safely rolled back after failing smoke testing probably will go unnoticed.
This allows us to derive two more key metrics:
Time to Restore
Optimizing the time between identifying an incident and recovering from the incident mitigates the risk exposure during the failure. Typically, this encourages us to decompose our designs into small footprint components that are easily testable and have limited impact in the case of failure. Where mistakes are made, limiting the scope of the impact of those mistakes reduces risk and the time needed to recover.
Change Failure Rate
Tracking the percentage of changes that lead to an incident in production gives us a proxy metric for quality. This should in turn lead to appropriate root cause analysis. If many of the changes we make to our product are causing issues, do we properly understand the problem our product is designed to address? Does the product development team understand the product they are building? Is our delivery process prone to errors?
Why this focus on metrics, though?
Well, this is another major challenge for technology product development.
In traditional engineering disciplines, such as civil engineering or aerospace, you are building something physical. If, every day, your product under development looks more like a bridge or an airplane, everyone on the team can understand that you are heading in the right direction and get a feel for progress. If your airplane starts to look more like an elephant, you can stop and reassess your activities to ensure you are on track.
Now imagine that you are building an invisible aircraft out of thousands of invisible components.
You can’t see the components being built, or the product being assembled from them. You have a team of hundreds of people working on the product and are relying entirely upon blind faith that they all know what they are doing and are all building compatible parts because the first time you will be able to evidence that will be on the maiden flight with the customer…
Surely, nobody in their right minds would buy an aircraft from a company that had an engineering process that looked like that?
Of course, the reality is that this is exactly how a lot of organizations approach the development of software products and goes a long way to explaining why 70% of software projects fail and 20% of those fail badly enough to threaten the viability of the parent company.
A core component of successful software delivery lies in making progress visible within an intrinsically invisible domain.
Small teams tend to do this naturally using communication. Many adopt basic Agile rituals that involve daily stand-ups to communicate issues, progress and blockers. At a small scale, this is viable. However, the number of possible communication paths scales with n2-n (where n is the number of people on the team). As a result, once you get beyond about six people on a team, they are spending the majority of their time on communication activities rather than developing software.
The implication of this is that for larger products, it is insufficient to follow our natural instincts or to apply basic engineering project management techniques from other disciplines. We need to apply a methodology specifically designed to mitigate the risks associated with working with intangible assets.
Before we look at the details of such a methodology, we should first consider the value of a software asset from a commercial, rather than a purely technical perspective. The software that we create can be considered to be a tool designed to solve a specific real world problem and the value of that tool relates to the benefit generated for the customer by the resolution of their problem.
This is important because it is rare that software has intrinsic, long term value. Software exists as a transient solution to contemporary issues in the outside world. Where a commodity like gold has a fluctuating but effectively perpetual value in the marketplace, the value of software is entirely dependent upon external factors, such as the continued existence of the problem that a given program is designed to solve, and the fact that this particular software still represents the best available solution to this problem.
Because software is implicitly impacted by external factors, it should be considered as a perishable asset. The problem your software is designed to face changes slowly over time. The market your software competes in changes quite rapidly. The physical hardware and the operating system that your software operates upon evolves continuously. The legislative and compliance environment impact your software changes, as do the security and data protection challenges associated.
It is critical therefore to understand that software rots, if left untouched for any period of time.
If we think about software on a project-centric basis, we make the error of considering it as a linear process from design to release and then try to optimize our process to get over the ‘finishing line’ faster.
To properly manage software assets, they must be considered in the context of an ongoing and iterative asset management loop, where ‘release’ is merely a regular, repeating stage in the lifecycle of the asset.
It is worth reiterating here, the value of software is based upon what it can actually do for customers, not what we believe it can do. If that value can shift over time due to external factors, and the asset itself is effectively invisible and therefore cannot be observed directly, we need some mechanism by which we can reliably assess the ongoing value of the asset that we own.
This problem is fractal in nature, since it recurs at multiple levels within our application. A software product is typically made up from multiple components or features, which when operating at scale, are nearly always developed and maintained by different teams. Those components are themselves built upon other components, often from teams outside your organization, and so on, down to the microcode in the processors you are running on.
Each of those components should be assumed to be partially complete. They achieve some of their design goals, but have some level of associated technical debt that has not been paid off because it is not economically viable to do so completely, or because you are doing something with it that the owners haven’t got around to testing themselves yet.
So, everything in your asset is built on stuff that is potentially a bit broken or about to break due to factors outside your awareness.
There is no ‘permanent solution’ to this problem that can be applied once and forgotten about. The only viable approach is to repeatedly rebuild your application from all of its components to ensure that it can still be built, and to repeatedly test the asset to measure that it is still meeting the factors that give it value in the marketplace. In the next section, we will look in more detail at best practice methods for achieving this effectively and reliably.
1.2 - Common mistakes
“How to be really good at Continuous Delivery AND completely destroy your company”
Before we go deeper, it is very important that we flag a common anti-pattern that can take your Continuous Delivery experience off into a catastrophic direction.
We have introduced the idea of DevOps as a process that helps you optimize delivery based upon metrics. Pick the wrong metrics, however, and you will get very efficient at being busy doing all the wrong things until you run out of resources and are shut down.
It is very tempting to align all your metrics against the delivery of features. Everyone in the organization will happily join in this activity because features are arbitrary lists of work that are easy to come up with, easy to measure progress against and usually represent things that are directly tied to milestones and bonuses. Worse, it is really, really easy to build processes that automate the end to end pipeline of delivering features and measuring their delivery. You will feel good doing it. Everyone will be busy. Stuff will happen and features will get delivered. Things will be happening faster and more efficiently than they were before. Bonuses and incentives will abound.
Then, there will be a reality adjustment. Your customers still aren’t happy. Nobody is buying. In fact, the instrumentation shows that nobody is really even using the features you shipped. Growth tanks and everyone starts saying that “Continuous Delivery doesn’t work”.
Here’s the error: Features are outputs. For Continuous Delivery to work as intended, your metrics must be based upon Outcomes, not Outputs.
The benefits of Continuous Delivery are not derived from the optimization of the engineering process within your organization. The purpose of Continuous Delivery is to optimize the delivery of those outcomes that are most important to your customers.
The bulk of the technical details of Continuous Delivery implementation can appear to sit within the engineering function and DevOps transformations are often driven from that team, but Continuous Delivery will only work for your organization if it is adopted as a core component of a product commercialization strategy that aligns all activities across shareholders, management, marketing, sales, engineering and operations.
The activities flowing through your Continuous Delivery pipeline should either be experiments to validate a hypothesis about customer needs, or the delivery of features that represent a previously validated customer need. All of this must be driven by direct customer interaction.
In order to successfully implement Continuous Delivery, your organization must have a structure that sets out strategic outcomes and empowers teams to discover customer needs and take solutions to them to market. This implies a level of comfort with uncertainty and a trust in delivery teams to do what is best for the customer in the context of the strategy.
In a classical business, ‘strategy’ is little more than a plan of work which is formed annually and which spells out the set of features to be delivered in the following year. All spending and bonuses are structured against this plan, leaving little room to change anything without ‘failing’ on the plan.
If an engineering team attempts to implement Continuous Delivery unilaterally within this structure, they will find themselves railroaded into using it to implement planned features by the rest of the organization. Furthermore, they will have no power to release into production because go-live processes will remain defined by other areas of the business that are still operating against classical incentives and timescales.
Broad support for organisational change at board level is necessary for a successful Continuous Delivery implementation.
1.3 - Going slower to go faster
Understanding what makes the most difference when optimizing for success
For a single person, working in isolation on a single problem, the fastest route is a straight line from point A to point B.
The classic approach to trying to speed up a team is to make linear improvements to each team member’s activities, cutting corners until they are doing the minimum possible activity to get to the desired outcome. Taken to extremes, this often results in neglecting essential non-functional requirements such as security, privacy or resilience. This is, however, an anti-pattern for more subtle and pernicious reasons - inner loops and exponential scalars…
When trying to complete a task in a hurry, we tend to use a brute-force, manual approach to achieving tasks. Need to set up an environment? “Oh, I’ll just google it and cut and paste some commands.” Need to test something? “I can check that by exercising the UI.”
All these tasks take time to do accurately, but in our heads we budget for them as one-off blockers to progress that we have to push through. Unfortunately, in our heads, we are also assuming a straight line, ‘happy path’ to a working solution. In practice, we actually are always iterating towards an outcome and end up having to repeat those tasks more times than we budgeted for.
Under these circumstances, we notice that we are slipping and start to get sloppy at repeating the boring, manual tasks that we hadn’t mentally budgeted for. This has the effect of introducing a new class of errors, that we also hadn’t factored into our expectations, throwing us further off track.
If these manual tasks fall inside an inner loop of our iterative build and release process, they are capable of adding considerable overhead and risk to the process. Identifying these up front and budgeting for automating them on day one pays back the investment as soon as things stray into unanticipated territory.
The benefits of this are easy enough for individuals to visualize. The next problem, however, is more subtle and more impactful.
As we have discussed previously, working with ‘invisible’ artifacts like software systems means that teams have to manage the problem of communicating what is needed, what is being done and what technical debt remains. On a team of ‘n’ people, the number of possible communication paths goes up with n2-n and this exponential scaling factor can rapidly wipe out all productivity.
Similarly, if everyone on the team takes a unique approach to solving any given problem, the amount of effort required for the team to understand how something works and how to modify it also goes up with n2.
As your product grows, numbers of customers and volumes of transactions also scale exponentially.
As a result, behaviors that worked just fine on small teams with small problems suddenly and unexpectedly become completely unmanageable with quite small increases to scale.
To operate successfully at scale, it is critical to mitigate the impact of complexity wherever possible. Implementing consistent, repeatable processes that apply everywhere is a path to having one constrained set of complex activities that must be learned by everyone, but which apply to all future activities. This adds an incremental linear overhead to every activity, but reduces the risk of exponentially scaling complexity.
The implication of this is that there is a need to adopt a unified release cycle for all assets. It helps to think about your release process as a machine that bakes versions of your product. Every asset that makes up your product is either a raw material that is an input to the machine, or it is something that is cooked up by the machine as an intermediate product. The baking process has a series of steps that are applied to the ingredients in order to create a perfect final product, along with quality control stages that reject spoiled batches.
If you have ingredients that don’t fit into the machine, or required process steps that the machine doesn’t know about, you cannot expect to create a consistent and high-quality final product.
It is worth stressing this point. If your intention is to build a product which is a software system that produces some value for customers, you should have the expectation that you will also need to own a second system, which is the machine that assembles your product. These systems are related, but orthogonal. They both have a cost of ownership and will require ongoing maintenance, support and improvement. The profitability of your product is related to the effectiveness of the machine that assembles it, as is your ability to evolve future products to identify product-market fit.
So, what are the key features of a machine that manufactures your product?
Let’s start by looking at the ingredients that you are putting into the machine. These will typically comprise the source code that your development teams have created, third party dependencies that they have selected and data sets that the business has aggregated. These are people-centric operations that carry with them an increased risk of human error and a set of associated assumptions that may or may not align to the outcomes required for your final product.
Any component at this level will have been created by an individual or small team and it should be assumed that this work was undertaken with minimal understanding of how the component will interact with the rest of your product.
This introduces some key questions that your build system must be able to validate:
Can the component be built in your official environment?
Does it behave the way the developer expected?
Does it behave the way that other developers, who are customers of its service, expected?
Does it align to functional and non-functional requirements for the product?
Let’s touch briefly on the importance of consistent environments. The most commonly heard justification within development teams is probably “Well, it works on my machine!”
The environment in which you build and test your code represents another set of dependencies that must be managed if you are to maintain consistency across your final product. It is hard to do this effectively if your environments are physical computers, since every developer’s laptop may be configured differently and these may vary significantly from the hardware used in the build environment, the staging environment and production.
Virtualization and containerization make it much easier to have a standard definition for a build environment and a runtime environment that can be used consistently across the lifecycle of the component being maintained. We will discuss this in further detail later, but your build system will require a mechanism by which to configure an appropriately defined environment in which to create and validate your source components.
To build a component from source, we need to collate all of the dependencies that this component relies upon, perform any configuration necessary for the target environment, and, in the case of compiled languages, perform the compilation itself.
This brings us to one of the harder problems in computing, dependency management. A given version of your code expects a particular version of every library, shared component, external service or operating system feature upon which it relies. All of these components have their own lifecycles, most of which will be outside your direct control.
If you do not explicitly state which version of a dependency is required in order to build your code, then your system will cease to be buildable over time as external changes to the latest version of a library introduce unanticipated incompatibilities.
If you explicitly state which version of a dependency is required in order to build your code, then your system will be pinned in time, meaning that it will drift further and further behind the functionality provided by maintained dependencies. This will include security patches and support for new operating system versions, for example.
Furthermore, depending upon the way in which your application is deployed, many of these dependencies may be shared across multiple components in your system, or between your system and other associated applications in the same environment. This can lead to runtime compatibility nightmares for customers. You should also consider that you may need to be able to run multiple versions of your own components in parallel in production, which again introduces the risk of incompatibility between shared libraries or services. For good reason, this is generally known as ‘dependency hell’ and can easily destroy a product through unanticipated delays, errors and poor customer experience.
The implication of this is that you must employ a mechanism to allow controlled dependency management across all the components of your product and a process to mandate continuous, incremental updates to track the lifecycles of your dependencies, or your product will succumb to ‘bit rot’ as it drifts further and further behind your customer’s environments.
Decomposing your product into loosely-coupled services that can be deployed in independent containers communicating through published APIs provides maximal control over runtime dependency issues.
The remainder of our questions bring us to the topic of testing.
To verify that a component behaves the way its developer expected, our build system should be able to run a set of tests provided by the developer. This should be considered as a ‘fail fast’ way to discover if the code is not behaving the way that it was when it was developed and represents a form of regression testing against the impact of future modifications to the codebase in question. Note however that both the code and the tests incorporate the same set of assumptions made by the original developer, so are insufficient to prove the correctness of the component in the context of the product.
At this stage, it is recommended to perform an analysis of the code to establish various metrics of quality and conformance to internal standards. This can be in the form of automated code quality analysis, security analysis, privacy analysis, dependency scanning, license scanning etc, and in the form of automated enforcement of manual peer review processes.
To verify that a component behaves the way that other developers, who are consuming services provided by the component, expect, we must integrate these components together and test the behavior of the system against their combined assumptions. The purpose of this testing is to ensure that components meet their declared contract of behavior, and to highlight areas where this contract is insufficiently precise to enable effective decoupling between teams.
The final set of testing validates whether the assembled product does what is expected of it. This typically involves creating an environment that is a reasonable facsimile of your production environment and testing the end-to-end capabilities against the requirements.
Collectively, these activities are known as Continuous Integration when performed automatically as a process that is triggered by code being committed to product source control repositories. This is a topic that we shall return to in more detail in later chapters.
These activities provide a unified picture of asset status, technical debt and degradation over time of intellectual property. For the majority of a product team, they will be the only mechanism by which the team has visibility of progress and as such, it should be possible for any member of the team to initiate tests and observe the results, regardless of their technical abilities.
Ultimately, however, your teams must take direct responsibility for quality, security and privacy. Adopting a standard, traceable peer review process for all code changes provides multiple benefits. Many eyes on a problem helps to catch errors, but also becomes an integral part of the communication loop within the team that helps others to understand what everyone is working on and how the solution fits together to meet requirements. Done correctly and with sensitivity, it also becomes an effective mechanism for mentoring and lifting up the less experienced members of a team to effective levels of productivity and quality.
Your build system should provide full traceability so that you can be confident that the source that was subjected to static analysis and peer review is the same code that is passing through the build process, and is free from tampering.
The desired output from this stage in manufacturing your product is to have built assets that can be deployed. Previously, this would have been executables, packages or distribution archives ready for people to deploy, but under DevOps we are typically looking at creating containerized environments containing the product, pre-configured.
This should be an automated process that leverages environment specific configuration information held in the source repository with the code that describes the desired target environment for deployment. This information is used to create a container image which holds your product executables, default configuration, data and all dependencies.
At this stage in the process, it is appropriate to apply automated infrastructure hardening and penetration testing against your container image to ensure a known security profile.
The image may then be published to an image repository, where it represents a versioned asset that is ready for deployment. This repository facilitates efficient re-use of assets and provides a number of benefits, including:
Being able to deploy known identical container images to test and production environments alike
Simplifying the re-use of containerized service instance across multiple products
Making it easy to spin up independent instances for new customers in isolated production environments
Enabling management of multiple versions of a product across multiple environments, including being able to rapidly roll back to a known good version after a failed deployment.
As an aside to the process, it is advisable to set up scheduled builds of your product, purely to act as automated status validation of your asset health. This gives confidence against ‘bit rot’ due to unanticipated external factors such as changes in your dependency tree or build environment that introduce build failures. In parallel with this, you must plan regular maintenance activities to update your codebase to reflect changes in external dependencies over time, so that your asset does not become stale and leave you unable to react rapidly to emergency events such as zero day exploits that must be patched immediately.
Having a machine that can build your product is only half the answer, however. Much of the risk sits within the process of deploying the product into production and this must also be automated in order to successfully mitigate the main problems in this space.
The first issue is ‘what to deploy?’. You have an asset repository filling up with things that are theoretically deployable, but there is a subtle problem. Depending upon the way you measure and reward your development teams, the builds coming into your repository may be viable units of code that pass tests, but which don’t represent customer-ready features that are safe to turn on. In the majority of new teams, it is fair to expect that this will be the default state of affairs, but this is a classic anti-pattern.
To realize many of the benefits of DevOps, it is essential that the images that are getting into your asset repository are production-ready, not just ‘done my bit’ ready from a developer’s perspective. You need to create a culture in which the definition of done switches from ‘I finished hacking on the code’ to ‘All code, configuration and infrastructure is tested and the feature is running in prod’.
It’s not always easy to get to this point, especially with complex features and large teams but there is a work-around that allows you to kick the problem down the road somewhat and this is the adoption of feature switching, or ‘feature flags’. Using this approach, you wrap all code associated with your new feature in conditional statements that can be enabled or disabled at runtime using configuration. You test your code with the statements on and off to ensure that both scenarios are safe and create an asset that can be deployed in either state. Under these circumstances, you are protected from situations where the code is dependent upon another service that has slipped on its delivery date, so your asset is still production-ready with the new feature turned off. You are also protected from unanticipated failures in the new feature since you can turn it off in production, or can perform comparative testing in production by only turning the feature on for some customers.
Given this, the next issue is ‘when to deploy?’. At some point, you need to make a go / no go decision against the deployment of a new asset. This should be based upon a consistent, audited, deployment process that is automated as much as possible.
In safety critical environments, such as aircraft or in operating theaters, checklists are used to ensure that the right actions are taken in any given scenario, especially when there is pressure to respond urgently to immediate problems. Flight crews and theatre staff are drilled in the use of checklists to minimize the chances of them skipping essential activities when distracted by circumstances. Your product build system must automate as many of these checklist activities as possible to ensure that key actions happen each and every time you make a release.
This is also a good place to automate your regulatory compliance tasks so that you can always associate mandated compliance activities with an audited release.
The deployment decision should be managed under role-based access control, with only nominated individuals being authorized to initiate a deployment. Remember that if someone manages to breach the system that builds your product, they may be able to inject malicious code into your asset repository by manipulating your tests, so you must take precautions to ensure that there are clear reasons for new code passing into production.
This brings us to the ‘what, specifically?’ of deployment. The assets in your repository are typically re-usable services that are bundled with default configurations that have been used for testing but now you need a concrete instance of this service, in a given environment, against a specific set of other service instances, for a specific customer or application. Your deployment process must therefore include an appropriate set of configuration overrides that will define the specific instance that is created. This is where you must deal with the feature switching can that you kicked down the road earlier.
Finally, we get to the ‘how?’ of deployment. Ideally, you want a deployment process that is untouched by human hands, so that you can guarantee a predictable and repeatable process with known outcomes. You can use procedural scripts to deploy your asset, but generally it is better to use a declarative approach, such as GitOps, where you maintain a versioned description of how you would like your production environment to look and changes committed to this description trigger the system to do whatever is necessary to bring the production environment into line with the desired state.
Remember the ‘pets vs cattle’ model of environments. If you have a problem with infrastructure, it is far far better to kill the instance and create a fresh one automatically than to try and tinker with it manually to make it healthy again.
As part of this process, you will want to have automated validation of successful deployment, and automated recovery to the last known good state in the event of a failure.
You are trying to create a culture and a mechanism within which small units of functionality are incrementally deployed to production as frequently as possible, with the minimum of human input. This is Continuous Deployment.
Tying all this together, your goal is to build a product delivery engine that enables Continuous Delivery of features supporting your product discovery goals, at a cadence aligned to the metrics discussed earlier, thus maximizing your chances of commercial success within the constraints of your available runway and capacity.
In subsequent sections, we will dive more deeply into the specifics of each of these challenges.
1.4 - AI & Machine Learning
MLOps: Models are assets too
Many products include machine learning as a technology component and the process of managing machine learning in production is usually referred to as MLOps, however there are wildly differing views as to what this means in practice in this nascent field.
A common misunderstanding is to treat machine learning as an independent and isolated discipline with tools optimized purely for the convenience of data science teams delivering stand-alone models. This is problematic because it takes us back to development patterns from the pre-DevOps era, where teams work in isolation, with a partial view of the problem space, and throw assets over the fence to other teams, downstream, to deploy and own.
In reality, the machine learning component of a product represents around 5-10% of the effort required to take that product to market, scale and maintain it across its lifespan. What is important is managing the product as a whole, not the models or any other specific class of technology included within the product. MLOps should therefore be seen as the practice of integrating data science capabilities into your DevOps approach and enabling machine learning assets to be managed in exactly the same way as the rest of the assets that make up your product.
This implies extending the ‘machine that builds your product’ to enable it to build your machine learning assets at the same time. This turns out to have significant advantages over the manual approach common in data science teams.
Firstly, your data science assets must be versioned. This includes relatively familiar components like training scripts and trained models, but requires that you also extend your versioning capability to reference explicit versions of training and test data sets, which otherwise tend to get treated as ephemeral buckets of operational data that never have the same state twice.
Your training process should be automated, from training scripts that are themselves managed assets that include automated acceptance tests. Keep in mind the idea that the models that you are producing are not optimal blocks of code that can be debugged, but are rather approximations to a desired outcome that may be considered fit for purpose if they meet a set of predefined criteria for their loss function. Useful models are therefore discovered through training, rather than crafted through introspection and the quality of your models will represent a trade-off between the data available for training, the techniques applied, the tuning undertaken to hyper-parameters and the resources available for continued training to discover better model instances.
Many of these factors may be optimized through automation as part of your build system. If your build system creates the infrastructure necessary to execute a training run dynamically, on the fly, and evaluates the quality of the resultant model, you can expand your search space and tune your hyper-parameters by executing multiple trainings in parallel and selecting from the pool of models created.
A big part of successfully managing machine learning assets lies in having the ability to optimize your utilization of expensive processing hardware resources, both during training and operational inferencing. Manually managing clusters of VMs with GPU or TPU resources attached rapidly becomes untenable, meaning that you can accrue large costs for tying up expensive resources that aren’t actually being utilized for productive work. Your build system needs to be able to allocate resources to jobs in a predictable fashion, constraining your maximum spend against defined budgets, warning of excessive usage and enabling you to prioritize certain tasks over others where resources are constrained in availability.
It is important to be aware that beyond trivial examples that can be run in memory on a single computing device, much of machine learning sits in the domain of complex, high-performance, distributed computing. The challenge is to decompose a problem such that petabytes of training data can be usefully sharded into small enough chunks to be usefully processed by hardware with only gigabytes of RAM, distributed into parallel operations that are independent enough to significantly reduce the elapsed time of a training. Moving that much data across thousands of processing nodes in a way that ensures that the right data is on the right node at the optimal time is a conceptual problem that humans are poorly suited to optimizing and the cost of errors can be easily multiplied by orders of magnitude.
Consideration should be given to the forward and reverse paths in this product lifecycle. Your build process should seek to optimize the training and deployment of versioned models into production environments, but also to enable a clear audit trail, so that for any given model in production, it is possible to follow its journey in reverse so that the impact of incidents in production can be mitigated at minimal cost.
On the forward cycle, there are additional requirements for testing machine learning assets, which should be automated as far as possible. Models are typically decision-making systems that must be subject to bias detection and fairness validation, with specific ethics checks to ensure that the behavior of the model conforms to corporate values.
In some cases, it will be a legal requirement that the model is provable or explainable, such that a retrospective investigation could understand why the model made a given decision. In these cases, it should be expected that the evolutionary lifecycle of the model will include the need to be able to back-track through the training process from an incident in production, triggering retraining and regression testing to ensure that mistakes are corrected in subsequent releases.
Models also require security and privacy evaluation prior to release. This should take the form of adversarial testing where the model is subjected to manipulated input data with the intent of forcing a predictable decision or revealing personally identifiable training data in the output. Note that there is always a trade-off between explainability and privacy in machine learning applications, so this class of testing is extremely important.
The build system must be able to appropriately manage the synchronization of release of models and the conventional services that consume them, in production. There is always a problem of coupling between model instances and the services that host and consume inference operations. It should be expected that multiple versions of a given model may be deployed in parallel, in production, so all associated services must be versioned and managed appropriately.
Note that in some geographic regions, it is possible for customers to withdraw the right to use data that may comprise part of the training set for production models. This can trigger the need to flush this data from your training set, and to retrain and redeploy any models that have previously consumed this data. If you cannot do this automatically, there is a risk that this may be used as a denial of service attack vector upon your business, forcing you into cycles of expensive manual retraining and redeployment or exposing you to litigation for violations of privacy legislation.
2 - Team culture
Why culture is the most important factor in successful adoption
“Any organization that designs a system (defined broadly) will produce a design whose structure is a copy of the organization’s communication structure”
— Melvin E. Conway
Consider a traditional organization. There will most likely be a hierarchy of decision-making with a small group disseminating strategic decisions to a range of teams or departments for implementation. Each department will have a single function, such as sales, marketing, legal, development, testing, operations etc. and all will compete with each other annually for budget and headcount. Within each department, staff will compete against each other for career advancement opportunities within the department.
Annually, strategic goals are set for the business, outputs are defined for each department, budgets granted and bonuses set against the successful implementation of these outputs.
This model has been around for centuries, so what could possibly be wrong with it?
Well, it operates based upon the assumption that markets change linearly over very long timescales so you can rely upon a few percent of growth per year based upon an annual cadence of strategic input.
This assumption is however invalid in an exponentially changing, technology-driven economy where massive disruption of existing business models can happen in months and true ‘Digital’ businesses leverage machine learning to change business models on a minute-by-minute basis, tracking the behavior of markets in real time.
Under these circumstances, the culture of a traditional organization turns against itself because structure, metrics and incentives are all inappropriate. Staff will resist reacting to unanticipated market changes because their bonuses are tied to annual strategic outputs. Departments will compete rather than collaborate because of their budget structure. Teams will become increasingly risk-averse because change takes them further from the comfortable predictability of the expected small, linear future.
Now consider a high-growth, technology business. This will typically be a much flatter structure where there is a product-centric mission and individuals are rewarded through the increasing value of their shareholding. Cross-functional teams are structured based upon product line or customer need.
Strategic goals are set in the form of outcomes, objectives and key results, with clearly defined metrics that can be collected and evaluated daily or weekly. Goals are updated iteratively based upon the results of experiments run in live environments with customers. Resources are allocated based upon a spread of longer-term ‘bets’ and shorter term results against objectives.
In many cases, decision-making is now offloaded to machine learning models trained to react to changing market behaviors.
So, how does this impact the delivery process?
Well, in a traditional business, the product lifecycle starts with a strategic guess about a customer need. A team then creates a detailed specification for a product to fulfill that need. This specification goes through legal approval and sales and marketing teams start work on promoting the product to customers whilst a development team works on building the product to specification. The output of this development is then tested by another team and passed to operations for deployment to production. It may be a year or more before a customer starts to interact with the product and you discover whether there is any demand for it.
The end result is a very long feedback loop, with the product passing between many silos inside the organization, with each team expected to come up to speed from cold before progressing the delivery. If the original guess proves to be wrong, or the market changes significantly during development, the cost can be fatally high.
The approach in a high growth business is very different. A product manager will make a hypothesis about a potential customer need, based upon conversations with actual or potential customers. They define a minimal experiment to validate this hypothesis with one or more test customers, create a set of metrics by which the outcome will be evaluated and raise funds for running the experiment for a defined time period. A cross-functional team then does all of the work necessary to put this experiment live in a customer-facing environment, collects usage data according to the previously identified metrics and tracks progress daily. This process may evolve into a sequence of iterative experiments that demonstrate product-market fit, or result in a rapid decision to abandon this feature in favor of another approach. Only products demonstrating strong product-market fit are funded to be scaled out.
Under this approach, feedback is very rapid and the organization is able to learn very quickly from the trial and error process whilst minimizing the downside risk. This is particularly important in cutting edge fields where there is no prior experience or best known method to build from.
How is this relevant?
Continuous Delivery is a methodology designed to optimize the delivery process within high growth organizations. It is the answer to the questions “How do we minimize the time it takes to go from a hypothesis about a need to testing and validating that hypothesis in production?" and “How do we make mistakes safely and cheaply so that we can have the fastest possible learning experience without blowing up the business?"
If you are a traditional business in a mature and stable market, you likely have no need for Continuous Delivery and could end up spending a lot of effort implementing processes that yield minimal benefits within your organization.
If you are a tech start-up, you must follow the iterative product commercialization approach if you are to grow fast enough to successfully exit. Continuous Delivery is going to be very beneficial to you and you have many advantages that will help you to implement it. You have a team that is incentivized to build a product that customers love because the value of their equity depends upon it. Your team is not big enough to have developed departmental silos that can’t easily be broken up and restructured to align to the needs of your customers. Your priority is to work out what is the cheapest set of experiments that can be run to demonstrate the existence or lack of product-market fit within the available runway, so this encourages a culture of rapid experimentation with early adopters.
If you are a traditional business, undertaking a ‘digital transformation’, you face the steepest challenge in implementing Continuous Delivery successfully. You may be encountering disruptive competition from digital businesses entering your market and recognize that they are responding to customer needs more rapidly than you because you are only able to deliver change once or twice per year.
It may not be clear to everyone in your business that this means that the clock is now ticking down on a window of time in which you must respond before a competitor takes your most valuable customers. To be able to compete, you must transition to being better than your competitors in running the product commercialization strategy that they are using against you, but this means a major change to your corporate culture.
It is essential to start with communicating the required changes in culture before attempting to implement Continuous Delivery so that you have the correct incentives and policies in place before implementation. If you do not do this, you will typically face a scenario where ‘corporate antibodies’ will react to sabotage the change. Experience shows that DevOps is not a capability that can be acquired and ‘bolted-on’ to an existing delivery team. To succeed in this, it is necessary to break down all of the silos in the organization and create cross-functional product teams that follow a DevOps methodology and are aligned to your customer needs and empowered to deliver. This means not just breaking down departmental boundaries, but also restructuring the way you measure performance and incentivize productivity.
Looking more closely at potential problems, a traditional business often follows a ‘command and control’ model of central decision making with teams following orders in narrowly defined roles. This can create a culture of “only following orders”, “more than my job’s worth”, “our paymasters say we do X, so we do X” and “not my problem”. This can lead to bizarre situations where everyone knows that there are huge gaps in a solution but nobody owns the work to fill the gaps and nobody dares inform the leadership team.
For a high growth product commercialization strategy to work, it is necessary for all teams to be directly empowered to do whatever is necessary (within the bounds of ethics and brand values) to deliver a product that customers want. To do this, it is necessary to create a culture of extreme personal responsibility, where everyone on the team understands that it is expected that they will ensure that all aspects of the delivery are appropriately addressed and communicate any gaps or risks in a timely manner.
To succeed in this, you must first move visibly away from a ‘blame’ culture. Leveraging people’s mistakes as a tool for career advancement is a common strategy in traditional businesses. This behavior will kill your transformation stone dead because it represents the opposite from the desired mindset.
The fastest way to learn how to solve a problem well is to make a bunch of mistakes that give you valuable information about the true nature of the problem, survive the mistakes and then leverage the learning to build a better solution. To do this well, you will need to make a large number of small mistakes whilst avoiding making any fatal ones. In order for that to happen, your team must feel safe to acknowledge errors and communicate the learnings that come with them widely.
This level of psychological safety can be created with a culture of frequent communication that flags risks and blockers as a daily activity, where ‘shooting the messenger’ is clearly rejected as inappropriate and where teams pull together to recover from challenges then hold retrospective investigations to ensure that processes are modified to avoid the same problem causing impact in the future.
It is itself a mistake to try to build a team of ‘perfect’ developers who always produce flawless code. Anywhere that there is a human in the loop, you must plan for a certain number of errors to creep in. Expecting perfection actually creates a culture of risk-aversity and prevents your team from having sufficient learning experiences to make breakthroughs. Instead, you must build psychological safety by a combination of automated verification and peer review to catch problems during the release process.
Similarly, if the Continuous Delivery process is being adhered to correctly, and reviews and tests are passing, team members must be empowered to put changes into production. A classic cause of failure is to have a ‘continuous’ process within engineering, but then require paper sign-off from multiple departments before deliveries can go live.
Much of the literature on Continuous Delivery will be found to focus upon engineering activities, but it is important to realize that much like fish being unaware of water, these texts are usually written from the perspective of engineers who are swimming in an eco-system of product commercialization which has become second nature. For other organizations, however, your job is first to fill the pool, so the engineering parts of the journey can flow smoothly.
3 - Version Control
Organize and manage changes to your software code, configuration, and other related versionable content.
- Why Version Control
- Types of Version Control
Version control is the practice tracking and managing changes to the files that make up a software system. The types of files can include source code, build information, configuration files, or any other file necessary for the assembling or running of a software system.
Why Version Control
The primary purpose of version control is to track changes to files in your system over time in order to ensure the best-practice principles of reproducibility and traceability of your software.
Reproducibility is knowing you can go back in history to any moment in your software’s development life cycle and have all the information necessary to recreate your software as it existed at a particular point in time.
Traceability means that for any existing deployment of a software system, all of the necessary information about that system’s provenance is known. Having traceability allows you to compare what is in place to what should be in place.
Practically speaking, without the use of version control any kind of legitimate automation to enable continuous integration or continues deployment is impossible.
Types of Version Control
There are numerous types of version control, though only two that are truly appropriate for production software.
Centralized System: All files are managed by a single service, including any history of changes, and the service tracks status of in-progress efforts. Access management lists of contributors are also controlled by the service. Examples of centralized version control systems include Subversion or Perforce.
Distributed System: While there is usually a primary version of a given project, development against that project is done by making a complete copy, or clone, of the whole project into a new environment. This includes the full history of changes up to that point. Each copy, or clone, of a project can also maintain it’s own list of contributors indepently. Examples of distributed version control systems include Mercurial and Git.
4 - Continuous integration
Best practices for continuous integration
- Why it matters
- Associated DORA capabilities
- Key stakeholders
- Best practices
- Relationship to other practices
Why Continuous Integration
Continuous Integration ensures that coding changes are merged, compiled/linked, packaged and registered frequently, avoiding broken builds and ensuring a clean release candidate is available continuously.
Continuous Integration, the CI in CI/CD, is the practice of combining code changes frequently, where each change is verified on check-in.
- Examples of verifications:
- Code scanning
- Building and packaging
Description and Scope
Minimizing broken builds due to incompatible coding changes is the purpose of the continuous integration process. Historically, project teams would have a ‘sync-up’ process which generally meant check-in all of your coding updates and hope that the build runs. An unsung hero called the Build Manager created the build script which would include merging and pulling all of the coding updates based on ‘tags’ and then get the build to run.
This ‘sync-up’ step was performed once a week, every two weeks or monthly. It was not unusual for the build manager to put in a solid 8-12 hours to get the build working. The resulting build was often referred to as a ‘clean’ build which created an available release candidate. This process meant you would only have a release candidate to pass to testing on a very low frequency basis, which in turn slowed down the entire application lifecycle process. Builds were always the bottleneck.
Continuous integration changed the way the build (merge, compile/link, package and register) step was implemented. By triggering the process on a check-in of code to the versioning system, continuous integration quickly identified if a developer broke the build when they introduced new code. In essence, the process of merging, compiling and linking code on a high frequency basis allows for the continuous integration of coding changes as soon as possible. If the build breaks due to a coding update, the build is easier to fix with smaller incremental changes, versus the ‘sync-up’ method where dozens of possible coding changes impacted the build leaving it to the build manager to sort out all of the problems - a tedious and onerous process. And more frequent builds meant that testing got more frequent updates, truly the beginning of ‘agile’ or incremental software updates.
The process of triggering the ‘build’ is sometimes referred to as the continuous build step of the CI process. It is in this step that all of the software configuration management is performed which includes determining what code to pull, what libraries to use and what parameters must be passed to the compilers and linkers. The ‘build’ step of CI is triggered by a source code ‘check-in’ event. It then executes a workflow process to run the build script and create, package and register the new binaries and thereby create a new release candidate.
As CI matured, so did the process around the central theme. CI workflows were created by developers to not only run the build script, but also deploy the new binaries to a development endpoint followed by running automated testing. In addition, code and library management steps were added to the build improving the quality and security of code, and ensuring the usage of the correct transitive dependencies, open source licenses, and security scans were done during the creation of the release candidate.
In your build step, it is important to understand what code is being pulled from version control to be assembled into a discrete deliverable. For this reason, there should be clear best practices defined for managing branches.
Build by Branch - for every build, a branch is referenced. The branch name is passed to the build step to determine what to pull for the compile/link.
Build by Tag - a Tag has been applied to all objects in the repository and the build pulls the code based on the Tag for the compile step. The Tag is a collection of objects that relate together.
Build Best Practices
Build Work Products
Regardless of what type of build is executing, it should produce 3 basic outputs.
- A build should create not only the binaries, but also the application package name based on a version schema that relates back to the versioning Tag. (MSI, zip, rpm, or container image)
- A full Bill of Material report should be required at minimum for all production releases. BOM reports are key to debugging issues if needed. A BOM report should show:
- All source code included in the build.
- All libraries or packages, internal and external used in the link.
- All compile/link parameters used to define the binaires.
- Licensing of external components and transitive dependencies.
- Every build should include a Difference report. A Difference report shows what changed between any two builds. This should be used for approving updates before a release to testing or production environments. A Difference report should be generated based on a comparison of two BOM reports. Difference reports can be pulled from the version repository, but may be incomplete as objects such as third party libraries are not pulled from a version repository.
Do not use wildcard includes (/.)
When defining where code, packages and libraries are to be found in the build process, do not use wildcard includes. Instead list all file references by name that need to be compiled or linked into the binary objects. While this may seem a lot of extra work, it is essential in securing that only approved objects end up in the resulting binary. If you use the wildcard includes, non-approved objects will no doubt be delivered with your binary. This can be a hidden security risk. This also means your binary includes unnecessary objects which can substantially increase the size of your binary.
Know Your Build Parameters
Build parameters determine how the resulting binaries are created. For example, the use of debug flags allows the binaries to include debug symbols that can be exploited and should not be deployed to production environments.
The CI build should include the updating of the binaries to an appropriate binary repository. The binary repository should be used to persist and share the binaries to the continuous deployment step of the Continuous Delivery Pipeline. These binaries should be tagged with the corresponding version control ‘tag.’
Docker Specific Best Practices
A Multi-stage Docker Build is the process of moving your CI steps into a container build. It is called ‘multi-stage’ because it aggregates all of the needed steps into a Docker Build of the image. For example, in stage one run the compile/link steps to create the artifacts, stage two copy the artifacts to a run-time and destroy stage one. The benefit of considering a Multi-stage Docker Build is to create an airtight environment in which the process is executed. The container holds all objects without the possibility of an external change. Interesting to the Multi-stage process is the ability to move your entire CI Build process into a single container build. Best practices related to each of those steps should still be followed. This option minimizes external updates to the build step making it a best practice candidate.
5 - Continuous deployment
Best practices for continuous deployment
Continuous Delivery and Deployment
Continuous Integration ensures that the commits to a code base are validated in an automated way. This led to the concept of Continuous Delivery, where the build output is published as an artifact (or set of artifacts) that are able to be deployed via a manual process. Continuous deployment takes that even further to say those artifacts can be automatically deployed because of the attestation the software pipeline ensures about correctness. Put another way, once developers have automated the creation and validation of build artifacts any time code is updated, the logical extension is to automate the deployment of those updates as well.
As developers and testing teams became more efficient at providing release candidates, production teams were being asked to move the new updates forward to the end users. However, the production teams had different deployment requirements and often used ‘operation’ tooling to perform releases. The scripts that drove deployments for development and testing were not accepted by the teams managing the production environments. This began a culture shift. We began to see the rise in “Site Reliability” engineering, individuals who work at an operational level but assigned to development teams. This began a conversation about automating the continuous deployment step of the DevOps pipeline and shifted the conversation from continuous integration to solving a repeatable continuous deployment step integrated into the continuous delivery orchestration. To support what the operational side of the house needed it became apparent that automated tooling, specific to deployments, was required. In particular, solutions to serve the auditability and change management of production endpoints was required to build a DevOps pipeline that truly served the needs of both sides of the equation. The deployment automation category was born.
Continuous deployment is an approach where working software is released to users automatically on every commit. The process is repeatable and auditable.
Description and Scope
The need to automate deployments grew out of the continuous integration movement. Developers automated deployments from their CI workflows using a simple deployment script to update their development environments for unit testing. Initially the scripts were just a copy command. As the industry evolved, the need to recycle web servers and tweak environment configurations were added to the scripts. The deployment step began to become more and more complicated and critical. Testing teams became more dependent on developers to perform testing releases. In many ways, this need evolved a simple CI workflow into a Continuous Delivery workflow, automating the update to testing upon a successful unit test in development. Now one workflow called another workflow and we began the journey into continuous delivery.
Once the unit testing was complete, the need to push the update to testing and production drove the evolution of automating deployments to include broader management of the deployment process with the goal of deployment repeatability across all stages. While continuous deployments had been embraced by developers and testers, production teams were not willing to accept updates on a high frequency basis. Operation teams, with the goal of maintaining a stable production environment, have a culture of being risk averse. In addition, the deployment needs of production are consistently different from the needs of development and testing. Creating a single platform for managing deployments across the lifecycle pipeline became the goal of the continuous deployment movement.
Continuous deployments can be viewed in two ways, a push process or a pull process. A push solution updates environments upon a call from the continuous delivery orchestration engine. A pull solution, such as GitOps, manages deployments based on a ‘state’ defined by configuration data in a deployment file stored in an ‘environment’ repository. An operator running in an environment monitors the state by referencing the deployment file and coordinates the updates. In either case, a new update ‘event’ triggers the Continuous Delivery process to perform an action. That action can push a deployment, or create a pull request to update a deployment file to a repository. The outcome is the same, a consistent repeatable deployment process is achieved.
The deployment process must be repeatable across all stages of the pipeline. To achieve repeatability, values that are specific to an environment should be separated from the deployment tasks. This allows the logic of the deployment to remain consistent, while the values change according to the endpoint.
Automation to Reduce One-Off Scripting
Continuous deployment requires the ability to scale quickly. This means that the reliance on deployment scripts can impede scaling of your release process. To avoid the reliance on scripts, the process should include a set of reusable tasks, components and functions that can define a templated approach to deployments.
A logical view of your endpoints, their use, ownership and capabilities is essential for defining your release landscape and creating a reference for automated deployments. Reporting on the Environment configurations is required for abstracting the differences between any two environments - a process required for debugging when a deployment does not perform as expected based on metrics defined in a previous environment.
Approval and Approval Gates
Depending on the specific vertical, approvals of releases to testing and production environments may be required. Highly regulated markets require a separation of duties which speaks directly to restricting access to certain stages in the application lifecycle such as testing and production. Including a method of notification and approval that a new release is moving to a particular location should be included in your release strategy if you are highly regulated.
Release Coordination and Auditing
Tracking and coordinating activities across both automated and manual steps in the deployment process is needed for a clear understanding of what occurred. In addition, all activities manual or automated, should include an audit log showing who, when and where an update occurred. This level of information can be used for resolving an incident, and serves the purpose of Audit teams in certain highly regulated industry segments.
The location of any artifact deployed to any location in an environment should be recorded. Understanding what is running in any environment is essential for maintaining a high level of service and quality. The inventory tracking should allow for viewing and comparing from the point of view of an artifact to all locations where the artifact is installed. From the environment view, the tracking should show all artifacts across all applications that are deployed to the environment.
Calendar and Scheduling
For larger enterprises where approvals and policies are needed for a release to occur, a calendar and scheduling process should be used. The calendar should be defined based on the environment and allow for collaboration between development teams, testing teams and production teams showing when a release is scheduled or requested.
The continuous deployment process should be free from manual changes. This requires all release metadata to and logic to be maintained in an immutable state in which the deployment can be re-executed with the assurance that no manual touches occurred.
Canary deployments, blue/green deployments and rolling blue/green deployments are common methods of ‘testing’ a release to mainly production environments. The continuous deployment process should support the various deployment models required by production teams.
Push Vs. Pull
In reality, all deployments are ‘push’ deployments. Even in a GitOps methodology, a push drives a pull request. In GitOps a deployment is initiated by committing a deployment definition (.yaml file) to an environment repository. All other best practices should be applied even to a Pull GitOps process.
Automation of the deployment may require specific guardrails depending on the environment. Policies should be defined to allow the automation process to incorporate standard ‘rules’ around any specific deployment that align with the organizational culture.
6 - Continuous testing
Automated testing helps you to consistently check your software and find problems faster.
Why Continuous Testing
Find issues fast
Continuous testing allows you to get the fastest possible feedback on the status of your application enhancements. Test feedback will go directly to those who are in the best position to remediate any found issues – the developers. The longer an issue exists in the code (or design) of a given work item, the more costly it becomes to resolve. Finding issues as soon as they arise will save unnecessary asset expenditure while improving quality and maintaining a high-quality standard.
Minimize Slow, Manual Testing
Shift-left testing prevents the need for long, drawn out and error prone manual test processes at the end of the development cycle. Manual testing can now focus on exploratory aspects to increase the testable knowledge of the system being developed and apply the newly gained knowledge to further automated test development.
Reduce Business Risk
Stakeholders expectations must be continuously met. Finding gaps in the system/feature later in the development cycle can cause undue setbacks. Continuous testing enables identifying these gaps as actionable items which can be prioritized early in the development cycle.
Improved Test Effectiveness/Quality
Continuously using automated tests bolsters the need for high quality, high value tests. Unnecessarily long feedback loops can occur where too many low value tests are being run as part of continuous testing. Removing the low value tests or improving the effectiveness of tests is commonplace in continuous testing due to the high return on investment. This ultimately leads to faster development cycles with a higher quality standard.
Developer On-boarding/System Documentation
It is often the case that key knowledge about how a product or feature works may largely reside within individuals, making them possible bottlenecks or points of failure for fixing bugs or making additions in a timely manner. A strong and reliable automated test system ensures that any developer has the ability to safely make changes in code they may not have a strong background in, acting as a safety net to ensure incompatible changes are not introduced. It also acts as a form of developer documentation for how the system actually works in a validatable way.
Getting fast feedback on the impact of changes throughout the software delivery lifecycle is the key to building quality into software. Testing is not something that we should only start once a feature is “dev complete.”
Because testing is so essential, we should be continuously performing all types of testing as an integral part of the development process.
Not only does this help teams build (and learn how to build) high-quality software faster; DORA’s research shows that it also improves software stability, reduces team burnout, and lowers deployment pain.
All types of automated testing (unit, integration, acceptance, etc) should be run against every commit to version control to give developers fast feedback on their changes. Developers should be able to run automated tests as much as possible on their workstations to triage and fix defects. Testers should be performing exploratory testing continuously against the latest builds to come out of CI stage.
Allow testers to work alongside developers throughout the software development and delivery process. Note that “tester” is a role, not necessarily a full-time job. Perform manual test activities such as exploratory testing and usability testing both as a way to validate existing automation and also identify opportunities to extend test coverage. Many products, especially those with a complex graphical interface, will not be able to achieve 100% automated test coverage, so it’s important to have strong domain expertise in testing strategies.
Testing in Production
Being able to successfully and safely test in production requires a significant amount of automation and a firm understanding of best practices.
The best effort simulation of the production environment is the production environment itself. Why using a “staging” environment is not always the best idea:
- size of the staging cluster, sometimes it’s a single machine used as a cluster
- configuration options (load balancers, databases, queues, etc) for every service will be different than production because of costs for example
- the lack of monitoring for the “staging” environment. Even if monitoring exists, several staging monitoring signals could end up being completely inaccurate given that one is monitoring a “different” environment than the production environment.
- During release: canary releasing, feature flagging, monitoring
- During post-release: profiling, distributed tracing, chaos testing, A/B testing
Continuous Testing Landscape
7 - Managing the Software Supply Chain
Best practices for managing your software supply chain.
When we produce a software product, that product does not stand in isolation. We must always consider all of the dependencies that we inherit from other providers, and our own role as a dependency to our customers, and their customers in turn.
If we consider the flow of dependencies across this landscape as a ‘software supply chain’, it becomes easier to recognize the fractal nature of the problem space, where similar challenges can be found to repeat at different scales, throughout the supply chain. By applying consistent responses to those challenges, we can simplify out unnecessary complexity and improve overall quality and reliability of our systems.
We look at some common, repeating patterns in the sections below.
7.1 - Licensing
Best practices for managing intellectual property in the software supply chain.
Software is a form of intellectual property and, in most jurisdictions, the creator of a piece of software is granted various legal rights with respect to the use and replication of this software by others. Typically, the assertion of these rights will be communicated by issuing a software license to consumers of this product.
Any individual or organization violating the terms of a license may be liable to prosecution, hence it is critically important to understand, for every dependency in your solution, that you have the right to legally use that component for the purposes that you intend.
In the case of commercial software, you are required to pay for a license, the price of which may be linked to the number of intended users, the number of instances, or the number of CPUs used to execute it. This payment may be a one-off fee, a recurring charge, or relative to usage. This implies the need to regularly confirm that licenses are still valid.
In the case of Free, Open Source software, the creator has chosen to grant a license without asking a fee. This does not mean that they are not asserting their rights, however, and it is normal for this class of software to come with attached and very specific constraints upon the type of use which is permissible. Again, you must be sure that you have permission to use all of your inherited dependencies in the way that you intend for your product. Some licenses contain ‘copyleft’ restrictions that are viral in nature. These generally grant permission to use, modify and redistribute a component, but only upon the condition that any derived works inherit the same conditions, and/or contribute the source for their modifications back to the community. Again, it is critical to understand the terms granted by every dependency, since you may be inheriting a legal obligation that conflicts with your intended usage or commercial model. Legal precedent now exists to show that this obligation exists at a contractual level when consuming Open Source software and is not merely an expression of copyright.
Some software falls into the class of Public Domain, where the creator has waived, or lost, all their rights, and this software may be used freely without consequence. It is not, however, safe to assume that software distributed without a license is in the public domain.
To manage this challenge safely, it is necessary to maintain a Software Bill of Materials (SBOM) that lists all of the components that make up your product, including all inherited dependencies. For each line item in this SBOM, it is essential to identify the type of license involved, the nature of the terms of this license, and any proof of transaction for commercial products. Whilst it is possible to do this by hand, it is an extremely manually intensive operation that a legal team will usually insist upon prior to every deployment. As such, it is really only viable to use automated scanning as part of your build pipeline. SBOM components that contain machine-readable licenses greatly simplify this process as the alternative is brute-force text scanning to try to identify licenses within each component and parse the license for sensitive terms. This can be expected to result in multiple false positives and negatives, so a necessary approach is to utilize a component library service that maintains a history of component-library relationships that can be manually maintained at an organization-wide level to mitigate the effort involved in tracking this information. This also facilitates the use of policy-based rules for license management.
Now let’s look at the downstream implications. Your product will need to present a license to your customers (and that license may need to inherit clauses from your dependencies). You should provide with your software an easily understood description of the license, the full legal statement of the detail of the license and a machine-readable version of the license such as those maintained by the OSI or Creative Commons. You should also publish the SBOM for your product where your license permits, to simplify recursive dependency management.
If your product is commercial in nature, you will find that licensing can be a costly and resource-intensive overhead. A key element for start-ups in particular is to do as much as possible to automate the on-boarding process for your product, so that your customers can easily integrate your product into their process without manual intervention on your part. This applies equally to paid software modules, or Software-as-a-Service offerings. Simplifying your sales process into something that aligns with your customer’s build pipelines will increase conversion rate and stickiness to an extent that can outweigh a more aggressive, tiered pricing approach requiring a dedicated sales team.
7.2 - Lifecycle
Best practices for managing the lifecycle of your software supply chain.
We often make the mistake of thinking about software development as a linear process with a start and an end: gather some requirements, write some code, ship it, move onto something else. In reality, we are building an asset and the value of that asset is intrinsically linked to its ongoing viability across the lifecycle of the product. We must always think in terms of a circular process of iterative improvement and maintenance as part of our Continuous Delivery process. Software left untouched, starts to rot. The business environment changes, the global marketplace changes, regulations change, new market segments open, so requirements change and our software must be updated in response.
Now consider this in the context of the software supply chain. The same is true of all of your dependencies, and of your customer’s products in turn. Every entry in your SBOM is a product with a lifecycle and your product is an entry in someone else’s SBOM. As a result, as soon as you release a version of your product, it is becoming stale because the dependencies it relies upon are changing. Many of these changes will just represent minor bug fixes or new features that don’t impact you directly, but many will be security fixes for active vulnerabilities, or breaking changes to APIs you rely upon, caused by changing demands upon these products.
First and foremost, you must always ensure that you use explicitly declared versions of every dependency, so that your SBOM remains consistent for a given release. You must also version each component and API in your product so that you are able to introduce change over time without impacting your customers negatively. If you reference a dependency by ‘latest’ rather than an explicit version, you may experience random breaking changes in production if those dependencies are dynamically loaded.
Now consider the forces at work across the supply chain. All of your dependencies are either slowly evolving in response to external stimuli, or they have been abandoned and are becoming stale. You must have a strategy and roadmap for updating your system to align to the changes coming from upstream, which involves maintaining your SBOM, enhancing your code to adapt to breaking changes and running your build pipelines regularly to check for unanticipated external factors that break the system.
Beyond this, you must also maintain your own code in response to changing requirements and urgent events such as security issues.
All this has an impact upon your customers. You are free to publish a new release at any time, but your customers may not be able to change their environments at the same pace that you can. In the ideal world, it would be Continuous Delivery all the way down, with everyone updating every few hours in response to changing pressures. In practice, however, there is significant latency across the supply chain and this can lead to tension in the system, and even fracture under some circumstances. If your customers are slower than you to adapt to change then you may need to be able to support multiple versions of your product in production simultaneously, but this quickly increases your overheads and leaves you potentially exposed when change is triggered by security concerns. If you are slower to adapt than your dependencies, you may find that support for key components is withdrawn before you are ready and you end up in a state where you can no longer build your product without major remedial actions.
It is essential to recognize this pattern of waves of change that ripple up and down the supply chain and invest in your Continuous Delivery process to mitigate the impact upon your part of the process.
7.3 - Regulatory Compliance
Best practices for ensuring compliance in your software supply chain.
In many cases, you may be operating in an environment that requires that you conform to predefined standards of compliance to corporate and/or government regulation.
In a pattern similar to that discussed in licensing, you will therefore have to inspect all inherited dependencies in your SBOM to confirm that you are compliant. An example of this would be the International Traffic in Arms regulations (ITAR) which restrict and control the export of defense and military related technologies and covers certain classes of software asset that may not be exported to some countries. In a similar manner, it is therefore necessary to include a step in the build pipeline that scans for non-compliant articles. This usually requires a combination of text searches and centrally managed, policy-driven lookup tables.
Compliance always implies external audit, so your pipeline steps must produce audited output that can be preserved for the lifetime of the product and presented to internal and external audiences for validation. It should be assumed that this data may at any point become required as part of a legal defense, so forensic cleanliness should be observed at all times, generating immutable records with appropriate levels of tamper-proofing and non-repudiation built in.
7.4 - Security
Best practices for securing your software supply chain.
Any software assets that your product depends upon can potentially be used to introduce vulnerabilities into your product, or to spread those vulnerabilities to your customers. It is critical therefore, to take appropriate action to mitigate this risk, otherwise you may potentially be held liable for damages caused to your customers as a result of a subsequent breach, or incur irreparable reputational damage to your product or business.
You must be able to identify all software assets that your product depends upon, and be able to establish the ongoing provenance of each of these assets.
You must have systems in place to allow you to identify and immediately act upon tampering attempts within your software supply chain.
You must be able to secure the infrastructure that you use to create your product, detect and act upon attempts to subvert your build process by attackers.
You should be able to provide strong assurances of the integrity of your build process as provenance to your customers.
The following are examples of potential attack vectors which must be secured against:
Individuals may attempt to deliberately introduce vulnerabilities directly into your source code. In the Open Source space, this may be in the form of adversarial pull requests containing malicious code hidden within feature submissions or bug fixes. In commercial software, this may take the form of insider attacks or revenge from disgruntled employees. Multiple peer reviews of code submissions should be used to mitigate this risk, with careful consideration for social engineering that undermines this people-process, such as sock-puppet review accounts.
Attackers may seek to inject code into your source by compromising your source code repository. This may be in the form of attacks upon outsourced service providers, permitting access to your codebase, or supply chain attacks upon the integrity of the software assets used to implement self-hosted source code repositories. You should be auditing all changes to the codebase and correlating these against known work-stream activities in order to identify and act upon unexpected changes.
By subverting your build infrastructure, it is possible for attackers to substitute alternate source code during your build process. Mitigating against this requires a chain of provenance that can be established independent from an individual build step to enable tampering to be detected. The security of your build process should receive similar investment to the security of your end product as one depends entirely upon the other.
A bait-and-switch attack can involve the introduction of an initially innocuous dependency which is then later replaced with a malicious payload, upstream of your build process. Under these circumstances, a one-off code review and dependency evaluation is insufficient to detect the attack. Instead, it is necessary to validate the provenance of the dependency during each build so that the substitution may be found as soon as it occurs.
Similarly, it can be possible to substitute assets that live in image repositories, post-build, but which are subsequently utilized for deployments. Thus, the item deployed is not the one built and tested. Mitigation requires the extension of your provenance system into your production deployment process, or to your customer’s build and deployment process. The same attack can apply to third party mirror sites for common dependencies used dynamically during deployment.
Denial of service attacks can be employed by attackers by removing key dependencies from your upstream supply chain, or overloading network resources that provide access to these components. This risk can be mitigated by caching copies of these assets locally, but be aware that this brings with it all of the challenges associated with maintaining the validity of cached data, which can include other security vulnerabilities such as failure to detect and react in a timely manner to new vulnerabilities reported against existing, cached dependencies (zero-day CVEs etc.), or the use of cache-poisoning to inject vulnerabilities.
Static and dynamic security analysis tools can be used as part of the build process but it is important to recognize that these also form part of your software supply chain dependencies and are themselves also potentially vulnerable to attack. Meta-analysis currently shows that no single security analysis tool on the market provides comprehensive identification of all currently known vulnerabilities, so it is essential to adopt a defense-in-depth strategy, leveraging multiple, parallel solutions to maximize your chances of early identification of problems.
Be aware that corporate and state-level firewall solutions may deliberately substitute or inject changes into assets which are downloaded through these routes and subsequently consumed by build processes. The same may apply to the software assets that you supply to your customers.
It is particularly important to consider the security of infrastructure used to train ML models, since it is now practical to inject vulnerabilities into models which are mathematically hard to detect with analytical testing but which can be remotely enabled through the use of specific data injected as input to production systems through the front door.
Appropriate implementation of security best practice usually requires some degree of separation of concerns. If we consider the three aspects of ‘maintaining the product codebase’, ‘maintaining the build and release governance process’ and ‘maintaining the audit trail’, it is clear that anyone who has access to all three is empowered to subvert the process without detection. As such, it is advisable to limit the number of people who are required to have full visibility and access to all of these aspects of your continuous delivery process, and to ensure that additional audit processes exist to monitor changes that span all areas.
Consideration should be given to the SLSA specification for supply chain security which may be found at slsa.dev.
8 - Configuration management
Automating configuration changes for consistency, auditability, and reliability
Why Configuration Management
Configuration Management maintains a record of the state of software and runtime environments with changes overtime. Configuration management tracks the many attributes and conditions of how software is assembled, how environments are defined, and records updates allowing systems to be consistently maintained by presenting a history of state changes.
Configuration management is the task of tracking, reporting and controlling changes in software and IT infrastructures.
Description and Scope
Configuration management has a wide scope tying together changes in software and infrastructure updates, managing external packages, and tracking microservices across clusters.
Configuration management addresses changes to both software and infrastructure. In both traditional and cloud native environments, the understanding of how artifacts are assembled and infrastructures are defined provide critical information to site reliability engineers and operation teams for maintaining required service levels.
Concepts in configuration management began as Software Configuration Management (SCM) or ‘Change Management.’ In traditional development, SCM is performed at build time during the CI build step. Version control tools are used to manage SCM, tracking source code changes with revision history and driving what is included in the software build step. It is at this point where decisions are made as to what versions of source code, transitive dependencies, packages and libraries are used to create an artifact. Changes are tracked via Bill of Material reports, Difference Reports and Version Control revision history.
The use of library management solutions for scanning code and managing transitive dependencies became an essential configuration management step as developers began incorporating more external open source libraries into their software. Library management is used to track the configuration of these external libraries (packages), perform security management and license compliance of open source libraries, and pull any dependencies needed to compile code. Even in a cloud native architecture, library management is critical for building compliant container images.
Similarly, Infrastructure configuration management tracks changes applied to servers, VMs and clusters. The movement of the IT Infrastructure Library (ITIL) drove the adoption of centralizing IT data in a Configuration Management Database (CMDB), a central store of information about the overall IT environment such as components that are used to deliver and run IT services with asset relationships. The adoption of CMDBs were impeded by the need to statically define configuration updates and changes making the CMDB quickly out of date. This led to the creation of a ‘service catalog’ which performs service discovery and ‘collects’ information versus a manual collection process after a change had been made. However, service catalogs generally track production environments, and not development and test.
Configuration management became a reactive step to updates instead of proactive. A better method was required. This led to the ‘configuration as code’ and ‘state’ management movement. Tools like Chef and Puppet were created to define a particular environment ‘state’ driven by a ‘manifest.’ The ‘manifest’ defined the ‘state’ and all corresponding environments were automatically updated to that state. Changes in the configuration could be tracked by the differences in the manifest. In essence, this is the core of the GitOps practice where a deployment .yaml serves as the manifest and the GitOps operator manages the ‘state.’ The manifest is configuration as code.
Microservices adds a new layer of configuration challenges. Because each are small functions and independently deployed, the traditional CI practices of compiling and linking code into a static binary is lost. For this reason, new methods of tracking microservices, their usage, ownership, relationships, inventory, key values and versions across clusters are required.
Lessons from the past instruct us to begin automating the configuration management and change tracking of the full IT stack. Adding automated configuration management into the CD pipeline will become increasingly important as companies decompose applications into hundreds of independent functions running across dozens of clusters.
Core to configuration management is understanding the relationships between objects. The ability to easily see relationships between artifacts, objects and environments is essential. Dependency management enables the view of the relationships from an artifact to a version of an application, from application version to the artifacts that compose it and from an environment to the application versions that run in the environment.
Included in your software build process or container image build is the need for defining transitive dependencies. This step is needed to identify vulnerabilities, licenses usage and policy conformance in open source libraries on which your software depends.
Understanding the impact a single source code change has on the consuming applications is critical in understanding the risk level of an update. In a cloud native environment, the impact of a microservice update is only exposed at run-time. The ability to view the impact or blast radius prior to a release will minimize the runtime incidents in both monolithic and microservice architectures. In addition, when changing the infrastructure, a clear understanding of the potential risk of an update is needed.
Variables relate to how the binaries, containers and runtime environments are configured. Whether you are running in a traditional server based environment or cloud native environment, environment variables need to be managed. The same artifact or application can execute differently depending on these variables. Key value pairs and configuration maps should be versioned based on the environment. This data should be stored in a central location such as a catalog or configuration server. Comparison of key values between environments is needed for debugging purposes.
Centralized Configuration Data
Configuration data should be stored and managed in a central location. Querying the centralized configuration metadata provides critical insights into what has occurred, or what may occur supporting the analysis of impacts and exposing potential problems before an update. As we develop machine learning, a centralized data store has the potential to predict the risk of a change.
Configuration reporting should include comparison reports, relationship reports and configuration changes with an audit log. The audit log should show details such as when, why, what and who drove the changes over time. Difference or ‘comparison’ reports are needed to expose precisely what was updated and how two environments differ or two application versions differ. The difference detail is essential in maintaining site reliability and minimizing incidents.
Domain Driven Design (DDD) and Service-oriented Architecture (SOA)
While not apparent, there are particular configuration requirements for a Service Oriented Architecture (SOA) that can significantly simplify the change management process. Minimizing the sprawl of services across environments can reduce complexities in tracking changes and improve overall reuse of services. Domain Drive Design (DDD) has been a method for defining an SOA and is critical for organizing reusable services. DDD should be incorporated into the configuration management strategy to reduce redundancy and improve quality.
9 - Assess your current state
Assess the maturity of your continuous delivery in your organization.
Information in this section helps your user try your project themselves.
What do your users need to do to start using your project? This could include downloading/installation instructions, including any prerequisites or system requirements.
Introductory “Hello World” example, if appropriate. More complex tutorials should live in the Tutorials section.
Consider using the headings below for your getting started page. You can delete any that are not applicable to your project.
Are there any system requirements for using your project? What languages are supported (if any)? Do users need to already have any software or tools installed?
Where can your user find your project code? How can they install it (binaries, installable package, build from source)? Are there multiple options/versions they can install and how should they choose the right one for them?
Is there any initial setup users need to do after installation to try your project?
Try it out!
Can your users test their installation, for example by running a command or deploying a Hello World example?
9.1 - Value stream mapping
An approach to identifying waste and bottlenecks in current processes so that you can work to improve the way your teams work.
Value stream mapping is a tool for documenting end-to-end process across an
organization and identifying what needs to change.
Add more info and resources
9.2 - DevOps capabilities
An approach to identifying waste and bottlenecks in current processes so that you can work to improve the way your teams work.
Stuff about DevOps Research and Assessment (DORA)
9.3 - Software supply chain security
An approach to identifying waste and bottlenecks in current processes so that you can work to improve the way your teams work.
Software we all build and use has many dependencies, including internal
dependencies, third-party vendor software, and open source software. And each
dependency has its own chain of dependencies.
We also need to trust the software tools and infrastructure we use to
develop, build, store, and run software. And we need to trust that members of
our teams are following secure practices.
As a result, there are many places where a software supply chain can be
vulnerable to attacks.
Supply-chain Levels for Software Artifacts is a framework for assessing
and improving the security and integrity of your software supply chain. It
maps best practices as requirements for each defined level of security maturity.
10 - Domain-specific practices
Practices specific to a domain or industry.
- Why it matters
- Associated DORA capabilities
- Key stakeholders
- Best practices
- Relationship to other practices