This is the multi-page printable view of this section. Click here to print.
Where to Start?
- 1: Understanding the Problem
- 2: Common mistakes
- 3: Going slower to go faster
- 4: AI & Machine Learning
1 - Understanding the Problem
Software development at scale is an activity within the broader context of Product Commercialization.
Our goal is not to build software, but to build a product that leverages software or machine learning to solve the problem that the product addresses.
The problems of product commercialization drive the approach that we use to optimize software delivery. Let’s start by understanding these problems.
It is often assumed that product development looks like this:
- Have a brilliant idea for a product
- Build the finished product
- Sell the product
- Profit!
In practice, the reality is that your idea is actually an assumption about what might represent a product, but at that stage, you have no evidence to show that anyone wants to buy it.
You could follow the flow above, but if you are going to invest say $10M in the process of developing your product, you are effectively gambling that $10M on the chance that there will be enough customers to return your investment and make a profit.
Best practice in product commercialization looks more like this:
- Have 1,000 ideas
- Evaluate and rank each idea for potential merit to find the best
- Take this idea and find a handful of customers who have the problem it solves
- Work iteratively with your customers to build a series of experiments that test the usefulness of your idea at the minimum viable scale
- If your customers get excited and start offering to pay for the product, invest in scaling to a full implementation; otherwise
- Stop. Pick the next idea from your list and go back to step 3
This ensures that you are only investing heavily in activities that have strong evidence of product-market fit.
The trouble is, even with this process, less than one in ten product ideas will succeed in the marketplace, so it is very important to optimize this process to run as lean as possible so that the cost of running your learning experiments is manageable within the reserves of investment available to you.
What does this mean from a software development perspective?
Well, in the first model, you would expect to receive a full set of requirements up front and could carefully optimize a design to take account of everything that the product needs to be able to do. Then this could be decomposed into features and given to teams to implement against specifications.
In the lean model, all you have to start is a set of assumptions about what a customer’s problem might be. In this case, the role of the development team is actually to help design and build experiments that validate (or invalidate) each assumption, in conjunction with real customers.
In the first model, you expect to spend a year or so making mistakes and learning about the problem domain in private, before a big public launch of a polished application.
In the lean model, you have to do your learning in front of the customer, regularly delivering new experimental features into live production environments.
This learning takes the form of an iterative loop:
- Propose a hypothesis about what feature might be of value to the customer
- Design and develop an experiment to test this feature in the customer’s environment
- Run the experiment with the customer and evaluate
- Adjust the experiment and re-run; or
- Move to the next experiment; or
- Shut down the product
Of course, you can expect to have a finite budget to invest in learning experiments, and your customers will have a finite attention span, so there is a hard limit on how many times you can iterate through this loop before failing out. As a result, it is crucial to optimize this loop so that you can afford as many iterations as possible to maximize your chance of discovering a feature set that demonstrates product-market fit.
This means that we can derive two critical metrics that will assist us in staying focused upon optimizing our product commercialization process at all times:
Deployment Frequency
Your ability to execute the above loop is constrained by the frequency at which you can deploy your product into production. If it takes six months to move a feature into production, you will run out of money and customers long before you get to run more than that one experiment. If it takes you a week to deploy, you probably won’t be able to run enough experiments to discover the right product.
This implies the need to optimize the entire delivery process to be able to move features into production on a cadence of days to hours.
Lead Time
Understanding the time it takes from the point of identifying a new hypothesis you wish to test, to the point at which an experiment to test it has been deployed in production, helps to define the granularity of the experiments which you are able to run. If it is going to take a team three months to implement an experiment, then consideration should be given to reducing the scope of the experiment, or splitting the problem into smaller components that can be developed in parallel by multiple feature teams.
Optimizing the delivery process to consistently reduce lead times increases the capacity to run more experimental iterations within your available runway.
Working in this iterative fashion requires a fundamental conceptual shift across delivery teams. Instead of the assumption that “the product must be perfect, eventually”, it is necessary to assume that failure is to be expected regularly and as a result, be primarily very good at fixing things fast.
This should not be interpreted to mean ‘quality is lower’. Realistically, the idea of building products that don’t break is a naive oversimplification. Instead, we must consider that things can break; we should anticipate the ways in which they might be expected to break and mitigate the risk associated with breakages.
An unanticipated breakage that takes a product down for a week can kill a business. A temporary outage of one feature of a product is merely a minor annoyance and a failed feature deployment that is safely rolled back after failing smoke testing probably will go unnoticed.
This allows us to derive two more key metrics:
Time to Restore
Optimizing the time between identifying an incident and recovering from the incident mitigates the risk exposure during the failure. Typically, this encourages us to decompose our designs into small footprint components that are easily testable and have limited impact in the case of failure. Where mistakes are made, limiting the scope of the impact of those mistakes reduces risk and the time needed to recover.
Change Failure Rate
Tracking the percentage of changes that lead to an incident in production gives us a proxy metric for quality. This should in turn lead to appropriate root cause analysis. If many of the changes we make to our product are causing issues, do we properly understand the problem our product is designed to address? Does the product development team understand the product they are building? Is our delivery process prone to errors?
Why this focus on metrics, though?
Well, this is another major challenge for technology product development. In traditional engineering disciplines, such as civil engineering or aerospace, you are building something physical. If, every day, your product under development looks more like a bridge or an airplane, everyone on the team can understand that you are heading in the right direction and get a feel for progress. If your airplane starts to look more like an elephant, you can stop and reassess your activities to ensure you are on track.
Now imagine that you are building an invisible aircraft out of thousands of invisible components.
You can’t see the components being built, or the product being assembled from them. You have a team of hundreds of people working on the product and are relying entirely upon blind faith that they all know what they are doing and are all building compatible parts because the first time you will be able to evidence that will be on the maiden flight with the customer…
Surely, nobody in their right minds would buy an aircraft from a company that had an engineering process that looked like that?
Of course, the reality is that this is exactly how a lot of organizations approach the development of software products and goes a long way to explaining why 70% of software projects fail and 20% of those fail badly enough to threaten the viability of the parent company.
A core component of successful software delivery lies in making progress visible within an intrinsically invisible domain.
Small teams tend to do this naturally using communication. Many adopt basic Agile rituals that involve daily stand-ups to communicate issues, progress and blockers. At a small scale, this is viable. However, the number of possible communication paths scales with n2-n (where n is the number of people on the team). As a result, once you get beyond about six people on a team, they are spending the majority of their time on communication activities rather than developing software.
The implication of this is that for larger products, it is insufficient to follow our natural instincts or to apply basic engineering project management techniques from other disciplines. We need to apply a methodology specifically designed to mitigate the risks associated with working with intangible assets.
Before we look at the details of such a methodology, we should first consider the value of a software asset from a commercial, rather than a purely technical perspective. The software that we create can be considered to be a tool designed to solve a specific real world problem and the value of that tool relates to the benefit generated for the customer by the resolution of their problem.
This is important because it is rare that software has intrinsic, long term value. Software exists as a transient solution to contemporary issues in the outside world. Where a commodity like gold has a fluctuating but effectively perpetual value in the marketplace, the value of software is entirely dependent upon external factors, such as the continued existence of the problem that a given program is designed to solve, and the fact that this particular software still represents the best available solution to this problem.
Because software is implicitly impacted by external factors, it should be considered as a perishable asset. The problem your software is designed to face changes slowly over time. The market your software competes in changes quite rapidly. The physical hardware and the operating system that your software operates upon evolves continuously. The legislative and compliance environment impact your software changes, as do the security and data protection challenges associated.
It is critical therefore to understand that software rots, if left untouched for any period of time.
If we think about software on a project-centric basis, we make the error of considering it as a linear process from design to release and then try to optimize our process to get over the ‘finishing line’ faster.
To properly manage software assets, they must be considered in the context of an ongoing and iterative asset management loop, where ‘release’ is merely a regular, repeating stage in the lifecycle of the asset.
It is worth reiterating here, the value of software is based upon what it can actually do for customers, not what we believe it can do. If that value can shift over time due to external factors, and the asset itself is effectively invisible and therefore cannot be observed directly, we need some mechanism by which we can reliably assess the ongoing value of the asset that we own.
This problem is fractal in nature, since it recurs at multiple levels within our application. A software product is typically made up from multiple components or features, which when operating at scale, are nearly always developed and maintained by different teams. Those components are themselves built upon other components, often from teams outside your organization, and so on, down to the microcode in the processors you are running on.
Each of those components should be assumed to be partially complete. They achieve some of their design goals, but have some level of associated technical debt that has not been paid off because it is not economically viable to do so completely, or because you are doing something with it that the owners haven’t got around to testing themselves yet.
So, everything in your asset is built on stuff that is potentially a bit broken or about to break due to factors outside your awareness.
There is no ‘permanent solution’ to this problem that can be applied once and forgotten about. The only viable approach is to repeatedly rebuild your application from all of its components to ensure that it can still be built, and to repeatedly test the asset to measure that it is still meeting the factors that give it value in the marketplace. In the next section, we will look in more detail at best practice methods for achieving this effectively and reliably.
2 - Common mistakes
Before we go deeper, it is very important that we flag a common anti-pattern that can take your Continuous Delivery experience off into a catastrophic direction.
We have introduced the idea of DevOps as a process that helps you optimize delivery based upon metrics. Pick the wrong metrics, however, and you will get very efficient at being busy doing all the wrong things until you run out of resources and are shut down.
It is very tempting to align all your metrics against the delivery of features. Everyone in the organization will happily join in this activity because features are arbitrary lists of work that are easy to come up with, easy to measure progress against and usually represent things that are directly tied to milestones and bonuses. Worse, it is really, really easy to build processes that automate the end to end pipeline of delivering features and measuring their delivery. You will feel good doing it. Everyone will be busy. Stuff will happen and features will get delivered. Things will be happening faster and more efficiently than they were before. Bonuses and incentives will abound.
Then, there will be a reality adjustment. Your customers still aren’t happy. Nobody is buying. In fact, the instrumentation shows that nobody is really even using the features you shipped. Growth tanks and everyone starts saying that “Continuous Delivery doesn’t work”.
Here’s the error: Features are outputs. For Continuous Delivery to work as intended, your metrics must be based upon Outcomes, not Outputs.
The benefits of Continuous Delivery are not derived from the optimization of the engineering process within your organization. The purpose of Continuous Delivery is to optimize the delivery of those outcomes that are most important to your customers.
The bulk of the technical details of Continuous Delivery implementation can appear to sit within the engineering function and DevOps transformations are often driven from that team, but Continuous Delivery will only work for your organization if it is adopted as a core component of a product commercialization strategy that aligns all activities across shareholders, management, marketing, sales, engineering and operations.
The activities flowing through your Continuous Delivery pipeline should either be experiments to validate a hypothesis about customer needs, or the delivery of features that represent a previously validated customer need. All of this must be driven by direct customer interaction.
In order to successfully implement Continuous Delivery, your organization must have a structure that sets out strategic outcomes and empowers teams to discover customer needs and take solutions to them to market. This implies a level of comfort with uncertainty and a trust in delivery teams to do what is best for the customer in the context of the strategy.
In a classical business, ‘strategy’ is little more than a plan of work which is formed annually and which spells out the set of features to be delivered in the following year. All spending and bonuses are structured against this plan, leaving little room to change anything without ‘failing’ on the plan.
If an engineering team attempts to implement Continuous Delivery unilaterally within this structure, they will find themselves railroaded into using it to implement planned features by the rest of the organization. Furthermore, they will have no power to release into production because go-live processes will remain defined by other areas of the business that are still operating against classical incentives and timescales.
Broad support for organisational change at board level is necessary for a successful Continuous Delivery implementation.
3 - Going slower to go faster
For a single person, working in isolation on a single problem, the fastest route is a straight line from point A to point B.
The classic approach to trying to speed up a team is to make linear improvements to each team member’s activities, cutting corners until they are doing the minimum possible activity to get to the desired outcome. Taken to extremes, this often results in neglecting essential non-functional requirements such as security, privacy or resilience. This is, however, an anti-pattern for more subtle and pernicious reasons - inner loops and exponential scalars…
When trying to complete a task in a hurry, we tend to use a brute-force, manual approach to achieving tasks. Need to set up an environment? “Oh, I’ll just google it and cut and paste some commands.” Need to test something? “I can check that by exercising the UI.”
All these tasks take time to do accurately, but in our heads we budget for them as one-off blockers to progress that we have to push through. Unfortunately, in our heads, we are also assuming a straight line, ‘happy path’ to a working solution. In practice, we actually are always iterating towards an outcome and end up having to repeat those tasks more times than we budgeted for.
Under these circumstances, we notice that we are slipping and start to get sloppy at repeating the boring, manual tasks that we hadn’t mentally budgeted for. This has the effect of introducing a new class of errors, that we also hadn’t factored into our expectations, throwing us further off track.
If these manual tasks fall inside an inner loop of our iterative build and release process, they are capable of adding considerable overhead and risk to the process. Identifying these up front and budgeting for automating them on day one pays back the investment as soon as things stray into unanticipated territory.
The benefits of this are easy enough for individuals to visualize. The next problem, however, is more subtle and more impactful.
As we have discussed previously, working with ‘invisible’ artifacts like software systems means that teams have to manage the problem of communicating what is needed, what is being done and what technical debt remains. On a team of ‘n’ people, the number of possible communication paths goes up with n2-n and this exponential scaling factor can rapidly wipe out all productivity.
Similarly, if everyone on the team takes a unique approach to solving any given problem, the amount of effort required for the team to understand how something works and how to modify it also goes up with n2.
As your product grows, numbers of customers and volumes of transactions also scale exponentially.
As a result, behaviors that worked just fine on small teams with small problems suddenly and unexpectedly become completely unmanageable with quite small increases to scale.
To operate successfully at scale, it is critical to mitigate the impact of complexity wherever possible. Implementing consistent, repeatable processes that apply everywhere is a path to having one constrained set of complex activities that must be learned by everyone, but which apply to all future activities. This adds an incremental linear overhead to every activity, but reduces the risk of exponentially scaling complexity.
The implication of this is that there is a need to adopt a unified release cycle for all assets. It helps to think about your release process as a machine that bakes versions of your product. Every asset that makes up your product is either a raw material that is an input to the machine, or it is something that is cooked up by the machine as an intermediate product. The baking process has a series of steps that are applied to the ingredients in order to create a perfect final product, along with quality control stages that reject spoiled batches.
If you have ingredients that don’t fit into the machine, or required process steps that the machine doesn’t know about, you cannot expect to create a consistent and high-quality final product.
It is worth stressing this point. If your intention is to build a product which is a software system that produces some value for customers, you should have the expectation that you will also need to own a second system, which is the machine that assembles your product. These systems are related, but orthogonal. They both have a cost of ownership and will require ongoing maintenance, support and improvement. The profitability of your product is related to the effectiveness of the machine that assembles it, as is your ability to evolve future products to identify product-market fit.
So, what are the key features of a machine that manufactures your product?
Let’s start by looking at the ingredients that you are putting into the machine. These will typically comprise the source code that your development teams have created, third party dependencies that they have selected and data sets that the business has aggregated. These are people-centric operations that carry with them an increased risk of human error and a set of associated assumptions that may or may not align to the outcomes required for your final product.
Any component at this level will have been created by an individual or small team and it should be assumed that this work was undertaken with minimal understanding of how the component will interact with the rest of your product.
This introduces some key questions that your build system must be able to validate:
-
Can the component be built in your official environment?
-
Does it behave the way the developer expected?
-
Does it behave the way that other developers, who are customers of its service, expected?
-
Does it align to functional and non-functional requirements for the product?
Let’s touch briefly on the importance of consistent environments. The most commonly heard justification within development teams is probably “Well, it works on my machine!”
The environment in which you build and test your code represents another set of dependencies that must be managed if you are to maintain consistency across your final product. It is hard to do this effectively if your environments are physical computers, since every developer’s laptop may be configured differently and these may vary significantly from the hardware used in the build environment, the staging environment and production.
Virtualization and containerization make it much easier to have a standard definition for a build environment and a runtime environment that can be used consistently across the lifecycle of the component being maintained. We will discuss this in further detail later, but your build system will require a mechanism by which to configure an appropriately defined environment in which to create and validate your source components.
To build a component from source, we need to collate all of the dependencies that this component relies upon, perform any configuration necessary for the target environment, and, in the case of compiled languages, perform the compilation itself.
This brings us to one of the harder problems in computing, dependency management. A given version of your code expects a particular version of every library, shared component, external service or operating system feature upon which it relies. All of these components have their own lifecycles, most of which will be outside your direct control.
If you do not explicitly state which version of a dependency is required in order to build your code, then your system will cease to be buildable over time as external changes to the latest version of a library introduce unanticipated incompatibilities.
If you explicitly state which version of a dependency is required in order to build your code, then your system will be pinned in time, meaning that it will drift further and further behind the functionality provided by maintained dependencies. This will include security patches and support for new operating system versions, for example.
Furthermore, depending upon the way in which your application is deployed, many of these dependencies may be shared across multiple components in your system, or between your system and other associated applications in the same environment. This can lead to runtime compatibility nightmares for customers. You should also consider that you may need to be able to run multiple versions of your own components in parallel in production, which again introduces the risk of incompatibility between shared libraries or services. For good reason, this is generally known as ‘dependency hell’ and can easily destroy a product through unanticipated delays, errors and poor customer experience.
The implication of this is that you must employ a mechanism to allow controlled dependency management across all the components of your product and a process to mandate continuous, incremental updates to track the lifecycles of your dependencies, or your product will succumb to ‘bit rot’ as it drifts further and further behind your customer’s environments.
Decomposing your product into loosely-coupled services that can be deployed in independent containers communicating through published APIs provides maximal control over runtime dependency issues.
The remainder of our questions bring us to the topic of testing.
To verify that a component behaves the way its developer expected, our build system should be able to run a set of tests provided by the developer. This should be considered as a ‘fail fast’ way to discover if the code is not behaving the way that it was when it was developed and represents a form of regression testing against the impact of future modifications to the codebase in question. Note however that both the code and the tests incorporate the same set of assumptions made by the original developer, so are insufficient to prove the correctness of the component in the context of the product.
At this stage, it is recommended to perform an analysis of the code to establish various metrics of quality and conformance to internal standards. This can be in the form of automated code quality analysis, security analysis, privacy analysis, dependency scanning, license scanning etc, and in the form of automated enforcement of manual peer review processes.
To verify that a component behaves the way that other developers, who are consuming services provided by the component, expect, we must integrate these components together and test the behavior of the system against their combined assumptions. The purpose of this testing is to ensure that components meet their declared contract of behavior, and to highlight areas where this contract is insufficiently precise to enable effective decoupling between teams.
The final set of testing validates whether the assembled product does what is expected of it. This typically involves creating an environment that is a reasonable facsimile of your production environment and testing the end-to-end capabilities against the requirements.
Collectively, these activities are known as Continuous Integration when performed automatically as a process that is triggered by code being committed to product source control repositories. This is a topic that we shall return to in more detail in later chapters.
These activities provide a unified picture of asset status, technical debt and degradation over time of intellectual property. For the majority of a product team, they will be the only mechanism by which the team has visibility of progress and as such, it should be possible for any member of the team to initiate tests and observe the results, regardless of their technical abilities.
Ultimately, however, your teams must take direct responsibility for quality, security and privacy. Adopting a standard, traceable peer review process for all code changes provides multiple benefits. Many eyes on a problem helps to catch errors, but also becomes an integral part of the communication loop within the team that helps others to understand what everyone is working on and how the solution fits together to meet requirements. Done correctly and with sensitivity, it also becomes an effective mechanism for mentoring and lifting up the less experienced members of a team to effective levels of productivity and quality.
Your build system should provide full traceability so that you can be confident that the source that was subjected to static analysis and peer review is the same code that is passing through the build process, and is free from tampering.
The desired output from this stage in manufacturing your product is to have built assets that can be deployed. Previously, this would have been executables, packages or distribution archives ready for people to deploy, but under DevOps we are typically looking at creating containerized environments containing the product, pre-configured.
This should be an automated process that leverages environment specific configuration information held in the source repository with the code that describes the desired target environment for deployment. This information is used to create a container image which holds your product executables, default configuration, data and all dependencies.
At this stage in the process, it is appropriate to apply automated infrastructure hardening and penetration testing against your container image to ensure a known security profile.
The image may then be published to an image repository, where it represents a versioned asset that is ready for deployment. This repository facilitates efficient re-use of assets and provides a number of benefits, including:
Being able to deploy known identical container images to test and production environments alike Simplifying the re-use of containerized service instance across multiple products Making it easy to spin up independent instances for new customers in isolated production environments Enabling management of multiple versions of a product across multiple environments, including being able to rapidly roll back to a known good version after a failed deployment.
As an aside to the process, it is advisable to set up scheduled builds of your product, purely to act as automated status validation of your asset health. This gives confidence against ‘bit rot’ due to unanticipated external factors such as changes in your dependency tree or build environment that introduce build failures. In parallel with this, you must plan regular maintenance activities to update your codebase to reflect changes in external dependencies over time, so that your asset does not become stale and leave you unable to react rapidly to emergency events such as zero day exploits that must be patched immediately.
Having a machine that can build your product is only half the answer, however. Much of the risk sits within the process of deploying the product into production and this must also be automated in order to successfully mitigate the main problems in this space.
The first issue is ‘what to deploy?’. You have an asset repository filling up with things that are theoretically deployable, but there is a subtle problem. Depending upon the way you measure and reward your development teams, the builds coming into your repository may be viable units of code that pass tests, but which don’t represent customer-ready features that are safe to turn on. In the majority of new teams, it is fair to expect that this will be the default state of affairs, but this is a classic anti-pattern.
To realize many of the benefits of DevOps, it is essential that the images that are getting into your asset repository are production-ready, not just ‘done my bit’ ready from a developer’s perspective. You need to create a culture in which the definition of done switches from ‘I finished hacking on the code’ to ‘All code, configuration and infrastructure is tested and the feature is running in prod’.
It’s not always easy to get to this point, especially with complex features and large teams but there is a work-around that allows you to kick the problem down the road somewhat and this is the adoption of feature switching, or ‘feature flags’. Using this approach, you wrap all code associated with your new feature in conditional statements that can be enabled or disabled at runtime using configuration. You test your code with the statements on and off to ensure that both scenarios are safe and create an asset that can be deployed in either state. Under these circumstances, you are protected from situations where the code is dependent upon another service that has slipped on its delivery date, so your asset is still production-ready with the new feature turned off. You are also protected from unanticipated failures in the new feature since you can turn it off in production, or can perform comparative testing in production by only turning the feature on for some customers.
Given this, the next issue is ‘when to deploy?’. At some point, you need to make a go / no go decision against the deployment of a new asset. This should be based upon a consistent, audited, deployment process that is automated as much as possible.
In safety critical environments, such as aircraft or in operating theaters, checklists are used to ensure that the right actions are taken in any given scenario, especially when there is pressure to respond urgently to immediate problems. Flight crews and theatre staff are drilled in the use of checklists to minimize the chances of them skipping essential activities when distracted by circumstances. Your product build system must automate as many of these checklist activities as possible to ensure that key actions happen each and every time you make a release.
This is also a good place to automate your regulatory compliance tasks so that you can always associate mandated compliance activities with an audited release.
The deployment decision should be managed under role-based access control, with only nominated individuals being authorized to initiate a deployment. Remember that if someone manages to breach the system that builds your product, they may be able to inject malicious code into your asset repository by manipulating your tests, so you must take precautions to ensure that there are clear reasons for new code passing into production.
This brings us to the ‘what, specifically?’ of deployment. The assets in your repository are typically re-usable services that are bundled with default configurations that have been used for testing but now you need a concrete instance of this service, in a given environment, against a specific set of other service instances, for a specific customer or application. Your deployment process must therefore include an appropriate set of configuration overrides that will define the specific instance that is created. This is where you must deal with the feature switching can that you kicked down the road earlier.
Finally, we get to the ‘how?’ of deployment. Ideally, you want a deployment process that is untouched by human hands, so that you can guarantee a predictable and repeatable process with known outcomes. You can use procedural scripts to deploy your asset, but generally it is better to use a declarative approach, such as GitOps, where you maintain a versioned description of how you would like your production environment to look and changes committed to this description trigger the system to do whatever is necessary to bring the production environment into line with the desired state. Remember the ‘pets vs cattle’ model of environments. If you have a problem with infrastructure, it is far far better to kill the instance and create a fresh one automatically than to try and tinker with it manually to make it healthy again.
As part of this process, you will want to have automated validation of successful deployment, and automated recovery to the last known good state in the event of a failure.
You are trying to create a culture and a mechanism within which small units of functionality are incrementally deployed to production as frequently as possible, with the minimum of human input. This is Continuous Deployment.
Tying all this together, your goal is to build a product delivery engine that enables Continuous Delivery of features supporting your product discovery goals, at a cadence aligned to the metrics discussed earlier, thus maximizing your chances of commercial success within the constraints of your available runway and capacity.
In subsequent sections, we will dive more deeply into the specifics of each of these challenges.
4 - AI & Machine Learning
Many products include machine learning as a technology component and the process of managing machine learning in production is usually referred to as MLOps, however there are wildly differing views as to what this means in practice in this nascent field.
A common misunderstanding is to treat machine learning as an independent and isolated discipline with tools optimized purely for the convenience of data science teams delivering stand-alone models. This is problematic because it takes us back to development patterns from the pre-DevOps era, where teams work in isolation, with a partial view of the problem space, and throw assets over the fence to other teams, downstream, to deploy and own.
In reality, the machine learning component of a product represents around 5-10% of the effort required to take that product to market, scale and maintain it across its lifespan. What is important is managing the product as a whole, not the models or any other specific class of technology included within the product. MLOps should therefore be seen as the practice of integrating data science capabilities into your DevOps approach and enabling machine learning assets to be managed in exactly the same way as the rest of the assets that make up your product.
This implies extending the ‘machine that builds your product’ to enable it to build your machine learning assets at the same time. This turns out to have significant advantages over the manual approach common in data science teams.
Firstly, your data science assets must be versioned. This includes relatively familiar components like training scripts and trained models, but requires that you also extend your versioning capability to reference explicit versions of training and test data sets, which otherwise tend to get treated as ephemeral buckets of operational data that never have the same state twice.
Your training process should be automated, from training scripts that are themselves managed assets that include automated acceptance tests. Keep in mind the idea that the models that you are producing are not optimal blocks of code that can be debugged, but are rather approximations to a desired outcome that may be considered fit for purpose if they meet a set of predefined criteria for their loss function. Useful models are therefore discovered through training, rather than crafted through introspection and the quality of your models will represent a trade-off between the data available for training, the techniques applied, the tuning undertaken to hyper-parameters and the resources available for continued training to discover better model instances.
Many of these factors may be optimized through automation as part of your build system. If your build system creates the infrastructure necessary to execute a training run dynamically, on the fly, and evaluates the quality of the resultant model, you can expand your search space and tune your hyper-parameters by executing multiple trainings in parallel and selecting from the pool of models created.
A big part of successfully managing machine learning assets lies in having the ability to optimize your utilization of expensive processing hardware resources, both during training and operational inferencing. Manually managing clusters of VMs with GPU or TPU resources attached rapidly becomes untenable, meaning that you can accrue large costs for tying up expensive resources that aren’t actually being utilized for productive work. Your build system needs to be able to allocate resources to jobs in a predictable fashion, constraining your maximum spend against defined budgets, warning of excessive usage and enabling you to prioritize certain tasks over others where resources are constrained in availability.
It is important to be aware that beyond trivial examples that can be run in memory on a single computing device, much of machine learning sits in the domain of complex, high-performance, distributed computing. The challenge is to decompose a problem such that petabytes of training data can be usefully sharded into small enough chunks to be usefully processed by hardware with only gigabytes of RAM, distributed into parallel operations that are independent enough to significantly reduce the elapsed time of a training. Moving that much data across thousands of processing nodes in a way that ensures that the right data is on the right node at the optimal time is a conceptual problem that humans are poorly suited to optimizing and the cost of errors can be easily multiplied by orders of magnitude.
Consideration should be given to the forward and reverse paths in this product lifecycle. Your build process should seek to optimize the training and deployment of versioned models into production environments, but also to enable a clear audit trail, so that for any given model in production, it is possible to follow its journey in reverse so that the impact of incidents in production can be mitigated at minimal cost.
On the forward cycle, there are additional requirements for testing machine learning assets, which should be automated as far as possible. Models are typically decision-making systems that must be subject to bias detection and fairness validation, with specific ethics checks to ensure that the behavior of the model conforms to corporate values.
In some cases, it will be a legal requirement that the model is provable or explainable, such that a retrospective investigation could understand why the model made a given decision. In these cases, it should be expected that the evolutionary lifecycle of the model will include the need to be able to back-track through the training process from an incident in production, triggering retraining and regression testing to ensure that mistakes are corrected in subsequent releases.
Models also require security and privacy evaluation prior to release. This should take the form of adversarial testing where the model is subjected to manipulated input data with the intent of forcing a predictable decision or revealing personally identifiable training data in the output. Note that there is always a trade-off between explainability and privacy in machine learning applications, so this class of testing is extremely important.
The build system must be able to appropriately manage the synchronization of release of models and the conventional services that consume them, in production. There is always a problem of coupling between model instances and the services that host and consume inference operations. It should be expected that multiple versions of a given model may be deployed in parallel, in production, so all associated services must be versioned and managed appropriately.
Note that in some geographic regions, it is possible for customers to withdraw the right to use data that may comprise part of the training set for production models. This can trigger the need to flush this data from your training set, and to retrain and redeploy any models that have previously consumed this data. If you cannot do this automatically, there is a risk that this may be used as a denial of service attack vector upon your business, forcing you into cycles of expensive manual retraining and redeployment or exposing you to litigation for violations of privacy legislation.