With many companies still unsure about what exactly DevOps is, how can you assess whether your company should care about DevOps, let alone make a significant investment in it?
For starters, if you recognise some of the following problems the chances are that adopting DevOps principles and practices will improve your organisational performance:
- Deploying new releases is painful and more often than not results in system outages.
- So much time is spent fighting fires that new projects are always behind schedule.
- Innovation is so slow that you are consistently losing ground to your competitors.
- Customers frequently report the same bugs in your client facing systems.
- Fault finding is a dark art and fixing one problem typically results in new problems being created.
So What Is DevOps?
DevOps is not a product, it is not a tool, it is not a standard and it is not a process.
DevOps is teams working together to release changes to customers quickly but safely.
DevOps is a collection of principles and practices that you put into action by having product owners, developers, QA, and IT Ops work as team mates rather than adversaries, who use a common set of tools and techniques to rapidly deliver low risk releases to customers.
How Do I Get Started?
- You define "Why does our organisation need to adopt DevOps principles and practices?".
- You appoint a project sponsor who can secure budget and drive organisational change.
- You sell the journey to the organisation.
- You start small.
- You pick a project.
- You create a cross functional team.
- You create reusable tools and infrastructure.
- You promote the principle of self service, on demand in relation to infrastructure, data, tests, telemetry etc.
- You demonstrate success.
- You incrementally transform your organisation so that all projects benefit from DevOps.
What Is Involved In a DevOps Transformation?
Visibility
In the excellent book The DevOps Handbook, a Value Stream is defined as "the process required to convert a business hypothesis into a technology enabled service that delivers value to the customer". Within a value stream there will be several team members responsible for performing tasks at different stages of the value stream.
The first task is to identify exactly who must work together in order for this value stream to deliver value to the customer. The typical members are:
- Product owner
- Development team
- QA
- IT Ops
- Infosec
- Release manager
Now you can map the work that must be done and the interactions that must occur between members of the value stream in order to deliver value to the customer. Your goals are:
- Identify which steps in the value stream are likely to be the bottlenecks that will hinder your pursuit of maximising the pace and minimising the risk at which you deliver value to your customers.
- Demonstrate to each member of the value stream the effort that must be invested by other members of the value stream in order for value to be delivered to the customer.
The value stream map should be annotated with key metrics for each stage:
- %C/A - percent complete/accurate, as judged by the immediate downstream member in the value stream.
- LT - lead time is the time spent at a step in the value stream including non-productive wait time.
- VA - value added time in which productive work is being performed.
Now you have your "As Is" value stream map the next step is to consider how, at each step, you can improve quality, measured by %C/A, decrease wait time measured in LT and improve efficiency measured by VA. This will yield your "To Be" value stream map and provides metrics against which you can judge the success of your DevOps transformation.
Having secured this high level visibility of the work that must be done in order to deliver value to your customers, now it is time to create visibility over the physical tasks that members of the value stream are working on. For maximum benefit all work items for both developers and IT ops should be recorded in a common backlog in order to give a complete picture of work required. This promotes cross functional understanding and helps with task prioritization.
Taking this a step further, publishing this backlog prominently on large screens can be a highly effective way to share information with all members of the value stream, maximising the flow of information and therefore maximising the pace at which value can be delivered to customers.
Standardisation
If you want low risk releases you must standardise your dev, QA and production environments.
If you want your development team to work efficiently you must provide them with standardised infrastructure that works.
In days gone by it was not uncommon for developers to begin a new project by installing a load of new software on their workstation, spending a few days of trial and error trying to get all of the software to run without error and then finally starting to cut some code.
That approach does not scale and will cause release headaches. You are almost guaranteed that every developer environment runs different software versions, different configurations and different workarounds to fix up the inevitable installation problems.
The result is a volatile IT landscape that causes unpredictability at release time because code has not been tested in an environment that matches the QA and Production environments.
Standardise your infrastructure by turning it into code and versioning it. Server configurations can be handled by tools such as Kubernetes, Docker, Puppet, Chef and Ansible, which allow teams to create code based configuration profiles that can be used, self service, on demand by all team members.
That means developers can spin up exactly the same environments, using exactly the same tools that IT Ops use to manage the QA and Production environments.
Now your releases are lower risk because unknown differences have been eliminated and your developers can work more efficiently because they no longer need to worry about configuring systems that can contain many different software packages.
Deployment Pipeline
Your deployment pipeline will bring consistency, pace and robustness to your release processes by automating steps that can be time consuming or prone to human error.
Having standardised your infrastructure and made it available on demand your deployment pipeline is now able to spin up all of the needed infrastructure in response to a developer checking in some code to the Git repository.
Once the infrastructure is live the most recent copy of the code base can be deployed to it and if needed a relevant data set can be used to populate any relevant data stores.
Now before deploying into live it is possible for you to confirm that your software runs without errors in a production-like environment.
Automated Tests
A key part of verifying that your software will run as expected in a production-like environment is having a suite of automated tests that can be run on demand to verify that your code behaves as expected. It is important that your suite of tests is effective and you should be careful to avoid these traps:
- Tests that produce false positives, alerting the team to errors that are not actually present
- Tests that hinder productivity because they add no real value yet add creation and maintenance overhead
It takes time and therefore money to create automated tests, however, each time you need to release an updated version of your software the automated tests are repaying you. They will guard against buggy code being introduced into Production, they will prevent regression bugs, they will prevent errors due to unchanged code objecting to a new and invalid invocation.
With sufficient automation testing no longer waits until the end of the process. Rather it happens as early and as often as possible.
- Unit tests - executed by developers on their workstations to test individual methods and classes in isolation. Other services are typically "mocked out" to remove dependencies and to ensure that tests complete quickly. Unit tests should be executed frequently, therefore any slow running tests will impair productivity and risk developers not running tests as often as they should.
- Acceptance tests - these are tests written from the business perspective used to test that the software as a whole will behave as the business intends. These tests are typically written in the "Given-When-Then" style that can also serve as an excellent source of system documentation. These tests are typically slower running and are executed after a developer commits code, orchestrated by the deployment pipeline.
- Integration tests - these tests ensure that your application will interact as expected with other production applications and services, rather than calling stubbed out services. For example, does your ERP software correctly update your ecommerce system with price and stock updates when they are altered in the ERP?
As a rule of thumb, the earlier errors are caught the easier they will be to fix because the developer will still retain the appropriate context and fewer other code changes will have been introduced to add complexity to a fault finding process. Therefore, whenever an integration test uncovers a fault your team should consider how it could have been picked up by acceptance tests or unit tests and when an acceptance test uncovers a fault your team should consider how it could have been picked up by a unit test. This will maximise productivity and confirms that the test suite should be considered as something that is always evolving.
Telemetry
When your systems go down you are losing money. If your engineers must log on to individual servers and inspect individual application log files they are being starved of information because they do not have an easy way to inspect the relationship between different applications at a particular point in time.
A better approach is to gather telemetry from all of your systems, including those in your deployment pipeline, and store it in a single place that supports analysis of data from different systems at a particular moment in time. This can quickly reveal all systems that could be causative of or involved in a serious system outage.
That level of insight will minimise the amount of time it takes to identify faults and outline the necessary corrective action.
What can be even more powerful is the predictive nature of such telemetry, which could indicate that load is spiking under heavy demand or that certain servers are looking unhealthy and should be taken out of rotation. This proactive action can be prevent problems before they are even noticeable.
Quality
When your teams don't have time to ship high quality code they are creating extra work for someone else in the value stream who must pay the price of the imperfections and it is typical for process overhead to be added to catch these quality problems, which hinders the objective of delivering value to customers quickly and efficiently.
On the contrary, when high quality work is the norm and it is backed up by an effective, automated test suite, developers can take ownership of releasing their own code into the production environment, which maximises the pace at which value is delivered to customers and minimises the amount of overhead required to achieve it.
Learning
Things go wrong, things will break. Rather than deny that fact, the goals for your organisation should be:
- Rapid recovery.
- Taking all possible learning opportunities to ensure that your organisational performance is not stifled by the same issue surfacing time and again.
- Amplify the learning experience by sharing it outside the immediate team and across the organisation.
The first step towards achieving that goal is to create an environment in which team members are comfortable owning mistakes rather than hiding them. If your response to a system outage is to decree from up high that more process and more manual checks must be made in order to prevent such errors from taking place you will stifle your team, discourage information sharing and in all likelihood promote lower quality work and higher risk deployments since the team will assume it is the job of the process to catch errors rather than the team treating it as part of their daily work.
On the other hand, if you give your team the safe space to own the problem, fix the problem and upgrade the deployment pipeline to catch such problems in the future you will keep your team motivated and able to ship updates rapidly and with low risk, since they will recognise that they are responsible for ensuring quality and they will not hide behind the process.
A highly effective learning tool is to create failures in a controlled way so that the team can practice responding to them and fixing them. For example, you might believe that you have a resilient system, but you don't really know that to be the case until you either experience a failure or simulate a failure.
It is far better to simulate a failure, such as shutting down a chunk of your hardware, in order to identify a weakness that can be rectified in a controlled fashion rather than experience a failure, probably on the busiest day of the year, which must be responded to as an emergency event.
DevOps Transformation Challenges
Applying DevOps principles and practices to your organisation is likely to be a significant transformation programme in its own right. You will ask people to work in different ways, learn new skills, adopt new tools and integrate with new team members.
To begin with these will be barriers to progress. Things will take longer because on day one you have a blank canvas, nothing to reuse and therefore nothing to save your team time. This often creates inertia, resistance and reluctance.
It is for this reason that you don't strive to change overnight, but rather you establish a DevOps capability incrementally, starting with a defined project scope and an enthusiastic, cross functional team who can secure successes, deliver value and begin to cross pollinate by demonstrating the value in this new way of working to colleagues, sharing knowledge and encouraging reuse of the tools and infrastructure that have proven to deliver greater efficiency, quality and maintainability.