Before we get directly into Total Cost of Ownership questions, I’d like to give a little background on how I approach this topic. I’m a student of the Theory of Constraints. I’m no expert, but I have a working knowledge of the concepts and how to apply them in Software.
Theory of Constraints (ToC)
The Theory of Constraints is a deep topic. If you’re not familiar with it, I encourage you to read “The Goal,” “The Phoenix Project,” and “The Unicorn Project” as primers. These are all fiction novels that do a fantastic job bringing abstract ideas into concrete reality in a way that’s easy to grasp.
The aspect of the ToC I want to focus on now is the attitude toward inventory and operating expense. When you invest in inventory, you are committing funds that are “frozen” until the end-product is sold. Inventory and Operating Expense detract from the realization of value, i.e., profit. Many managers focus their effort on reducing inventory and operating expense as a way to increase profit. There’s no intrinsic problem with this approach, but it does have some limitations.
First, you need some inventory and some operating expense in order to produce value. This means that the theoretical limit to how much you can reduce inventory and operating expense approaches but can never reach zero. At some point, you will have done all you can.
In ToC, while you are encouraged to reduce inventory an operating expense where it makes sense, this is less important than increasing throughput. If you can produce more quality product faster but incur some minor increase in inventory and operating expense, it’s worth it to do it. ToC’ers are careful to remind you though that local optimizations (e.g, optimizing just one step in the production process) are irrelevant. What matters more is that you can move value through the entire value stream and realize the value as quickly as possible.
ToC in Software
In software engineering, inventory is your backlog. The realization of value is when the software is used. Everything in between is operating expense. The golden metric in software engineering is lead time–the time it takes to deliver a feature from the moment it’s started.
Aside: I have found it helpful to track the delivery time from the moment it's requested (ordered) as well as the time from the moment an engineer starts working on the story. This helps separate engineering bottlenecks from project management bottlenecks.
The activity of software engineering is aimed at delivering value through features. Repairing defects does not add value. They are work that has already been paid for so the repair effort is a net loss to feature delivery. They consume valuable resources (developer time) without adding new value (features).
Three Approaches to Functional Quality
In software delivery, the biggest bottleneck is usually in the testing phase. As it stands, it’s also the phase that most often gets cut. The result is low-quality systems.
In software construction, there are only three approaches to functional quality.
- Production “Testing”. Unfortunately, I’ve worked for some companies that do this. They have no QA and no internal quality gates or metrics. They throw their stuff out there and let the users find the bugs. Even some “Agile” shops do this since it’s easier to teach people how to move post-it notes across whiteboards than it is to teach them how to engineer well.
- Manual Testing. This is much more common. In the worst case, developers write code and pass it through “works on my machine” certification. In the best case companies hire Testers who are integrated with the team. The testers have written test cases that they traverse for each release.
- Automated Testing. This approach is much less common than I would like. In this model, developers write testing programs along with the code they are developing. These tests are run every time changes are committed to check for regressions. The defects slip through, the fixes are captured with additional automated tests so that they don’t recur.
If you are testing in production, you don’t care about quality. Your users will likely care and you are not likely to keep them. Almost everyone understands that this is not an ideal way to proceed. Most people rely on manual testing. Some have some supplemental automated testing. Few have fully reliable automated test suites.
Manual Testing
Many companies rely mostly on manual testing. In a purist’s world, all test cases are executed for every release. Since manual testing–even for small systems–is necessarily time-consuming, most companies do some version of targeted manual testing–targeting the feature that had changes. Of course, defects still slip through, often in the places that weren’t tested because the test cases weren’t considered relevant to the change. What I want to bring your attention to here is not the impact on quality but on lead time.
In this model, when the dev work is done (it’s “dev-complete”), it gets handed off to some QA personnel for manual testing. This person may or may not be on the same team, but it’s irrelevant for our purposes. This person has to get a test environment, setup the software, and march through their manual test cases. This cannot be done in seconds or minutes. In the best case scenario, it takes hours. In reality, it’s usually days. If failures are found the work is sent back to engineering and then process is repeated.
Due to the need to occasionally deploy emergency fixes, there has to be some defined alternative approach to getting changes out that is faster and has less quality gates. Many companies require management and/or compliance approval to use these non-standard processes. Hotfixes themselves have been known to cause outages due to unforeseen consequences of the change that would normally be captured by QA.
Automated Testing, Continuous Integration, and Continuous Deployment
In contrast to the manual testing approach, automated testing facilitates rapid deployment. The majority of use-cases are covered by test programs that run on every change. The goal is to define the testing pipeline in such a way that passing it is a good enough indicator of quality that the release should not be held up.
In this model, branches are short-lived and made ready to release as quickly as possible. The test cases are executed by a machine which takes orders of magnitude less time than a human being. Once the changes have passed the automated quality gates, they are immediately deployed to production, realizing the value for the business.
When done well, this process takes minutes. Even with human approval requirements, I’ve had lead times of less than an hour to get changes released to production.
The capabilities that these processes enable are enormous. Lead times go way down which means higher feature throughput for our engineering teams. We are able to respond to production events more quickly which increases agility not only for our engineering teams but also for our businesses. We have fewer defects which means even more time to dedicate to features.
DevOps
Many engineers think of DevOps as automating deployments. That’s certainly part of it, but not all. DevOps is about integrating your ops and dev teams along the vertical slices. Software construction should be heavily influenced by operational concerns. If the software is not running, then we are not realizing value from it. Again, the ToC mindset is helpful here.
Software construction should include proper attention to logging, telemetry, architecture, security, resiliency, and tracing. Automating the deployments allows for quickly fine-tuning these concerns based on the team’s experience running the service in production.
Deployment automation is a good first step and helps with feature-delivery lead times right away. Let’s think about some other common sources of production service failure:
- Running out of disk space.
- Passwords changed.
- Network difficulties.
- Overloaded CPU.
- Memory overload.
- Etc…
A good DevOps/SRE solution would monitor for these (and other) situations and alert engineers before they take down the service. In the worst case, they would contain detailed information about the problem and what to do to address it. This reduces downtime for the service and allows you to restore service faster in the case of an outage. From a ToC perspective, both outcomes increase the time you are realizing value from the software.
So Why Are Modern Engineering Practices Still Relatively Rare in our Industry?
I’ve been trying to answer this question for 17 years. I think I finally have a handle on it.
Remember that it’s common to attempt to increase profit by reducing operating expense. Automated testing and deployment requires a fair amount of expertise and a not-small amount of time to setup and do well. They are not often regarded as “features” even though the capability of rapid, confident change certainly is. These efforts begin as a significant increase in operating expense, especially if it’s being introduced into a brown-field project for the first time.
Aside: It can be hard to convince managers that we should spend time cleaning up technical debt. It's harder to convince them later that failing to clean up technical debt is the reason it takes so long to change the text in an email template. Managers want the ability to change software quickly, but they don't always understand the technical requirements to do that. Treating lead time like a first-class feature and treating defects as demerits to productivity can help create a common language between stakeholders about where it's important to spend engineering time. If you can measure lead time, you can show your team getting more responsive to requests and delivering faster.
The cost of getting started with modern engineering practices is even bigger than it first appears. It is not possible to build fast, reliable automated tests without learning a range of new software engineering principles, patterns, and practices. These include but are not limited to Test Driven Development, Continuous Integration, Continuous Deployment, Design Patterns, Architectural Patterns, Observability patterns, etc.. Many software engineers and managers alike balk at this challenge, not seeing what lies on the other side. Most engineers will slow down when learning how to practice these things well since the patterns are unfamiliar and the tendency toward old habits is strong. Many will declare automated testing a waste of time since it doesn’t work well with what they’ve always done. The idea that they may have to change the way they develop is alien to them and not seriously considered. The promise is increased productivity, but the initial reality is the opposite– a near work-stoppage. This is true unless you are working with engineers who’ve already climbed these learning curves.
Engineers will describe it as “this takes too long.” Managers will be frustrated by the delays to their features. In business terms, this is seen as increased operating expense and lower throughput–the opposite of what we want. We are inclined as an industry to abandon the effort. We feel justified in doing so based on the initial evidence.
This is a mistake.
All of these costs are mitigated enormously if these efforts are done at the beginning of the project. Very often companies will create mountains of technical debt in the name of “moving fast.” These companies will pay an enormous cost when it’s time to harden their software engineering and delivery chops. The irony is that the point of modern engineering practices is to facilitate going fast, so this argument should be viewed skeptically. There are cases when this tradeoff is warranted, to be sure, but it is my opinion that this is less often than is commonly believed.
Getting Through the Learning Dip
We must remember that we’re not as unique as we think we are. Learning new ways of operating is hard. We can look to the experience of other enterprises to remind us why we’re doing this. It’s clear from the data that companies that embrace modern engineering practices dramatically reduce their lead times and the total cost of ownership of their software assets. If we want to compete with them we must be willing to climb this initial learning curve.
The frustration and anxiety we feel when we take on these challenges is so normal it has a name: “The Learning Dip.” We must recognize that this is where we are and keep going! It’s important not to abandon the effort. For those who like to be “data driven,” tracking lead time will be helpful. For project managers, treating defects as a negative to productivity will also help drive the right attention to quality. Again, time spent fixing bugs is time not spent building out new features. Defects as a percentage of your backlog is something you can measure and show to indicate progress to your stakeholders.
I once managed a team that did one release every 5-6 weeks. After investing heavily in this learning, we were able to release three times in one week. It was a big moment for us and represented enormous progress, but the goal was to be able to release on-demand. We celebrated, but we were not satisfied. The overall health of our service began climbing rapidly according to metrics chosen by our business stakeholders. More than one of the engineers told me later that “I will never go back to working any other way.” They haven’t.
As engineers, even the most experienced people must be willing to adopt a learner’s stance (or “growth mindset”). We must change our design habits to enable automated testing and delivery. We must learn to care about the operational experience of our software and about getting our features into production as fast as possible without defects. Any regular friction we encounter during the testing and deployment process should be met with aggressive action to fix and/or automate away the pain.
As managers, we must set the expectation that our engineers will learn and practice all of the modern software engineering techniques. This includes TDD, CI, CD, DevOps, and SRE concepts. We must make time for them to do so and protect that time.
If we are concerned about the initial impact to our timelines, we can hire engineers who already have this expertise to help guide the effort. It is not necessary that every engineer has the expertise already, but it is necessary that those who have it can teach it to the others and those who don’t are actively engaged in learning. This will dramatically reduce time spent in The Learning Dip in the early stages of rewiring how our teams think about their solutions. If we can’t afford to hire FTE’s for this role, perhaps we can find budget to hire experienced consultants to work with us and get us through the slump.
Conclusion
Modern Engineering Practices do represent a significant initial expense for teams just learning how to employ them. However, this initial expense enables a force multiplicative effect on feature delivery. In other words, it’s true that these techniques cost more–at least initially. It’s also true that they reduce the TCO of your software assets over the long-term. They speed up your engineering teams’ and business’ ability to react to the marketplace. A little more expense up front will save you a lot more down the road. As Uncle Bob says, “the only way to go fast is to go well.”
Go well and be awesome.
Avoiding Common Pitfalls When Getting Started With DevOps
If you’re in the planning or early development stages of implementing CI/CD for the first time, this post might help you.
DevOps is all the rage. It’s the new fad in tech! Years ago we were saying we should rely less on manual testing and fold testing into our engineering process. Now we are saying we should rely less on manual deployments and fold deployments and operational support into our engineering process. This all sounds lovely to me!
Having been a part of this effort toward automating more and more of our engineering process for the bulk of my career, I’ve had the opportunity to see CI/CD initiatives go awry. Strangely, it’s not self-evident how to setup a CI/CD pipeline well. It’s almost as if translating theory into practice is where the work is.
There are several inter-related subject-areas that need to be aligned to make a CI/CD pipeline successful. They are:
- Source Control
- Branching Strategy
- Automated Build
- Automated Testing
- Build Artifacts
- Deployment Automation
- Environment Segregation
Let’s talk about each of them in turn.
Source Control
Your code is in source control right? I hate to ask, but I’m surprised by how often I have encountered code that is not in source control. A common answer I get is “yes, except for these 25 scripts we use to perform this or that task.” That’s a “no.” All of your code needs to be in source control. If you’re not sure where to put those scripts, create a /scripts
folder in your repo and put them there. Get them in there, track changes, and make sure everyone is using the same version.
It’s customary for the repo structure to look something like
/src
/build
/docs
/scripts
/tests
README.md
LICENSE.md
I also encourage you to consider adopting a 1 repo, 1 root project, 1 releasable component standard separating your repositories. Releasable components should be independently releasable and have a separate lifecycle from other releasable components.
Branching Strategy
You should use a known, well-documented branching strategy. The goal of a branching strategy is to make sure everyone knows how code is supposed to flow through your source control system from initial development to the production release. There are three common choices:
Feature Branch | Git Flow | Commit to Master |
---|---|---|
|
|
|
Some purists will argue that Continuous Integration isn’t happening unless you’re doing Commit to Master. I don’t agree with this. My take is that as long as the team is actively and often merging to the same branch, then the goal of Continuous Integration is being met.
Automated Build
Regardless of what programming language you are using, you need an automated build. When your build is automated, your build scripts become a living document that removes any doubt about what is required to build your software. You will need an automated build system such as Jenkins, Azure DevOps, or Octopus Deploy. You need a separate server that knows how to run your build scripts and produce a build artifact. It should also programmatically execute any quality gates you may have such as credentials scanning or automated testing. Ideally, any scripts required to build your application should be in your repo under the /build
folder. Having your build scripts in source control has the additional advantage that you can use and test them locally.
Automated Testing
Once you can successfully build your software consistently on an external server (external from your development workstation), you should add some quality gates to your delivery pipeline. The first, easiest, and least-expensive quality gate should be unit tests. If you have not embraced Test Driven Development, do so. If your Continuous Integration server supports it, have it verify that your software builds and passes your automated tests at the pre-commit stage. This will prevent commits from making it into your repo if they don’t meet minimum standards. If your CI server does not support this feature, make sure repairing any failed builds or failed automated testing is understood to be the #1 priority of the team should they go red.
Build Artifacts
Once the software builds successfully and passes the initial quality gates, your build should produce an artifact. Examples of build artifacts include nuget packages, maven packages, zip files, rpm files, or any other standard, recognized package format.
Build artifacts should have the following characteristics:
- Completeness. The build artifact should contain everything necessary to deploy the software. Even if only a single component changed, the artifact should be treated like it is being deployed to a fresh environment.
- Environment Agnosticism. The build artifact should not contain any information specific to any environment in which it is to be deployed. This can include URL’s, connection strings, IP Addresses, environment names, or anything else that is only valid in a single environment. I’ll write more about this in Environment Segregation.
- Versioned. The build artifact should carry it’s version number. Most standard package formats include the version in the package filename. Some carry it as metadata within the package. Follow whatever conventions are normally used for your package management solution. If it’s possible to stamp the files contained in the package with the version as well (e.g., .NET Assemblies), do so. If you’re using a zip file, include the version in the zip filename. If you are releasing a library, follow Semantic Versioning. If not, consider versioning your application using release date information (e.g., for a release started on August 15th, 2018 consider setting the version number to 2018.8.15 or 1808).
- Singleton. Build artifacts should be built only once. This ensures that the artifact you deploy to your test environment will be the artifact that you tested when you go to production.
Deployment Automation
Your deployment process should be fully automated. Ideally, your deployment automation tools will simply execute scripts they find in your repo. This is ideal because it allows you to version and branch your deployment process along with your code. If you build your release scripts in your release automation tool, you will have integration errors when you need to modify your deployment automation for different branches independently.
The output of your build process is a build artifact. This build artifact is the input to your deployment automation along with configuration data appropriate to the environment you are deploying to.
Taking the time to script your deployment has the same benefits as scripting your build–it creates a living document detailing exactly how your software must be deployed. Your scripts should assume a clean machine with minimal dependencies pre-installed and should be re-runnable without error.
Take advantage of the fact that you are versioning your build artifact. If you are deploying a website to IIS, create a new physical directory matching the package and version name. After extracting the files to this new directory, repoint the virtual directory to the new location. This makes reverting to the previous version easy should it be necessary as all of the files for the previous version are still on the machine. The same trick can be accomplished on Unix-y systems using sym-links.
Lastly, your deployment automation scripts are code. Like any other code, it should be stored in source control and tested.
Environment Segregation
I’ve written that you should avoid including any environment-specific configuration in your build artifact (and by extension, in source control), and I’ve said that you should fully automate your deployment process. The configuration data for the target environment should be defined in your deployment automation tooling.
The goal here is to use the same deployment automation regardless of which environment you are deploying to. That means there should be no special steps for special environments.
Most deployment automation tools support some sort of variable substitution for config files. This allows you to keep the config files in source control with defined placeholders where the environment-specific configuration would be. At deployment time, the deployment automation tools will replace the tokens in the config files with values that are meaningful for that environment.
If variable substitution is not an option, consider maintaining a parameter-driven build script that writes out all your config files from scratch. In this case your config files will not be in source control at all but your scripts will know how to generate them.
The end-result of all of this is that you should be able to select any version of your build, point it to the environment of your choice, click “deploy,” and have a working piece of software.
Epilogue
The above is not a complete picture of everything you need to consider when moving towards DevOps. I did not cover concepts such as post-deployment testing, logging & monitoring, security, password & certificate rotation, controlling access to production, or any number of other related topics. I did however cover things you should consider when getting started in CI/CD. I’ve seen many teams attempt to embrace DevOps and create toil for themselves because they didn’t understand the material I’ve covered here. Following this advice should save you the effort of making these mistakes and give you breathing room to make new ones :).
1. DevOps is hard
It might not seem like it, but DevOps is hard. A few years ago I thought to myself that it can’t be that difficult since installing an individual application isn’t that difficult. I was wrong in part because…
2. Security is hard
Production is scary. I’d rather not have access when possible. On the other hand the tools that we use will definitely need access to production since it’s kind of the reason they exist. This means we have to have very tight control over who has access to the credentials that the tools run under. We work to limit our own day-to-day accounts so that their access is limited as well.
As a developer I didn’t think much about Security. I pretty much just stuffed an AD Group in a config file somewhere when I was told to and I was done. As a DevOps engineer I have had (and will continue) to learn a lot more about security and its organization even though I don’t manage security for my organization. Security impacts deployments at every level so you will have to learn about security infrastructure in order to make safe and practical recommendations to your security administration group.
3. You are not Netflix (unless you are)
Our organization got excited about DevOps tools after seeing some compelling presentations by Netflix at QCon San Francisco. Netflix has the need for highly scaled web servers which fully embrace the “cattle vs. pets” philosophy because they have millions of concurrent users of a publicly facing service.
We are not Netflix. We have 50+ internal applications with usage rates measured in the 10’s. They’re important to us–they run our business–but our problems are not the same ones Netflix faces. The tools that Netflix uses are designed to solve problems Netflix has. That doesn’t necessarily make them a good fit for our needs. We lost a lot of time and effort trying to make Netflix solutions fit our problems.
4. Windows vs. Linux matters when choosing your tools.
There are basically 5 possibilities when it comes to your server topology:
- Windows Only
- Linux Only
- Windows Dominant
- Linux Dominant
- Hetergeneous
If you are managing a homogeneous ecosystem then it’s imperative that you use tools that natively support that system. Don’t try to use Linux tools to manage Windows and vice versa. If you do, you’re gonna have a bad time. If you are primarily deploying to Windows you should look at tools like Octopus Deploy or Build Master. If you’re managing a Linux ecosystem look into Chef, Puppet, or even Docker.
If you’re managing a mixed ecosystem where one OS was dominant, you should still use tools designed to support the dominant system. It may be worth the effort to see if your existing tools can also manage the subdominant system. In our case it’s not worth the effort so we have instead moved toward an “appliance” model for our Linux servers. What this means is instead of managing a bunch of code to deploy RabbitMQ to Linux, we’re instead creating VM Images for the Rabbit installation which we can hydrate at will. We have far fewer resources who know how to administer Linux so this model works better for us.
5. DevOps tools are in their infancy
DevOps tools are optimized for the problems their creators were facing. There are many more problems in the DevOps space than any of the dominant tools are capable of managing on their own.
For example, Chef wants to deploy a machine. It’s not primarily concerned about applications. The Chef model is to declare the state of the machine and then let Chef decide how to bring the machine to that state. This approach optimizes for horizontally scaling hundreds or thousands of identical nodes with very few commands. Awesome!
In our organization we see the world in terms of Applications–not machines. Our whole way of thinking about deployment is different than the way Chef looks at it. This isn’t a deficiency in Chef or in the way we look at the world, but when we started using Chef we weren’t aware of how fundamental that difference in perspective would actually be.
Because Chef looks at the world in terms of nodes, it has no built-in (or even recommended) solution for artifact and version management. We had to build that. We had to build solutions for managing cookbook versions, publishing artifact and cookbook versions into targeted environments, and forwarding changes to production to antecedent environments.
If you’re using Octopus (we’re migrating from Chef to Octopus) and looking at the world in terms of applications, you will have problems when you need to spin up new environments and whole machines with many applications pre-installed. Either way, you will have to build other tools to glue the off-the-shelf tools together.
(Aside: Though I am not personally a fan of Chef, I have heard of people using Chef to deploy their infrastructure and using Octopus to deploy applications.)
6. DevOps “best practices” are in their infancy
Chef likes to advertise “use Chef however you want! We’re flexible!” Great…. except Chef is complicated and I would like some guidance on how to use it! This isn’t so much a problem with Chef though–DevOps in general is a very young field so we don’t have the wealth of shared experience from which to draw generalized lessons. To the extent that there is guidance it’s basically cribbed from Software Engineering best practices and doesn’t always apply well.
Here are some of mine:
- Have a canary environment that rebuilds all machines and redeploys all software on a regularly scheduled basis. Use this environment to detect problems in your deployment tool chain early.
- Every developer should have an individual environment of their own to test deployments.
- Every team should have at least one environment for testing and/or UAT.
- Avoid “Standard Failures.” These are errors that occur often and either do not have a known solution or have a manual workaround. Identify the root cause of errors and address them. Incorporate manual workaround solutions into your automated solutions.
- Where possible, embed some sort of “health check” into your applications that you can invoke to have the application check it own configuration.
- Identify rollback strategies for your applications.
7. Developers will have to learn infrastructure
If you come from a development background you will have to learn about security, networking, hardware, virtual hardware, etc. This is the domain you are working in now. I’m still at the beginning of this process myself but I’m starting to see the size of how much I still have to learn. For example, if you’re deploying to the cloud you’ll have to learn the inner workings of your chosen cloud infrastructure.
8. Ops will have to learn development patterns and practices.
If you come from an Ops background you will have to learn Software Engineering patterns and practices. You are graduating from someone who writes the occasional script to someone who manages code. Writing some code that only has to be run once is easy. Writing code that has to work again and again and again as well as tolerate change is much, much harder. As the number of people, environments, and machines grow software engineering skills will become more and more important.
9. Don’t automate a bad process.
Consider this: Chef doesn’t provide a built-in way to define which artifacts should be deployed to which environments. To that end we built an “application versions” cookbook which contains a list of all applications, their version, and their artifact location. In order to start work a team must:
- Take a branch of the application versions cookbook.
- Edit the versions/artifact information.
- Upload the cookbook to Chef
- commit and push the changes back to github
- clone the chef-repo
- edit the affected environment to use the new version of the application versions cookbook.
- commit/push chef-repo
- upload the edited environment to Chef.
Does that sound like a good idea to you? It doesn’t to me–but it’s necessary if you’re going to use a Chef Cookbook as a source for environment application versions. Before you go and wrap some automation around this to make it “easier,” let’s challenge the basic assumption: should we maybe just store application versions by environment elsewhere? A json file on a network share would be easier than this.
When you automate a process (even to make it “easier”) you’re pouring a certain amount of lime over it. Be careful.
10. “Infrastructure as Code!” is not always a good idea.
Code != Artifacts != Configuration. The daily work of DevOps breaks down into basically three disciplines: Code, Configuration, and Artifact management. A change to one of these should not necessitate a change to the other. That means that Code, Configuration, and Artifacts should not live together in github.
Use a Package Manager for your artifacts. If you don’t know where to look check out Artifactory. It’s a versatile artifact repository that supports many different kinds of package managers. It’s API even understands version numbers and will let you identify and retrieve the “latest” version of your artifact. Let your CI server publish artifacts to your package manager and make it the canonical source for artifact retrieval.
Configuration should not be managed like code. Configuration data is any data required by applications to run. Examples are things like dns addresses, email addresses for notifications, database connection strings, api endpoints, etc.. Configuration data is just data about environments. Unlike code it does not need to be branched. It should be stored in some central repository and accessed directly by the deployment code.
The code that you use to execute your deployments is most emphatically and in every possible way code. This means it should be tested, stored in source control, subject to your company’s chosen branching strategies, built by a CI server, etc..
The “infrastructure as code” idea is a really great idea, but it applies only to the procedure of deploying hardware and software. It does not fit well with the metadata that describes which hardware and software should be deployed. Don’t use “infrastructure as code” as an execute to push square pegs into round holes.
As part of our Octopus Deploy migration effort we are writing a powershell module that we use to automatically bootstrap the Tentacle installation into Octopus. This involves maintaining metadata about machines and environments outside of Octopus. The reason we need this capability is to adhere to the “cattle vs. pets” approach to hardware. We want to be able to destroy and recreate our machines at will and have them show up again in Octopus ready to receive deployments.
Our initial implementation cycled through one machine at a time, installing Tentacle, registering it with Octopus (with the same security certificate so that Octopus recognizes it as the same machine), then moving on to the next machine. This is fine for small environments with few machines, but not awesome for larger environments with many machines. If it takes 2m to install Tentacle and I have 30 machines, I’m waiting an hour to be able to use the environment. With this problem in mind I decided to figure out how we could parallelize the boostrapping of machines in our Powershell module.
Start-Job
Start-Job is one of a family of Powershell functions created to support asynchrony. Other related functions are Get-Job
, Wait-Job
, Receive-Job
, and Remove-Job
. In it’s most basic form, Start-Job
accepts a script block as a parameter and executes it on a background thread.
# executes "dir" on a background thread.
$job = Start-Job -ScriptBlock { dir }
The job object returned by Start-Job
gives you useful information such as the job id, name, and current state. You can run Get-Job
to get a list of running jobs, Wait-Job
to wait on one or more jobs to complete, Receive-Job
to get the output of each job, and Remove-Job
to delist jobs in the current Powershell session.
Complexity
If that’s all there was to it, I wouldn’t be writing this blog post. I’d just tweet the link to the Start-Jobs
msdn page and call it done. My scenario is that I need to bootstrap machines using code defined in my Powershell module, but run those commands in a background process. I also need to collate and log the output of those processes as well as report on the succes/failure of each job.
When you call Start-Job
in Powershell it creates a new session in which currently loaded modules are not automatically loaded. If you have your powershell module in the $PsModulePath
you’re probably okay. However, there is a difference between the version of the module I’m currently working on and testing vs. the one I have on my machine for normal use.
Start-Job
has an additional parameter for a script block used to initialize the new Powershell session prior to executing your background process. The difficulty is that while you can pass arguments to the background process script block, you cannot pass arguments to the initialization script. Here’s how you make it all work.
Setup Code
<br /># Store the working module path in an environment variable so that the new powershell session can locate the correct version of the module.
# The environment variable will not persist beyond the current powershell session so we don't have to worry about poluting our machine state.
$env:OctobootModulePath = (get-module Octoboot).Path
$init = {
# When initializing the new session, use the -Force parameter in case a different version of the module is already loaded by a profile.
import-module $env:OctobootModulePath -Force
}
# create a parameterized script block
$scriptBlock = {
Param(
$computerName,
$environment,
$roles,
$userName,
$password,
$apiKey,
)
Install-Tentacle -computer $computerName `
-environment $environment `
-roles $roles `
-userName $userName -password $password `
-apiKey $apiKey
}
# I like to use an -Async switch on the controlling function. Debugging issues is easier in a synchronous context than in an async context. Making the async functionality optional is a win.
if ($async) {
$job = Start-Job `
-ScriptBlock $scriptBlock `
-InitializationScript $init `
-Name "Install Tentacle on $($computerName)" `
-ArgumentList @(
$computerName, $environment, $roles
$userName, $password,
$apiKey) -Debug:$debug
} else {
Install-Tentacle -computer $computerName `
-environment $environment `
-roles $roles `
-userName $userName -password $password `
-apiKey $apiKey
}
The above code is in a loop in the controlling powershell function. After I’ve kicked off all of the jobs I’m going to execute, I just need to wait on them to finish and collect their results.
Finalization Code
<br />if ($async) {
$jobs = get-job
$jobs | Wait-Job | Receive-Job
$jobs | foreach {
$job = $_
write-host "$($job.Id) - $($job.Name) - $($job.State)"
}
$jobs | remove-job
}
Since each individual job is now running in parallel, bootstrapping large environments doesn’t take much longer than bootstrapping smaller ones. The end result is that hour is now reduced to a few minutes.
DevOps is a relatively new space in the software engineering world. There are a smattering of tools to aid in the automation of application deployments, but precious little guidance with respect to patterns and practices for using the new tools. As a guy who loves leaning on principles this lack of attention to best practices leaves me feeling a bit uncomfortable. Since I’m leading a migration to Octopus Deploy, I thought I would share some of the decisions we’ve made.
This series of posts is an attempt to start a conversation about best practices. I want to be clear: We have not been applying these ideas long enough to know what all of the ramifications are. Your mileage may vary.
Posts in this series
1. Environments
2. Roles
3. Variables & Variable Sets
Variables & Variable Sets
Octopus Deploy allows you to modify your application’s configuration through the use of variables. You can define variables at the project level, or share variable values between projects through variable sets. If you have relatively little sharing of variables between projects you will likely prefer to create variables at the project level. My team manages over 50 different applications. Many of them are web services designed to support SOA. The net impact is that we have a lot of shared variables and for this reason we define variables exclusively through variable sets. This saves us time hunting for where a given variable is defined.
We use 2 kinds of variable sets
1. Global
2. Role based
Global variable sets define values that might be required across the company irrespective of any particular application, or that are more easily managed together. For example, we wish to capture metadata about environments. Octopus itself does not have a facility for tagging environments with arbitrary metadata. To satisfy this goal we created a variable set called “environment” in which we create variables to indicate values such as “owner” and “abbr”. We use these values to compose the values of other variables such as dns addresses or email addresses.
We also have some environments for which we do not create dns addresses for the sites. In these environments we need to install web applications with alternate ports. We keep a variable sets to define the port number we use for web applications in these environments since they must be unique across the web server.
The number of global variable sets should be as small as possible.
Role based variable sets are variables defined for the specific roles they target. If we have a role called heroes-iis
we will also have a variable set called heroes-iis
. Since we create roles on a per-deployed-application basis, this helps us keep roles, projects, and variable sets linked. If heroes-iis
as web service end points, this variable set may be included in some other project that depends on those end points.
Naming Conventions
It is important to have naming conventions for your variables. I highly recommend prefixing all variables in a variable set with the name of the variable set to avoid potential naming collisions. For example, If I have a variable set called heroes-iis
it will have variables with names like:
- heroes-iis.application-pool.name
- heroes-iis.application-pool.password
Define a Standard Structure for Similar Variable Sets
Once you get the rhythm of installing applications with Octopus, you will discover that similar kinds of applications have similar variable definition needs. You can save yourself a lot of time and Chrome tabs by establishing a variable set template that you use when creating a variable set for each kind of application you deploy. Here is our variable set template for web applications being deployed to iis:
Variable Set Name | Segment | Field | Variable Name | Notes |
---|---|---|---|---|
name-iis | application-pool | name | name-iis.application-pool.name | The name of the application pool |
username | name-iis.application-pool.username | the username the application pool runs under | ||
password | name-iis.application-pool.password | the password the application pool runs under | ||
host | name.host | This corresponds to the site name as registered in IIS. It does not include the protocol (http://, https://). It should be blank if the site is being deployed into an environment without a dns entry. | ||
site-name | name.site-name | This will be just the name of the web application in environments that do not have a dns entry. If the environment has a dns entry, it should resolve the host property. | ||
site-root | name.site-root | This is the url root for the site. It should include the protocol (http://, https://) as well as the port, and any additional routing information. | ||
endpoints | endpoint-name | name.endpoints.endpoint-name | A web service may expose one or more endpoints. These should have unique names. Their values should be defined with reference to the host and port variables. | connection-strings | cs-name | name.connection-strings.cs-name | The name of the connection string in the config file. |
Scope
Octopus Deploy allows you to scope variables by environment, role, or channel (as of 3.3). The scoping rules are as follows:
(environment1 OR environment2 OR ...) AND (role1 OR role2 OR ...) AND (channel1 OR channel2 OR ...)
I recommend that you scope variable values as broadly as possible. Use composed variable values where you can to minimize the number of variable values you have to maintain. For example:
<br />heroes-iis.connection-strings.heroes-db => "Server=#{environment.sql-server.url}; Database=#{heroes-db.database-name};"
heroes-db.database-name => #{HEROES_}#{environment.name}
environment.sql-server.url => http://sql-server.#{environment.name}.com
By using a composed variable value I don’t need to scope the connection string variable itself. Instead, I can confine scoping to environment.name
and satisfy the resolution of all of the descendant variables. This minimizes the number of variables I have to actively maintain as new environments are created.
DevOps is a relatively new space in the software engineering world. There are a smattering of tools to aid in the automation of application deployments, but precious little guidance with respect to patterns and practices for using the new tools. As a guy who loves leaning on principles this lack of attention to best practices leaves me feeling a bit uncomfortable. Since I’m leading a migration to Octopus Deploy, I thought I would share some of the decisions we’ve made.
This series of posts is an attempt to start a conversation about best practices. I want to be clear: We have not been applying these ideas long enough to know what all of the ramifications are. Your mileage may vary.
Posts in this series
1. Environments
2. Roles
3. Variables & Variable Sets
Roles
When you add machines into Octopus, you must specify environments and roles for that machine. For our purposes, environments were pretty easy to define. Roles however took some work. Here are the kinds of roles we defined.
Operating Systems
Example: windows, linux
This is pretty easy. We started with Linux
and Windows
for this type of role. I can see a day when we may need to additionally specify ubuntu-14
or 2k8-r2
. In the meantime, YAGNI.
Environment Types
Example: dev, uat, integration, staging, prod, support
Our environment naming convention for developer environments is dev-{first initial}{last name}
. For uat environments it’s uat-{team}
. There is only one of each integration, staging, production, and support environments. There are certain variables that are defined consistently across all dev
environments but may differ in uat
environments. For this reason we are applying the environment type as a role across all machines in the relevant environments.
Commands
Example: hero-db.migrator
This is a standalone role. There will only be one machine in each environment that will have this role. It’s purpose is to execute commands on some resource in the enviornment that should not be run multiple times or concurrently. A good example of this is an Entity Framework database migration. We choose one machine in an environment that database migrations can be run from.
Applications
Example: webapp-iis, topshelf-service
Each deployable application has its own role. Not every application gets installed on every machine in an environment. We use the -iis
affix for applications installed into IIS regardless of whether they’re sites or web services. We use the -service
affix for Windows Services. We do this because we sometimes have a family of applications that have the same name but target a different kind of application.
DevOps is a relatively new space in the software engineering world. There are a smattering of tools to aid in the automation of application deployments, but precious little guidance with respect to patterns and practices for using the new tools. As a guy who loves leaning on principles this lack of attention to best practices leaves me feeling a bit uncomfortable. Since I’m leading a migration to Octopus Deploy, I thought I would share some of the decisions we’ve made.
This series of posts is an attempt to start a conversation about best practices. I want to be clear: We have not been applying these ideas long enough to know what all of the ramifications are. Your mileage may vary.
Posts in this series
1. Environments
2. Roles
3. Variables & Variable Sets
Our Default Lifecycle
Before I begin, I should give you some background on our development ecosystem. Our Octopus Lifecycle looks like this:
dev => uat => integration => staging => prod => support
The Environments
Name | Convention | Purpose | Notes |
---|---|---|---|
dev | dev-{first initial}{last name} | The primary purpose of these environments is to test the deployment tooling itself. | We have 15 or so individual developer environments. Each developer gets their own environment with 2 servers (1 Linux, 1 Windows) and all of our 60 or so proprietary applications installed to it. |
uat | uat-{team} | These environments are used by teams to test their work. | We have 10 or so User Acceptance Testing environments. These are a little bit more fleshed out in terms of hardware. There are multiple web servers behind load balancers. The machines are beefier. These enviornments are usually owned by a single team, though they may sometimes be shared. |
integration | N/A | Dress rehearsal by Development for releases | The integration environment is much closer to production. When multiple teams are releasing their software during the same release window, integration gives us a rehearsal environment to make sure all of the work done by the various teams will work well together. |
staging | N/A | Dress rehearsal by Support for releases | Staging is exactly like integration except that it is not owned by the Development department. We have a team of people who are responsible for executing releases. This is their environment to verify that the steps development gave them will work. |
prod | N/A | Business Use | Prod is not managed by the deployment engineering team. We build the button that pushes to prod, but we do not push it. |
support | N/A | Rehearsal environment for support solutions | Support is a post-production environment that mirrors production. It allows support personnel to test and verify support tasks in a non-prod environment prior to running them in production. |