Why We’re Migrating From Chef to Octopus
A couple of months ago I started leading a team that was using Chef to manage our deployment infrastructure. We are now migrating from Chef to Octopus Deploy. I thought it might be interesting to explain why. What follows is my opinion and my best understanding of the facts. As always, do your own thinking. 🙂
Chef takes the DevOps mantra “infrastructure as code” to heart. Everything created for consumption by Chef is code. This is a strength in many ways, but sometimes a weakness.
Chef consists of a server which contains metadata about your infrastructure, “cookbooks” which contain instructions for configuring machines and applications, and a client application which does the work of truing up a machine to its representation in the Chef Server.
In Chef, the target of work is a node. When
chef_client runs it’s goal is to make the node it’s running on look like Chef Server expects it to. It has little awareness of applications except insofar as you layer that into your cookbook design.
chef_client pulls the data it needs from the chef server in order to do its work. This means that in order to update a machine you have to have access to that machine as well as rights to run
chef_client. This also means that it takes a bit more effort to be able to redeploy a single application to the node in the event something goes wrong. If you are running single-purpose nodes that’s probably not an issue for you.
It bears mentioning that Chef was designed for Linux first. Windows is an afterthought and it shows. There was no built-in support for common Windows constructs such as IIS and Windows Services at the time of our implementation. (My understanding from a coworker is that Chef now has built-in support for windows services.)
Chef has an api, but it’s nearly impossible to access using standard tools due to a convoluted authentication model. Chef as a CLI called
knife which wraps up most api behavior. It should be stated that you cannot initiate activities through the api–that is all done by running
chef_client on the target machines.
knife can query the api for the machines you want and initiate
chef_client runs for you however.
The developer experience with Chef is all about editing cookbooks. A “cookbook” is a collection of “recipes” that can be executed by
chef_client on the target node. Cookbook development involves using a ruby dsl that transpiles to raw ruby on the target node during a process called “convergence.” You edit the cookbook, push the modified cookbook to the chef server, then run
chef_client on the node. On the surface this seems simple. It’s not.
If you are staying “in the rails” (rimshots for the ruby joke 🙂 ) with chef and using their built in functions (called “resources”) to do your work, you will mostly be okay. If you are interested in testing your work you will start to feel pain. Chef provides the ability to do “local” testing–by which they mean spinning up VM’s using Vagrant over VirtualBox. “Local” in this context doesn’t mean you get to see the transpiled ruby before it executes, or step into a debugger. It just means that the VM you’re testing on happens to be running on your developer machine.
There are tools or automated testing with Chef cookbooks. We haven’t really learned to use them very well so I can’t personally comment on their quality. My coworker had this to say:
There is a commonly used unit testing framework for Chef, called ChefSpec (which is built upon RSpec). I have issues with ChefSpec, so most of my Chef unit tests are done in pure RSpec. However, ChefSpec is not officially supported by Chef.
There is chef-zero, which allows you to spin up an in memory mock Chef Server, which will allow you to start/pause/stop a mock node convergence, and will allow you to examine attributes at different steps of resolution. Test-Kitchen is the integration testing framework, which spins up actual VMs and uses Vagrant as a driver (it may use Docker now? Not sure), then fully converges that VM as a node. So the Chef ecosystem has the ability to do lots of testing at various levels. I feel it is inaccurate to state that testing is poor in Chef. What is accurate to state is that Chef is Linux first, Windows afterthought. So all the testing goodness in the Chef ecosystem was either unavailable to us as a Windows shop, or it was extremely difficult to use. … we knew that lack of testing was a weakness in our Chef code. We really tried to make it happen, but the hurdles were insurmountable with the resources available.
You can customize chef behavior by writing “resources” and “providers.” This process is not straightforward or easy to understand and frankly I still find it confusing. When writing new code for chef you can use straight ruby, but consuming standard ruby gems is problematic because of the convergence process. It’s not clear when the ruby code will run. Sometimes it runs during convergence. Sometimes it runs at when a recipe is called. Keeping it all straight and making it work as expected requires some mental and technical gymnastics. We had numerous errors caused by ruby code not executing when we thought it would.
There is very little guidance about the correct way to use Chef. In fact, Opscode seems to pride itself on this fact. They exclaim on twitter and at conferences “You can use Chef anyway you want!” Great–but how should I use Chef? This is not a trivial question as there are at least 3 kinds of cookbooks in Chef development. There are cookbooks that define data, define utilities or “resources”, and deploy applications. There are 6 levels of overrides for variables. Which data should you put in application cookbooks vs. environment cookbooks? When should you use overrides and how?
Chef has elected to rely on the community to provide cookbooks for most things. Some vendors have created officially supported cookbooks for their products, but most have not. Who can blame them? There’s no clear winner in the DevOps space yet, so why spend the time and money to support a tool that may fizz out? This means that most community cookbooks are of dubious quality. They are generally designed to solve the particular problem the author was facing when they were trying to install something. They may or may not set things up in a recommended fashion. They make assumptions about the OS, version, security, and logging strategies that are by no means universal. This left us in the position of having to read and fully understand the cookbook before deciding whether or not to use it. More often, we just ended up writing our own customized version of the same cookbook.
It’s clear to us that Chef is not a good fit for our organization. We deploy 50+ applications to 40+ environments. Because chef provisions a node and not an application if there’s a problem in any one application the entire deployment is considered a failure. We were never able to get above 80% successful deployments across our environments. Many of the reasons for this failure were non-technical and have nothing to do with the tool, but the complexity of the tool did not help us. Finding expert-level help with Chef on Windows was equally problematic. Even at the conferences we attended, the Windows Chef implementors were like the un-cool kids sitting at a table by themselves in the corner not quite excited by the “ra ra ra” of the pep rally.
If I were going to use Chef I would use it in an ecosystem that had fewer applications and more servers. Let’s say I had a web service that I needed to horizontally scale across 1000 Linux nodes. Chef would be a great fit for this scenario.
My co-worker says:
Chef would be a good choice if
a. You use Linux
b. You have a simple deployment topology, which can be supported by Systems Engineers who can do a little scripting (i.e., use only the built in resources).
c. Or you have a large DevOps engineering team that has enough throughput to dissect the bowels of Chef.
Chef is designed for a setup where there are thousands of nodes, each node is responsible for doing 1 thing, and it is cheap to kill and provision nodes. Because we use Windows, we have hundreds of nodes [and] each node is responsible for many things. It is expensive to kill and provision nodes. Windows is moving towards the cattle not pets paradigm, but it’s not there yet. Chef is designed to work with setups that are already at the cattle stage.
Octopus kind of ignores the “infrastructure as code” idea and tries to give you a button-click experience for deployments.
Octopus Deploy is composed of a web server and a windows service. The web server stores metadata about your hardware and software ecosystem. The windows service (called Tentacle) is installed on each machine that Octopus will manage and performs the work locally. Octopus was developed api-first which means that all of the functionality you invoke through the Web UI is available through the api. The api is secured using standard models which means it is easy to consume using your favorite tools.
Instructions are packaged up by the server and pushed to the Tentacle agents where they are executed. As long as the Tentacle service account has the requisite permissions, anyone with permission to deploy through Octopus Server can deploy to the target machines even if they don’t otherwise have access to those machines.
Whereas Chef is node-centric, Octopus is application-centric. This is a strength in that it’s easy to deploy or redeploy a single application to a node that may have many applications on it. This is a weakness in that it’s harder to deploy all of the applications required for a particular machine (more button clicks). This weakness will be mitigated by the elastic environment features of the soon-to-be-released 3.4 version.
Octopus separates artifact management from release process management. You can modify your release process without ever editing your deployment artifacts, and vice versa. Deployment artifacts are a first class citizen in Octopus. You will have to layer in your own artifact management into Chef.
Octopus is written in C# and runs on Windows. The advantage over Chef here is that we get native Active Directory integration for security purposes. In the Octopus world developers publish their applications as NuGet packages which are already understood by most .NET developers. The nuget package defintion lives with the application source code so no external mapping of deployment code to the application is required. Further, it’s obvious which bits go in the NuGet package and which bits will live on the Octopus Server.
Octopus uses one of several variable substitution techniques to manage the config files. Since Octopus was written with Windows deployments in mind, it contains built in processes for deploying IIS applications and Windows Services. If further customization is needed, developers can write custom scripts in Powershell as either a Step Template or a script included in the nuget package.
Unlike Chef, you can test the scripts you write locally with the caveat that you will have to provide any variables they require to execute. Since it’s just Powershell you can use any of the existing tools and debuggers written for Powershell. The lack of a transpilation step means that the code you wrote is the code that will run.
Like Chef, Octopus has not “won” the DevOps space. My impression from the community is that the percentage of shops working in the DevOps space at all is fairly low so this isn’t a mark against either tool. However, low adoption means that there are fewer people sharing their lessons learned with the world. If you go looking for Octopus best practices, you won’t find much more than my blog series. Most “best practice” articles and videos about DevOps consist of little more than advice to automate all the things.
On the other hand, the way Octopus itself is structured steers you toward some best practices. Some of these ideas could be layered onto a Chef implementation just as easily, but Octopus is opinionated in such a way that it steers you in the right direction. It’s nice to have Octopus tell us “this is the blessed path to accomplish your goal.”
Octopus distinguishes variables from processes. A Release Process is a set of logic that results in the successful deployment of an application to a machine in an environment. It is informed by variables, which can be scoped to environments and roles (and channels as of version 3.3). Variables can be managed at a project level or in shared libraries called Variable Sets.
Custom logical steps can be defined in powershell, stored and versioned in Octopus, and used as part of the Release Process. These are called Step Templates. Step Templates are usually a lighter weight maintenance effort than a cookbook since they’re just powershell and often targeted at very small units of functionality that can be chained together without much overhead.
In Octopus, a Release is created which binds a deployment artifact to a set of variables and instructions. This release is then promoted through a series of environments defined in a lifecycle. There is no binding of artifacts to cookbooks in Chef unless you write that yourself.
My impression is that the Octopus community is smaller than the Chef community. There is only one sanctioned mechanism for the community to share code–a github repo which Octopus itself reviews before accepting pull requests. The fact that Octopus reviews contributions gives me the impression that the quality of the community code is better than what we found in Chef. Having gone through the review process with them for a new step template we created we know that they are actually looking at the code. On the other hand, since Octopus targets the Windows ecosystem and has built-in support for most things Windows the need for community help isn’t as great.
When we were first learning Octopus, our first deploy from time to install was about 30 minutes. With Chef there were days of training and hours spent tweaking cookbooks until we got our first deployment. Octopus is simply easier for us to understand. We have a small team of 3 dedicated developers to manage 40+ environments that must be deployed with 50+ applications on demand. In addition this team manages Artifactory, TeamCity, and glue-code that ties these systems together. Our goal is to bring our successful deployment rate up to 95+% and not need babysit deployments to production.
For our needs Octopus is the clear winner. We are a .NET shop deploying C# applications to the Windows ecosystem. Octopus is a C# application that runs on Windows. Octopus’ understanding of our problem space is far superior to Chef. The tools they provide are high quality, easy to use, and work with very little configuration. There are gaps to fill, but the fact that Octopus is written API-first gives me confidence that we can easily fill them.