1. DevOps is hard
It might not seem like it, but DevOps is hard. A few years ago I thought to myself that it can’t be that difficult since installing an individual application isn’t that difficult. I was wrong in part because…
2. Security is hard
Production is scary. I’d rather not have access when possible. On the other hand the tools that we use will definitely need access to production since it’s kind of the reason they exist. This means we have to have very tight control over who has access to the credentials that the tools run under. We work to limit our own day-to-day accounts so that their access is limited as well.
As a developer I didn’t think much about Security. I pretty much just stuffed an AD Group in a config file somewhere when I was told to and I was done. As a DevOps engineer I have had (and will continue) to learn a lot more about security and its organization even though I don’t manage security for my organization. Security impacts deployments at every level so you will have to learn about security infrastructure in order to make safe and practical recommendations to your security administration group.
3. You are not Netflix (unless you are)
Our organization got excited about DevOps tools after seeing some compelling presentations by Netflix at QCon San Francisco. Netflix has the need for highly scaled web servers which fully embrace the “cattle vs. pets” philosophy because they have millions of concurrent users of a publicly facing service.
We are not Netflix. We have 50+ internal applications with usage rates measured in the 10’s. They’re important to us–they run our business–but our problems are not the same ones Netflix faces. The tools that Netflix uses are designed to solve problems Netflix has. That doesn’t necessarily make them a good fit for our needs. We lost a lot of time and effort trying to make Netflix solutions fit our problems.
4. Windows vs. Linux matters when choosing your tools.
There are basically 5 possibilities when it comes to your server topology:
- Windows Only
- Linux Only
- Windows Dominant
- Linux Dominant
If you are managing a homogeneous ecosystem then it’s imperative that you use tools that natively support that system. Don’t try to use Linux tools to manage Windows and vice versa. If you do, you’re gonna have a bad time. If you are primarily deploying to Windows you should look at tools like Octopus Deploy or Build Master. If you’re managing a Linux ecosystem look into Chef, Puppet, or even Docker.
If you’re managing a mixed ecosystem where one OS was dominant, you should still use tools designed to support the dominant system. It may be worth the effort to see if your existing tools can also manage the subdominant system. In our case it’s not worth the effort so we have instead moved toward an “appliance” model for our Linux servers. What this means is instead of managing a bunch of code to deploy RabbitMQ to Linux, we’re instead creating VM Images for the Rabbit installation which we can hydrate at will. We have far fewer resources who know how to administer Linux so this model works better for us.
5. DevOps tools are in their infancy
DevOps tools are optimized for the problems their creators were facing. There are many more problems in the DevOps space than any of the dominant tools are capable of managing on their own.
For example, Chef wants to deploy a machine. It’s not primarily concerned about applications. The Chef model is to declare the state of the machine and then let Chef decide how to bring the machine to that state. This approach optimizes for horizontally scaling hundreds or thousands of identical nodes with very few commands. Awesome!
In our organization we see the world in terms of Applications–not machines. Our whole way of thinking about deployment is different than the way Chef looks at it. This isn’t a deficiency in Chef or in the way we look at the world, but when we started using Chef we weren’t aware of how fundamental that difference in perspective would actually be.
Because Chef looks at the world in terms of nodes, it has no built-in (or even recommended) solution for artifact and version management. We had to build that. We had to build solutions for managing cookbook versions, publishing artifact and cookbook versions into targeted environments, and forwarding changes to production to antecedent environments.
If you’re using Octopus (we’re migrating from Chef to Octopus) and looking at the world in terms of applications, you will have problems when you need to spin up new environments and whole machines with many applications pre-installed. Either way, you will have to build other tools to glue the off-the-shelf tools together.
(Aside: Though I am not personally a fan of Chef, I have heard of people using Chef to deploy their infrastructure and using Octopus to deploy applications.)
6. DevOps “best practices” are in their infancy
Chef likes to advertise “use Chef however you want! We’re flexible!” Great…. except Chef is complicated and I would like some guidance on how to use it! This isn’t so much a problem with Chef though–DevOps in general is a very young field so we don’t have the wealth of shared experience from which to draw generalized lessons. To the extent that there is guidance it’s basically cribbed from Software Engineering best practices and doesn’t always apply well.
Here are some of mine:
- Have a canary environment that rebuilds all machines and redeploys all software on a regularly scheduled basis. Use this environment to detect problems in your deployment tool chain early.
- Every developer should have an individual environment of their own to test deployments.
- Every team should have at least one environment for testing and/or UAT.
- Avoid “Standard Failures.” These are errors that occur often and either do not have a known solution or have a manual workaround. Identify the root cause of errors and address them. Incorporate manual workaround solutions into your automated solutions.
- Where possible, embed some sort of “health check” into your applications that you can invoke to have the application check it own configuration.
- Identify rollback strategies for your applications.
7. Developers will have to learn infrastructure
If you come from a development background you will have to learn about security, networking, hardware, virtual hardware, etc. This is the domain you are working in now. I’m still at the beginning of this process myself but I’m starting to see the size of how much I still have to learn. For example, if you’re deploying to the cloud you’ll have to learn the inner workings of your chosen cloud infrastructure.
8. Ops will have to learn development patterns and practices.
If you come from an Ops background you will have to learn Software Engineering patterns and practices. You are graduating from someone who writes the occasional script to someone who manages code. Writing some code that only has to be run once is easy. Writing code that has to work again and again and again as well as tolerate change is much, much harder. As the number of people, environments, and machines grow software engineering skills will become more and more important.
9. Don’t automate a bad process.
Consider this: Chef doesn’t provide a built-in way to define which artifacts should be deployed to which environments. To that end we built an “application versions” cookbook which contains a list of all applications, their version, and their artifact location. In order to start work a team must:
- Take a branch of the application versions cookbook.
- Edit the versions/artifact information.
- Upload the cookbook to Chef
- commit and push the changes back to github
- clone the chef-repo
- edit the affected environment to use the new version of the application versions cookbook.
- commit/push chef-repo
- upload the edited environment to Chef.
Does that sound like a good idea to you? It doesn’t to me–but it’s necessary if you’re going to use a Chef Cookbook as a source for environment application versions. Before you go and wrap some automation around this to make it “easier,” let’s challenge the basic assumption: should we maybe just store application versions by environment elsewhere? A json file on a network share would be easier than this.
When you automate a process (even to make it “easier”) you’re pouring a certain amount of lime over it. Be careful.
10. “Infrastructure as Code!” is not always a good idea.
Code != Artifacts != Configuration. The daily work of DevOps breaks down into basically three disciplines: Code, Configuration, and Artifact management. A change to one of these should not necessitate a change to the other. That means that Code, Configuration, and Artifacts should not live together in github.
Use a Package Manager for your artifacts. If you don’t know where to look check out Artifactory. It’s a versatile artifact repository that supports many different kinds of package managers. It’s API even understands version numbers and will let you identify and retrieve the “latest” version of your artifact. Let your CI server publish artifacts to your package manager and make it the canonical source for artifact retrieval.
Configuration should not be managed like code. Configuration data is any data required by applications to run. Examples are things like dns addresses, email addresses for notifications, database connection strings, api endpoints, etc.. Configuration data is just data about environments. Unlike code it does not need to be branched. It should be stored in some central repository and accessed directly by the deployment code.
The code that you use to execute your deployments is most emphatically and in every possible way code. This means it should be tested, stored in source control, subject to your company’s chosen branching strategies, built by a CI server, etc..
The “infrastructure as code” idea is a really great idea, but it applies only to the procedure of deploying hardware and software. It does not fit well with the metadata that describes which hardware and software should be deployed. Don’t use “infrastructure as code” as an execute to push square pegs into round holes.