Would you intentionally create a service that arbitrarily (within constraints) shuts down other things in your environment? I wouldn’t, but Netflix did with its “Chaos Monkey,” —why? Why would someone do this?
DevOps hype is extensive and you must be careful not to be swept away by its exuberance. The solution, however, is not to be a reactionary, but to look at the thing, understand it, and see what you can learn from it.
So, what can network engineers learn from DevOps? How can we apply DevOps to do things better? Is there something real and valuable here—or is it all one big Chaos Monkey?
Here are seven DevOps practices that I think are potentially valuable. This list is by no means exhaustive.
#1: Optimize the whole.
I am sure many of you have seen conflicts between different parts of IT (developers versus sys admins) or between different parts of a company (marketing versus operations). Many of these conflicts are caused by different parts of a company optimizing only their domain at the expense of the business.
At root, this is probably a management failure—a failure to decide what is important and a failure to create an environment where different groups are working towards common goals.
Strive to be more aware of the various IT systems in your organization and the business results they are intending to achieve. Work to increase your understanding on how your work affects others and their workflow dependencies on you.
#2: Restrict your work in process.
Having 50 projects going on simultaneously is not a good idea. Strictly control how much work in process you have.
Too much work in process causes a lot of problems both for individuals (the cost of mentally shifting from one project to another) and for teams (delays waiting for someone else).
Is there a lower bound on this? Yes, but the lower bound depends on several factors. How big is your project? What is the nature of the tasks that you do (how long, how complex)? How engaged are you on this task versus other tasks?
#3: Reduce your batch size.
Basically, do smaller changes more frequently.
I am sure that many of you have been bitten by large, “big-bang” projects (my term for projects that have a large cutover at a point in time). Smaller changes are easier to plan, easier to test, and easier to roll back.
Also, doing changes more frequently can increase your proficiency at them. You will likely be better at something that you do every week compared to something you do every quarter.
There are some big caveats here, however. A lot of the benefit of reduced batch size comes from automation. How much of your process can be automated? Are you actively working on automating it? Does your organization have reasonable practices and policies with respect to changes? If all network changes must be done in a Sunday change window from 2 a.m. to 4 a.m., then I am not going to do this very frequently.
The lower bound on batch size minimization is again contextual. If you could automate every part of the process including QA testing, deploying the change, validating the change, monitoring for issues, and potentially rolling back the change, then you could do changes very frequently.
The batch size lower bound also depends on your organization—how important is downtime minimization versus new feature deployment? How rapidly is your organization changing? What resources do you have available for making changes and for automating processes?
#4: Reduce variation.
Variation can be good in some contexts, but in the network, variation introduces unexpected errors and unexpected behaviors.
Whether you manage dozens, hundreds, or thousands of network devices, how much of your configuration can be standardized? Can you standardize the OS version? Can you minimize the number of models that you use? Can you minimize the number of vendors?
Variation increases network complexity, testing complexity, and the complexity of automation tools. It also increases the knowledge that engineers must possess.
Obviously, there are cost and functional trade-offs here, but reducing variation should at least be considered.
#5: Automate as much as possible.
Automation reduces variation. If you do things manually, you will have significantly more variation in your network across time. Automation also makes it practical to increase how frequently you make changes and thus reduce the size of each individual change.
While there is an upfront price to pay for automation, in terms of the time required to automate, it can save meaningful amounts of time and eliminate drudgery in the long run.
#6: Treat infrastructure as code.
Your configurations should be in a source code repository, and they should be automatically revision controlled. Your automation tooling, monitoring configurations, and QA scripts should also probably be revision controlled.
In the long run, network engineers might want to employ methods used in software deployment. Can we stage our changes on a central server? Can we automatically test our changes prior to deployment in a QA environment? Can we deploy our changes to production programmatically? Can we automatically rollback if we encounter issues? Do we have a feedback loop that our changes are operating correctly?
In this regard, an overlay-underlay network model could be essential. If we can create a relatively stable underlay network and on top of this have a virtual environment where network devices are created/changed/destroyed programmatically, then this could be a very powerful combination. It could allow us to achieve the above goal of using a software-like deployment model.
#7: Uptime and change are not necessarily in conflict.
This brings us back to Netflix’s Chaos Monkey.
I am still coming to grips with this and definitely need more data, but in some contexts, uptime and the number of changes are not inversely related.
In other words, you might be able to design a system where you change things frequently and rapidly incorporate all the things you learn into the system. Automate the things you learn, and over a period of time you actually have a more robust system that has less aggregate downtime. You make both a very reliable system and one that is frequently changing.
This is what Netflix did – they intentionally introduced failure into their system so that they could rapidly learn from it.
I am not advocating doing the same, but I think there is an interesting idea here that we can potentially learn from.