Make no mistake, abstraction is a good thing. And abstraction with rich application program interfaces (APIs) is a great thing. Software abstraction supports the goal of all mechanization, to reduce human drudgery and foster engineers’ creativity. But that same facility always comes with a downside rarely apparent in the early stages of administration: sprawl.
Virtualization largely eliminated barriers to spawning new servers and we now must regularly prune our clusters lest we end up with blinking bezels of un-provisionable resources. Flash and hybrid storage are great assets in our eternal quest for more input/output operations per second (IOPS), but undisciplined pool design does not a happy data center make. And when we look at software defined networking (SDN), network functions virtualization (NFV), on-premises platform-as-a-service (PaaS), hyperconverged systems, etc., why would we expect these systems to be somehow immune? They aren’t, and now wrangling software defined sprawl is one of your jobs.
The Age of Infinite Networks
Perhaps you’ve monitored traffic paths inside the networks of a SaaS provider and noticed traffic route changes happen daily or even hourly. And it makes sense. These operators have considerable Dev resources and developed proprietary SDN approaches long ago. Their services follow the sun and adapt to demand dynamically using a combination of abstraction and software actuation. It allows them to make configuration changes faster and more accurately than any team of humans could (for example, click the SolarWinds diagram below of Yahoo’s internal network).
But now with SDN, software defined storage (SDS), and more in your datacenter, routing, security policies, application delivery optimization, and storage architecture may be just as dynamic. At one time, you kept your network streamlined around a few robust links that, for the most part, stayed out of trouble when left alone. If, however, you can wield intricate multi-homed networks with the click of a button, then you should; and for all the same reasons the big operators do.
There’s just one catch. Because it’s easy, it’s likely that resources ultimately will be orphaned, duplicated, over-provisioned, or just plain lost. If you’ve ever monitored on-premises NetFlow, AWS virtual private cloud (VPC) Flow Logs, or Microsoft Azure operations management suite (OMS) agent Wire Logs, you might have seen a few percent of traffic “leaking” past expected routes. With manually configured networks, this is often traced to branch office local internet/VPN links, failover routes that are active when they should be in standby, border gateway protocol (BGP) classless inter-domain routing (CDIR) shenanigans, or random, screwball firewall polices.
If the number of links is relatively static and small enough to maintain a complete mental picture, remediation can be straightforward. But if your SDN framework is doing what it should be, VMware vRealize is cracking the reigns of NSX or system center virtual machine manager (SCVMM) is shuffling ACI contracts. Controllers are modifying network access, permission, addresses, and routes regularly without your review. That also means that for the most part, it’s happening out of view.
Software-defined storage (SDS) is often even more prone to sprawl. Ideally, well-tuned policies shift storage workloads to achieve maximum performance while minimizing resources. Hybrid controllers optimize volumes’ individual files, assigning them to different physical media based on usage analysis. But unlike outage—networks’ occasionally forgivable sin—storage failure (data loss) is grievous. The result is a tendency to create conservative policies that, when in doubt, may leave files in place to be cleaned up later. Storage sprawl is a safer byproduct when letting policies execute autonomously.
Sprawl As a Service
Software defined infrastructure sprawl is worst where it is compound. Orchestrators with dominion over networks, storage, and other resources on-premises are actually the smallest problem. A top-level controller script that in turn calls APIs in legacy virtualization systems can spew VMs, pools, and virtual MAC addresses (vMACs) with little awareness of resource consumption. But at least on-premises engineers still maintain full administrative control, and manual, if tedious, sprawl monitoring is possible with existing tools.
Hybrid IT and cloud, on the other hand, too often take sprawl to a whole new level. Any IT manager who has ever been surprised with a potentially career-limiting service invoice can attest to the dark side of software defined sprawl. Reporting or even discovery to determine how to terminate idle or orphaned instances can be difficult. “Is the finance team’s monthly analytics process running on i-5of78b0zg9vps9fro or was it i-09eomj1mnex8ex6yi? Oh wait, that one is human resource (HR), finance is on i-299rjpen9ruwahzuo. Maybe.”
The trouble comes when admins rush to take advantage of the amazing capabilities of software defined infrastructure, but don’t have experience with distributed/delegated transaction tracing, configuration action logging, and more. If an organization embraces DevOps with a capital DEV, there’s some relief. Continuous integration/build/delivery tools make it easier to ensure traceability across abstraction layers, but simply using Chef or Puppet alone is not enough.
What You Can Do
- Document – Yes, the D-word. First, admit that you don’t like it, agree that no one likes to do it, and then do it, anyway. The systems that manage individual control points do a great job of organizing elements within their purview. They’re also terrible for visualizing end-to-end relationships that pass through multiple system layers. Spend time in Visio, diagram your integrated SDx systems architecture, and then splurge on poster-sized prints at Office Depot. Troubleshooting conversations will be more efficient and management will feel you’re in control.
- Reports, not just alerts – Responsive IT management demands smart alert definitions that speed recovery. Proactive IT management relies on useful reports that keep you ahead of problems. Read your monitoring and management systems’ manuals, and build reports that identify capacity exhaustion, highlight error-prone operational areas, etc. If you can include regular detailed cost reporting, especially for hybrid IT and cloud, even better.
- DevOps All the Things – If you can’t say with some confidence that you could at this moment recreate 75 percent of your SDN-managed systems programmatically, stop reading now and get busy figuring out how to get there. Check your config artifacts into source control and make sure everyone knows how to get them back out. Read The Phoenix Project. At least try a bit of Agile. It only hurts a little bit at the beginning, and you don’t have to adopt all of it. Imagine how many nights and weekends you could get back if you made the bulk, or maybe even all, of your production changes during office hours? Really, that’s possible! You also get the bonus of a more accurate, sprawl-resistant change framework to boot.
Deja Vu All Over Again
While it’s annoying when sprawl pops up like crabgrass in yet another area of IT, you can keep it in check. In fact, we can trim software defined infrastructure using nearly the same approach used for app or virtualization sprawl, plus a few new tools. Also, remember this is a transitional problem as IT moves from the command line through first generation SDx, and eventually to PaaS infrastructure. PaaS includes intelligence and tight integration that manages resources with advanced tools to keep sprawl to a minimum. Of course, that may not help much with container and serverless sprawl, but that’s a whole other story.