Since the mid-1990s, the costs of compute, storage, and memory have been steadily plummeting. The commoditization revolution in compute (“x86 everywhere”), combined with the commercialization of distributed systems programming models, has taken huge advantage of this economic situation.
At the same time, we are processing more data, from more sources, for more purposes than ever before. Although multi-core and multi-socket compute platforms are commonplace – leveraging parallelism at the processor to try to handle the hockey-stick growth of data processing needs – even these approaches are insufficient to deliver the processing power needed to accomplish enterprise data mining efforts. Hence the scramble to build distributed systems solutions for our applications.
This scramble to build distributed systems solutions has had a few, key consequences:
- Applications are more distributed across networks. An application is considered “distributed” as soon as it has any non-trivial east/west dependencies. Once this happens, the network is an implicit part of the application. It’s no longer sufficient to have the fastest algorithms in your applications, or to have the highest-RPM disk drives in your servers. Now network latency and throughput have a critical impact on the performance of distributed applications.
- Operating distributed applications changes network operations. In the past, networks were segmented into layers, zones, and areas. Applications existed primarily within one of those segments. Now, as a result of their increasingly distributed nature, modern applications operate across boundaries and layers. In many cases, our modern transient and elastic applications deploy, work, and decommission non-deterministically. Their increasing scale and ephemerality and requirements for performance, continuity, and on-demand capacity, require comprehensive, real-time feedback and automation.
- These conditions are now leading to a third and very important consequence. The traditional ways of operating networks are unsustainable for today’s applications. This is because the traditional network operations model is all about monitoring and managing networks, not operating environments for applications.
The traditional approach to network operations attempts to manage the various pieces of the system in isolation – with information stove-piped across device health, operational statistics, traffic characteristics, configuration changes, and security. Actions were separated from sources of feedback.
The point of integration is the brains of the network operator.
The traditional model presupposes that the network engineer has a prior knowledge of the “good” state of the network, and that any variation outside some pre-defined parameters is “bad.” It also presupposes that he or she can access and synthesize all of the required information and react quickly and correctly even as the system scales and its rate of change increases. Combined with overly heavy change management processes, manual configurations, and paper-based documentation, the status quo is unsustainable
In order to understand what is necessary to build a model that will scale and is sustainable, it’s important to understand what network operations really mean in the context of today’s applications and their requirements.
Operating the Network Is More Than Just Monitoring It
For as long as we’ve been deploying IP networks in data center environments, we’ve known that we have to have some visibility into the operational stability of the network. Simple Network Management Protocol (SNMP) emerged as the universal standard for this functionality. This is despite some of SNMP’s well documented flaws, including its verbosity, resource inefficiency, reliance on ASN.1 (obscure, poor tool support, early security vulnerabilities), the unreliable nature of UDP datagrams used to deliver potentially critical trap notifications, and the difficulty of scaling a centralized collector.
As with any centralized polling technology, SNMP also introduces challenges by burdening the infrastructure with repeated requests, centralized data hygiene, and increasing storage requirements – challenges that reach tipping points in large-scale or high-performance environments.
Finally, despite efforts to provide a unified data model (as with the IETF proposed and ratified standards on MIB definitions), SNMP suffers from heterogeneity: Vendor differentiation demands that, though most vendors support some part of the standard enterprise MIB collections, they also provide their own.
However, even if we fixed all the problems with SNMP and polling, the fact remains that monitoring a network isn’t the same as operating a network.
Further, networks exist to power applications. Traditional network monitoring tools typically do not incorporate “application awareness.” Network operations involve more than passive observation. They provide a continuous feedback loop similar to the “Observe, Orient, Decide, Act” (OODA) process described by Colonel John Boyd, where the input to the next operational iteration is the output of the last.
Network operations are active and learn over time how to prevent (or correct) issues more effectively. Monitoring simply provides input as point-in-time observations of the network’s state. Network operations correlate application metrics with network metrics for a holistic view of the network’s state and the state of the applications connected to the network.
Operating the Network Is More Than Just Managing It
“Network management” sometimes sounds like a dirty phrase, and there are some good reasons. Managing a network usually involves some of the least desirable practices in IT: manual, human-performed configuration editing; brain- or paper-based documentation; and calendar- or clock-based change coordination, just to name a few.
DevOps-style tools (like Ansible, Chef, Puppet, or Salt) can help with provisioning, configuration templating, and pushing configuration changes, but they don’t have any concept of ongoing feedback or visibility. Adding APIs to network operating systems solves the problem of using heavy-weight and error-prone processes (CLI scraping, among others) to affect configuration. But it doesn’t give us any better idea of the impact of the configuration on the network as a whole.
Network operations, on the other hand, are aware of the network’s purpose, and can guide engineers in achieving that purpose. Network operations provide better tools for describing architecture, intent, and behavior. They also translate these high-level concepts into the low-level language required by the devices in the network. Network operations use a continuous feedback loop to determine what change to drive (perhaps using automation to drive that change).
While automation tools focus on removing the opportunity for error in the how (how changes are applied), network operations address the opportunities for error in the what (what changes are applied). Further, network operations provide continuously updated models and simulations of the running network that help address the bigger problem of coordinating or orchestrating change in such a way that the network as an overall system – a system that includes the applications connected to the network – is improved.
Treating the Network as a Distributed System
“It’s not a networking problem; it’s a systems problem.” — Martin Casado
Since the advent of dynamic routing protocols (at least as early as the 1980s), the network has been a distributed system. The applications we run are increasingly deployed as distributed systems. Why would we suppose that the techniques and strategies we’ve used to manage our distributed systems applications are unfit for improving the state of network operations? We cannot rely on a hodge-podge of disjointed, off-the-shelf tools. We need to start operating our networks as if they were a distributed system.
Network operation takes a systems-oriented approach. A systems-oriented approach focuses on the functionality of the network as a dynamic, responsive entity (a system). Network operations act holistically upon the network as well as on the devices that comprise the network – recognizing that understanding the behavior of a network is best understood in the context of the relationships, interconnections, and interdependencies among the parts that comprise that network.
This approach is not only concerned with the health or performance of a single element. It is concerned about the behaviors and functionality present in the system, and the extent to which these are in compliance with the wishes of the network operators.
In order to take a systems-oriented approach to network operations, though, there are several core problems that must be addressed.
- The heterogeneity problem: No large network (in fact, almost no small network) today is homogenized on a single vendor, and it cannot be operated based on a single type of data collection. A systems-oriented approach to network operations must collect data from multiple vendors in a way that is amenable to comparison, sorting, and other algorithmic approaches to solving problems.
- The unification problem: Network operations need to leverage the various types of information the network stores and generates, such as configuration, traffic sampling, control and data plane, compute, storage and application data, policy, inventory and other information to complete a picture of the network and understand its operations over time.
- The continuous-state model problem: The traditional approach suffers from looking at problems in the small view, focusing narrowly on the individual elements that compose the state of the network rather than the state as a system. A systems-oriented approach to network operations must develop reactive, constantly updated state models (simulations) of the network in order to make broader predictions about the precursors of outages.
- The adaptive problem: Today’s hardware, protocols, and software are tomorrow’s legacy cruft. The data models and the state models must be flexible enough to keep up with the constantly evolving landscape of software and hardware in the data center.
- The scale problem: Traditional network operations tools do not collect and process information efficiently enough to scale to modern requirements. New methods are required to gather, normalize, stream, and process network data.
All useful network feedback abstractions (security posture, network health, traffic distribution, etc.) and advanced analytics approaches (machine learning, visualization, etc.) are derivative of solving these core problems. By treating the network as a distributed system (because that’s what it is!), a systems-oriented approach to network operations can apply many of the same strategies used to manage distributed applications to the network – solving longstanding problems in network operations and addressing the scale, speed, and complexity of modern environments.
In today’s data centers, network operations comprise more than network monitoring, more than network provisioning or network automation, and more than network management. It is a superset of all these areas, based on holistic information, incorporating automation and working in conjunction with the applications the network supports. Network operations apply distributed systems approaches and techniques to solve the challenges faced by piecemeal solutions that only address a part of the overall picture.