You’ve done it! You convinced management to adopt SDN and you’re buying switches from Pica8, IBM, and Dell, controllers from Big Switch, as well as applications within the Big Switch Ecosystem. You are so on top of things you have even implemented your own custom traffic engineering application. Excellent! You’ve rolled it all out and for the last month everything has been great until you started experiencing some performance issues in the network. You have great utilization, but the performance seems slow. So let’s start troubleshooting the network and applications. Where do I start? Ok, I just use which tool? Hmmmm, in the past, I always called my main hardware vendor for support. I’ll call tech support for help … but now, who do I call?
The value proposed by SDN proponents is simplified management, and the ability to plug any vendor hardware in for forwarding services. The network is no longer about the proprietary hardware; it’s now about the flexibility of software and accelerating innovation. Has this truly made operating costs lower, or is it all hype? In a September blog, Does SDN Make my Network Management Look Fat, John Strassner explored the challenges of Network Management. Beyond Network Management, troubleshooting and support services are two additional complex challenges.
One of the more unique problems created by the current definition of SDN, and demonstrated by the current SDN state of the art, is the lack of standards and the relative youth of the various SDN offerings. Which version of OpenFlow is your switch supporting, what vendor-specific attributes are added, which optional attributes are available or not on the various switches, are the controllers interoperable, what features have been implemented on top of the controller, and will this only work in greenfield networks, or will it work for hybrid networks as well? With just this complexity alone, troubleshooting performance and transient problems will be very challenging. Now, what tools are available to help you?
In existing networks today, Network Administrators have a wealth of tools and technologies that have been developed over decades explicitly focused on the challenges of analyzing network performance, troubleshooting and diagnosing network outages and providing solutions for remediation. Administrators looking to stand on the bleeding edge of SDN adoption will have to be prepared to roll their own solutions and/or spend extra time trying to find and adapt from the myriad of SDN solutions sprouting like weeds.
This creates several challenges:
- There are no standards for controllers and NB APIs, so it will be left to the administrators to find and most likely adapt offerings to the environment they have.
- Due to the lack of standards for controllers and NB APIs, there is a high likelihood that the tools you most need do not yet exist, and so they will have to be built in-house.
- While CapEx budget may be reduced by the “cheaper commodity hardware”, the OpEx costs for developers will most certainly grow. Does your department have the required budget? Can you afford OpEx costs that may be equivalent or larger than your CapEx reduction?
- More importantly, as these new tools are developed and integrated, can you afford the extra time and delay induced by having to wait for new and/or adapted tools to be created, then tested, and then integrated with your network? What about the risk to the stability of your network services?
These points bring us to the next major challenge for Operations, which is how services are supported. One of the biggest potential values of the SDN movement is the speed of innovation and adoption of network services. Whether you believe in the abstractions and functionality offered by OpenFlow or not, the trend of the Hardware vendors to provide open APIs is a benefit to the end developer. They now have mechanisms to adapt the networks to their specific needs. Oh, but be careful what you wish for!
While working at Cisco, I had many opportunities to meet with Enterprise companies from around the world. One of our most powerful capabilities was a feature set called the Embedded Event Manager (which now plays a prominent role in the Cisco ONE offering). EEM was the first major differentiator for device programmability offered by Cisco. It is a powerful tool that enables operators to create custom solutions to identify and correct problems with network services; it can also be used to augment feature sets provided by Cisco with additional, customer- and application-specific functionality. However, such power can work both ways. The number one concern of customers was not bugs in EEM; rather, it was if their developers wrote a set of custom scripts, what percentage of the functionality in those custom scripts would Cisco support. This is because of the difference in programming models for applications compared to networks. In the application world, every tool and IDE vendor will of course support application development. However, this is typically not true in the networking world. This is where I worry about the ability of the SDN initiative to make real breakthroughs. Until application developers can treat the network as “just another set of resources and services”, the real innovation of network applications will remain slow and difficult to develop.
Is anyone prepared to truly support broad adoption of custom features into production networks? Part of the problem is the level of abstraction that these features are developed at. It’s one thing to use EEM to fire off simple TCL scripts for pre-defined IOS events – this is simple and straightforward (since nothing in IOS is changing), but very powerful (since your program can now provide custom behavior triggered by IOS events). However, it is quite a different matter to change the behavior of IOS, or of entities that IOS directly depends on or uses. While the former is easy to detect, the latter can vary widely with the complexity of the operations being performed. An additional challenge is “how do you synchronize changes made by network-integrated applications to changes made with other tools on other network devices?”
Are the tools available to diagnose custom scripts or programs across multi-vendor hardware and software solutions? Are the tools able to recognize that different commands from different management applications applied to different devices from different vendors have the same (or similar) effect? If not, how do you build end-to-end services? How do you build and troubleshoot vendor-agnostic applications? Even if the NB API is equivalent, is the underlying behavior the same? Take that example of EEM; even though the same function is available in IOS, IOS XR, and NXOS, the behaviors of that function are not always equivalent. A script on a Cat6k will behave differently on a Nexus7k or GSR12k. If 3 OSs from one company cannot interoperate, how will creating an equivalent application work across multi-vendor solutions? As an administrator, whom do you call in this environment for the applications you wish to create and use?
Many vendors are touting the release or availability of custom APIs. Often, the easiest step is to get the infrastructure in place to offer APIs to control different features. It is far more challenging to create tools to catch and validate changes to devices; it is even harder to do the same for the business processes and support environments that are used to manage devices that provide the resources and services delivered by the network. As long as we are dreaming, we might as well throw in unified interfaces and APIs for compute, storage, and networking. These are the set of tools that are required to truly enable the adoption of custom network-aware applications. The impact of a faulty application that manages the network can be far larger on a business than an application confined to a server or server farm, as network outages can adversely impact all applications, as well as the resources and services that are needed by those applications.
Some of the vendors are starting to offer Developer Support services. It’s viewed as new service revenue – this is something they understand, growth of revenue. But the technical support organizations are historically based in “box thinking”. This environment is not one that has a history of dealing with high-complexity, small volume problems. In these environments, developer support (application creation) and network operations (application and network operations) are different groups. A call that comes in for network problems will typically route to a team that knows how to troubleshoot and debug “Standard Configurations”.
In order to really help customers, certification processes will need to be put in place to help certify the “safety” of adding custom scripts, programs, and applications to the network. But one of the promises of the SDN vision is commodity hardware and software inside the network device, leading to a vendor-agnostic platform, right? So which vendors provide support and certification for multi-vendor software and hardware networks?
In the end, who will be responsible for your SDN network? For now, it appears that this will fall on the already overburdened and under-staffed IT shops. Are you ready?