According to data, 2016 will see a strong uptake of live network functions virtualization (NFV) deployments. It is not yet a full transformation of service provider networks, but we will see many specific deployments that address real business needs, such as end-of-life replacements, and new technology deployments, such as VoLTE, that will skip the classical deployment model right away.
This trend puts NFV operations squarely into the focus of service providers and their vendors. NFV brings a lot of advantages through its software-based technology and virtualization. As a basic example, we no longer need to ship physical boxes and bring them through customs for international deployments – at least not as often. NFV also enables a much higher degree of remote operation and operational automation.
On the other hand, NFV doesn’t come without its own challenges. By design, NFV – as described by ETSI – has a multilayer architecture (see Figure 1). At the level of resources, there is the hardware layer with servers, storage, and networks; there is the virtualization layer that creates protected environments; and there are the virtual network functions as well as higher-level network services and applications that run inside these virtual environments.
Above these layers are resource management systems, a service orchestration and assurance layer, and the business support layer. In addition, these layers should be open to multiple vendors, where each layer could be contributed by a different vendor or, more likely, solutions from multiple vendors will come together on the same layers. All of these layers and management systems need to provide and consume mutually agreed-upon resources and services – no small feat in a world where few of these interfaces are fully standardized.
NFV environments are also much more dynamic than classical physical network solutions. As a case in point, customers have requested a different uplink network design, i.e., a change in the physical and virtualization layer. And here is the issue: What is the impact of this change on the VNF running on top of it? It’s indicated that we’d need to test the whole solution across hardware, virtualization, and application because the failover behavior may be affected.
From the mindset of delivering a packaged solution, that sounds like the right thing to do. But from the mindset of NFV and cloud, this causes concerns if changes deep in the infrastructure could cause applications to break. It would also restrict the scalability: It may be OK to retest one application, but if there are dozens of applications running on the infrastructure, such a change may turn into a major endeavor.
A major challenge pertains to service assurance in a multi-layer NFV design. What happens when something goes wrong on a lower layer? Are the higher layers appropriately notified? Do the higher layers need to understand the inner workings of the hardware and virtualization layers? Should applications and their management systems build up a complete model of the infrastructure they are running on?
In the event of a failure, many alarms will be generated, the majority of which will be secondary alarms caused by one or a few primary failures. In addition, there may be too few alarms, as higher layers may not be notified of relevant events at the lower layers. For example, a high CPU load condition at the level of a server, when there is a noisy neighbor, does not necessarily lead to a high CPU load shown to another virtual machine if that virtual machine does not itself consume high amounts of CPU cycles.
When a failure occurs, a multi-step response should ensue, most of it without the need for human involvement. First, high availability mechanisms should kick in, relying on redundancy models, such as active/active load sharing, where the remaining resources take on the load of any failed resource. In a second step, short-term or long-term repair actions can be triggered.
For example, a hardware or software resource can be restarted as a first-level repair action; or a resource, such as a server or disk, is taken out of service until it is replaced at a scheduled maintenance interval or the whole rack assembly is replaced. Despite a high level of automation, human operators always need to be able to see the status of the NFV resources through powerful visualization tools such as virtual-to-physical mappings and application resource status displays.
Service levels also need to be assured during system updates, and we need in-service software upgrades. Unlike in NFV proofs-of-concept, we cannot simply shut down the system and install a new software version from scratch – the system needs to be prepared to keep applications and services operational while the infrastructure and the management systems are incrementally upgraded.
Service providers should look out for solutions to the operational challenges of running an NFV-based network. Being able to run a virtualized network function on an x86 server is not enough. Key requirements are a layered architecture with well-defined interfaces delivering the right abstractions between the layers as messages and notifications. With this kind of architecture we can turn NFV operational challenges into opportunities, taking advantage of NFV’s own strengths such as automation and the software nature of the virtual resources. Open communities – such as ETSI , OpenStack, and OPNFV – have laid the groundwork, but the efforts need to be intensified to drive the industry forward to accelerate the success of NFV.