The rules of telecom infrastructure are changing, but fundamental requirements like high availability cannot be compromised. The starting point has traditionally been anchored around big-iron building blocks (the kind built for around 1 million subscribers), each one being a five-9s package of reliable and redundant software, hardware, power, and cooling capacity — hefty and quality building blocks for a highly scalable and quality network, not to mention government regulated, service.
Taking it one step further, in traditional networks we have historically connected functional systems in distribution centers using telco grade routing systems, which are also five 9s, doubling the probability of failure. The traditional response to this has been to double the number of systems in a distribution center, and since the network leading to the distribution center, which is also five 9s, can fail — we double the capacity again in each center so that a buddy system of distribution centers can back each other up.
These are the type of principles that have kept billions of customers of digital telephony in the ’80s, dialup in the ’90s, and DSL and mobile subscribers in the new millennium reasonably satisfied globally, at least as far of reliability goes. However, these are the same principles that are keeping telecom systems at 20 percent utilization and 20 years behind in terms of software engineering, prolonging major rollout for decades, while the Googles, Facebooks, Netflixes, and Amazons of the world are cashing in on connected services.
For this reason alone, we are seeing SDN and NFV grow in interest in the telco market, causing us to fundamentally reconsider our basic assumptions. Since five 9s service reliability for mass consumer service is still a must, as any lower bar does not scale financially nor operationally, regulated or not, the question becomes: How do we achieve this same level of reliability on virtualized networks while keeping costs low, increasing utilization and elasticity, and enabling features to be delivered to market faster?
Sizing Up NFV
One of the key considerations in choosing an NFV strategy is size. Just by looking to NFV, we have already decided to separate the network function from the physical network node or junction, a base principle that allows virtualization. However, there are two different ways we can go about that function-junction-separation shift.
One obvious method is for the vendors that used to supply large, reliable, functional big iron to supply the same systems as Big VNFs in clustered software on standard hardware. In this scenario, each of these Big VNF systems needs to be able to scale individually and recover from hardware and network failures in the underlying compute clusters. This can be a more difficult task then just running a packaged embedded system, as there are different NIC cards, PC boards, and virtual machine operating systems that need to be considered, as well as the far less deterministic inner VNF interprocess communication (IPC) network to account for. That being said, this approach can be done and will achieve the targeted high availability and capacity using the new NFV framework.
Unfortunately, given that it is a first step, and not too far from the legacy approach, not much will fundamentally improve. The issues of dependency on a few large vendors or costs are not likely to decrease, nor will the feature velocity, dynamic programmability, or even the utilization will likely increase.
While we should certainly keep the above solution in mind, as it will be needed for transition, we examine other options that were not available in the past, approaches that surfaced with the new NFV paradigm.
Let’s consider, for example, enterprise-class VNF software that runs as a single VM on a few CPU cores, which can likely only handle a fraction (1/1000) of the capacity of a telco system. This type of NFV component can be made by numerous independent software vendors (ISV), improving the vendor dependency issue. However they are likely only three-9s reliable. Developing functionality at a fraction of the capacity and reliability of a large telco system is far easier and faster, the old 80/20 rule. Too bad that individually these are not carrier-class, but is this really the case for the overall system? Let’s do the numbers!
Three 9s reliability would mean it is OK for a component to be up 99.9 percent of the year, or down less than 10 hours a year. This is not an acceptable number for a telco-grade solution, as too many components would be failing too frequently. However, since each VNF component is much smaller, we only need to allocate an extra 1/1000 capacity to each VNF to absorb one failure, 2/1000 to absorb two failures, and so on, in an all-active setting that rebalances 1,000 users using built-in self-healing procedures. For these procedures to achieve five 9s, they need to complete the rebalancing of 1,000 users in under five minutes per failure divided by expected fails per instance per subscriber times expected functions per subscriber. Plenty of time to rebalance just a fraction of the subscribers. What’s more exciting is that this approach can be delivered with less cost, produces higher utilization, and allows for far better multivendor, increased feature velocity for service programmability.
Where the Magic Is
So what’s the catch? There isn’t one, but there is a hidden assumption.
In the “Big” VNF paradigm we require the NFV vendor to bring the application state to wherever the subscriber traffic ends up. This clustering is an integral part of the one software system that absorbs and hides the underlying fluidity of software components on COTS hardware. However, in the “Small” VNF paradigm, we assume the existence of “magical” virtual networking that will bring the flow to where the application state happens to be, and that will hide the thousands of multivendor “Small” NFV VMs’ existence from subscribers and each other. This is not a trivial task in terms of in-line orchestration and context management. It needs to handle scaling, balancing, chaining, affinity, and high-availability. This capability is only possible now due to SDN technology that separates control from forwarding, a job that subnet routers cannot do. In addition, distributed flow-steering at scale is also only feasible now due to the standardization of network virtualization overlays (NVO) that can mobilize the needed mapping just in time for SDN to act upon.
The general conclusion is that in principle, because of SDN and network virtualization, we can adopt a small, agile, component-based approach to NFV. Indeed, we do see these small VNFs in many IP functions such as filters, firewalls, transcoders, and analytics collectors. We also see the “Big” VNF approach mostly in trials that target legacy 3GPP functionality, but as the market evolves in the long run, the cost-benefit numbers point to “Small” VNFs and micro-partitioning of these functions.