Network function virtualization (NFV) is clearly on the rise, with an increasing number of production deployments across carriers worldwide. Operators are looking to create nimble, software-led topologies that can deliver services on-demand and reduce operational costs. From a data center performance standpoint, there’s a problem: Traditional IT virtualization approaches that have worked for cloud and enterprise data centers can’t cost-effectively support the I/O-centric and latency-sensitive workloads that carriers require.
NFV, as the name suggests, involves abstracting the underlying hardware from specific network functionalities. Where a stack was once a siloed on proprietary piece of hardware, virtual functions are created in software and can be run on x86 servers in a data center. Workloads can be shifted around as needed and network resources are spun up on-demand by whatever workload is asking for it. This fluid, just-in-time approach to provisioning services has significant upside in the carrier world, where over-provisioned pools of resources have always been the norm, and where hardware-tied infrastructure has historically made “service agility” an oxymoron. But there’s a bugbear ruining this rosy future-view: data center performance concerns.
Unique Carrier NFV Data Center RequirementsIn the traditional, virtualized IP data center, servers need to communicate to each other. But carriers need to support packet processing and forwarding, in and out of the data center and from one place to another for real-time services that makes the requirements for carrier NFV very different from the enterprise. Carriers deal with “five-nines” reliability and uptime, and offer strict SLAs for business services—and they deliver real-time services like voice and video that can’t tolerate jitter, latency, and packet loss. New NFV-based services like bandwidth on-demand and virtual firewalls also can’t tolerate delays—a cyber-attack for instance needs to be detected in milliseconds. In other words, the NFV infrastructure needs to be rock-solid and fast.
“In the IT world, the architecture is built with the assumption that things will become congested, and when they do, there’s a mechanism that provides a failover quick enough so that it doesn’t impact workloads,” said Cliff Grossner, senior research director and advisor, Cloud & Data Center Research Practice Technology, Media & Telecommunications, at IHS Markit. “Whereas, in the carrier NFV scenario, that architecture is designed to avoid failure and latency altogether. If there’s a server or software outage, and the packet plane slows down or fails, they have to retransmit all of those failed packets, and that leads to network congestion and services not working.”
Virtualization’s Performance PenaltyUnfortunately, virtualization carries a high performance penalty, because virtual network functions (VNFs) consume considerable resources within the virtual machines’ virtual CPUs. The sheer amount of processing that’s required to support so many instances of vCPU, vMemory, vStorage and so on slows down the proceedings considerably. Also, small packet sizes—which is the majority of what carriers put through their networks—put much more pressure on the I/O system’s capability to be able to push line rate throughput. Packets for real-time voice traffic, for instance, are just around 100 bytes long.
In other words, legacy carrier hardware was dumb but efficient; virtual infrastructure in contrast is smart but slow.
The obvious way to address this is to build more processing power—but a cost-benefit analysis comes into play. VMs take up capacity on the servers—meaning that an escalating requirement for more compute capacity translates into a need for more x86 servers, more storage, building or renting more data center space, and hiring more people to manage it all. At a certain point, the cost of the computing and storage infrastructure can outstrip the initial operational savings that abstraction provides.
To minimize that cost, the idea is to make those virtual machines (VMs) as dense as possible. Many users also implement containerization to squeeze even more computing workloads onto a single server. It sounds good on the surface, but the greater the density, the greater the potential becomes for taking a performance hit thanks to congestion and the heavy amount of processing going on.
“Carrier data center infrastructure must take into account scalability and the cost of that infrastructure – which means achieving the highest container and VM workload density you can,” said Joe Skorupa, vice president and distinguished analyst, Data Center Convergence, at Gartner. “But the reality is, the virtual switches (vSwitches) and actually the whole software stack have never been optimized for I/O performance.”
Intel, in one of its technology briefs, admits as much: “Without careful deployment configurations, virtualization-based solutions may also tend toward non-determinism: one can ask for something to be done, and it will be done; but one cannot generally say when it will be done. It can impact concepts like NFV, where virtualized network functions running on industry-standard, high-volume servers are expected to replace proprietary hardware.”
There are, however, various technologies that can be applied to mitigate this issue.
Accelerating Into NFVOne way to address the performance problem is CPU pinning, which involves running a specific VM's virtual CPU (vCPU) on a specific physical CPU (pCPU) in a specific host. Coupling the vCPU with hardware minimizes processing time, and also takes advantage of the fact that remnants of a process that was run on a given processor may remain in that processor's state (for example, data in the cache memory) after another process was run on that processor. Scheduling that process to execute on the same processor improves its performance by reducing performance-degrading events such as cache misses.
However, there’s a caveat.
“This creates a direct connection from the virtual machine to the NIC card, eliminating the vSwitch, and that removes performance inefficiencies that arise from the abstraction,” said Grossner. “But the downside is that when you do that, you can’t really migrate virtual machines. In some instances you don’t care, but a large reason you go to a virtual environment is to easily migrate apps and functions from one place to another.”
Intelligent network adaptors and network interface cards (NICs) with acceleration also enable carriers to offload functions from the vCPU. These functions can be implemented on the card itself via ever more capable systems on a chip (SoC)—which leaves capacity on the stack for other things.
“Packet processing is becoming more important, so we’re seeing more field-programmable gate arrays (FPGAs) — specialized programmable circuitry on the NIC that can be coded to deal with protocols, firewalls, and other types of processing off of the CPU cores,” said Grossner.
Carriers will of course pay more for a programmable NIC, which translates into significantly more cost-per-port, but the tradeoff is the ability to offload functions from very expensive CPU.
“You have to weigh the extra cost for the card against paying for more processing power to see if it’s the right investment for your use case,” Grossner said. “But we expect carriers to invest heavily in programmable NICs that have higher ASPs than the adapters typically purchased by enterprises.”
Using an Open vSwitch (OVS) with the Data Plane Development Kit (DPDK) can be another main path to getting better I/O performance, freeing up CPU resources. DPDK optimizes the packet receive operation to eliminate a number of interrupts, context switches, and buffer copies in a Linux network stack to achieve a several-fold improvement in packet performance. The OVS also takes advantage of the DPDK libraries to bypass the hypervisor kernel and boost packet performance.
Another option is the use of accelerators, which can be built on top of a NIC to allow either the entire vSwitch or significant portions of a vSwitch or distributed virtual router (DVR) operations, to be offloaded to that NIC.
“You can gain a 10X NFV performance bump using these accelerators,” explained Kevin Deierling, vice president of marketing at Mellanox, describing one usage scenario: “In a firewall case, you may have a rule that says I’ve got a DDoS attack, so I need to blackhole these packets. And if you think about trying to do that in an NFV app in the data center, there are millions of packets per second running through there. Once it starts dropping malicious packets, the CPU is doing its job, but it’s not serving the real requests that are coming in, and services are disrupted. With intelligent network cards, you don’t pollute the CPU with this—the card will drop the malicious packets before gets to the CPU, as part of the data-path operation in the network. It can look at packets, make a decision, and take action in the network before it hits the CPU.”
Skorupa noted that changing the performance for a vSwitch by a factor of 10 would result in a 90 percent reduction in overall server Capex, which would fundamentally change the economics of NFV. But he added that for the foreseeable future, carriers will be using a hodgepodge of approaches to resolve the data center performance issue.
“Optimization, capital costs, and performance need to be looked at not in a single dimension. There’s not just one way to do this—there are lots of different workloads in a PoP,” Skorupa said. “There’s routing, secure web gateways, session border controllers, firewalls, 3G and 4G, WAN optimization—and to have a universal infrastructure for all of that will be a challenge. Thanks to the latest chips from Intel and AMD, you might plug in accelerator cards and program DSPs for a number of functions, and you might have a flexible stack over there, then another for routing and optimization, where acceleration just doesn’t matter. And if you have spare capacity in one stack, you might allow it in the short term to be used by other parts of your infrastructure. You have to think about how to achieve maximum leverage with the assets that you have, how to get every bit of efficiency and optimization, and how to turn stuff up and down as needed.”