“Software-defined” is the darling of the infrastructure technology industry in recent years. Software-defined network, software-defined storage, software-defined data center, software-defined security -- the list goes on and on. There is a good reason behind the software-defined everything (SDx) frenzy. Take networking as an example. Until recently, most leading networking vendors were focusing on the speeds and feeds to be able to catch up with the exponential traffic growth and various service needs over the years, but they have not made revolutionary changes in terms of network programmability, policy-based provisioning and automation. Software-defined network (SDN) fills that gap and promises to automate network and service provisioning, and provide significant network operational savings.

As software and virtualization take the spotlight, a side effect is the downgrade of hardware to a second-class citizen in data centers. Hardware is viewed as a target for commoditization, and software, even the virtualization layers, wants to be hardware independent. This is a dangerous slippery slope. If we look at the history of computing and data centers, many software breakthroughs are really enabled by advancements in hardware. Take server virtualization as an example. In the early phase, the number of virtual machines (VM) that a physical host could support was underwhelming until there were breakthroughs that significantly increased the number of cores and threads that CPUs can support. Hypervisors are actually NOT hardware independent and they are smart enough to identify a CPU that can perform hyper-threading. It will schedule VMs based on the quantity of physical sockets, physical cores per physical socket, and logical processors.

The best practice to build the most efficient virtualized data centers is really not for software to be hardware independent, but for software and hardware to work together and for software to maximize innovations and enhancements that hardware can bring. The burden now really falls on virtual machine managers, hypervisors, and most recently containers to implement the right abstraction for the best hardware innovations to shine.

Another example is related to a technology called Single Root I/O Virtualization (SR-IOV). This technology has been around for a while, and solves a key performance issue that causes packet I/O performance degradation when network functions used to run on special purpose hardware appliances make their way to high-volume servers and storage devices. I have previously discussed the three aspects of packet I/O performance:

  1. Raw throughput of the interface
  2. Packet throughput
  3. Virtualized (or Containerized) packet throughput

Ultimately it is the third level that matters to your virtualized or containerized applications.

In a virtualized server, packet forwarding is done through a virtual switch or router, the most popular one being the open virtual switch (OVS). OVS oftentimes resides in the OS or hypervisor kernel and it can require multiple copies, interrupts, and context switches to send or receive a packet, resulting in sub-par throughput and latency. We started to see claims that OVS reaches close to 10 Gbit/s throughput, but the next question you should really ask is, how about small packet performance, which is key to some virtualized network functions such as VoIP? And, perhaps more importantly, do you really want to consume all of your expensive CPU and memory resources simply running the virtual switch rather than the actual application being virtualized? With Data Plane Development Kit (DPDK), the problem is relieved, but still, packet processing, including inner packet checksum, encapsulation and decapsulation, etc., is done by the CPU, resulting in CPU overhead and non-deterministic latency – as you never know when the core running an application linked with the DPDK library will be busy with other CPU-intensive tasks, such as doing deep packet inspection on a received packet. SR-IOV virtualizes the PCIe device, the network interface card (NIC) in this case, into multiple virtual functions (VFs), and when VMs are bound to these VFs, a portion of the routine packet I/O operations can be offloaded to the NIC. The most basic offloads include stateless transport acceleration, but more advanced NICs support virtualization gateways and even embedded OVS switch, which performs virtual packet switching in hardware -- resulting in deterministic latency and near bare-metal performance.

Sounds great, doesn’t it? You can’t help asking, what’s the catch here? There are two main pushbacks against SR-IOV adoption.

One is that hardware independence is lost. The VMs run NIC-dependent drivers and the number of VMs in the system is limited by the number of virtual functions. Because of this hardware dependence with SR-IOV, VMs are not portable and upgrades can cause downtime.

First of all, the claim that VMs are not portable with SR-IOV is just not true. Microsoft has supported live migration for VMs using SR-IOV in its Hyper-V, not only migrating between two hosts both supporting SR-IOV, but also from a host supporting SR-IOV to a host that does not. Here is what Microsoft says in one of its blogs on how Hyper-V supports live migration of VMs using SR-IOV:

Live migrating a VM with SR-IOV NICs does work, as each SR-IOV NIC is “shadowed” by an ordinary VM Bus NIC. So if you migrate a VM to a host that doesn’t have SR-IOV NICs or where there are no more free VFs, the traffic simply continues over ordinary synthetic links.

Some hypervisor folks worry that they will lose control, and that the sky will fall if they don’t handle everything in the hypervisor itself. But ultimately stuff needs to be done, and done by the most capable entity. Offloading has been used forever in the networking world. If you ask networking professionals if their routing and forwarding devices forward their packets in software on the route processors or route engines with general-purpose CPU, the answer you get is almost invariably no. The heavy lifting of moving packets around is offloaded to so-called line card, with specialized hardware that handles line rate packet forwarding at low and deterministic latency. Better yet, it is software that defines how hardware forwards packets. Software has full flexibility to implement control path policies while hardware performs the data path processing -- ultimate harmony achieved. The virtualization layer only needs to query hardware for advanced capabilities, define a set of APIs common to all hardware component vendors, and let them provide their own plug-ins for the framework.

This is exactly what DPDK is doing. DPDK is hardware dependent, initially only working on Intel Architectures (IA) CPUs. Leveraging the data plane enhancements requires applications to load a CPU-specific DPDK library, which in turn loads NIC-specific poll mode driver (PMD) plug-ins. This has not been a huge concern because IA CPUs have a large market share, but other options like the Open Power CPUs and low-power alternatives like ARM are inching in. It is expected that they will provide their own DPDK support that is specific to their CPU architectures with the same interface. On the NIC side, many major NIC vendors also support SR-IOV, so we are seeing a perfect storm forming. SDN and NFV might just push things over the edge and we will see more hypervisors supporting live migration and other performance advantages enabled by SR-IOV.

Last but not least, the concern about hitting SR-IOV numbers of the VF limit. Currently vendors support 128 VFs in its current generation of 40G NIC, and this number will go up to 256 for next-generation NICs. I have yet to meet a customer who deploys over 256 VMs on a single host in production environment, period.

The second pushback is regarding PCI resource depletion. Intra-host VM-to-VM workloads consume PCIe resources and there is a concern that inter-host bandwidth is limited. Therefore, bandwidth constraints may actually necessitate using a software switching based approach like OVS versus SR-IOV.

This is truly an interesting point and indeed there is tradeoff between OVS and SR-IOV. Ultimately it is a question of whether you want to deplete your PCIe bandwidth resource with SR-IOV, or you want to deplete your CPU resource with OVS. It will depend on your workload and infrastructure setup, but if your intra-host traffic rate is so high that it is saturating your PCIe bus, I bet the same amount of traffic will render your CPU completely useless also. Ultimately, it would be ideal to have the freedom to switch intra-host traffic using OVS or SR-IOV along with an embedded OVS depending on resource availability. It is not possible currently, but hey, that is the gap the new innovations can fill.