Large cloud services providers such as Amazon, Google, Baidu, and Tencent have reinvented the way in which IT services can be delivered, with capabilities that go beyond scale in terms of sheer size to also include scale as it pertains to speed and agility. That’s put traditional carriers on notice: John Donovan, chief strategy officer and group president at AT&T technology and operations, for instance, said last year that AT&T wants to be the “most aggressive IT company in the world.” He noted that in a world where over-the-top (OTT) offerings have become commonplace, application and services development can no longer be defined by legacy processes.
“People that were suppliers are now competitors,” he said. “People that were competitors are now partners in areas such as open source development. The way the whole industry worked is changing. Problems once went to standards bodies and standards bodies got everyone’s feedback and were carefully considered, and we managed and we voted and put committees together and then we produced work product. That doesn’t play well when you’re playing against the web scale players that don’t operate behind those rules.”
To position for the future and these new competitive realities, operators like AT&T have embraced carrier NFV and distributed computing to support application delivery, along with machine learning and big data analytics to manage the infrastructure. This in turn has necessitated an evolution in the data center to ensure performance for large data sets and IO-intensive workloads.
Ordinary cloud-based enterprise workloads tend to be compute intensive—but not necessarily IO-intensive or “chatty.” Thus, communications applications delivered using NFV and the cloud bring vastly new requirements for the networking fabric—a reality that becomes magnified with things like the development of streaming analytics. As carriers embrace NFV, there’s a large amount of data being gathered in real-time about both the virtual and physical layers of the network.
“We are reaching a point where you can look for any byte, action, capture, rewrite—and it’s remarkably sophisticated stuff,” explained Peter Jarich, chief analyst at Global Data. “It’s possible to timestamp every packet going through the network end-to-end. There’s continuous, real-time monitoring of buffering in switches.”
Levers for Performance: Bandwidth Speeds
With the increased focus on massive data transfers and instantaneous data travel in the network, there are two things that affect the performance of that system—raw throughput and latency.
To address the former, high-capacity servers are redefining the architecture and economic model inside data centers. Investment is being made to increase server speeds from 10 Gb/s to 25 Gb/s or greater, which gives organizations an exponential increase in bandwidth and message rates. 25 Gb/s for instance provides 2.5X more bandwidth over 10 Gb/s with a cost increase between just 1.3X to 1.5X.
“Economically speaking, this is significant, and fabrics are being built-out now with 25 Gb/s and 50 Gb/s service, which translate into network connections at 50 Gb/s and 100 Gb/s, and they’re working to get to speeds beyond that,” said Joe Skorupa, vice president and distinguished analyst for the Data Center Convergence and Data Center practice at Gartner. “The cost of optics is coming down as well—whereas in previous fiber generations, a 40 Gb/s connection involved four lanes of 10 Gb/s each, today’s 50 Gb/s connection consists of two links of 25, and they’re work to get that down to a single link. And that’s a big economic difference. Bottom line, you have speeds going up, and the price per bit coming down as both switch port and optics costs fall.”
Capacity is an important factor for performance when it comes to data center interconnection as well. For instance, mobile edge computing is becoming a trend as carriers look to use machine learning and artificial intelligence to manage networks that have more distributed nodes.
“Historically you had one EPC node per every couple hundred base stations—so there would be a dozen sites in the evolved packet core in the U.S. traditionally,” Jarich noted. “Now, you have potentially thousands of sites, and we end up with a different paradigm because you need more bandwidth to link them to the core. The argument is that if we’re going to implement a lot of the analytics and control at the edge of the network in data centers, we have to provision those quickly and be able to turn them up and shut them down easily. So the trend is an investment in transport capacity, bulked up as an exhaust valve.”
Architecting for Reduced Latency
As far as latency, the internal data center topology is changing as well, with the aging three-tier design being replaced with leaf-spine design, which was specifically pioneered to accommodate big data industries like telecom with evolving data centers.
A traditional three-tiered model consists of core routers, aggregation routers, and access switches that are interconnected by pathways for redundancy. However, one primary route is designated, with backup paths only activated in the event of an outage.
In a next-generation configuration, there are two tiers—leaf and spine. The leaf layer consists of access switches that connect to devices such as servers, firewalls, load balancers, and edge routers. The spine layer consists of backbone switches that perform routing. Every leaf switch is interconnected with every spine switch for a mesh topology that means that all devices are a predictable distance away from each other and contain a predictable and consistent amount of delay or latency required for information to travel. Dynamic routing allows the best path to be determined and adjusted based on responses to network change.
“This lowers the cost dramatically and provides for huge amounts of cross-sectional bandwidth,” said Skorupa. “So, it’s not a big deal to move around large amounts of data. Soon, getting 50 Gb/s to the server will be pretty easy.”
Those high-speed connections matter, even for small amounts of data. “When you have a workload passing around small message sizes but is latency-sensitive, it’s a big win to go from 10 Gb/s to 25 Gb/s links,” Skorupa explained. “The network might only be busy 5 percent of the time, but to get a bit in and off the wire takes 2.5 percent longer on a 10 Gb/s connection. Moving to 25 Gb/s means that it takes less time to get messages from one node to another, and that matters. All of these things makes these new apps easier to do.”
Offloading from the vSwitch
The other issue affecting latency for distributed and virtualized applications is the capability—or lack thereof—of the virtual switch (vSwitch).
“When you’re looking at services like WAN optimization and virtual firewalls, you’re pushing a lot of bits in and out of the box—that’s an issue for vSwitch performance,” Skorupa said. “Options for addressing this include making the vSwitch better by offloading some of the switching functions into a network interface card (NIC), or buying lots more servers, which has a negative effect on the ROI for NFV.”
An offload strategy decreases the amount of work the CPU needs to do. By moving to a technology stack that supports advanced kernel bypass techniques on a fast underlying high-performance network, the result is a significantly higher data transfer rate at the lowest latency.
Ancillary technologies help this along, like messaging acceleration software, which can enhance the performance of applications by decreasing overall latency and minimize server CPU workloads. Also, the data plane development kit (DPDK) provides a programming framework that optimizes the data path for applications to communicate with the NIC. In doing so, it enables them to process data packets faster, which is especially beneficial for applications that handle a substantial amount of Ethernet packet processing or high message rates, such as virtualized network functions.
“Carriers are more and more interested in embracing the cloud—especially as they’re looking at renovating networks for 5G and putting workloads up there,” said Jarich. “But when you go to distributed computing, real-world networking requirements change, and everyone’s trying to gain some consensus on that. The value propositions in moving to distributed computing are undeniable: lower latency, backhaul savings, manageability. So operators will come around to it.”