CALIENT Technologies is well known for being the leader in Optical Circuit Switching technology. In this featured interview with Atiq Raza, Chairman and CEO of CALIENT, we learn about the new challenges driving the future of data center design, how the growing popularity of Pod architecture simplifies the need for managing operations, and the role played by Optical Circuit Switching to improve operational efficiency
SDxCentral: Software-defined networking (SDN) is driving a lot of innovation in the data center, but we’re certainly not quite there yet. What do you see as the key applications and challenges driving the next generation of data center design? And what types of designs are you seeing?
Raza: What we’re seeing is a new generation of very large data centers that in many ways mimic the stages that I saw processor and computer design traverse in prior roles at AMD and RMI. These new data centers are not simple collections of collocated servers. Hardware and software must work together harmoniously to efficiently deliver high levels of performance, and this can only be achieved by a holistic approach in which the data center is designed and operated as a single massive warehouse-scale computer.
Efficient utilization of data center compute resources is one of the major challenges in these facilities. This is first and foremost because compute resources represent more than 85 percent of the total capital spending on the electronic equipment in a Web-scale data center, where the total spend may be counted in billions. Any compute resource that is sitting idle or under-utilized in such a data center represents waste and cost on a massive scale. With general industry estimates suggesting that data centers use only 10 to 20 percent of their available computing cycles on average, there is clearly a great deal of room for improvement.
In terms of new designs, we’re seeing general movement to flexible replicated Pod architectures with very fast flat leaf-spine network connectivity. Networks are quickly moving to 100G in the larger cloud facilities with very low levels of oversubscription. 100G optics are dropping in cost very rapidly with new CWDM4 and CLR4 data center-reach transceivers coming online very soon, so 100G in the data center is real.
Why do you think the Pod architecture is becoming popular? What key requirements are driving this module-based approach?
The Pod architecture originates in computer architecture, where the analogy is the worker thread. Multiple worker threads exist and tasks can be assigned to any thread – it doesn’t matter which thread, as long as it knows how to process the task. With the evolution to virtualization of hardware resources and the scale increasing to warehouse-scale computers, the unit of an application becomes the equivalent of the thread and the individual processor complex becomes a pod of computing resource, while the pod itself is virtually partitioned within its physical limits and within a single network domain to enable efficient clustering.
The Pod architecture is largely popular due to the practicalities of scaling in very large facilities where scale requires a template for replication. New compute resources can be brought online in a modular fashion, while the Pod also simplifies maintenance activities. The underlying assumption of this architecture is that any application can work on any Pod, so the data center infrastructure team has a much simpler task of managing operations. Simplified operations are an absolute fundamental needed to support the massive scale we see in Web-scale data centers.
What are the challenges to this Pod-based approach? What have you seen customers struggling with?
One of the big challenges is optimizing the use of compute resources to support the variability in demands by different services and applications because estimated demand, allocated resources, and actual demand may be quite different. Capacity planning and resource allocation are based upon estimation, with built-in over-dimensioning to support worst-case workload demands.
As workloads grow or shrink over time, there has to be enough compute resource within the Pod architecture to accommodate the change. This is because sharing a workload across more than one pod means splitting the workload across different L2/3 domains leading to workload fragmentation, which can degrade performance or even break applications.
This is the reason for over-dimensioning (or conversely under-utilizing) Pod resources, but this resource over-dimensioning is simply massively wasted capex. For example, if Pods are 20 percent utilized on average and 85 percent of the cost of the data center electronics is in the Pods, then 70 percent of the data center investment is sitting idle – really quite shocking.
Google is something of an exception to the rule – claiming up to 80 percent utilization on some of its Web-indexing servers, and utilization rates in the 30 to 50 percent range for other servers, but this is by no means the norm for other data centers.
One current industry practice to improve utilization in data centers is to use load-balancing and simply pick up the workload and move it to a different Pod where resources are available, but this requires policies and significant planning. In many cases the latency through multiple network hops eliminates the possibility of clustering as the round-trip latency through packet switches represents several microseconds. As such, any compute aggregation or resource balancing beyond pod boundaries is not feasible.
How can the programmability of an SDN-approach like CALIENT LightConnect™ Fabric help? Why is this not possible with other existing solutions?
Imagine if compute resources could be easily reassigned between Pods at Layer 1, allowing workloads to grow without splitting domain boundaries. The application workloads in most cloud data centers contain the analytics to enable preemptive reassignment of resources that may be required as often as a few times a second. CALIENT’s LightConnect Fabric allows Pod resources to be reassigned at the optical layer in response to the needs of workloads, and it does so with a new virtual Pod or V-Pod architecture.
The LightConnect Fabric is a flat optical circuit switched (OCS) network layer that sits between TORs or T1 switches and aggregation L2/3 switches in the Pod. It layers across multiple Pods to allow sharing of resources among Pods across the data center.
Let’s say that you have a data center architecture that is built up of multiple Pods, each containing rows of racks. As compute demand grows within a Pod, it can be reconfigured to borrow resources from other Pods. Depending on the granularity of the LightConnect Fabric, it can borrow rows of racks or individual racks by switching them in. Because it’s happening at Layer 1, you’re creating a larger compute capability within the Pod and allowing workloads to expand without moving them or splitting them across domain boundaries. Conversely, a Pod can “give up” compute resources to another Pod if it has available capacity. This is the V-Pod concept.
What’s really powerful about the approach is that it’s no longer necessary to over-build every Pod with resources to handle worst-case anticipated demands. Instead, you can over-build by an average and lower amount across the entire data center and reallocate this on demand between Pods. That translates directly to lower compute resource capex.
With existing Layer 2/3 solutions, very few programmable options exist for resource allocation either at the application or hypervisor level. As a result, there is no really efficient solution available for cluster-wide or data-center-wide reconfiguration, except by operator-driven interventions and re-cabling, which are time intensive and highly error-prone.
Can you give an example of a real-world use case in which the OCS approach brought about significant benefits?
The LightConnect Fabric V-Pod approach to resource optimization is in the early stage of adoption within major cloud data centers where the potential savings are enormous. One of the largest Web-scale data center operators is actively deploying CALIENT’s OCS technology into its production networks, and we have several others in advanced stages of business development. The nature of these business relationships is confidential so I’m not able to share details, but the use cases are very aligned with compute resource optimization, as I’ve described earlier.
In addition to V-Pod-based resource allocations, we’re seeing several other very practical use cases that benefit from frequent network reconfiguration. These include index/database synchronizations, database resiliency across availability zones, content delivery and edge caching, and storage replication, backup and disaster recovery.
Facebook has built a modular architecture with non-blocking any to any connectivity between compute resources across all pods, but it doesn’t solve the problem of resource fragmentation when a workload expands over time and is shuffled across pods for elasticity.
I also question if non-blocking connectivity between each server in the data center is really needed because application resource allocations tend to be localized within their respective clusters. By contrast, a LightConnect Fabric solution offers full agility in allocation or interconnection of resources and at lower cost than a fully non-blocking L2/3 fabric.
Large numbers of spine switches, as we see in the fully scaled-out Facebook architecture, also increase the potential for ECMP hash-conflicts in the presence of multiple east-west elephant flows. This can lead to degraded network performance, including hot spot buildup on some links, and increased latency, leading to a reduction in network performance and throughput.
Offloading of elephant flows to a separate LightConnect Fabric, by contrast, is a very elegant and cost-effective solution that frees up the Layer 2/3 network to deliver the less-persistent traffic flows and avoid the buffer overloads that can happen in all-packet solutions.
How would a customer go about adopting the LightConnect Fabric V-Pod approach? Is this only relevant to greenfield deployments? How can a customer get started?
While greenfield deployments are the easiest to implement, existing facilities can definitely adopt the LightConnect Fabric in their architectures. The first step is to identify which resources (racks of servers, storage, etc.) can be shared and to connect these to the LightConnect Fabric. This is a very scalable approach because it can grow from a single OCS to as many as are needed over time.
The Fabric is managed by the LightConnect Fabric Manager. This can be deployed as a stand-alone solution, allowing data center operators to make changes to the fabric as needed, or can be integrated into existing data center resource management systems via variety of open interfaces (TL1, REST, SNMP, RPC or WebGUI).