Facebook is moving to a new data center fabric that’s more sleek and elegant than the clusters in use today. The architecture is explained in detail in a blog posting by Facebook’s Alexey Andreyev, published Friday.
Facebook’s new fabric, apparently named just “data center fabric,” follows the core-and-pod model that’s become popular for hyperscale deployments. Google uses it — and Big Switch Networks, as part of its mission to bring hyperscale techniques to data centers of more down-to-earth sizes, references the core-and-pod approach. It’s already running in Facebook’s new Altoona, Iowa data center.
Consisting of hundreds of servers, clusters were Facebook’s answer to networking bandwidth issues, the idea being to pool large chunks of compute resources into a space that one large switch could handle.
But that idea has hit the wall, Andreyev writes on the blog post. It requires big switches that are just getting bigger. In the case of white boxes, the biggest systems tend to be slow to get the latest interface speeds, Andreyev writes. One alternative would be to buy OEMs’ systems, but they run on proprietary software; this would hamper Facebook’s ability to develop its own configuration tools (as noted below).
Hence, the data center fabric, which is less a “fabric” and more a network for the entire data center. Clusters, which were arranged in groups of four (three plus a redundant cluster) are replaced with pods: smaller groups consisting of 48 racks apiece, with the top-of-rack switches connected to four fabric switches.
These fabric switches are simpler than the big cluster switches and give Facebook more purchasing options, Andreyev writes.
The whole concoction connects to the outside world through edge pods, with a total of 7.68 Tb/s going out. That bandwidth can go to the Internet or to the 100-Gb/s-capable pipes that connect Facebook’s data centers.
The entire fabric — that is, everything from the top-of-rack switches to the edge pods’ uplinks to the outside world — is Layer 3, running on a minimal implementation of the BGP4 protocol. The fabric runs a dual IPv4/IPv6 stack.
Unlike the cluster setup, the new fabric is not oversubscribed. In fact, inside the fabric switches, the bandwidth of every downlink to a top-of-rack switch is matched by equivalent bandwidth reserved on the fabric’s uplink side.
Configuration of the network is done at the fabric level rather than the device level. Facebook created its own tools, capable of automatically deploying configurations on any topology. The tools can also automatically discover devices that have been added to the fabric.
In classic hyperscale fashion, the fabric is designed to just route around faults. If a particular component crashes, the fabric adjusts accordingly, leaving the corpse of the server or switch to get replaced sometime later.
The point of all this, in addition to curbing potential bandwidth crises, was to create a more modular architecture. Pieces of the fabric can be scaled or upgraded at their own paces, from the servers to the fabric spine to the inter-data-center connections.
For data centers other than Altoona, that’s how Facebook intends to implement this design: modularly. The new fabric can be added to one of Facebook’s cluster-based data centers one chunk at a time, Andreyev writes.