I don’t understand why nobody is willing to step over the line and deploy a massively scalable Layer-2-only data center network solution? As I have noted previously, your data center is not The Internet and it may not even be an internet. It has end stations that stick layer-3 IP headers on everything and see the world through IP address resolution, but all the benefits derived from layer-3 routing solutions can be reproduced by another implementation that is not based on the existing routing stack. At it’s north/south border there needs to be (one or more) gateways to the external routed universe, but from east to west it can speak whatever works. All of the announced and shipping commercial Ethernet fabric solutions are proprietary right now. In general they are proprietary variations on TRILL which looks like an L2 black box, but has a thick L3 (IS-IS) filling inside. The autonomous systems model is more of a liability than an asset in data center networks, yet it is familiar and comfortable and thus scary to move away from for veteran operators. TRILL has the scaling limitations of a LAN with the slow fault handling of a WAN. It is not the pure L2 solution I’m talking about and pays a price for clinging to layer 3 topology advertisements. Solutions fitting under the “whatever works” heading could be orders of magnitude less complex than, say, OSPF (consult your local mathematician). Complexity equals enormous TCO cost (see your local ops research professional or philosopher) and should only be tolerated where some benefit has enough payback to offset the cost. On technical merit, this solution should happen.
So the reason nobody does this is that IP addresses were never designed to move around and TCAMs on the commercial switching Silicon only have 64K rules? Wrong. As I’ve validated in discussions with several of the worlds largest data center operators, we can expect that edge processing will increasingly be done in the server rather than in the ToR. Some plans in this area are confidential, but it is public record that Nicira (OVS– Open Virtual Switch), 6Wind, and Infinetics have solutions for edge processing in the server and Igor Gashinsky at Yahoo has publicly stated his belief in such an approach, saying this is partly because it solves the TCAM-size problem by doing fast software lookups into arbitrarily large tables instead. And there are commercial solutions and nascent standardization efforts to provide floodless IP address resolution in distributed switching trees (IETF armd WG).
Now then, if current state of the art is capable of building a much lower complexity, all L2 network, it stands to reason that the networking industry would create such solutions and data center operators would be deploying them? So maybe its impossible?
Perhaps the theoretical and academic building blocks for such a solution don’t exist. Absolutely wrong. There are research papers, simulations, models, prototypes and non-commercial deployments of massive scale L2 network features including:
- IP mobility
- Tens of thousands of switches
- Millions of end stations
- Floodless address resolution
- Good multipath utilization
- High Performance multicast implementation
- Clever load balancing algorithms
- Firewall/policy-enforcement integrated
Research has been published, yet not commercialized, that typically concluded with something like “We have demonstrated a simple, novel and easily-implementable approach for significantly boosting the scalability of Ethernet, which has a working prototype switch firmware implementation” (from MOOSE referenced below). Where are the commercial versions?
Nothing that captures the benefits of these projects has been commercially offered even though there are companies that collectively have all the pieces. Why have they never been combined into a single killer solution and offered to data center operators? I would love to hear from you if you know of commercial offerings I’ve overlooked.
All through their stealth mode it was abundantly clear to me what the geniuses who invented SDN and OpenFlow would be doing at Nicira. They would, of course, be building a commercial controller that fulfilled the Scott Shenker vision of the abstracted, high level development platform that the openflow controller could become, and then they would use OpenFlow to implement a massively scalable L2 network; the killer app for the biggest data center operators. Wrong and Wrong. I wasn’t even close. Along the way they decided that instead of selling their controller as “The Platform” to an ecosystem of developers, they would bake it into their own solutions. And the killer app they have chosen is a layer-2 over layer-3 solution that stuffs all the cost and complexity of a legacy, heirarchical, routed L3 cloud, underneath their overlay network. They have done this, even though their solutions include a Hypervisor switch that neatly implements the exact type of encapsulation technology one would need to implement a layer 2 solution instead. Guido Appenzeller from Big Switch Networks, who fell off the same branch of the Stanford SDN tree as the Nicira leadership, recently reminded me that in the early days Nicira was positioned as a fabric switch company, and their early OpenFlow/SDN papers talked specifically about enabling this. So what changed? It wouldn’t be shocking to learn that they headed out to do exactly what I anticipated and “pivoted” when the biggest data center operators told them that they are just now learning to trust layer-3 ECMP technology to introduce multipath “fabrics” into their networks. Nicira’s migration from programmable hardware switch to programmable VSwitch was likely also a recognition that OpenFlow availability on mature hardware platforms won’t be widely available for the current data center spending blitz. The already-started (2012-2014) DC server refresh will deploy 10G server interfaces and edge switches, but conservative operators are choosing L3 with ECMP as the way to increase their fabric-ness. Moving from uni-path to multipath in the DC network will radically increase bisectional bandwidth and improve network performance, but it is dangerous and scary because it steps away from the known and familiar. Network engineers must answer the question “How the heck do you debug packet loss when there are multiple paths from A to B and path selection is by non-obvious algorithms?” One might argue that the overlay solution providers could bring out the pure-L2 solutions later, after folks get comfortable with the smaller step to L2-over-L3, but the benefits of all-L2 are seriously diminished if you’ve already paid CAPEX for the L3 solution (although the TCO hit of continuing to run it might argue for switchover). The question here isn’t really what kind of overlay do you prefer (where Nicira fits in), but rather, what is your multipath technology in the physical network that your overlay rides on. Nicira has their own protocol proposal that they can run over an L3 cloud, called STT. VMware’s contribution to overlay protocols, VXLan, is similarly an L2-over-L3 technology, as is Microsoft’s NV-GRE, and they all assume that the best fabric to transport overlay packets is based on standard dynamic routing protocols. The combined VMware/Nicira doesn’t get us any closer to a next generation L2 solution in the physical network. All of the major multipath solutions are based on statistically load balancing across all of the N paths at each hop. Who’s providing the far superior implementation that intelligently assigns traffic to paths based on policy or workload specifics? ECMP is a step in the right direction, but it’s theoretical baisis is “divide it randomly and hope that on average you get what you want”. We can do better.
A brief rant about how easy it should be for anyone to whip together a massively scalable, flat L2 fabric solution: As mentioned previously, some day SDN developers will have a controller platform available to them with powerful tools and rich foundation libraries and APIs. If someone would fulfill this by “finishing the controller”, it would be easy to throw this together at crazy low cost (Both CPlane and Big Switch Networks appear serious about trying to build a business based on finishing the controller-as-platform). One could build an app that takes a cup of Portland, a splash of VL2 and throw in dash of multi-tenant, floodless, mobile address resolution mixed with distributed computing library sauce and run it on a flock of commodity 10G pizza boxes. If the controller software doesn’t cost a fortune (more on this in a future post, if Matt Palmer doesn’t beat me to the punch) this solutions would blow away L2-over-L3 at a fraction of the cost.
So I can only conclude that there is a form of layer-ism being perpetrated by collusion within the DC operator community. They are all discriminating against poor layer-2 just because of the number of it’s layer. The Network OEMs aren’t going to build this solution if the DC operators aren’t begging for it. Won’t someone save us? Where is the commercial L2 product for hyperscale data centers?