Network virtualization has come a long way. NSX has played a key role in redefining and modernizing networking in a Datacenter. Providing an optimal routing path for the traffic has been one of the topmost priorities of Network Architects. Thanks to NSX distributed routing, that the routing between different subnets on a ESXi hypervisor can be done in kernel and traffic never has to leave the hypervisor.
With NSX-T, we take a step further and extend this network functionality to a multi-hypervisor and multi-cloud environment.
NSX-T is a platform that provides Network and Security virtualization for a plethora of compute nodes such as ESXi, KVM, Bare Metal, Public Clouds and Containers.
This blog series will primarily focus on NSX-T routing, routing components, packet walk between the VMs sitting in same/different hypervisors, connectivity to physical infrastructure and multi-tenant routing. Let’s start with a quick reference to NSX-T architecture.
NSX-T has a built-in separation for Management plane (NSX-T Manager), Control Plane (Controllers) and Data Plane (Hypervisors, Containers etc.) I highly recommend going through NSX-T Whitepaper for detailed information on architecture to understand the components and functionality of each of the planes.
- NSX-T Manager is decoupled from vCenter and is designed to run across all these heterogeneous platforms.
- NSX-T Manager and controllers can be deployed in a VM form factor on either ESXi or KVM.
- In order to provide networking to different type of compute nodes, NSX-T uses hostswitch.This hostswitch is a variant of the VMware virtual switch on ESXi-based endpoints and as Open Virtual Switch (OVS) on KVM-based endpoints.
- Data Plane stretches across a variety of compute nodes: ESXi, KVM, Containers, and NSX-T edge nodes (on/off ramp to physical infrastructure).
- Each of the compute nodes is a transport node & will be TEP (Tunnel End Point) for the host.Depending upon the teaming policy, this host could have one or more TEPs.
- NSX-T uses GENEVE as underlying overlay protocol for these TEPs to carry Layer 2 information across Layer 3.GENEVE provides us the complete flexibility of inserting Metadata as TLV (Type, Length, Value) fields which could be used for new features. One of the examples of this Metadata is VNI (Virtual Network Identifier). We recommend a MTU of 1600 to account for encapsulation header.More details on GENEVE can be found on the following IETF Draft.https://datatracker.ietf.org/doc/draft-ietf-nvo3-geneve/
Before dive deep into routing, let me define a few key terms.
is a broadcast domain which can span across multiple compute hypervisors. VMs in the same subnet would connect to the same logical switch.
- It runs as a kernel module in Hypervisor. Since it’s a kernel module, it can span across compute and edge nodes.
- It takes care of the centralized functions like NAT, DHCP, LB etc. and runs on an Edge node.
- Interface connecting to a Logical switch.
- – Interface connecting to the physical infrastructure/physical router.
- – Interface connecting two Logical routers.
are appliances with a pool of capacity to run the centralized services and would be an on/off ramp to the physical infrastructure. You can think of Edge node as an empty container which would host one or multiple Logical routers to provide centralized services and connectivity to physical routers. Edge node will be a transport node just like compute node and will also have a TEP IP to terminate overlay tunnels.
They are available in two form factor: (leveraging Intel’s DPDK Technology).
You can’t mix and match a VM form factor with Bare Metal, both of them have to be the same.
Moving on, let’s also get familiarized with the topology that I will use throughout this blog series.
I have two hypervisors in above topology, ESXi and KVM. Both of these hypervisors are prepared for NSX & have been assigned a TEP (Tunnel End Point) IP, ESXi Host: 192.168.140.151, KVM host: 192.168.150.152 These hosts have L3 connectivity between them via transport network.
I created 3 Logical switches via NSX Manager & have connected a VM to each one of the switches. I have also created a Logical Router named , which is connected to all the logical switches and is acting as a Gateway for each subnet.
Before we look at the routing table, packet walk etc., let’s look at how configuration looks like in NSX Manager. Here is switching configuration, showing 3 Logical switches.
Following is the routing configuration that I have done on the NSX Manager.
Let’s validate that on both hosts. Following is the output from ESXi showing the Logical switches and router.
Following is the output from KVM host showing the Logical switches and router.
Another important concept here is that, as soon as a VM comes up and connects to Logical switch, TEP registers its MAC with the NSX Controller. Following output from NSX Controller shows that the MAC addresses of VMs on Web VM1, App VM1 and DB VM1 have been reported by their respective TEPs.
NSX Controller publishes this MAC/TEP association to the compute hosts depending upon type of host.
Now, we will look at the communication between VMs on the same hypervisor.
We have WEB VM1 and App VM1 hosted on the same ESXi hypervisor. Since we are discussing the communication between VMs on same host, I am just showing the relevant topology below.
Following is how traffic would go from Web VM1 to App VM1.
- Web VM1 (172.16.10.11) sends traffic to the gateway 172.16.10.1, as the destination (172.16.20.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local DR on ESXi Host.
- Routing lookup happens on the ESXi distributed router (DR) and 172.16.20.0 subnet is a Connected route. Packet gets routed and is put on the App-LS.
- Destination MAC lookup for MAC address of App VM1 is needed to forward the frame. In this case, since App VM1 is also hosted on the same ESXi, we do a MAC address lookup and find a local MAC entry as highlighted in diagram above.
- L3 rewrite is done and packet is sent to App VM1.
Please note that the packet didn’t have to leave the hypervisor to get routed. This routing happened in kernel. Now that we understand the communication between two VMs (in different subnet) on same hypervisor, let’s take a look at the packet walk from Web VM1 (172.16.10.11) on ESXi to DB-VM (172.16.30.11) hosted on KVM.
- Web VM1 (172.16.10.11) sends traffic to the gateway 172.16.10.1, as the destination (172.16.30.11) is in different subnet. This traffic traverses Web-LS and goes to Downlink interface of Local DR on ESXi Host.
- Routing lookup happens on the ESXi distributed router (DR). Packet gets routed and is put on the DB-LS. Following output show the DR on ESXi host and it’s routing table.
- Destination MAC lookup for MAC address of DB VM1 is needed to forward the frame. MAC lookup is done and MAC address of DB VM1 is learnt via remote TEP 192.168.150.152. Again, this MAC/TEP association table was published by NSX Controller to the hosts.
- ESXi TEP encapsulates the packet and sends it to the remote TEP with a Src IP=192.168.140.151, Dst IP=192.168.150.152.
- Packet is received at remote KVM TEP 192.168.150.152, where VNI (21386) is matched. MAC lookup is done and packet is delivered to DB VM1 after removing the encapsulation header.
A quick traceflow validates the above packet walk.
This concludes the routing components part of this blog. In the next blog of this series, I will discuss multitenant routing and connectivity to physical infrastructure.