vCD Cross VDC with Cross VC NSX design Failure & High Availability Scenarios Deep Dive

2ndd

 

In this blog I will be deep diving into the vCD NSX design that Daniel and I are proposing focusing on the packet life during Failures showcasing the High Availability that we will get.

This is a continuation of the blog post  where we deep dived into the use cases and the high level design of cross VC NSX and cross VDC vCloud Director instances.

 

Scenario 1: All components are up and running

 

Pepsi workloads will have their respective Tenant_Pepsi UDLR as their Default gateway.

 

East/West Traffic:

Pepsi workloads that are communicating L2 East/West whether in the same Site or across Sites will use the stretched L2 logical switch and communicate successfully via Host VTEP encapsulation. If its an L3 communication between Workloads, then Tenant UDLR will do the routing on the source host and encapsulate the packet again via the VTEP to the destined workload via its host VTEP.

North/South Traffic:

Traffic originating from Pepsi Workloads whether they live on ORGVDC site 1 and/or ORGVDC site 2 will have to egress from the Active ORGVDC-A_Pepsi Edge Services Gateway. Tenant UDLR has a BGP neighbor-ship with both the Active Tenant ESG and the standby ESG with higher weight for the Active Tenant-ESG.

From the Active Tenant-ESG, traffic will egress to the Provider ECMP ESGs (E1 to E4 Green)  Due to the fact that the Provider UDLR has local egress (Active/Active) being configured in the  Provider NSX layer.

The Provider Primary UDLR Control VM is peering BGP with ECMP E1 to E4 Green on site 1 with weight 60 whereas with BGP weight 30 with ECMP E5 to E8 Green on site 2.

The Provider Secondary UDLR Control VM is peering BGP with ECMP E1 to E4 Blue on site 2 with weight 60 whereas with BGP weight 30 with ECMP E5 to E8 Blue on site 1.

This is shown demonstrated in the diagram below for Tenant Pepsi that has an active Tenant-ESG  on site 1 and passive Tenant-ESG on Site 2.

scenario1a

 

Where as for Coke, It has an active Tenant ESG on site 2 and a passive Tenant ESG on site-1. Hence traffic will egress from Site 2 Provider ESGs ( E1 to E4 Blue) again due to the fact of local Egress being configured at the Provider NSX layer.

Scenario1b

 

Note that due to the active/passive mode on the Tenant Layer, we maintained the state-full services that the Tenant ESG provisions such as NATing/Fwing/LB/DHCP etc..

Always remember that UDLR control VM is not in the data path as the UDLR is an instance present on every host.

Moreover, notice how we distributed traffic on both sites leveraging maximum resources.

Scenario 2 : We lose the Tenant’s Active ESG

We lose the Active ESG on site 1 for Pepsi.

Scenario2

 

BGP weight kicks in as now the Previously “Standby” ESG will become active and hence traffic will egress from the Tenant-ESG on Site 2 (OrgVDC-B-Pepsi) and hence traffic will Egress from E1 to E4 Blue.

The same above scenario will happen with Tenant-Coke if Coke loses its Active ESG on Site 2.

Scenario2b

 

Scenario 3 : We lose the upstream Physical Switches on Site-1

 

Scenario 3a

 

If I lose my physical switches upstream, or technically the default originate, Traffic will will still egress from Active Tenant ESG on Site 1 and the Provider UDLR will send the traffic to egress from E5 to E8 Green on site 2 as now and due to BGP weights kicking in, default originate is coming from those Edges on site 2. Hence all internet traffic will be accessible from site 2.

Coke traffic (who has their active Tenant-ESG on site 2) are egressing normally from on E1 to E4 Blue on site 2(Refer to Scenario 1).

 

Same case will happen if I lose Internet on site 2 where now traffic for active Tenant-ESGs on site 2 will egress from the E5 to E8 Blue while Active Tenant ESGs on site 1 will still egress from E1 to E4 green.

Scenario3b

 

 

In Summary, this design is an optional design that a Cloud Provider can adopt.

It does showcase the resiliency and High availability while leveraging ECMP to egress north to the physical infrastructure. Its also a great way to detect failure northbound and traffic engineer a workaround this failure to egress from the healthy site.