vCD Cross VDC with Cross VC NSX design Failure & High Availability Scenarios Deep Dive

2ndd

 

In this blog I will be deep diving into the vCD NSX design that Daniel and I are proposing focusing on the packet life during Failures showcasing the High Availability that we will get.

This is a continuation of the blog post  where we deep dived into the use cases and the high level design of cross VC NSX and cross VDC vCloud Director instances.

 

Scenario 1: All components are up and running

 

Pepsi workloads will have their respective Tenant_Pepsi UDLR as their Default gateway.

 

East/West Traffic:

Pepsi workloads that are communicating L2 East/West whether in the same Site or across Sites will use the stretched L2 logical switch and communicate successfully via Host VTEP encapsulation. If its an L3 communication between Workloads, then Tenant UDLR will do the routing on the source host and encapsulate the packet again via the VTEP to the destined workload via its host VTEP.

North/South Traffic:

Traffic originating from Pepsi Workloads whether they live on ORGVDC site 1 and/or ORGVDC site 2 will have to egress from the Active ORGVDC-A_Pepsi Edge Services Gateway. Tenant UDLR has a BGP neighbor-ship with both the Active Tenant ESG and the standby ESG with higher weight for the Active Tenant-ESG.

From the Active Tenant-ESG, traffic will egress to the Provider ECMP ESGs (E1 to E4 Green)  Due to the fact that the Provider UDLR has local egress (Active/Active) being configured in the  Provider NSX layer.

The Provider Primary UDLR Control VM is peering BGP with ECMP E1 to E4 Green on site 1 with weight 60 whereas with BGP weight 30 with ECMP E5 to E8 Green on site 2.

The Provider Secondary UDLR Control VM is peering BGP with ECMP E1 to E4 Blue on site 2 with weight 60 whereas with BGP weight 30 with ECMP E5 to E8 Blue on site 1.

This is shown demonstrated in the diagram below for Tenant Pepsi that has an active Tenant-ESG  on site 1 and passive Tenant-ESG on Site 2.

scenario1a

 

Where as for Coke, It has an active Tenant ESG on site 2 and a passive Tenant ESG on site-1. Hence traffic will egress from Site 2 Provider ESGs ( E1 to E4 Blue) again due to the fact of local Egress being configured at the Provider NSX layer.

Scenario1b

 

Note that due to the active/passive mode on the Tenant Layer, we maintained the state-full services that the Tenant ESG provisions such as NATing/Fwing/LB/DHCP etc..

Always remember that UDLR control VM is not in the data path as the UDLR is an instance present on every host.

Moreover, notice how we distributed traffic on both sites leveraging maximum resources.

Scenario 2 : We lose the Tenant’s Active ESG

We lose the Active ESG on site 1 for Pepsi.

Scenario2

 

BGP weight kicks in as now the Previously “Standby” ESG will become active and hence traffic will egress from the Tenant-ESG on Site 2 (OrgVDC-B-Pepsi) and hence traffic will Egress from E1 to E4 Blue.

The same above scenario will happen with Tenant-Coke if Coke loses its Active ESG on Site 2.

Scenario2b

 

Scenario 3 : We lose the upstream Physical Switches on Site-1

 

Scenario 3a

 

If I lose my physical switches upstream, or technically the default originate, Traffic will will still egress from Active Tenant ESG on Site 1 and the Provider UDLR will send the traffic to egress from E5 to E8 Green on site 2 as now and due to BGP weights kicking in, default originate is coming from those Edges on site 2. Hence all internet traffic will be accessible from site 2.

Coke traffic (who has their active Tenant-ESG on site 2) are egressing normally from on E1 to E4 Blue on site 2(Refer to Scenario 1).

 

Same case will happen if I lose Internet on site 2 where now traffic for active Tenant-ESGs on site 2 will egress from the E5 to E8 Blue while Active Tenant ESGs on site 1 will still egress from E1 to E4 green.

Scenario3b

 

 

In Summary, this design is an optional design that a Cloud Provider can adopt.

It does showcase the resiliency and High availability while leveraging ECMP to egress north to the physical infrastructure. Its also a great way to detect failure northbound and traffic engineer a workaround this failure to egress from the healthy site.

 

 

What is the Virtual Any Cloud Network?

The Virtual Any Cloud Network Vision

 

 

1stp

The network of the future is software-defined. A Virtual Cloud Network, built on VMware NSX technology, is a software layer from on-prem data center to any cloud, any branch and any infrastructure.

VMware NSX is the foundation of the Virtual Cloud Network which consists of : NSX Data Center, NSX Cloud, App Defense , NSX SD-WAN by VeloCloud and NSX Hybrid Connect.

 

In this blog, I will be focusing on NSX Cloud piece of it to demonstrate its key features and benefits.

 

NSX Cloud

 

NSX Cloud is a secure, consistent foundation that makes managing the networking and security of your workloads living literally on any cloud which includes your on-prem  private cloud and/or the hyper-scale public clouds (AWS, Azure and GCP).

It solves challenges Cloud Architects face when dealing with public clouds which include Lack of real-time visibility in cloud traffic flows, security policy consistency across hybrid clouds, compliance to enterprise security policies and leveraging existing operational tools.

It also adds a new line of revenue for Cloud Service Providers where NSX Cloud can be utilized to offer managed security services for their tenants’ workloads living on any cloud.

 

 

So What Are The Benefits of NSX Cloud in The Virtual Cloud Network?

1. Consistent Security For Apps Running On Any Cloud:

The main benefit in my opinion when utilizing NSX Cloud is the ability to enforce your Firewall security rules from your on-prem NSX Manager to workloads living on-premise and/or Azure and/or AWS and/or Google Cloud Platform using the same firewall rule applied across all clouds.

NSX Cloud brings networking and security capabilities to endpoints across multiple clouds. By integrating with NSX Data Center (deployed on-premises), it enables networking and security management across clouds and data center sites using the same NSX manager deployed on-prem.

Security policy is automatically applied and enforced based on instance attributes and user-defined tags. Policies automatically follow instances when they are moved within and across clouds.​Dynamic Policies can based on

  • VM attributes in cloud – eg VNET ID, Region, Name, Resource Group, etc
  • Customer defined Cloud resource tags

 

NSX_Cloud_usecase1

 

Example:

A 3-tier application (Web VMs / App VMs/ DB VMs) consists of :

  • Multiple Web servers spread across multiple hyperscale clouds: AWS/Azure/GCP.  All of these servers will be tagged its respective native cloud tags: “Web”
  • App server is a kubernetes container living On-Premises and will be tagged with “App”
  • DB Server which consists of multiple VMs living On-Premises and will be tagged with “DB”

All of the above 3 tier app workloads will be tagged with a unique tag that will differentiate the application (in this case FE/APP/DB of website X). We will be using this tag to apply-to field in the DFW rules.

Webservers will need to to communicate with App servers living on AWS on port TCP 8443.

App containers would need to communicate with the DB servers on port MySQL TCP 3306.

Any should be able to communicate with the Web Servers on HTTPS TCP 443.
3 Firewall rules will be needed from the NSX Manager living on-prem that will enforce the above security posture using security tags constructs across ALL of the leveraged clouds.

usecase1

 

2. Manageability and Visibility Across All Public Clouds

 

From the NSX Cloud Service Manager “CSM”, you will have the option to add all your Public Cloud accounts (AWS/Azure/GCP).

Once you add all your public cloud accounts, you will be able to view all of the Accounts, Regions, VPCs/VNETs you own along with their respective workloads from the same User Interface. Think of its a single pane of glass across all accounts across Public Clouds.

NSX_CLOUD_Visibility

This will simplify the manageability of your workloads and save a lot of time in troubleshooting and capacity planning.

 

usecase2

3. Real Time Operation Monitoring With DFW Syslog and L3SPAN in Public Clouds

 

From an operational perspective, you want to make sure all these security groups that you have are active and on. Since the Firewall and Security policies are now managed via the NSX manager, we will have the option to use syslog which look very similar to the logs created by hypervisor.

You can collect the data and send it to your Syslog collector, say Splunk and you will be able to run Data Analysis off of it.

This will provide statistics on who are the VMs talking to , what ports are permitted/denied etc.

 

usecase3

 

 

Real Time Operations visibility can also be complimented by using the L3SPAN in public clouds per the below flow:

  • Enable NSX L3 SPAN/Mirror on a per Logical port or Logical switch basis using Port-mirroring Switching Profile
  • NSX Agent running inside the VM captures the traffic.
  • Mirrored traffic forwarded to IP destination (collector) within VPC using GRE

 

usecase3_2

 

4. Full Security Compliance with the Quarantine Option

 

You will have the option for full compliance for policy enforcement where the VMs that are created in the public cloud which are not managed by NSX will be quarantined.

This is done via the Multi-layered security which provides two independent security perimeters, primary security through NSX firewall, second layer of security through the native public cloud security groups which will be utilized by the NSX.

Note– This is optional. You can still have a dev environment that has both NSX managed and non managed workloads without enforcing the quarantine.

 

usecase4

 

Note: As of the publishing date of this blog, the NSX-T 2.2 GA version fully supports workloads on-prem and Azure.

The upcoming NSX-T Cloud versions will be supporting AWS and other public Clouds.

In Summary

 

The virtual any cloud network is the next generation of cloud Networking. Cloud Admins could now utilize NSX Cloud to have full visibility for cloud traffic flows while the Security admins will enforce  consistent security policies across all clouds.

 

Cloud Service Providers can utilize NSX Cloud to offer managed services for their tenant workloads anywhere, creating a new line of revenue.

 

In my next post, I will discuss the CSM in further details along with deep diving on the architecture of NSX Cloud.

 

Comparing Centralized Firewalls to NSX Distributed Firewall DFW – The apples to oranges comparison

Solution Architects and often Security Engineers design Data Centers in a way that they can achieve the highest level of security with the highest performance possible . Often Firewalls are installed and configured to protect workloads from unauthorized access and comply with security policies. VMware introduced the NSX distributed firewall concept which changed the centralized mindset and raised the firewall component to a completely different level.

Although comparing the centralized to distributed firewalls Architecture and capabilities is like comparing apples to oranges, Architects and Network Admins would often request such a comparison to try visualize the new mindset VMware NSX DFW brought into the game.

In the next series of blogs I will show you how NSX DFW compare to the Traditional Centralized Firewalls (The apple to orange comparison). I will also share with you the best practices in achieving Line rate performance/throughput when using NSX DFW along with the results of the performance testings.

So how do Centralized and Distributed Firewalls compare?

Traditionally, Firewalls were centralized and are typically physical boxes that process the packets and take the “allow/drop” decisions based on pre-configured rules. Traffic will be typically hair-pinned to those Firewall boxes when being processed.

VMware NSX Distributed Firewall or often called DFW, introduced a new concept by Distributing the Firewall capability across all compute hypervisors without the need of making the traffic exit to another hop for the allow/drop traffic decision processing .

Traditional FWs will often need the packets sourced/destined to be filtered via the firewall box itself. Hence for large data centers, Firewall throughput is considered a key concern with respect to bottlenecks in the data processing. Scaling a centralized Firewall would often be challenging  whenever the datacenter traffic is exceeding the box’s limit. Network/Security Admins will need to purchase additional firewalls to cascade with the existing ones or often a rip and replace would be needed to accommodate the new demanding throughput needs. (yes

NSX DFW changes the concept of Centralized Firewall and introduced a new perception in the architectural design of Firewalls. With NSX DFW, the Security team can protect the workload at the Virtual Machine’s vNic level. By rules being processed at the vNic, decisions of allowing or dropping packets sourced from the DFW protected VMs is taken even before the packet exits the hypervisor the VM lives on.

Picture1

Traditional FW technologies are fixed based on initial purchase of technology (i.e. 40Gbps FW)

Compared to…

NSX which scales based on the amount of ESXi hosts which already exist in your environment running the VM workloads

Therefore, when we talk about scaling –

  • Traditional FW technologies will require a rip/replace or physical upgrade to mitigate any performance bottlenecks/hairpinning along with potential architecture redesign
  • Compared to VMware NSX which linearly adds performance as we add ESXi hosts to scale VM workloads… not to mention that the ESXI hosts already exist in your Data center (lower CAPEX)

 

as we addNSX performance scales

What is the most powerful differentiator? 

One of the most powerful features of NSX DFW in my opinion is the ability to create firewall rules based on static and dynamic membership criteria. Security groups construct is introduced which is an extremely efficient tool  to implement security policies or firewall rules based on those security groups defined. Security Groups can be leveraged to either create Static or Dynamic Inclusion based rules.

Static inclusion provides the capability to manually include particular objects into the security groups such as Specific Cluster, Logical Switch, vAPP, Data Center , IP Sets, Active Directory group, Mac Sets, Security tag, vNic, VM, Resource Pool and DVS Port Group.

5

Dynamic Inclusion would include Computer OS name, Computer Name, VM name, Security tag and Entity.

RDecker-3

For instance you can create a firewall rule that will allow HTTPS access to all VMs that have the word “web” in their VM name. Or perhaps create firewall rules based on Security tags where a tag can be associated with a specific tenant workloads in the Service Provider world.

 

Ofcourse, The FW rules configured move with the VM as it vMotions across NSX prepared hosts!

 

In Summary:

 

 

1

Traditional FW Technologies  

VMware NSX DFW

 

CLI-Centric FW Model Distributed at hypervisor level
Hair-pinning Mitigation of hair-pinning due to kernel-decision processing vs the centralized model
Static Configuration Dynamic, API based Configuration
IP Address-Based Rules Static and Dynamic Firewall constructs which includes VM Name, VC Objects and Identity-based Rules
Fixed Throughput per Appliance

(i.e. 40Gbps)

Line Rate ~ 20 Gbps per host (with 2 * 10 Gbps pNics).

~ 80 GBps per host (with 2 * 40 Gbps Nic Cards and MTU 8900).

Lack of visibility with encapsulated traffic Full Visibility to encapsulated traffic

 

 

 

In my next blogs, I will show you the testings made to the NSX DFW throughput and what are the best practices to achieve LINE-RATE performance.

 

 

 Big thank you to my peer Daniel Paluszek for motivating me to start blogging and for giving me feedback on this post. You can follow his amazing blog here