8 Best Practices to Achieve Line-Rate Performance in NSX

Every Architect and administrator would love to achieve maximum throughput and hence achieve the optimum performance out of their overlay and underlay networks.

Often Line-rate throughput is the throughput that each Architect/admin would aim for their workloads to leverage. In this blog, I will be sharing the best practices that Architects/admins can follow in order to achieve maximum throughput resulting in optimum performance.

 

So what is line-rate?

Line rate is defined as the actual speed with which the bits are sent onto its corresponding wire (physical layer gross bit rate).

Line-rate or wire-speed means that the workloads (in our case a Virtual Machines) are  supposed to be able to push traffic at the link-speeds of their respective ESXI hosts physical NICs

Its important to note that the maximum achievable throughput will be limited to to the hypervisor’s Physical NIC  throughput along with the VM vNIC throughput (E1000/VMXNET3) irrespective of how massive is the throughput support on your underlay devices ( Switches/Firewalls).

 

Best Practices To Achieve Line-rate:

 

Best Practice 1: Enable RSS on ESXI hosts (prior to ESXI 6.5)

RSS (Receive Side Scaling) looks at the outer packet headers to make queuing decisions.  For traffic going between just two nodes – the only thing different in the out headers is the source port and hence this is not really optimal.

Note: Rx Filters, available since ESX 6.5 as a replacement to RSS, looks at the inner packet headers.  Hence, queuing is lot more balanced.

Like  RSS, NIC cards need to support Rx Filters in hardware and they should have a driver, for it to work. If available, Rx Filters are enabled by default. VMware is working on having Rx Filters listed on the I/O compatibility guide.

 

Without RSS

 

 

With RSS

Best Practice 2: Enable TSO

 

Using TSO (TCP Segmentation Offload) on the physical and virtual machine NICs improves the performance of ESX/ESXi hosts by reducing the CPU overhead for TCP/IP network operations. The host will use more CPU cycles to run applications.

If TSO is enabled on the transmission path, the NIC divides larger data chunks into TCP segments whereas if TSO is disabled, the CPU performs segmentation for TCP/IP.

TSO is enabled at the Physical NIC card. If you are using NSX, make sure to purchase NIC cards that have the capability of VXLAN TSO offload.

TSO

 

Best Practice 3: Enable LRO

 

Enabling LRO (Large Receive Offload) reassembles incoming network packets into larger buffers and transfers the resulting larger but fewer packets to the network stack of the host or virtual machine. The CPU has to process fewer packets when LRO is enabled which reduces its utilization for networking.

LRO is enabled at the Physical NIC card. If you are using NSX, make sure to purchase NIC cards that have the capability of VXLAN LRO offload.

 

LRO

 

Best Practice 4: Use multiple 40Gbps NIC Cards with multiple PCIe busses

 

The more NIC Bandwidth you will have the less bottlenecks you will create. Having multiple PCI-e busses will help  with higher maximum system bus throughput, lower I/O pin count and smaller physical footprint resulting in better performance.

40g

Best Practice 5: Use MTU 9000 in the underlay

 

To achieve maximum throughput (whether on traditional VLAN or VXLAN), having the underlay supporting 9K MTU jumbo frames will have a huge impact in enhancing the throughput. This is will be extremely beneficial when if the MTU on the VM itself has a corresponding 8900 MTU.

 

Best Practice 6: Purchase >= 128 GB physical Memory per host

 

This is a useful best practice for folks having NSX Distributed Firewall  (DFW) Configured. NSX DFW leverages 6 memory heaps for vSIP (VMware Internetworking Service Insertion Platform) where each of those heaps can saturate more efficiently with more physical Memory available to the host.

Note below that each hip uses a specific filter part of the DFW functionality.

  • Heap 1 : Filters
  • Heap 2 : States
  • Heap 3 : Rules & Address Sets
  • Heap 4 : Discovered Ips
  • Heap 5 : Drop flows
  • Heap 6: Attributes

FW heap

 

Best Practice 7: Follow NSX maximum Guidelines

A good best practice is definitely following the maximum tested guidelines.

These guidelines are now publically published by VMware and you can find them via the below link:

NSX_Configuration_Maximums_64.pdf

Maximums

 

 

Best Practice 8: Compatibility Matrix

 

Make sure to check the VMware compatibility matrix for supported NICs:

https://www.vmware.com/resources/compatibility/search.php?deviceCategory=io

The driver and firmware versions should be on the latest release and the recipe should match.

Compatibility

You Can also pick the NICs that support VXLAN offload features ( TSO/LRO) using that matrix.

 

In Summary:

Here is a summary of the best practices that need to be followed in order to achieve line rate performance within a vSphere environment running NSX:

summary

 

 

 

 

 

 

 

 

 

 

 

Comparing Centralized Firewalls to NSX Distributed Firewall DFW – The apples to oranges comparison

Solution Architects and often Security Engineers design Data Centers in a way that they can achieve the highest level of security with the highest performance possible . Often Firewalls are installed and configured to protect workloads from unauthorized access and comply with security policies. VMware introduced the NSX distributed firewall concept which changed the centralized mindset and raised the firewall component to a completely different level.

Although comparing the centralized to distributed firewalls Architecture and capabilities is like comparing apples to oranges, Architects and Network Admins would often request such a comparison to try visualize the new mindset VMware NSX DFW brought into the game.

In the next series of blogs I will show you how NSX DFW compare to the Traditional Centralized Firewalls (The apple to orange comparison). I will also share with you the best practices in achieving Line rate performance/throughput when using NSX DFW along with the results of the performance testings.

So how do Centralized and Distributed Firewalls compare?

Traditionally, Firewalls were centralized and are typically physical boxes that process the packets and take the “allow/drop” decisions based on pre-configured rules. Traffic will be typically hair-pinned to those Firewall boxes when being processed.

VMware NSX Distributed Firewall or often called DFW, introduced a new concept by Distributing the Firewall capability across all compute hypervisors without the need of making the traffic exit to another hop for the allow/drop traffic decision processing .

Traditional FWs will often need the packets sourced/destined to be filtered via the firewall box itself. Hence for large data centers, Firewall throughput is considered a key concern with respect to bottlenecks in the data processing. Scaling a centralized Firewall would often be challenging  whenever the datacenter traffic is exceeding the box’s limit. Network/Security Admins will need to purchase additional firewalls to cascade with the existing ones or often a rip and replace would be needed to accommodate the new demanding throughput needs. (yes

NSX DFW changes the concept of Centralized Firewall and introduced a new perception in the architectural design of Firewalls. With NSX DFW, the Security team can protect the workload at the Virtual Machine’s vNic level. By rules being processed at the vNic, decisions of allowing or dropping packets sourced from the DFW protected VMs is taken even before the packet exits the hypervisor the VM lives on.

Picture1

Traditional FW technologies are fixed based on initial purchase of technology (i.e. 40Gbps FW)

Compared to…

NSX which scales based on the amount of ESXi hosts which already exist in your environment running the VM workloads

Therefore, when we talk about scaling –

  • Traditional FW technologies will require a rip/replace or physical upgrade to mitigate any performance bottlenecks/hairpinning along with potential architecture redesign
  • Compared to VMware NSX which linearly adds performance as we add ESXi hosts to scale VM workloads… not to mention that the ESXI hosts already exist in your Data center (lower CAPEX)

 

as we addNSX performance scales

What is the most powerful differentiator? 

One of the most powerful features of NSX DFW in my opinion is the ability to create firewall rules based on static and dynamic membership criteria. Security groups construct is introduced which is an extremely efficient tool  to implement security policies or firewall rules based on those security groups defined. Security Groups can be leveraged to either create Static or Dynamic Inclusion based rules.

Static inclusion provides the capability to manually include particular objects into the security groups such as Specific Cluster, Logical Switch, vAPP, Data Center , IP Sets, Active Directory group, Mac Sets, Security tag, vNic, VM, Resource Pool and DVS Port Group.

5

Dynamic Inclusion would include Computer OS name, Computer Name, VM name, Security tag and Entity.

RDecker-3

For instance you can create a firewall rule that will allow HTTPS access to all VMs that have the word “web” in their VM name. Or perhaps create firewall rules based on Security tags where a tag can be associated with a specific tenant workloads in the Service Provider world.

 

Ofcourse, The FW rules configured move with the VM as it vMotions across NSX prepared hosts!

 

In Summary:

 

 

1

Traditional FW Technologies  

VMware NSX DFW

 

CLI-Centric FW Model Distributed at hypervisor level
Hair-pinning Mitigation of hair-pinning due to kernel-decision processing vs the centralized model
Static Configuration Dynamic, API based Configuration
IP Address-Based Rules Static and Dynamic Firewall constructs which includes VM Name, VC Objects and Identity-based Rules
Fixed Throughput per Appliance

(i.e. 40Gbps)

Line Rate ~ 20 Gbps per host (with 2 * 10 Gbps pNics).

~ 80 GBps per host (with 2 * 40 Gbps Nic Cards and MTU 8900).

Lack of visibility with encapsulated traffic Full Visibility to encapsulated traffic

 

 

 

In my next blogs, I will show you the testings made to the NSX DFW throughput and what are the best practices to achieve LINE-RATE performance.

 

 

 Big thank you to my peer Daniel Paluszek for motivating me to start blogging and for giving me feedback on this post. You can follow his amazing blog here