Haven’t updated my blog for quite a long time!

I’m not just lazy. I’m super lazy. :-)
https://cdn.shopify.com/s/files/1/0267/4223/products/Superlazy-clean_compact.jpg?v=1510686239
from: https://www.teeturtle.com/products/super-lazy

Linus Torvalds said “Talk is cheap. Show me the code.” I read AWS Elastic Network Adapter (ENA) driver when I wait my son’s after-school math club this weekend. I learned some details of AWS ENA that I haven’t found from other articles.

ENA is AWS’s elastic network adapter. It is a part of Nitro system which AWS announced last year 2017. ENA provides network function for both bare-metal host and VMs of AWS. It “hardwarize” network virtualization for AWS cloud. The ENA driver could be downloaded from Github. According to the code from Github, AWS open source their code for Linux kernel, Freebsd and DPDK(user space). The code has been upstream. But it doesn’t mean ENA only support these platforms. Actually it should also supports Windows and iPXE. (Check the host_info_os_type from the code.)

In the init of this PCIe device, module init function ena_init() creates a single thread workqueue as method to defer packet processing. The driver module is named as “ena”. There are four types of PCIe device id( defined in ena_pci_tbl). These IDs are interesting, all are related to “EC2”! :-)

#define PCI_DEV_ID_ENA_PF      0x0ec2 
#define PCI_DEV_ID_ENA_LLQ_PF  0x1ec2 
#define PCI_DEV_ID_ENA_VF      0xec20 
#define PCI_DEV_ID_ENA_LLQ_VF  0xec21

These IDs show the SR-IOV Physical Function (PF) and Virtual Function (VF) support of ENA devices. It is one of the key technologies which AWS network virtualization relies on. It makes VMs to bypass kernel/user space networking process software, operate NIC hardware directly and increase the network performance dramatically. But SR-IOV also has limitations. It is difficult to migrate VMs if they use SR-IOV. ENA’s pci_driver struct ena_pci_driver also defines ena_sriov_configure as the SR-IOV configure function.

Another feature which can be noticed from the IDs is “LLQ”. It means Low Latency Queue. Some of the ENA devices support this operation mode, which “saves several more microseconds”. Kernel/Documentation/networking/ena.txt describes as below:

The ENA driver supports two Queue Operation modes for Tx SQs:

  • Regular mode
  • In this mode the Tx SQs reside in the host’s memory. The ENA device fetches the ENA Tx descriptors and packet data from host memory.
  • Low Latency Queue (LLQ) mode or “push-mode”.
  • In this mode the driver pushes the transmit descriptors and the first 128 bytes of the packet directly to the ENA device memory space. The rest of the packet payload is fetched by the device. For this operation mode, the driver uses a dedicated PCI device memory BAR, which is mapped with write-combine capability.

As the function comment described, “ena_probe() initializes an adapter identified by a pci_dev structure. The OS initialization, configuring of the adapter private structure, and a hardware reset occur.” ENA device exposes standard PCI config registers and device specific memory mapped(MMIO) registers to host CPU for hardware configuration during the initialization. For each VM or host OS, a pair of queues: Admin Queue (AQ) and Admin Completion Queue(ACQ) are created for further hardware configuration after the device is initialized.

The following admin queue commands are supported:

  • Create I/O submission queue
  • Create I/O completion queue
  • Destroy I/O submission queue
  • Destroy I/O completion queue
  • Get feature
  • Set feature
  • Configure AENQ
  • Get statistics

Besides this, ENA device has another mechanism named Asynchronous Event Notification Queue(AENQ) to report devices status. AENQ has three handlers:

  • link change: report link up/down
  • Notification: update parameters from hardware; like admin_completion_tx_timeout, mmio_read_timeout, missed_tx_completion_count_threshold_to_reset, missing_tx_completion_timeout, netdev_wd_timeout, etc
  • keep alive: get keep alive jiffies and rx drops

ENA setup timer service to check missing_keep_alive, admin_com_state, missing_completions, empty_rx_ring, etc. If something goes wrong, ENA driver will reset the device. Reset reasons are defined in ena_regs_reset_reason_types.

From https://www.youtube.com/watch?v=RS5HS41s5YQ

For data path, as shown in above picture, there could be one pare of TX/RX submission queues associated to one vCPU. Each submission queue has a completion queue(CQ) associated with it. The max number of IO queue is 128. But the actual number of IO queue is min of ena_calc_io_queue_num, io_sq_num, io_cq_num, num_of_onlinecpu, MSI-X number - 1 (one IRQ for management). This submission queue and completion queue architecture has various benefits. As described in Kernel/Documentation/networking/ena.txt:

  • Reduced CPU/thread/process contention on a given Ethernet interface.
  • Cache miss rate on completion is reduced, particularly for data cache lines that hold the sk_buff structures.
  • Increased process-level parallelism when handling received packets.
  • Increased data cache hit rate, by steering kernel processing of packets to the CPU, where the application thread consuming the packet is running.
  • In hardware interrupt re-direction.

ENA driver uses NAPI interface as the packet processing mechanism. ena_io_poll() is the poll function of ENA’s NAPI. In ena_io_poll(), it “cleans” TX IQR and RX IRQ. ena_clean_tx_irq() fetches the completion description from complete queue(CQ) of TX. ena_clean_rx_irq() fetches the incoming packet description from RX queue.

Overall, the architecture of ENA driver is similar to other popular NIC in the industry. ENA is developed by Annapurna Labs which is acquired by Amazon in 2015. AWS have said that eventually most (or all) instances will use the Nitro hypervisor. Other big cloud vendors, like Microsoft Azure, etc also have their own hardware implementation of network virtualization. Semiconductor companies, like Mallanox, Broadcom released multiple Smart-NIC solutions.

Hardware always matters in cloud.