之前做的一些網絡性能的測試都是在三層網絡測試的,最近在大二層網絡重新測試TDocker的網絡性能時,發現物理機的性能比容器還差,在容器內部可以跑60w+,物理機器卻只能跑45w+。這與100w+的預期相差太遠。
由於在大二層的網絡下引入了VLAN設備(由於linux bridge不支持VLAN而引入),所以初步懷疑問題出在VLAN network device。
使用perf看一下,發現dev_queue_xmit中的一個spin lock占用了大量的CPU,達到70%+。
但是,在3.10.x的內核下卻沒有這個問題:
從上面可以看到,在3.10.x內核下,內核spin lock的開銷很小。另外,從後者的調用的路徑可以看到,spin lock主要出現在sk_buff從VLAN設備下發物理網卡,而不是從協議棧下發VLAN設備。看來,對於CentOS6.5(2.6.32-431),問題主要出現在VLAN設備。
先看看dev_queue_xmit函數,它是協議棧到底層網絡設備的入口。
//net/core/dev.c
int dev_queue_xmit(struct sk_buff *skb)
{
struct net_device *dev = skb->dev;
struct netdev_queue *txq;
struct Qdisc *q;
...
txq = netdev_pick_tx(dev, skb);
q = rcu_dereference(txq->qdisc);
trace_net_dev_queue(skb);
if (q->enqueue) { ///對於VLAN設備,沒有qdisc隊列,參考noqueue_qdisc
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}
/* The device has no queue. Common case for software devices:
loopback, all the sorts of tunnels...
Really, it is unlikely that netif_tx_lock protection is necessary
here. (f.e. loopback and IP tunnels are clean ignoring statistics
counters.)
However, it is possible, that they rely on protection
made by us here.
Check this and shot the lock. It is not prone from deadlocks.
Either shot noqueue qdisc, it is even simpler 8)
*/
if (dev->flags & IFF_UP) {
int cpu = smp_processor_id(); /* ok because BHs are off */
if (txq->xmit_lock_owner != cpu) {
HARD_TX_LOCK(dev, txq, cpu);
if (!netif_tx_queue_stopped(txq)) {
rc = NET_XMIT_SUCCESS;
if (!dev_hard_start_xmit(skb, dev, txq)) {
HARD_TX_UNLOCK(dev, txq);
goto out;
}
}
HARD_TX_UNLOCK(dev, txq);
}
}
rc = -ENETDOWN;
rcu_read_unlock_bh();
可以看到,內核在把sk_buff下發給網絡設備驅動之前,會嘗試請求隊列的xmit_lock,這是為了防止SMP多個CPU同時給driver下發數據。實際上,大部分driver自身內部已經實現了lock,所以,這裡的xmit_lock顯得有點多余。所以,內核引入了NETIF_F_LLTX,如果驅動已經實現了lock,就會設置NETIF_F_LLTX標志位,這樣,內核在調用dev_queue_xmit時,就不會對xmit_lock加鎖了。
#define HARD_TX_LOCK(dev, txq, cpu) { \
if ((dev->features & NETIF_F_LLTX) == 0) { \
__netif_tx_lock(txq, cpu); \
} \
}
static inline void __netif_tx_lock(struct netdev_queue *txq, int cpu)
{
spin_lock(&txq->_xmit_lock);
txq->xmit_lock_owner = cpu;
}
從上面的代碼可以看到,如果網絡設備設置了NETIF_F_LLTX,內核就不會對xmit_lock加鎖。
但是CentOS6.5(2.6.32-431)的內核,對於VLAN設備,卻沒有設置NETIF_F_LLTX,由於VLAN設備只有一個隊列,必然導致xmit_lock競爭,使得sys CPU高達70%多。
static int vlan_dev_init(struct net_device *dev)
{
struct net_device *real_dev = vlan_dev_info(dev)->real_dev;
...
/* IFF_BROADCAST|IFF_MULTICAST; ??? */
dev->flags = real_dev->flags & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI);
dev->iflink = real_dev->ifindex;
dev->state = (real_dev->state & ((1<<__LINK_STATE_NOCARRIER) |
(1<<__LINK_STATE_DORMANT))) |
(1<<__LINK_STATE_PRESENT);
dev->features |= real_dev->features & real_dev->vlan_features;
...
而在3.10.x的內核,對於VLAN設備,也只有一個隊列,為什麼卻沒有性能問題呢?
實際上,3.10.x的內核,對於VLAN設備,設置了NETIF_F_LLTX,僅管只有一個隊列,也不會有xmit_lock的開銷。
static int vlan_dev_init(struct net_device *dev)
{
struct net_device *real_dev = vlan_dev_priv(dev)->real_dev;
...
/* IFF_BROADCAST|IFF_MULTICAST; ??? */
dev->flags = real_dev->flags & ~(IFF_UP | IFF_PROMISC | IFF_ALLMULTI |
IFF_MASTER | IFF_SLAVE);
dev->iflink = real_dev->ifindex;
dev->state = (real_dev->state & ((1<<__LINK_STATE_NOCARRIER) |
(1<<__LINK_STATE_DORMANT))) |
(1<<__LINK_STATE_PRESENT);
dev->hw_features = NETIF_F_ALL_CSUM | NETIF_F_SG |
NETIF_F_FRAGLIST | NETIF_F_ALL_TSO |
NETIF_F_HIGHDMA | NETIF_F_SCTP_CSUM |
NETIF_F_ALL_FCOE;
dev->features |= real_dev->vlan_features | NETIF_F_LLTX;
一般來說,我們可以通過ethtool -k 查看網絡設備的feature:
# ethtool -k eth1.11
Features for eth1.11:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
rx-vlan-offload: off
tx-vlan-offload: off
ntuple-filters: off
receive-hashing: off
對於CentOS6.5(2.6.32-431),是從/sys/class/net/${ethX}/features讀取features:
#cat /sys/class/net/eth1.11/features
0x114833
--------------------------
1 0001 0100 1000 0011 0011 0x114833
1 0000 0000 0000 NETIF_F_LLTX 4096
1000 0000 0000 NETIF_F_GSO 2048
1 0000 0000 0000 0000 NETIF_F_TSO 1<<16
100 0000 0000 0000 NETIF_F_GRO 16384
01 NETIF_F_SG 1
10 NETIF_F_IP_CSUM 2
1 0000 NETIF_F_IPV6_CSUM 16
10 0000 NETIF_F_HIGHDMA 32
1 0000 0000 0000 0000 0000 NETIF_F_TSO6 (1<<20)
可以看到,CentOS6.5的內核對於VLAN設備,沒有設置NETIF_F_LLTX標志。
對於3.10.x內核,已經沒有/sys/class/net/${ethX}/features,但是內核支持ETHTOOL_GFEATURES命令(2.6.32-431不支持該命令),ethtool通過ETHTOOL_GFEATURES獲取網絡設備的features:
//net/core/ethtool.c
int dev_ethtool(struct net *net, struct ifreq *ifr)
{
case ETHTOOL_GFEATURES:
rc = ethtool_get_features(dev, useraddr);
break;
# ./ethtool -k eth1.11 | grep tx-lockless
tx-lockless: on [fixed]
# ./ethtool -k eth1 | grep tx-lockless
tx-lockless: off [fixed]
從上面可以確認,3.10.x的內核對VLAN設備的確設置了NETIF_F_LLTX標志。
//ethtool-3.5
static struct feature_state *
get_features(struct cmd_context *ctx, const struct feature_defs *defs)
{
...
if (defs->n_features) { ///內核支持ETHTOOL_GFEATURES
state->features.cmd = ETHTOOL_GFEATURES;
state->features.size = FEATURE_BITS_TO_BLOCKS(defs->n_features);
err = send_ioctl(ctx, &state->features);
if (err)
perror("Cannot get device generic features");
else
allfail = 0;
} else {
/* We should have got VLAN tag offload flags through
* ETHTOOL_GFLAGS. However, prior to Linux 2.6.37
* they were not exposed in this way - and since VLAN
* tag offload was defined and implemented by many
* drivers, we shouldn't assume they are off.
* Instead, since these feature flag values were
* stable, read them from sysfs.
*/
char buf[20]; ///從/sys/class/net/%s/features讀取features
if (get_netdev_attr(ctx, "features", buf, sizeof(buf)) > 0)
state->off_flags |=
strtoul(buf, NULL, 0) &
(ETH_FLAG_RXVLAN | ETH_FLAG_TXVLAN);
}
static int get_netdev_attr(struct cmd_context *ctx, const char *name,
char *buf, size_t buf_len)
{
#ifdef TEST_ETHTOOL
errno = ENOENT;
return -1;
#else
char path[40 + IFNAMSIZ];
ssize_t len;
int fd;
len = snprintf(path, sizeof(path), "/sys/class/net/%s/%s",
ctx->devname, name);
assert(len < sizeof(path));
fd = open(path, O_RDONLY);
if (fd < 0)
return fd;
len = read(fd, buf, buf_len - 1);
if (len >= 0)
buf[len] = 0;
close(fd);
return len;
#endif
}
更多CentOS相關信息見CentOS 專題頁面 http://www.linuxidc.com/topicnews.aspx?tid=14