(出處: 333)
1. sk_buff 結構體
可以看出sk_buff 結構體很重要,
sk_buff — 套接字緩衝區,用來在linux網絡子系統中各層之間數據傳遞,起到了“神經中樞”的作用。
當發送數據包時,linux內核的網絡模塊必須建立一個包含要傳輸的數據包的sk_buff,然後將sk_buff傳遞給下一層,各層在sk_buff 中添加不同的協議頭,直到交給網絡設備發送。同樣,當接收數據包時,網絡設備從物理媒介層接收到數據後,他必須將接收到的數據轉換為sk_buff,並傳遞給上層,各層剝去相應的協議頭後直到交給用戶。
sk_buff結構如下圖所示: (define at include/linux/skbuff.h)
sk_buff定義如下:
/** * struct sk_buff - socket buffer * @next: Next buffer in list * @prev: Previous buffer in list * @tstamp: Time we arrived * @sk: Socket we are owned by * @dev: Device we arrived on/are leaving by * @cb: Control buffer. Free for use by every layer. Put private vars here * @_skb_refdst: destination entry (with norefcount bit) * @sp: the security path, used for xfrm * @len: Length of actual data * @data_len: Data length * @mac_len: Length of link layer header * @hdr_len: writable header length of cloned skb * @csum: Checksum (must include start/offset pair) * @csum_start: Offset from skb->head where checksumming should start * @csum_offset: Offset from csum_start where checksum should be stored * @priority: Packet queueing priority * @local_df: allow local fragmentation * @cloned: Head may be cloned (check refcnt to be sure) * @ip_summed: Driver fed us an IP checksum * @nohdr: Payload reference only, must not modify header * @nfctinfo: Relationship of this skb to the connection * @pkt_type: Packet class * @fclone: skbuff clone status * @ipvs_property: skbuff is owned by ipvs * @peeked: this packet has been seen already, so stats have been * done for it, don't do them again * @nf_trace: netfilter packet trace flag * @protocol: Packet protocol from driver * @destructor: Destruct function * @nfct: Associated connection, if any * @nfct_reasm: netfilter conntrack re-assembly pointer * @nf_bridge: Saved data about a bridged frame - see br_netfilter.c * @skb_iif: ifindex of device we arrived on * @tc_index: Traffic control index * @tc_verd: traffic control verdict * @rxhash: the packet hash computed on receive * @queue_mapping: Queue mapping for multiqueue devices * @ndisc_nodetype: router type (from link layer) * @ooo_okay: allow the mapping of a socket to a queue to be changed * @l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport * ports. * @wifi_acked_valid: wifi_acked was set * @wifi_acked: whether frame was acked on wifi or not * @no_fcs: Request NIC to treat last 4 bytes as Ethernet FCS * @dma_cookie: a cookie to one of several possible DMA operations * done by skb DMA functions * @secmark: security marking * @mark: Generic packet mark * @dropcount: total number of sk_receive_queue overflows * @vlan_tci: vlan tag control information * @transport_header: Transport layer header * @network_header: Network layer header * @mac_header: Link layer header * @tail: Tail pointer * @end: End pointer * @head: Head of buffer * @data: Data head pointer * @truesize: Buffer size * @users: User count - see {datagram,tcp}.c */struct sk_buff { /* These two members must be first. */ struct sk_buff *next; struct sk_buff *prev; ktime_t tstamp; struct sock *sk; struct net_device *dev; /* * This is the control buffer. It is free to use for every * layer. Please put your private variables there. If you * want to keep them across layers you have to do a skb_clone() * first. This is owned by whoever has the skb queued ATM. */ char cb[48] __aligned(8); unsigned long _skb_refdst;#ifdef CONFIG_XFRM struct sec_path *sp;#endif unsigned int len, data_len; __u16 mac_len, hdr_len; union { __wsum csum; struct { __u16 csum_start; __u16 csum_offset; }; }; __u32 priority; kmemcheck_bitfield_begin(flags1); __u8 local_df:1, cloned:1, ip_summed:2, nohdr:1, nfctinfo:3; __u8 pkt_type:3, fclone:2, ipvs_property:1, peeked:1, nf_trace:1; kmemcheck_bitfield_end(flags1); __be16 protocol; void (*destructor)(struct sk_buff *skb);#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE) struct nf_conntrack *nfct;#endif#ifdef NET_SKBUFF_NF_DEFRAG_NEEDED struct sk_buff *nfct_reasm;#endif#ifdef CONFIG_BRIDGE_NETFILTER struct nf_bridge_info *nf_bridge;#endif int skb_iif; __u32 rxhash; __u16 vlan_tci;#ifdef CONFIG_NET_SCHED __u16 tc_index; /* traffic control index */#ifdef CONFIG_NET_CLS_ACT __u16 tc_verd; /* traffic control verdict */#endif#endif __u16 queue_mapping; kmemcheck_bitfield_begin(flags2);#ifdef CONFIG_IPV6_NDISC_NODETYPE __u8 ndisc_nodetype:2;#endif __u8 ooo_okay:1; __u8 l4_rxhash:1; __u8 wifi_acked_valid:1; __u8 wifi_acked:1; __u8 no_fcs:1; /* 9/11 bit hole (depending on ndisc_nodetype presence) */ kmemcheck_bitfield_end(flags2);#ifdef CONFIG_NET_DMA dma_cookie_t dma_cookie;#endif#ifdef CONFIG_NETWORK_SECMARK __u32 secmark;#endif union { __u32 mark; __u32 dropcount; __u32 avail_size; }; sk_buff_data_t transport_header; sk_buff_data_t network_header; sk_buff_data_t mac_header; /* These elements must be at the end, see alloc_skb() for details. */ sk_buff_data_t tail; sk_buff_data_t end; unsigned char *head, *data; unsigned int truesize; atomic_t users;};
sk_buff主要成員如下:
1.1 各層協議頭:
— transport_header : 傳輸層協議頭,如TCP, UDP , ICMP, IGMP等協議頭
— network_header : 網絡層協議頭, 如IP, IPv6, ARP 協議頭
— mac_header : 鏈路層協議頭。
— sk_buff_data_t 原型就是一個char 指針
#ifdef NET_SKBUFF_DATA_USES_OFFSETtypedef unsigned int sk_buff_data_t;#elsetypedef unsigned char *sk_buff_data_t;#endif
1.2 數據緩衝區指針head, data, tail, end
— *head :指向內存中已分配的用於存放網絡數據緩衝區的起始地址, sk_buff和相關數據被分配後,該指針值就固定了
— *data : 指向對應當前協議層有效數據的起始地址。
每個協議層的有效數據內容不一樣,各層有效數據的內容如下:
a. 對於傳輸層,有效數據包括用戶數據和傳輸層協議頭
b. 對於網絡層,有效數據包括用戶數據、傳輸層協議和網絡層協議頭。
c. 對於數據鏈路層,有效數據包括用戶數據、傳輸層協議、網絡層協議和鏈路層協議。
因此,data指針隨著當前擁有sk_buff的協議層的變化而進行相應的移動。
— tail :指向對應當前協議層有效數據的結尾地址,與data指針相對應。
— end :指向內存中分配的網絡數據緩衝區的結尾,與head指針相對應。和head一樣,sk_buff被分配後,end指針就固定了。
head, data, tail, end 關係如下圖所示:
1.3 長度信息len, data_len, truesize
— len :指網絡數據包的有效數據的長度,包括協議頭和負載(payload).
— data_len : 記錄分片的數據長度
— truesize :表述緩存區的整體長度,一般為sizeof(sk_buff).
1.4 數據包類型
— pkt_type :指定數據包類型。驅動程序負責將其設置為:
PACKET_HOST — 該數據包是給我的。
PACKET_OTHERHOST — 該數據包不是給我的。
PACKET_BROADCAST — 廣播類型的數據包
PACKET_MULTICAST — 組播類型的數據包
驅動程序不必顯式的修改pkt_type,因為eth_type_trans會完成該工作。
2. 套接字緩衝區的操作
2.1 分配套接字緩衝區
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函數分配一個套接字緩衝區和一個數據緩衝區。
— len : 為數據緩衝區的大小
— priority : 內存分配的優先級
static inline struct sk_buff *alloc_skb(unsigned int size, gfp_t priority){ return __alloc_skb(size, priority, 0, NUMA_NO_NODE);}
/** * __alloc_skb - allocate a network buffer * @size: size to allocate * @gfp_mask: allocation mask * @fclone: allocate from fclone cache instead of head cache * and allocate a cloned (child) skb * @node: numa node to allocate memory on * * Allocate a new &sk_buff. The returned buffer has no headroom and a * tail room of size bytes. The object has a reference count of one. * The return is the buffer. On a failure the return is %NULL. * * Buffers may only be allocated from interrupts using a @gfp_mask of * %GFP_ATOMIC. */struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask, int fclone, int node){ struct kmem_cache *cache; struct skb_shared_info *shinfo; struct sk_buff *skb; u8 *data; cache = fclone ? skbuff_fclone_cache : skbuff_head_cache; /* Get the HEAD */ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node); if (!skb) goto out; prefetchw(skb); /* We do our best to align skb_shared_info on a separate cache * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives * aligned memory blocks, unless SLUB/SLAB debug is enabled. * Both skb->head and skb_shared_info are cache line aligned. */ size = SKB_DATA_ALIGN(size); size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info)); data = kmalloc_node_track_caller(size, gfp_mask, node); if (unlikely(ZERO_OR_NULL_PTR(data))) goto nodata; /* kmalloc(size) might give us more room than requested. * Put skb_shared_info exactly at the end of allocated zone, * to allow max possible filling before reallocation. */ size = SKB_WITH_OVERHEAD(ksize(data)); prefetchw(data + size); /* * Only clear those fields we need to clear, not those that we will * actually initialise below. Hence, don't put any more fields after * the tail pointer in struct sk_buff! */ memset(skb, 0, offsetof(struct sk_buff, tail)); /* Account for allocated memory : skb + skb->head */ skb->truesize = SKB_TRUESIZE(size); atomic_set(&skb->users, 1); skb->head = data; skb->data = data; skb_reset_tail_pointer(skb); skb->end = skb->tail + size;#ifdef NET_SKBUFF_DATA_USES_OFFSET skb->mac_header = ~0U;#endif /* make sure we initialize shinfo sequentially */ shinfo = skb_shinfo(skb); memset(shinfo, 0, offsetof(struct skb_shared_info, dataref)); atomic_set(&shinfo->dataref, 1); kmemcheck_annotate_variable(shinfo->destructor_arg); if (fclone) { struct sk_buff *child = skb + 1; atomic_t *fclone_ref = (atomic_t *) (child + 1); kmemcheck_annotate_bitfield(child, flags1); kmemcheck_annotate_bitfield(child, flags2); skb->fclone = SKB_FCLONE_ORIG; atomic_set(fclone_ref, 1); child->fclone = SKB_FCLONE_UNAVAILABLE; }out: return skb;nodata: kmem_cache_free(cache, skb); skb = NULL; goto out;}EXPORT_SYMBOL(__alloc_skb);
struct sk_buff *dev_alloc_skb(unsignedint len);
dev_alloc_skb()函數以GFP_ATOMIC 優先級調用上面的alloc_skb()函數。
並保存skb->dead 和skb->data之間的16個字節
/** * dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. Although this function * allocates memory it can be called from an interrupt. */struct sk_buff *dev_alloc_skb(unsigned int length){ /* * There is more code here than it seems: * __dev_alloc_skb is an inline */ return __dev_alloc_skb(length, GFP_ATOMIC);}EXPORT_SYMBOL(dev_alloc_skb);
/** * __dev_alloc_skb - allocate an skbuff for receiving * @length: length to allocate * @gfp_mask: get_free_pages mask, passed to alloc_skb * * Allocate a new &sk_buff and assign it a usage count of one. The * buffer has unspecified headroom built in. Users should allocate * the headroom they think they need without accounting for the * built in space. The built in space is used for optimisations. * * %NULL is returned if there is no free memory. */static inline struct sk_buff *__dev_alloc_skb(unsigned int length, gfp_t gfp_mask){ struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask); if (likely(skb)) skb_reserve(skb, NET_SKB_PAD); return skb;}
2.2 釋放套接字緩衝區
void kfree_skb(struct sk_buff *skb);
/** * kfree_skb - free an sk_buff * @skb: buffer to free * * Drop a reference to the buffer and free it if the usage count has * hit zero. */void kfree_skb(struct sk_buff *skb){ if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_kfree_skb(skb, __builtin_return_address(0)); __kfree_skb(skb);}EXPORT_SYMBOL(kfree_skb);
— kfree_skb() 函數只能在內核內部使用,網絡設備驅動中必須使用dev_kfree_skb()、dev_kfree_skb_irq() 或dev_kfree_skb_any().
void dev_kfree_skb(struct sk_buff *skb);
— dev_kfree_skb()用於非中斷上下文。
#define dev_kfree_skb(a) consume_skb(a)
/** * consume_skb - free an skbuff * @skb: buffer to free * * Drop a ref to the buffer and free it if the usage count has hit zero * Functions identically to kfree_skb, but kfree_skb assumes that the frame * is being dropped after a failure and notes that */void consume_skb(struct sk_buff *skb){ if (unlikely(!skb)) return; if (likely(atomic_read(&skb->users) == 1)) smp_rmb(); else if (likely(!atomic_dec_and_test(&skb->users))) return; trace_consume_skb(skb); __kfree_skb(skb);}EXPORT_SYMBOL(consume_skb);
void dev_kfree_skb_irq(struct sk_buff *skb);
— dev_kfree_skb_irq() 用於中斷上下文。
void dev_kfree_skb_irq(struct sk_buff *skb){ if (atomic_dec_and_test(&skb->users)) { struct softnet_data *sd; unsigned long flags; local_irq_save(flags); sd = &__get_cpu_var(softnet_data); skb->next = sd->completion_queue; sd->completion_queue = skb; raise_softirq_irqoff(NET_TX_SOFTIRQ); local_irq_restore(flags); }}EXPORT_SYMBOL(dev_kfree_skb_irq);
void dev_kfree_skb_any(struct sk_buff *skb);
— dev_kfree_skb_any() 在中斷或非中斷上下文中都能使用。
void dev_kfree_skb_any(struct sk_buff *skb){ if (in_irq() || irqs_disabled()) dev_kfree_skb_irq(skb); else dev_kfree_skb(skb);}EXPORT_SYMBOL(dev_kfree_skb_any);
2.3移動指針
Linux套接字緩衝區中的指針移動操作有:put(放置), push(推), pull(拉)和reserve(保留)等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len);
將tail 指針下移,增加sk_buff 的len 值,並返回skb->tail 的當前值。
將數據添加在buffer的尾部。
/** * skb_put - add data to a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer. If this would * exceed the total buffer size the kernel will panic. A pointer to the * first byte of the extra data is returned. */unsigned char *skb_put(struct sk_buff *skb, unsigned int len){ unsigned char *tmp = skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; if (unlikely(skb->tail > skb->end)) skb_over_panic(skb, len, __builtin_return_address(0)); return tmp;}EXPORT_SYMBOL(skb_put);
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb){ return skb->tail;}
unsigned char *__skb_put(struct sk_buff *skb, unsigned int len);
__skb_put() 與skb_put()的區別在於skb_put()會檢測放入緩衝區的數據, 而__skb_put()不會檢查
static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len){ unsigned char *tmp = skb_tail_pointer(skb); SKB_LINEAR_ASSERT(skb); skb->tail += len; skb->len += len; return tmp;}
2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len);
skb_push()會將data指針上移,也就是將數據添加在buffer的起始點,因此也要增加sk_buff的len值。
/** * skb_push - add data to the start of a buffer * @skb: buffer to use * @len: amount of data to add * * This function extends the used data area of the buffer at the buffer * start. If this would exceed the total buffer headroom the kernel will * panic. A pointer to the first byte of the extra data is returned. */unsigned char *skb_push(struct sk_buff *skb, unsigned int len){ skb->data -= len; skb->len += len; if (unlikely(skb->datahead)) skb_under_panic(skb, len, __builtin_return_address(0)); return skb->data;}EXPORT_SYMBOL(skb_push);
unsigned char *__skb_push(struct sk_buff *skb, unsigned int len);
static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len){ skb->data -= len; skb->len += len; return skb->data;}
__skb_push()和skb_push()的區別與__skb_put() 和skb_put()的區別一樣。
push操作在緩衝區的頭部增加一段可以存儲網絡數據包的空間,而put操作在緩衝區的尾部增加一段可以存儲網絡數據包的空間。
2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len);
skb_pull()將data指針下移,並減少skb的len值, 這個操作與skb_push()對應。
這個操作主要用於下層協議向上層協議移交數據包,使data指針指向上一層協議頭
/** * skb_pull - remove data from the start of a buffer * @skb: buffer to use * @len: amount of data to remove * * This function removes data from the start of a buffer, returning * the memory to the headroom. A pointer to the next data in the buffer * is returned. Once the data has been pulled future pushes will overwrite * the old data. */unsigned char *skb_pull(struct sk_buff *skb, unsigned int len){ return skb_pull_inline(skb, len);}EXPORT_SYMBOL(skb_pull);
static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len){ return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);}
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len){ skb->len -= len; BUG_ON(skb->len < skb->data_len); return skb->data += len;}
2.3.4 reserve 操作
void skb_reserve(struct sk_buff *skb, unsigned int len);
skb_reserve()將data指針和tail 指針同時下移。
這個操作用於在緩衝區頭部預留len長度的空間
/** * skb_reserve - adjust headroom * @skb: buffer to alter * @len: bytes to move * * Increase the headroom of an empty &sk_buff by reducing the tail * room. This is only allowed for an empty buffer. */static inline void skb_reserve(struct sk_buff *skb, int len){ skb->data += len; skb->tail += len;}
3. 例子:
Linux處理一個UDP數據包的接收流程,來說明對sk_buff的操作過程。
這一過程絕大部分工作會在內核完成,驅動中只需要完成涉及數據鏈路層部分。
假設網卡收到一個UDP數據包,Linux處理流程如下:
3.1 網卡收到一個UDP數據包後,驅動程序需要創建一個sk_buff結構體和數據緩衝區,將接收到的數據全部複製到data指向的空間,並將skb->mac_header指向data。
此時有效數據的開始位置data是一個以太網頭部,即鏈路層協議頭。
示例代碼如下:
//分配新的套接字緩衝區和數據緩衝區
3.2 數據鏈路層通過調用skb_pull() 剝掉以太網協議頭,向網絡層IP傳送數據包。
在剝離過程中,data指針會下移一個以太網頭部的長度sizeof(struct ethhdr), 而len 也減去sizeof(struct ethhdr)長度。
此時有效數據的開始位置是一個IP協議頭,skb->network_head指向data,即IP協議頭, 而skb->mac_header 依舊指向以太網頭, 即鏈路層協議頭。
內容如下圖所示:
3.3 網絡層通過skb_pull()剝掉IP協議頭,向UDP傳輸層傳遞數據包。
剝離過程中,data指針會下移一個IP協議頭長度sizeof(struct iphdr), 而len也會減少sizeof(struct iphdr)長度。
此時有效數據開始位置是一個UDP協議頭, skb->transport_header指向data,即UDP協議頭。
而skb->network_header繼續指向IP協議頭, skb->mac_header 繼續指向鏈路層協議頭。
如下圖所示:
3.4 應用程序在調用recv() 接收數據時,從skb->data + sizeof(struct udphdr) 的位置開始復製到應用層緩衝區。
可見,UPD協議頭到最後也沒有被剝離。
沒有留言:
張貼留言