2014/12/08

sk_buff 定義及其操作 (封包,數據包,packet ...)

(出處: 333)

1. sk_buff 結構體
可以看出sk_buff 結構體很重要,
sk_buff — 套接字緩衝區,用來在linux網絡子系統中各層之間數據傳遞,起到了“神經中樞”的作用。
當發送數據包時,linux內核的網絡模塊必須建立一個包含要傳輸的數據包的sk_buff,然後將sk_buff傳遞給下一層,各層在sk_buff 中添加不同的協議頭,直到交給網絡設備發送。同樣,當接收數據包時,網絡設備從物理媒介層接收到數據後,他必須將接收到的數據轉換為sk_buff,並傳遞給上層,各層剝去相應的協議頭後直到交給用戶。
sk_buff結構如下圖所示: (define at include/linux/skbuff.h)

sk_buff定義如​​下:

/**  *	struct sk_buff - socket buffer *	@next: Next buffer in list *	@prev: Previous buffer in list *	@tstamp: Time we arrived *	@sk: Socket we are owned by *	@dev: Device we arrived on/are leaving by *	@cb: Control buffer. Free for use by every layer. Put private vars here *	@_skb_refdst: destination entry (with norefcount bit) *	@sp: the security path, used for xfrm *	@len: Length of actual data *	@data_len: Data length *	@mac_len: Length of link layer header *	@hdr_len: writable header length of cloned skb *	@csum: Checksum (must include start/offset pair) *	@csum_start: Offset from skb->head where checksumming should start *	@csum_offset: Offset from csum_start where checksum should be stored *	@priority: Packet queueing priority *	@local_df: allow local fragmentation *	@cloned: Head may be cloned (check refcnt to be sure) *	@ip_summed: Driver fed us an IP checksum *	@nohdr: Payload reference only, must not modify header *	@nfctinfo: Relationship of this skb to the connection *	@pkt_type: Packet class *	@fclone: skbuff clone status *	@ipvs_property: skbuff is owned by ipvs *	@peeked: this packet has been seen already, so stats have been *		done for it, don't do them again *	@nf_trace: netfilter packet trace flag *	@protocol: Packet protocol from driver *	@destructor: Destruct function *	@nfct: Associated connection, if any *	@nfct_reasm: netfilter conntrack re-assembly pointer *	@nf_bridge: Saved data about a bridged frame - see br_netfilter.c *	@skb_iif: ifindex of device we arrived on *	@tc_index: Traffic control index *	@tc_verd: traffic control verdict *	@rxhash: the packet hash computed on receive *	@queue_mapping: Queue mapping for multiqueue devices *	@ndisc_nodetype: router type (from link layer) *	@ooo_okay: allow the mapping of a socket to a queue to be changed *	@l4_rxhash: indicate rxhash is a canonical 4-tuple hash over transport *		ports. *	@wifi_acked_valid: wifi_acked was set *	@wifi_acked: whether frame was acked on wifi or not *	@no_fcs:  Request NIC to treat last 4 bytes as Ethernet FCS *	@dma_cookie: a cookie to one of several possible DMA operations *		done by skb DMA functions *	@secmark: security marking *	@mark: Generic packet mark *	@dropcount: total number of sk_receive_queue overflows *	@vlan_tci: vlan tag control information *	@transport_header: Transport layer header *	@network_header: Network layer header *	@mac_header: Link layer header *	@tail: Tail pointer *	@end: End pointer *	@head: Head of buffer *	@data: Data head pointer *	@truesize: Buffer size *	@users: User count - see {datagram,tcp}.c */struct sk_buff {	/* These two members must be first. */	struct sk_buff		*next;	struct sk_buff		*prev;	ktime_t			tstamp;	struct sock		*sk;	struct net_device	*dev;	/*	 * This is the control buffer. It is free to use for every	 * layer. Please put your private variables there. If you	 * want to keep them across layers you have to do a skb_clone()	 * first. This is owned by whoever has the skb queued ATM.	 */	char			cb[48] __aligned(8);	unsigned long		_skb_refdst;#ifdef CONFIG_XFRM	struct	sec_path	*sp;#endif	unsigned int		len,				data_len;	__u16			mac_len,				hdr_len;	union {		__wsum		csum;		struct {			__u16	csum_start;			__u16	csum_offset;		};	};	__u32			priority;	kmemcheck_bitfield_begin(flags1);	__u8			local_df:1,				cloned:1,				ip_summed:2,				nohdr:1,				nfctinfo:3;	__u8			pkt_type:3,				fclone:2,				ipvs_property:1,				peeked:1,				nf_trace:1;	kmemcheck_bitfield_end(flags1);	__be16			protocol;	void			(*destructor)(struct sk_buff *skb);#if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)	struct nf_conntrack	*nfct;#endif#ifdef NET_SKBUFF_NF_DEFRAG_NEEDED	struct sk_buff		*nfct_reasm;#endif#ifdef CONFIG_BRIDGE_NETFILTER	struct nf_bridge_info	*nf_bridge;#endif	int			skb_iif;	__u32			rxhash;	__u16			vlan_tci;#ifdef CONFIG_NET_SCHED	__u16			tc_index;	/* traffic control index */#ifdef CONFIG_NET_CLS_ACT	__u16			tc_verd;	/* traffic control verdict */#endif#endif	__u16			queue_mapping;	kmemcheck_bitfield_begin(flags2);#ifdef CONFIG_IPV6_NDISC_NODETYPE	__u8			ndisc_nodetype:2;#endif	__u8			ooo_okay:1;	__u8			l4_rxhash:1;	__u8			wifi_acked_valid:1;	__u8			wifi_acked:1;	__u8			no_fcs:1;	/* 9/11 bit hole (depending on ndisc_nodetype presence) */	kmemcheck_bitfield_end(flags2);#ifdef CONFIG_NET_DMA	dma_cookie_t		dma_cookie;#endif#ifdef CONFIG_NETWORK_SECMARK	__u32			secmark;#endif	union {		__u32		mark;		__u32		dropcount;		__u32		avail_size;	};	sk_buff_data_t		transport_header;	sk_buff_data_t		network_header;	sk_buff_data_t		mac_header;	/* These elements must be at the end, see alloc_skb() for details.  */	sk_buff_data_t		tail;	sk_buff_data_t		end;	unsigned char		*head,				*data;	unsigned int		truesize;	atomic_t		users;};

sk_buff主要成員如下:
1.1 各層協議頭:
— transport_header : 傳輸層協議頭,如TCP, UDP , ICMP, IGMP等協議頭
— network_header : 網絡層協議頭, 如IP, IPv6, ARP 協議頭
— mac_header : 鏈路層協議頭。
— sk_buff_data_t 原型就是一個char 指針

#ifdef NET_SKBUFF_DATA_USES_OFFSETtypedef unsigned int sk_buff_data_t;#elsetypedef unsigned char *sk_buff_data_t;#endif

1.2 數據緩衝區指針head, data, tail, end
— *head :指向內存中已分配的用於存放網絡數據緩衝區的起始地址, sk_buff和相關數據被分配後,該指針值就固定了
— *data : 指向對應當前協議層有效數據的起始地址。
每個協議層的有效數據內容不一樣,各層有效數據的內容如下:
a. 對於傳輸層,有效數據包括用戶數據和傳輸層協議頭
b. 對於網絡層,有效數據包括用戶數據、傳輸層協議和網絡層協議頭。
c. 對於數據鏈路層,有效數據包括用戶數據、傳輸層協議、網絡層協議和鏈路層協議。
因此,data指針隨著當前擁有sk_buff的協議層的變化而進行相應的移動。
— tail :指向對應當前協議層有效數據的結尾地址,與data指針相對應。
— end :指向內存中分配的網絡數據緩衝區的結尾,與head指針相對應。和head一樣,sk_buff被分配後,end指針就固定了。
head, data, tail, end 關係如下圖所示:

1.3 長度信息len, data_len, truesize
— len :指網絡數據包的有效數據的長度,包括協議頭和負載(payload).
— data_len : 記錄分片的數據長度
— truesize :表述緩存區的整體長度,一般為sizeof(sk_buff).
1.4 數據包類型
— pkt_type :指定數據包類型。驅動程序負責將其設置為:
PACKET_HOST — 該數據包是給我的。
PACKET_OTHERHOST — 該數據包不是給我的。
PACKET_BROADCAST — 廣播類型的數據包
PACKET_MULTICAST — 組播類型的數據包
驅動程序不必顯式的修改pkt_type,因為eth_type_trans會完成該工作。
2. 套接字緩衝區的操作
2.1 分配套接字緩衝區
struct sk_buff *alloc_skb(unsigned intlen, int priority);
alloc_skb()函數分配一個套接字緩衝區和一個數據緩衝區。
— len : 為數據緩衝區的大小
— priority : 內存分配的優先級

static inline struct sk_buff *alloc_skb(unsigned int size,					gfp_t priority){	return __alloc_skb(size, priority, 0, NUMA_NO_NODE);}
/** *	__alloc_skb	-	allocate a network buffer *	@size: size to allocate *	@gfp_mask: allocation mask *	@fclone: allocate from fclone cache instead of head cache *		and allocate a cloned (child) skb *	@node: numa node to allocate memory on * *	Allocate a new &sk_buff. The returned buffer has no headroom and a *	tail room of size bytes. The object has a reference count of one. *	The return is the buffer. On a failure the return is %NULL. * *	Buffers may only be allocated from interrupts using a @gfp_mask of *	%GFP_ATOMIC. */struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,			    int fclone, int node){	struct kmem_cache *cache;	struct skb_shared_info *shinfo;	struct sk_buff *skb;	u8 *data;	cache = fclone ? skbuff_fclone_cache : skbuff_head_cache;	/* Get the HEAD */	skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);	if (!skb)		goto out;	prefetchw(skb);	/* We do our best to align skb_shared_info on a separate cache	 * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives	 * aligned memory blocks, unless SLUB/SLAB debug is enabled.	 * Both skb->head and skb_shared_info are cache line aligned.	 */	size = SKB_DATA_ALIGN(size);	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));	data = kmalloc_node_track_caller(size, gfp_mask, node);	if (unlikely(ZERO_OR_NULL_PTR(data)))		goto nodata;	/* kmalloc(size) might give us more room than requested.	 * Put skb_shared_info exactly at the end of allocated zone,	 * to allow max possible filling before reallocation.	 */	size = SKB_WITH_OVERHEAD(ksize(data));	prefetchw(data + size);	/*	 * Only clear those fields we need to clear, not those that we will	 * actually initialise below. Hence, don't put any more fields after	 * the tail pointer in struct sk_buff!	 */	memset(skb, 0, offsetof(struct sk_buff, tail));	/* Account for allocated memory : skb + skb->head */	skb->truesize = SKB_TRUESIZE(size);	atomic_set(&skb->users, 1);	skb->head = data;	skb->data = data;	skb_reset_tail_pointer(skb);	skb->end = skb->tail + size;#ifdef NET_SKBUFF_DATA_USES_OFFSET	skb->mac_header = ~0U;#endif	/* make sure we initialize shinfo sequentially */	shinfo = skb_shinfo(skb);	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));	atomic_set(&shinfo->dataref, 1);	kmemcheck_annotate_variable(shinfo->destructor_arg);	if (fclone) {		struct sk_buff *child = skb + 1;		atomic_t *fclone_ref = (atomic_t *) (child + 1);		kmemcheck_annotate_bitfield(child, flags1);		kmemcheck_annotate_bitfield(child, flags2);		skb->fclone = SKB_FCLONE_ORIG;		atomic_set(fclone_ref, 1);		child->fclone = SKB_FCLONE_UNAVAILABLE;	}out:	return skb;nodata:	kmem_cache_free(cache, skb);	skb = NULL;	goto out;}EXPORT_SYMBOL(__alloc_skb);

struct sk_buff *dev_alloc_skb(unsignedint len​​);
dev_alloc_skb()函數以GFP_ATOMIC 優先級調用上面的alloc_skb()函數。
並保存skb->dead 和skb->data之間的16個字節

/** *	dev_alloc_skb - allocate an skbuff for receiving *	@length: length to allocate * *	Allocate a new &sk_buff and assign it a usage count of one. The *	buffer has unspecified headroom built in. Users should allocate *	the headroom they think they need without accounting for the *	built in space. The built in space is used for optimisations. * *	%NULL is returned if there is no free memory. Although this function *	allocates memory it can be called from an interrupt. */struct sk_buff *dev_alloc_skb(unsigned int length){	/*	 * There is more code here than it seems:	 * __dev_alloc_skb is an inline	 */	return __dev_alloc_skb(length, GFP_ATOMIC);}EXPORT_SYMBOL(dev_alloc_skb);
/** *	__dev_alloc_skb - allocate an skbuff for receiving *	@length: length to allocate *	@gfp_mask: get_free_pages mask, passed to alloc_skb * *	Allocate a new &sk_buff and assign it a usage count of one. The *	buffer has unspecified headroom built in. Users should allocate *	the headroom they think they need without accounting for the *	built in space. The built in space is used for optimisations. * *	%NULL is returned if there is no free memory. */static inline struct sk_buff *__dev_alloc_skb(unsigned int length,					      gfp_t gfp_mask){	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);	if (likely(skb))		skb_reserve(skb, NET_SKB_PAD);	return skb;}

2.2 釋放套接字緩衝區
void kfree_skb(struct sk_buff *skb);

/** *	kfree_skb - free an sk_buff *	@skb: buffer to free * *	Drop a reference to the buffer and free it if the usage count has *	hit zero. */void kfree_skb(struct sk_buff *skb){	if (unlikely(!skb))		return;	if (likely(atomic_read(&skb->users) == 1))		smp_rmb();	else if (likely(!atomic_dec_and_test(&skb->users)))		return;	trace_kfree_skb(skb, __builtin_return_address(0));	__kfree_skb(skb);}EXPORT_SYMBOL(kfree_skb);

— kfree_skb() 函數只能在內核內部使用,網絡設備驅動中必須使用dev_kfree_skb()、dev_kfree_skb_irq() 或dev_kfree_skb_any().
void dev_kfree_skb(struct sk_buff *skb);
— dev_kfree_skb()用於非中斷上下文。

#define dev_kfree_skb(a)	consume_skb(a)
/** *	consume_skb - free an skbuff *	@skb: buffer to free * *	Drop a ref to the buffer and free it if the usage count has hit zero *	Functions identically to kfree_skb, but kfree_skb assumes that the frame *	is being dropped after a failure and notes that */void consume_skb(struct sk_buff *skb){	if (unlikely(!skb))		return;	if (likely(atomic_read(&skb->users) == 1))		smp_rmb();	else if (likely(!atomic_dec_and_test(&skb->users)))		return;	trace_consume_skb(skb);	__kfree_skb(skb);}EXPORT_SYMBOL(consume_skb);

void dev_kfree_skb_irq(struct sk_buff *skb);
— dev_kfree_skb_irq() 用於中斷上下文。

void dev_kfree_skb_irq(struct sk_buff *skb){	if (atomic_dec_and_test(&skb->users)) {		struct softnet_data *sd;		unsigned long flags;		local_irq_save(flags);		sd = &__get_cpu_var(softnet_data);		skb->next = sd->completion_queue;		sd->completion_queue = skb;		raise_softirq_irqoff(NET_TX_SOFTIRQ);		local_irq_restore(flags);	}}EXPORT_SYMBOL(dev_kfree_skb_irq);

void dev_kfree_skb_any(struct sk_buff *skb);
— dev_kfree_skb_any() 在中斷或非中斷上下文中都能使用。

void dev_kfree_skb_any(struct sk_buff *skb){	if (in_irq() || irqs_disabled())		dev_kfree_skb_irq(skb);	else		dev_kfree_skb(skb);}EXPORT_SYMBOL(dev_kfree_skb_any);

2.3移動指針
Linux套接字緩衝區中的指針移動操作有:put(放置), push(推), pull(拉)和reserve(保留)等。
2.3.1 put操作
unsigned char *skb_put(struct sk_buff *skb, unsigned int len​​);
將tail 指針下移,增加sk_buff 的len 值,並返回skb->tail 的當前值。
將數據添加在buffer的尾部。

/** *	skb_put - add data to a buffer *	@skb: buffer to use *	@len: amount of data to add * *	This function extends the used data area of the buffer. If this would *	exceed the total buffer size the kernel will panic. A pointer to the *	first byte of the extra data is returned. */unsigned char *skb_put(struct sk_buff *skb, unsigned int len){	unsigned char *tmp = skb_tail_pointer(skb);	SKB_LINEAR_ASSERT(skb);	skb->tail += len;	skb->len  += len;	if (unlikely(skb->tail > skb->end))		skb_over_panic(skb, len, __builtin_return_address(0));	return tmp;}EXPORT_SYMBOL(skb_put);
static inline unsigned char *skb_tail_pointer(const struct sk_buff *skb){	return skb->tail;}

unsigned char *__skb_put(struct sk_buff *skb, unsigned int len​​);
__skb_put() 與skb_put()的區別在於skb_put()會檢測放入緩衝區的數據, 而__skb_put()不會檢查

static inline unsigned char *__skb_put(struct sk_buff *skb, unsigned int len){	unsigned char *tmp = skb_tail_pointer(skb);	SKB_LINEAR_ASSERT(skb);	skb->tail += len;	skb->len  += len;	return tmp;}

2.3.2 push操作:
unsigned char *skb_push(struct sk_buff *skb, unsigned int len​​);
skb_push()會將data指針上移,也就是將數據添加在buffer的起始點,因此也要增加sk_buff的len值。

/** *	skb_push - add data to the start of a buffer *	@skb: buffer to use *	@len: amount of data to add * *	This function extends the used data area of the buffer at the buffer *	start. If this would exceed the total buffer headroom the kernel will *	panic. A pointer to the first byte of the extra data is returned. */unsigned char *skb_push(struct sk_buff *skb, unsigned int len){	skb->data -= len;	skb->len  += len;	if (unlikely(skb->datahead))		skb_under_panic(skb, len, __builtin_return_address(0));	return skb->data;}EXPORT_SYMBOL(skb_push);

unsigned char *__skb_push(struct sk_buff *skb, unsigned int len​​);

static inline unsigned char *__skb_push(struct sk_buff *skb, unsigned int len){	skb->data -= len;	skb->len  += len;	return skb->data;}

__skb_push()和skb_push()的區別與__skb_put() 和skb_put()的區別一樣。
push操作在緩衝區的頭部增加一段可以存儲網絡數據包的空間,而put操作在緩衝區的尾部增加一段可以存儲網絡數據包的空間。

2.3.3 pull操作:
unsigned char *skb_pull(struct sk_buff *skb, unsigned int len​​);
skb_pull()將data指針下移,並減少skb的len值, 這個操作與skb_push()對應。
這個操作主要用於下層協議向上層協議移交數據包,使data指針指向上一層協議頭

/** *	skb_pull - remove data from the start of a buffer *	@skb: buffer to use *	@len: amount of data to remove * *	This function removes data from the start of a buffer, returning *	the memory to the headroom. A pointer to the next data in the buffer *	is returned. Once the data has been pulled future pushes will overwrite *	the old data. */unsigned char *skb_pull(struct sk_buff *skb, unsigned int len){	return skb_pull_inline(skb, len);}EXPORT_SYMBOL(skb_pull);
static inline unsigned char *skb_pull_inline(struct sk_buff *skb, unsigned int len){	return unlikely(len > skb->len) ? NULL : __skb_pull(skb, len);}
static inline unsigned char *__skb_pull(struct sk_buff *skb, unsigned int len){	skb->len -= len;	BUG_ON(skb->len < skb->data_len);	return skb->data += len;}

2.3.4 reserve 操作
void skb_reserve(struct sk_buff *skb, unsigned int len​​);
skb_reserve()將data指針和tail 指針同時下移。
這個操作用於在緩衝區頭部預留len長度的空間

/** *	skb_reserve - adjust headroom *	@skb: buffer to alter *	@len: bytes to move * *	Increase the headroom of an empty &sk_buff by reducing the tail *	room. This is only allowed for an empty buffer. */static inline void skb_reserve(struct sk_buff *skb, int len){	skb->data += len;	skb->tail += len;}

3. 例子:
Linux處理一個UDP數據包的接收流程,來說明對sk_buff的操作過程。
這一過程絕大部分工作會在內核完成,驅動中只需要完成涉及數據鏈路層部分。
假設網卡收到一個UDP數據包,Linux處理流程如下:

3.1 網卡收到一個UDP數據包後,驅動程序需要創建一個sk_buff結構體和數據緩衝區,將接收到的數據全部複製到data指向的空間,並將skb->mac_header指向data。
此時有效數據的開始位置data是一個以太網頭部,即鏈路層協議頭。
示例代碼如下:
//分配新的套接字緩衝區和數據緩衝區

工作內容如下圖所示:

3.2 數據鏈路層通過調用skb_pull() 剝掉以太網協議頭,向網絡層IP傳送數據包。
在剝離過程中,data指針會下移一個以太網頭部的長度sizeof(struct ethhdr), 而len 也減去sizeof(struct ethhdr)長度。
此時有效數據的開始位置是一個IP協議頭,skb->network_head指向data,即IP協議頭, 而skb->mac_header 依舊指向以太網頭, 即鏈路層協議頭。
內容如下圖所示:

3.3 網絡層通過skb_pull()剝掉IP協議頭,向UDP傳輸層傳遞數據包。
剝離過程中,data指針會下移一個IP協議頭長度sizeof(struct iphdr), 而len也會減少sizeof(struct iphdr)長度。
此時有效數據開始位置是一個UDP協議頭, skb->transport_header指向data,即UDP協議頭。
而skb->network_header繼續指向IP協議頭, skb->mac_header 繼續指向鏈路層協議頭。
如下圖所示:

3.4 應用程序在調用recv() 接收數據時,從skb->data + sizeof(struct udphdr) 的位置開始復製到應用層緩衝區。
可見,UPD協議頭到最後也沒有被剝離。

沒有留言:

張貼留言