内存学习(1)

从do_page_fault说起

Posted by 大狗 on September 16, 2020

内存学习(1)

前言

内存这块我很早之前看过,但是一直没怎么做记录。这次姑且记录下这次的阅读的书籍分别为:《Linux内核0.11完全注释》,《Linux内核源代码情景分析》,《ATCP内存实现》(这个是我司的代码管理)。啊,对了,这次我同时在看SICP的东西,希望顺便打打基础。当然,纸上得来终觉浅,绝知此事要躬行,还是从已经分配好的内存看起吧,从用户态的开始看。

絮叨几句

简单来说linux2.6的内存管理方式并不能说麻烦,来来去去东西就那么多,但是有很多东西属于前置知识,也就是说你知道它是为了解决什么问题才可能理解为什么会有这种解决方式,单纯的看代码只会觉得仿佛是看天书。我第一次看《深入详解linux内核》只感觉是天书,毕竟毕业好多年,操作系统的东西差不多都还给老师了,后来为了学习就买了一本《操作系统真象还原》然后哆哆嗦嗦看完了,再看《深入理解linux内核》才大致明白是什么回事,可是内核庞大,经常前面的东西裹挟着后面的内容,所以经常会看看忽然看不懂,暂且释怀,总会明白,无非多看几次。

内存管理小前言

linux2.6的内存管理,总结来说分为下面两块

  • 虚拟内存管理
  • 物理内存管理 这其中,虚拟内存管理分为
  • 应用层虚拟内存管理(0-3GB的内存管理)
  • kernel虚拟内存管理(3-4GB的内存管理)
  • 虚拟内存的初始化 这里面无论是0-3GB还是3-4GB都是虚拟内存的概念,每个程序都认为自己有四GB的空间,但是实际使用的时候会根据需求进行调页。虚拟内存初始化讲道理和前两种分类并不一致,但是又没多少其他合适的地方放,所以姑且放到一起 物理内存管理层面分为
  • 物理内存的分配管理
  • 物理内存的回收和置换
  • 物理内存管理方式之slab管理器 严格来说,slab管理器实际上属于物理内存的分配管理,但是由于其管理方式非常常见也很重要所以单独拿出来了。下面我会一边捋思路,一边记录多种数据结构。因为这两者都互相影响,所以虚拟内存和物理内存的管理会穿插着来。

从页框说起–基本的物理内存管理

大部分书籍说起内存管理,都先说什么线性地址,可是每个人都知道什么是线性地址,也不需要我多少。我们从页框(Page Frame)说起,假设我们有个12KB的内存,我们认为一个4k是一个页框,第一个4k,我们称为第一页,第二个4k我们称为第二页,第三个4k我们称为第三页,简单而言每4k大小的物理内存我们称为一页。为了管理每个页框,我们为每个页框生成一个管理单元,该数据结构称为“页描述符”(以后我们page也指一页,也指对应的页描述符),数据结构写在下面了。

unsigned long flags用于表明page目前正处在哪种类型,用该标志来表示该页框的状态,常见的标志为

	PG_locked			//页被锁定,比方说在做I/O操作,这几次对内存的梳理我们都不会涉及到
	PG_referenced		//该页刚刚被访问过,我们等到“物理内存的回收和置换”再讨论
	PG_active			//该页在活动或者非活动链表里,我们等到“物理内存的回收和置换”再讨论
	PG_reserved			//该页给内存保留使用,或者无法使用
	PG_reclaim			//该页将要写入内存
	PG_swapcache		//该页属于兑换高速缓存
	PG_private			//该页的private字段有用,是有特殊含义的(比方说伙伴系统用这个记录位阶)

atomic_t _count字段说明该页框当前有有几个进程使用,比方说多个fork出来的进程同样拥有一个页框 atomic_t _mapcount 字段说明该页框当中有多少个页表项,如果一个没有为-1

struct address_space *mapping 如果该PAGE属于匿名区,那么mapping用于指向某个还没涉及到的数据结构,从而找到页表项。如果属于页高速缓存

unsigned long index

struct list_head lru 非常好玩的东西,只要是个list_head的链表都能挂,当然最常见的是挂在物理内存伙伴算法和活动链表和非活动链表上。

剩下的几个结构要么涉及不到,要么我暂时不想理。略过了。

	struct page {
	unsigned long flags;		/* Atomic flags, some possibly
					 * updated asynchronously */
	atomic_t _count;		/* Usage count, see below. */
	union {
		atomic_t _mapcount;	/* Count of ptes mapped in mms,
					 * to show when page is mapped
					 * & limit reverse map searches.
					 */
		struct {		/* SLUB */
			u16 inuse;
			u16 objects;
		};
	};
	union {
	    struct {
		unsigned long private;		/* Mapping-private opaque data:
					 	 * usually used for buffer_heads
						 * if PagePrivate set; used for
						 * swp_entry_t if PageSwapCache;
						 * indicates order in the buddy
						 * system if PG_buddy is set.
						 */
		struct address_space *mapping;	/* If low bit clear, points to
						 * inode address_space, or NULL.
						 * If page mapped as anonymous
						 * memory, low bit is set, and
						 * it points to anon_vma object:
						 * see PAGE_MAPPING_ANON below.
						 */
	    };
#if USE_SPLIT_PTLOCKS
	    spinlock_t ptl;
#endif
	    struct kmem_cache *slab;	/* SLUB: Pointer to slab */
	    struct page *first_page;	/* Compound tail pages */
	};
	union {
		pgoff_t index;		/* Our offset within mapping. */
		void *freelist;		/* SLUB: freelist req. slab lock */
	};
	struct list_head lru;		/* Pageout list, eg. active_list
					 * protected by zone->lru_lock !
					 */
	/*
	 * On machines where all RAM is mapped into kernel address space,
	 * we can simply calculate the virtual address. On machines with
	 * highmem some memory is mapped into kernel virtual memory
	 * dynamically, so we need a place to store that address.
	 * Note that this field could be 16 bits on x86 ... ;)
	 *
	 * Architectures with slow multiplication can define
	 * WANT_PAGE_VIRTUAL in asm/page.h
	 */
#if defined(WANT_PAGE_VIRTUAL)
	void *virtual;			/* Kernel virtual address (NULL if
					   not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef CONFIG_WANT_PAGE_DEBUG_FLAGS
	unsigned long debug_flags;	/* Use atomic bitops on this */
#endif

#ifdef CONFIG_KMEMCHECK
	/*
	 * kmemcheck wants to track the status of each byte in a page; this
	 * is a pointer to such a status block. NULL if not tracked.
	 */
	void *shadow;
#endif
};

数据结构列在上面了,这样一个结构大概40字节,用来表示4k字节,也就是说用1的内存来表示大小为100的内存,似乎还可以?好,我们继续向下说。管理那么多的物理内存,不可能都临时管理,肯定都是初始化好的,所以操作系统中会有一个全局的page数组,称为mem_map数组。平时页表或者页目录项里面存储的也是页帧号Page Frame Number。

对于X86系统,我们人为的将内存划成几个管理区:

  • ZONE_DMA包含低于16MB的内存页框
  • ZONE_NORMAL包含高于16MB且低于896MB的内存页框
  • ZONE_HIGHMEEM包含高于896MB的内存

让我们直接看内核里面的ZONE_TYPE的意思:

  • ZONE_DMA是给设备用的,因为不能直接对所有的可寻址地址做DMA操作,”that are not able to do DMA to all of addressable memory”
  • ZONE_DMA32是给X86_64用的,因为有设备只能对32bit地址下做DMA
  • ZONE_NORMAL真正的可寻址范围
  • ZONE_HIGHMEM是高端地址,不过X86_64是不需要用的
  • ZONE_MOVABLE 这部分建议直接看英文原版https://lwn.net/Articles/224829/,不会出现理解错误

我在公司的电脑,就三个ZONE,ZONE_DMA,ZONE_DMA32和ZONE_NORMAL

enum zone_type {
#ifdef CONFIG_ZONE_DMA
	/*
	 * ZONE_DMA is used when there are devices that are not able
	 * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
	 * carve out the portion of memory that is needed for these devices.
	 * The range is arch specific.
	 *
	 * Some examples
	 *
	 * Architecture		Limit
	 * ---------------------------
	 * parisc, ia64, sparc	<4G
	 * s390, powerpc	<2G
	 * arm			Various
	 * alpha		Unlimited or 0-16MB.
	 *
	 * i386, x86_64 and multiple other arches
	 * 			<16M.
	 */
	ZONE_DMA,
#endif
#ifdef CONFIG_ZONE_DMA32
	/*
	 * x86_64 needs two ZONE_DMAs because it supports devices that are
	 * only able to do DMA to the lower 16M but also 32 bit devices that
	 * can only do DMA areas below 4G.
	 */
	ZONE_DMA32,
#endif
	/*
	 * Normal addressable memory is in ZONE_NORMAL. DMA operations can be
	 * performed on pages in ZONE_NORMAL if the DMA devices support
	 * transfers to all addressable memory.
	 */
	ZONE_NORMAL,
#ifdef CONFIG_HIGHMEM
	/*
	 * A memory area that is only addressable by the kernel through
	 * mapping portions into its own address space. This is for example
	 * used by i386 to allow the kernel to address the memory beyond
	 * 900MB. The kernel will set up special mappings (page
	 * table entries on i386) for each page that the kernel needs to
	 * access.
	 */
	ZONE_HIGHMEM,
#endif
	ZONE_MOVABLE,
#ifdef CONFIG_ZONE_DEVICE
	ZONE_DEVICE,
#endif
	__MAX_NR_ZONES

};

有一个GFP_ZONE_TABLEZONES_SHIFT实际上就是多少位的移位的意思。默认是从ZONE_NORMAL开始计算方法如下:

#define GFP_ZONE_TABLE ( \
	(ZONE_NORMAL << 0 * GFP_ZONES_SHIFT)				       \
	| (OPT_ZONE_DMA << ___GFP_DMA * GFP_ZONES_SHIFT)		       \
	| (OPT_ZONE_HIGHMEM << ___GFP_HIGHMEM * GFP_ZONES_SHIFT)	       \
	| (OPT_ZONE_DMA32 << ___GFP_DMA32 * GFP_ZONES_SHIFT)		       \
	| (ZONE_NORMAL << ___GFP_MOVABLE * GFP_ZONES_SHIFT)		       \
	| (OPT_ZONE_DMA << (___GFP_MOVABLE | ___GFP_DMA) * GFP_ZONES_SHIFT)    \
	| (ZONE_MOVABLE << (___GFP_MOVABLE | ___GFP_HIGHMEM) * GFP_ZONES_SHIFT)\
	| (OPT_ZONE_DMA32 << (___GFP_MOVABLE | ___GFP_DMA32) * GFP_ZONES_SHIFT)\
)
#if MAX_NR_ZONES < 2
#define ZONES_SHIFT 0
#elif MAX_NR_ZONES <= 2
#define ZONES_SHIFT 1
#elif MAX_NR_ZONES <= 4
#define ZONES_SHIFT 2
#elif MAX_NR_ZONES <= 8
#define ZONES_SHIFT 3

外部碎片或者说迁移类型的问题,看这个页面https://pingcap.com/blog-cn/linux-kernel-vs-memory-fragmentation-1/和https://article.itxueyuan.com/pWkBe页面。此外,为了对抗碎片化的问题,把一些文章贴进来:

lwn 发布时间 标题
2004-09-08 Kswapd and high-order allocations
2004-05-10 Active memory defragmentation
2005-02-01 Yet another approach to memory fragmentation
2005-11-02 Fragmentation avoidance
2005-11-08 More on fragmentation avoidance
2006-11-28 Avoiding - and fixing - memory fragmentation
2010-01-06 Memory compaction
2014-03-26 Memory compaction issues
2015-07-14 Making kernel pages movable
2016-04-23 CMA and compaction
2016-05-10 make direct compaction more deterministic
2017-03-21 Proactive compaction
2018-10-31 Fragmentation avoidance improvements
2020-04-21 Proactive compaction for the kernel

对于物理内存的回收,释放交换等等内容,都是针对管理区而言的。我们看看每个管理区的结构

	struct zone {
	/* Fields commonly accessed by the page allocator */

	/* zone watermarks, access with *_wmark_pages(zone) macros */
	unsigned long watermark[NR_WMARK];

	/*
	 * When free pages are below this point, additional steps are taken
	 * when reading the number of free pages to avoid per-cpu counter
	 * drift allowing watermarks to be breached
	 */
	unsigned long percpu_drift_mark;

	/*
	 * We don't know if the memory that we're going to allocate will be freeable
	 * or/and it will be released eventually, so to avoid totally wasting several
	 * GB of ram we must reserve some of the lower zone memory (otherwise we risk
	 * to run OOM on the lower zones despite there's tons of freeable ram
	 * on the higher zones). This array is recalculated at runtime if the
	 * sysctl_lowmem_reserve_ratio sysctl changes.
	 */
	unsigned long		lowmem_reserve[MAX_NR_ZONES];

#ifdef CONFIG_NUMA
	int node;
	/*
	 * zone reclaim becomes active if more unmapped pages exist.
	 */
	unsigned long		min_unmapped_pages;
	unsigned long		min_slab_pages;
#endif
	struct per_cpu_pageset __percpu *pageset;
	/*
	 * free areas of different sizes
	 */
	spinlock_t		lock;
	int                     all_unreclaimable; /* All pages pinned */
#ifdef CONFIG_MEMORY_HOTPLUG
	/* see spanned/present_pages for more description */
	seqlock_t		span_seqlock;
#endif
	struct free_area	free_area[MAX_ORDER];

#ifndef CONFIG_SPARSEMEM
	/*
	 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
	 * In SPARSEMEM, this map is stored in struct mem_section
	 */
	unsigned long		*pageblock_flags;
#endif /* CONFIG_SPARSEMEM */

#ifdef CONFIG_COMPACTION
	/*
	 * On compaction failure, 1<<compact_defer_shift compactions
	 * are skipped before trying again. The number attempted since
	 * last failure is tracked with compact_considered.
	 */
	unsigned int		compact_considered;
	unsigned int		compact_defer_shift;
#endif

	ZONE_PADDING(_pad1_)

	/* Fields commonly accessed by the page reclaim scanner */
	spinlock_t		lru_lock;	
	struct zone_lru {
		struct list_head list;
	} lru[NR_LRU_LISTS];

	struct zone_reclaim_stat reclaim_stat;

	unsigned long		pages_scanned;	   /* since last reclaim */
	unsigned long		flags;		   /* zone flags, see below */

	/* Zone statistics */
	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];

	/*
	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
	 * this zone's LRU.  Maintained by the pageout code.
	 */
	unsigned int inactive_ratio;


	ZONE_PADDING(_pad2_)
	/* Rarely used or read-mostly fields */

	/*
	 * wait_table		-- the array holding the hash table
	 * wait_table_hash_nr_entries	-- the size of the hash table array
	 * wait_table_bits	-- wait_table_size == (1 << wait_table_bits)
	 *
	 * The purpose of all these is to keep track of the people
	 * waiting for a page to become available and make them
	 * runnable again when possible. The trouble is that this
	 * consumes a lot of space, especially when so few things
	 * wait on pages at a given time. So instead of using
	 * per-page waitqueues, we use a waitqueue hash table.
	 *
	 * The bucket discipline is to sleep on the same queue when
	 * colliding and wake all in that wait queue when removing.
	 * When something wakes, it must check to be sure its page is
	 * truly available, a la thundering herd. The cost of a
	 * collision is great, but given the expected load of the
	 * table, they should be so rare as to be outweighed by the
	 * benefits from the saved space.
	 *
	 * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
	 * primary users of these fields, and in mm/page_alloc.c
	 * free_area_init_core() performs the initialization of them.
	 */
	wait_queue_head_t	* wait_table;
	unsigned long		wait_table_hash_nr_entries;
	unsigned long		wait_table_bits;

	/*
	 * Discontig memory support fields.
	 */
	struct pglist_data	*zone_pgdat;
	/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
	unsigned long		zone_start_pfn;

	/*
	 * zone_start_pfn, spanned_pages and present_pages are all
	 * protected by span_seqlock.  It is a seqlock because it has
	 * to be read outside of zone->lock, and it is done in the main
	 * allocator path.  But, it is written quite infrequently.
	 *
	 * The lock is declared along with zone->lock because it is
	 * frequently read in proximity to zone->lock.  It's good to
	 * give them a chance of being in the same cacheline.
	 */
	unsigned long		spanned_pages;	/* total size, including holes */
	unsigned long		present_pages;	/* amount of memory (excluding holes) */

	/*
	 * rarely used fields:
	 */
	const char		*name;
} ____cacheline_internodealigned_in_smp;

这些内存管理区统统划为一个节点。

上面这些代码都只是偏向于理解,并不难,还是得RTFSC。

结尾的闲言碎语

写到这里差不多就可以结束了,就不多说了。 狗头的赞赏码.jpg