内存学习(4)-理解linux虚拟内存管理
这部分流程我看了两个版本,一个是linux 2.6.34,另一个是linux 5.0的版本,大同小异。书翻了《深入理解Linux虚拟内存管理》,《笨叔的奔跑吧linux内核》和各种各样自的博客。出于对于大神Mel gorman的敬意,我下面的内容会尽量和《深入理解Linux虚拟内存管理》保持一致。我这里写的东西,都是针对5.0的内核代码。但是有一点要注意,在理解linux虚拟内存管理之前,要先对计算机系统有一定的了解。
第一章:简介
为什么linux的虚拟内存难以理解,从我个人的角度来看,是因为:
- 版本变化巨大,很多新的feature不断增加,而中文互联网的资料过于分散,不适合学习
- 层出不穷的黑魔法和trick,知道的会觉得理解,不知道就不理解
- 内核的很多功能都是自洽的,不去看mail,根本不知道到底怎么吵出来的这个东西。
所以我自己的这个博客一方面是为了记录自己的学习,另一方面也是希望如果有人有疑问,刚好看到我的博客,能得到解答。当然,更建议直接看这个网页guts直接讲的视频,很不错
https://hackmd.io/@sysprog/linux-memory?type=view
第二章:描述物理内存
linux是一种适用于各种体系结构的系统,因此需要一种与体系结构无关的方式来描述内存,这一章重点就是讲节点,管理区,以及节点和管理区如何初始化的。对于NUMA系统而言,
系统中,每个节点链接到一个以NULL结尾的pgdata_list链表,其中的每个节点利用pg_data_tnode_next链接到下一个节点。每个节点内部又被分成不同的struct zone_struct,或者说管理区。一般终端关注的是ZONE_NORMAL,这个ZONE里面的内存地址被称为线性映射地址(实际上就是加上个系统偏移量)。对于X86系统而言,因为其可用的地址最多只能到4G,它的管理区如下:
-
ZONE_DMA为16MB,给很多驱动用
-
ZONE_NORMAL为16MB-896MB的地址空间,这部分直接映射X86虚拟地址里4G内存里的第4GB。一个疑问是为什么这块的大小是896MB,是因为896MB是直接映射或者说线性映射的地址,如果全映射了,那么第4GB就全是线性映射的部分,没办法映射其它地址(比方说高端地址)的空间了。
-
ZONE_HIGHMEM为896MB以上的地址空间,这部分ZONE在X86_64位上实际已经没多少用了。
对于现在电脑常用的X86_64位而言,它源码里定义的ZONE如下所示,具体为什么定义这部分看注释,源码下面是我司的8G虚拟机地址空间:
enum zone_type {
#ifdef CONFIG_ZONE_DMA
ZONE_DMA, //如果设备不能对ZONE_NORMAL的地址做DMA,那么久单独挖出来这么一块,具体的范围与架构有关,我们这里先默认是16MB
#endif
#ifdef CONFIG_ZONE_DMA32
ZONE_DMA32, //对于X86_64机器而言,之所以需要两个DMA是因为有的设备只能做16MB的DMA,而有的设备能做32bit内存也就是4G的DMA
#endif
ZONE_NORMAL, //本质来说这实际上就已经是线性区,
#ifdef CONFIG_HIGHMEM
ZONE_HIGHMEM, //高端内存区,这块内存如果内存想访问需要映射到自己的地址空间里,目前看起来也就X86用了(?),也就是高于900MB的地址空间。
#endif
ZONE_MOVABLE, //这个内存区很有趣,是个实际上并不真正存在的内存区。
#ifdef CONFIG_ZONE_DEVICE
ZONE_DEVICE,
#endif
__MAX_NR_ZONES
};
对于X86体系而言,许多kernel的操作都只能在ZONE_NORMAL上操作,因此,在kernel操作高端内存的地址的时候,往往需要先映射到最后1GB里面的映射区间 。系统的内存被划分成大小确定的许多块,每个块被称为物理页面帧,每个物理页面帧都由一个struct page描述,所有的结构都存在一个全局的mem_map数组里,该数组通常存放在ZONE_NORMAL的首部。
对于X86_64体系而言,目前的地址可以说都是线性映射区,不过映射不一样而已。
2.1节点
下面贴了pglist_data结构里的每个字段,具体的含义需要自己去查。关注的重点主要是linux2.6和linux5.0的不同点:
- lru_lock,5.0里lru_lock被挪入了节点里,也就是每次管理的就是具体的节点。
- lruevec,5.0里也被挪到了节点里
- node_mem_map是个扁平内存模型下的概念,具体几种Linux的内存模型看这篇文章。
typedef struct pglist_data {
struct zone node_zones[MAX_NR_ZONES];
struct zonelist node_zonelists[MAX_ZONELISTS];
int nr_zones;
#ifdef CONFIG_FLAT_NODE_MEM_MAP /* means !SPARSEMEM */
struct page *node_mem_map;
#ifdef CONFIG_PAGE_EXTENSION
struct page_ext *node_page_ext;
#endif
#endif
#if defined(CONFIG_MEMORY_HOTPLUG) || defined(CONFIG_DEFERRED_STRUCT_PAGE_INIT)
/*
* Must be held any time you expect node_start_pfn,
* node_present_pages, node_spanned_pages or nr_zones to stay constant.
*
* pgdat_resize_lock() and pgdat_resize_unlock() are provided to
* manipulate node_size_lock without checking for CONFIG_MEMORY_HOTPLUG
* or CONFIG_DEFERRED_STRUCT_PAGE_INIT.
*
* Nests above zone->lock and zone->span_seqlock
*/
spinlock_t node_size_lock;
#endif
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page
range, including holes */
int node_id;
wait_queue_head_t kswapd_wait; //为kswapd新添加的等待队列链表,每个节点都有一个kswapedN
wait_queue_head_t pfmemalloc_wait;
struct task_struct *kswapd; /* Protected by
mem_hotplug_begin/end() */
int kswapd_order;
enum zone_type kswapd_classzone_idx;
int kswapd_failures; /* Number of 'reclaimed == 0' runs */
#ifdef CONFIG_COMPACTION
int kcompactd_max_order;
enum zone_type kcompactd_classzone_idx;
wait_queue_head_t kcompactd_wait;
struct task_struct *kcompactd;
#endif
/*
* This is a per-node reserve of pages that are not available
* to userspace allocations.
*/
unsigned long totalreserve_pages;
#ifdef CONFIG_NUMA
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
unsigned long min_slab_pages;
#endif /* CONFIG_NUMA */
/* Write-intensive fields used by page reclaim */
ZONE_PADDING(_pad1_)
spinlock_t lru_lock;
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/*
* If memory initialisation on large machines is deferred then this
* is the first PFN that needs to be initialised.
*/
unsigned long first_deferred_pfn;
#endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
spinlock_t split_queue_lock;
struct list_head split_queue;
unsigned long split_queue_len;
#endif
/* Fields commonly accessed by the page reclaim scanner */
struct lruvec lruvec;
unsigned long flags;
ZONE_PADDING(_pad2_)
/* Per-node vmstats */
struct per_cpu_nodestat __percpu *per_cpu_nodestats;
atomic_long_t vm_stat[NR_VM_NODE_STAT_ITEMS];
} pg_data_t;
保护模式下,IA-32只能使用4GB的物理内存,支持PAE之后,可以支持64GB的内存。
2.2 管理区
如何确定是从哪个管理区分配出来的?看下面这片文章。
https://www.codenong.com/cs109638783/
每个结构的作用和定义如下:
struct zone {
/* Read-mostly fields */
/* zone watermarks, access with *_wmark_pages(zone) macros */
unsigned long _watermark[NR_WMARK];//管理区极值,分别对应WMARK_MIN,WMARK_LOW,WMARK_HIGH,
unsigned long watermark_boost;
unsigned long nr_reserved_highatomic;
/*
* We don't know if the memory that we're going to allocate will be
* freeable or/and it will be released eventually, so to avoid totally
* wasting several GB of ram we must reserve some of the lower zone
* memory (otherwise we risk to run OOM on the lower zones despite
* there being tons of freeable ram on the higher zones). This array is
* recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl
* changes.
*/
long lowmem_reserve[MAX_NR_ZONES];
#ifdef CONFIG_NUMA
int node;
#endif
struct pglist_data *zone_pgdat;//反指回具体的节点
struct per_cpu_pageset __percpu *pageset;
#ifndef CONFIG_SPARSEMEM
/*
* Flags for a pageblock_nr_pages block. See pageblock-flags.h.
* In SPARSEMEM, this map is stored in struct mem_section
*/
unsigned long *pageblock_flags;
#endif /* CONFIG_SPARSEMEM */
/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
unsigned long zone_start_pfn;
/*
* spanned_pages is the total pages spanned by the zone, including
* holes, which is calculated as:
* spanned_pages = zone_end_pfn - zone_start_pfn;
*
* present_pages is physical pages existing within the zone, which
* is calculated as:
* present_pages = spanned_pages - absent_pages(pages in holes);
*
* managed_pages is present pages managed by the buddy system, which
* is calculated as (reserved_pages includes pages allocated by the
* bootmem allocator):
* managed_pages = present_pages - reserved_pages;
*
* So present_pages may be used by memory hotplug or memory power
* management logic to figure out unmanaged pages by checking
* (present_pages - managed_pages). And managed_pages should be used
* by page allocator and vm scanner to calculate all kinds of watermarks
* and thresholds.
*
* Locking rules:
*
* zone_start_pfn and spanned_pages are protected by span_seqlock.
* It is a seqlock because it has to be read outside of zone->lock,
* and it is done in the main allocator path. But, it is written
* quite infrequently.
*
* The span_seq lock is declared along with zone->lock because it is
* frequently read in proximity to zone->lock. It's good to
* give them a chance of being in the same cacheline.
*
* Write access to present_pages at runtime should be protected by
* mem_hotplug_begin/end(). Any reader who can't tolerant drift of
* present_pages should get_online_mems() to get a stable value.
*/
atomic_long_t managed_pages;
unsigned long spanned_pages;
unsigned long present_pages;
const char *name;
#ifdef CONFIG_MEMORY_ISOLATION
/*
* Number of isolated pageblock. It is used to solve incorrect
* freepage counting problem due to racy retrieving migratetype
* of pageblock. Protected by zone->lock.
*/
unsigned long nr_isolate_pageblock;
#endif
#ifdef CONFIG_MEMORY_HOTPLUG
/* see spanned/present_pages for more description */
seqlock_t span_seqlock;
#endif
int initialized;
/* Write-intensive fields used from the page allocator */
ZONE_PADDING(_pad1_)
/* free areas of different sizes */
struct free_area free_area[MAX_ORDER]; //空闲区域位图,三种类型:可移动,可回收,不可移动。ORDER为具体的阶
/* zone flags, see below */
unsigned long flags;
/* Primarily protects free_area */
spinlock_t lock; //保护锁
/* Write-intensive fields used by compaction and vmstats. */
ZONE_PADDING(_pad2_)
/*
* When free pages are below this point, additional steps are taken
* when reading the number of free pages to avoid per-cpu counter
* drift allowing watermarks to be breached
*/
unsigned long percpu_drift_mark;
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* pfn where compaction free scanner should start */
unsigned long compact_cached_free_pfn; //内存平整用
/* pfn where async and sync compaction migration scanner should start */
unsigned long compact_cached_migrate_pfn[2];
#endif
#ifdef CONFIG_COMPACTION
/*
* On compaction failure, 1<<compact_defer_shift compactions
* are skipped before trying again. The number attempted since
* last failure is tracked with compact_considered.
*/
unsigned int compact_considered;
unsigned int compact_defer_shift;
int compact_order_failed;
#endif
#if defined CONFIG_COMPACTION || defined CONFIG_CMA
/* Set to true when the PG_migrate_skip bits should be cleared */
bool compact_blockskip_flush;
#endif
bool contiguous;
ZONE_PADDING(_pad3_)
/* Zone statistics */
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; //原子数据的统计。
atomic_long_t vm_numa_stat[NR_VM_NUMA_STAT_ITEMS];
} ___c
2.3 物理页面
每一个对应一个具体的页帧,根据不同的用途,union字段里面选择不同的资源。当然实际页表里面存储的都是PFN,也就是page frame number,可以很轻松的得到物理内存地址为X的页帧号为X»PAGE_SHIFT
struct page {
unsigned long flags; /* Atomic flags, some possibly 应该还是用flag来反向映射管理区吧,不过似乎没必要了
* updated asynchronously */
/*
* Five words (20/40 bytes) are available in this union.
* WARNING: bit 0 of the first word is used for PageTail(). That
* means the other users of this union MUST NOT use the bit to
* avoid collision and false-positive PageTail().
*/
union {
struct { /* Page cache and anonymous pages */
/**
* @lru: Pageout list, eg. active_list protected by
* zone_lru_lock. Sometimes used as a generic list
* by the page owner.
*/
struct list_head lru;
/* See page-flags.h for PAGE_MAPPING_FLAGS */
struct address_space *mapping;
pgoff_t index; /* Our offset within mapping. */
/**
* @private: Mapping-private opaque data.
* Usually used for buffer_heads if PagePrivate.
* Used for swp_entry_t if PageSwapCache.
* Indicates order in the buddy system if PageBuddy.
*/
unsigned long private;
};
struct { /* slab, slob and slub */
union {
struct list_head slab_list; /* uses lru */
struct { /* Partial pages */
struct page *next;
#ifdef CONFIG_64BIT
int pages; /* Nr of pages left */
int pobjects; /* Approximate count */
#else
short int pages;
short int pobjects;
#endif
};
};
struct kmem_cache *slab_cache; /* not slob */
/* Double-word boundary */
void *freelist; /* first free object */
union {
void *s_mem; /* slab: first object */
unsigned long counters; /* SLUB */
struct { /* SLUB */
unsigned inuse:16;
unsigned objects:15;
unsigned frozen:1;
};
};
};
struct { /* Tail pages of compound page */
unsigned long compound_head; /* Bit zero is set */
/* First tail page only */
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
};
struct { /* Second tail page of compound page */
unsigned long _compound_pad_1; /* compound_head */
unsigned long _compound_pad_2;
struct list_head deferred_list;
};
struct { /* Page table pages */
unsigned long _pt_pad_1; /* compound_head */
pgtable_t pmd_huge_pte; /* protected by page->ptl */
unsigned long _pt_pad_2; /* mapping */
union {
struct mm_struct *pt_mm; /* x86 pgds only */
atomic_t pt_frag_refcount; /* powerpc */
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t *ptl;
#else
spinlock_t ptl;
#endif
};
struct { /* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
unsigned long hmm_data;
unsigned long _zd_pad_1; /* uses mapping */
};
/** @rcu_head: You can use this to free a page by RCU. */
struct rcu_head rcu_head;
};
union { /* This union is 4 bytes in size. */
/*
* If the page can be mapped to userspace, encodes the number
* of times this page is referenced by a page table.
*/
atomic_t _mapcount;
/*
* If the page is neither PageSlab nor mappable to userspace,
* the value stored here may help determine what this page
* is used for. See page-flags.h for a list of page types
* which are currently stored here.
*/
unsigned int page_type;
unsigned int active; /* SLAB */
int units; /* SLOB */
};
/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */
atomic_t _refcount;
#ifdef CONFIG_MEMCG
struct mem_cgroup *mem_cgroup;
#endif
/*
* On machines where all RAM is mapped into kernel address space,
* we can simply calculate the virtual address. On machines with
* highmem some memory is mapped into kernel virtual memory
* dynamically, so we need a place to store that address.
* Note that this field could be 16 bits on x86 ... ;)
*
* Architectures with slow multiplication can define
* WANT_PAGE_VIRTUAL in asm/page.h
*/
#if defined(WANT_PAGE_VIRTUAL)
void *virtual; /* Kernel virtual address (NULL if
not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int _last_cpupid;
#endif
} _struct_p
第三章 页表管理
__startup_64
初始化64位机器
第四章 内存的分配和管理
在64位下已经没有高端内存的概念了,为什么还需要vmalloc?
物理内存的分配
虚拟内存的分配
vmalloc
brk
第六章 页缓存和块缓存
几种不同的同步方案:
- 不同的线程,比方说pdflush线程,周期性激活,将缓存的内容写回
- pdflush线程的第二种运作方式,缓存中修改的数据项显著增加,则执行写回
- 用户态提供了各种系统调用诸如sync来让系统
块缓存的两种结构,
page_cache_alloc
分配一个即将加入页缓存的新页分配数据结构。
add_to_page_cache
添加页面到页缓存
find_get_page
获取缓存页,使用基数树保存所有的基数树时很有用
在页上等待,回写的页会标记PG_writeback
标志位
第七章 交换管理
页面的回收并不能等到不够了再开始,每次kswaped都是在到达low水位开始工作,到min产生direct 回收,到high则停止回收。
缓存也不是一上来就近lru_list而是现在PER_CPU的pagevec里面放着。
那么何时会进行内存的回收呢?
- synchronous page reclaim,即当遭遇分配内存失败的时候,一言不合,然后直接调用page frame reclaiming进行回收。例如:在分配page buffer的时候(alloc_page_buffers),如果分配不成功,直接调用free_more_memory进行内存回收。或者在调用__alloc_pages的时候,如果分配不成功,直接调用try_to_free_pages进行内存回收。当然,其实free_more_memory也是调用try_to_free_pages来实现页面回收的。
- Suspend to disk(Hibernation)的场景。系统hibernate的时候需要大量内存,这时候会调用shrink_all_memory来回收指定数目的page frame。
- kswapd场景。Kswapd是一个专门用来进行页面回收的内核线程。
- slab内存分配器会定时的queue work到system_wq上去,从而会周期性的调用cache_reap来回收slab上的空闲内存。
内存回收的策略是什么?
- 没有办法回收的page frame。包括空闲页面(已经在free list上面,也就不需要劳驾page frame reclaim机制了)、保留页面(设定了PG_reserved,例如内核正文段、数据段等等)、内核动态分配的page frame、用户进程的内核栈上的page frame、临时性的被锁定的page frame(即设定了PG_locked flag,例如在进行磁盘IO的时候)、mlocked page frame(有VM_LOCKED标识的VMA)
- 可以交换到磁盘的page frame(swappable)。用户空间的匿名映射页面(用户进程堆和栈上的page frame)、tmpfs的page frame。
- 可以同步到磁盘的page frame(syncable)。用户空间的文件映射(file mapped)页面,page cache中的page frame(其内容会对应到某个磁盘文件),block device的buffered cache、disk cache中的page frame(例如inode cache)
- 可以直接释放的page frame。各种内存cache(例如 slab内存分配器)中还没有使用的那些page frame、没有使用的dentry cache。
除此之外,还有一些达成共识的东西:
- 尽量不要修改page table。例如回收各种没有使用的内核cache的时候,我们直接回收,根本不需要修改页表项。而用户空间进程的页面回收往往涉及将对应的pte条目修改为无效状态。
- 除非调用mlock将page锁定,否则所有的用户空间对应的page frame都应该可以被回收。
- 如果一个page frame被多个进程共享,那么我们需要清除所有的pte entry,之后才能回收该页面。
- 不要回收那些最近使用(访问)过的page frame,或者说优先回收那些最近没有访问的page frame。
- 尽量先回收那些不需要磁盘IO操作的page frame。
为了能够降低对于active_list和inactive_list的操作(因为频繁操作产生竞争),因此在两者之间加了一个一个LRU cache,使用了pagevec
的缓冲,用来
交换区
缓存区创建的步骤由用户空间工具mkswap创建,激活交换区则需要使用系统调用sys_swapon
linux5.0的交换管理发生了一些变化,多了一些项:
struct swap_info_struct {
unsigned long flags; /* SWP_USED etc: see above 是否使用等标志位*/
signed short prio; /* swap priority of this type 优先级*/ //交换区优先级
struct plist_node list; /* entry in swap_active_head */
signed char type; /* strange name for an index */
unsigned int max; /* extent of the swap_map 交换区当前包含的页数,不仅计算可用的槽数,也包括因为块设备故障损坏或用于管理目的的槽位*/
unsigned char *swap_map; /* vmalloc'ed array of usage counts 包含的项数和交换区槽位数目相同,访问计数*/
struct swap_cluster_info *cluster_info; /* cluster info. Only for SSD */
struct swap_cluster_list free_clusters; /* free clusters list */
unsigned int lowest_bit; /* index of first free in swap_map 这个和下面这个都是用来提高交换区访问速度,直接知道可以用哪个的*/
unsigned int highest_bit; /* index of last free in swap_map */
unsigned int pages; /* total of usable pages of swap 交换区可用槽位的个数 */
unsigned int inuse_pages; /* number of those currently in use 已经使用的交换区*/
unsigned int cluster_next; /* likely index for next allocation */
unsigned int cluster_nr; /* countdown to next cluster search */
struct percpu_cluster __percpu *percpu_cluster; /* per cpu's swap location per cpu的交换位置*/
struct swap_extent *curr_swap_extent; /* 这个对应当前正在操作的缓存文件的list的位置,因为可能有多个相连,直接找到 */
struct swap_extent first_swap_extent; /* 这个对应于间隔的缓存文件,因为文件形式的缓存在地址上未必是连续的 */
struct block_device *bdev; /* swap device or bdev of swap file 交换设备,指向文件或分区所在底层快设备的block_device */
struct file *swap_file; /* seldom referenced */
unsigned int old_block_size; /* seldom referenced */
#ifdef CONFIG_FRONTSWAP
unsigned long *frontswap_map; /* frontswap in-use, one bit per page */
atomic_t frontswap_pages; /* frontswap pages in-use counter */
#endif
spinlock_t lock; /*
* protect map scan related fields like
* swap_map, lowest_bit, highest_bit,
* inuse_pages, cluster_next,
* cluster_nr, lowest_alloc,
* highest_alloc, free/discard cluster
* list. other fields are only changed
* at swapon/swapoff, so are protected
* by swap_lock. changing flags need
* hold this lock and swap_lock. If
* both locks need hold, hold swap_lock
* first.
*/
spinlock_t cont_lock; /*
* protect swap count continuation page
* list.
*/
struct work_struct discard_work; /* discard worker */
struct swap_cluster_list discard_clusters; /* discard clusters list */
struct plist_node avail_lists[0]; /*
* entries in swap_avail_heads, one
* entry per node.
* Must be last as the number of the
* array is nr_node_ids, which is not
* a fixed value so have to allocate
* dynamically.
* And it has to be an array so that
* plist_for_each_* can work.
*/
};
交换区的第一个槽位,用来存储交换一些交换特征。
交换缓存
swap cache是什么?交换缓存交换缓存是为了方便多个进程共享的内存页换入时能直到是否已经在内存中存在。
- 换出时
- 换入时,换入页将停留在交换缓存,直到所有进程都从交换区请求该页,且知道该页在内存中的新位置。
具体用的结构是swp_entry_t
typedef struct {
unsigned long val;
} swp_entry_t;
val
里面的低60位,高5位为交换区标识符,代码直接定义了获取的具体type和offset的方法:
结尾
唉,尴尬