内存学习(10)

从do_page_fault说起

Posted by 大狗 on September 16, 2020

内存学习(10)

前言

内存这块我很早之前看过,但是一直没怎么做记录。这次姑且记录下这次的阅读的书籍分别为:《Linux内核0.11完全注释》,《Linux内核源代码情景分析》,《ATCP内存实现》(这个是我司的代码管理)。啊,对了,这次我同时在看SICP的东西,希望顺便打打基础。当然,纸上得来终觉浅,绝知此事要躬行,还是从已经分配好的内存看起吧,从用户态的开始看。有一点需要赘述一下,关于每个地方的锁,是否开中断,权限这三个会单独开其他三个文章来分析

缺页中断的发生场景

什么时候会发生缺页中断?我们分析下 1 发生缺页异常的线性地址在内核态 2 发生缺页异常的线性地址在用户态 针对第一种情况下, 1 如果是用户态的访问 2 如果是内核态的访问 针对第二种情况 1 如果是用户态的访问,那么要么是NOT present要么是堆栈expand 2 如果是内核态的访问,那么要么是内核错误,另一种情况只能是系统调用,从而也可能调用expand_stack函数

从do_page_fault开始

先从缺页异常说起,我们假设走的是堆栈的缺页异常,这个时候还没有建立想访问的错误address对应的物理地址,所以此时会触发缺页中断,从而进入do_page_fault。此时有错误码,pt_regs为寄存器的值,下面我们拆开理一理代码流程。

  • 首先获得陷入内核态之前,当前进程的描述符,current宏从实现角度来说就是代码get_current()函数,这个函数返回的实际上是current_thread_info()->task。我们都非常清楚,现在是在当前进程的内核栈里,而线程栈紧贴着内核地址,向下延伸两个页的长度,线程栈从上向下生长,因此我们能轻易地找到。从而获得当前进程的mm_struct–内存管理结构。从cr2寄存器中读出来想访问的地址。kmemcheck_active函数和kmemcheck_hide函数我们都跳过,都是cpu的内存布局合理性检查函数,我还没看过。
  • 调用函数fault_in_kernel_space判断page_fault所访问的线性地址是不是发生在内核态,追进去看实现,就会发现判断虚拟地址是不是在PAGE_OFFSET上。如果page_fault的虚拟地址发生在kernel之上,再看看错误码是不是包含PF_RSVD | PF_USER | PF_PROT,包含那么必然是出现了用户态访问内核地址,或者是内核访问空洞/保留区的问题,这种情况下直接发个信号表示终止完事了。相反,是kernel的调用,那么调用vmalloc_fault函数要么给直接映射区间,要么给vmallc区间分配地址。spurious_fault负责判断是不是由于陈旧TLB导致的PAGE FAULT,也就是LAZY TLB,毕竟更新所有线程的页表是不现实,而且耗费巨大的,因为刷新别的CPU的TLB是个很蛋疼的操作。
  • 现在我们知道不是访问内核态的线性地址了,判断是不是用户态的内存申请,开中断
  • 如果error_code & PF_RSVD,说明是用户态访问了内核使用的线性地址。
  • 缺页发生的时候应该还没有为写获得信号量mmap_sem,使用down_read_trylock()函数,发现不是发生在用户态是内核态的错误,那就发个错误信号终止了。
  • 代码可以运行到这里说明是客户端page fault,下面就可以判断是不是堆栈的问题了。
  • 获取信号量,查询address属于的vma,如果没有找到那就是没有哪个vma的可以包含该地址,这是个越界访问。直接报错。相反如果address + 65536 + 32 * sizeof(unsigned long) < regs->sp为真,也就是说这是个对堆栈的入栈操作。那么调用expand_stack函数拓展堆栈(堆栈是向下发展的),我们假设分配成功。
  • 判断vma的访问权限和错误原因的访问权限是不是符合
  • 调用handle_mm_fault函数分配物理内存。
dotraplinkage void __kprobes
do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
	struct vm_area_struct *vma;
	struct task_struct *tsk;
	unsigned long address;
	struct mm_struct *mm;
	int write;
	int fault;

	tsk = current;
	mm = tsk->mm;

	/* Get the faulting address: */
	address = read_cr2();

	/*
	 * Detect and handle instructions that would cause a page fault for
	 * both a tracked kernel page and a userspace page.
	 */
	if (kmemcheck_active(regs))
		kmemcheck_hide(regs);
	prefetchw(&mm->mmap_sem);

	if (unlikely(kmmio_fault(regs, address)))
		return;

	/*
	 * We fault-in kernel-space virtual memory on-demand. The
	 * 'reference' page table is init_mm.pgd.
	 *
	 * NOTE! We MUST NOT take any locks for this case. We may
	 * be in an interrupt or a critical region, and should
	 * only copy the information from the master page table,
	 * nothing more.
	 *
	 * This verifies that the fault happens in kernel space
	 * (error_code & 4) == 0, and that the fault was not a
	 * protection error (error_code & 9) == 0.
	 */
	if (unlikely(fault_in_kernel_space(address))) {
		if (!(error_code & (PF_RSVD | PF_USER | PF_PROT))) {
			if (vmalloc_fault(address) >= 0)
				return;

			if (kmemcheck_fault(regs, address, error_code))
				return;
		}

		/* Can handle a stale RO->RW TLB: */
		if (spurious_fault(error_code, address))
			return;

		/* kprobes don't want to hook the spurious faults: */
		if (notify_page_fault(regs))
			return;
		/*
		 * Don't take the mm semaphore here. If we fixup a prefetch
		 * fault we could otherwise deadlock:
		 */
		bad_area_nosemaphore(regs, error_code, address);

		return;
	}

	/* kprobes don't want to hook the spurious faults: */
	if (unlikely(notify_page_fault(regs)))
		return;
	/*
	 * It's safe to allow irq's after cr2 has been saved and the
	 * vmalloc fault has been handled.
	 *
	 * User-mode registers count as a user access even for any
	 * potential system fault or CPU buglet:
	 */
	if (user_mode_vm(regs)) {
		local_irq_enable();
		error_code |= PF_USER;
	} else {
		if (regs->flags & X86_EFLAGS_IF)
			local_irq_enable();
	}

	if (unlikely(error_code & PF_RSVD))
		pgtable_bad(regs, error_code, address);

	perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, 0, regs, address);

	/*
	 * If we're in an interrupt, have no user context or are running
	 * in an atomic region then we must not take the fault:
	 */
	if (unlikely(in_atomic() || !mm)) {
		bad_area_nosemaphore(regs, error_code, address);
		return;
	}

	/*
	 * When running in the kernel we expect faults to occur only to
	 * addresses in user space.  All other faults represent errors in
	 * the kernel and should generate an OOPS.  Unfortunately, in the
	 * case of an erroneous fault occurring in a code path which already
	 * holds mmap_sem we will deadlock attempting to validate the fault
	 * against the address space.  Luckily the kernel only validly
	 * references user space from well defined areas of code, which are
	 * listed in the exceptions table.
	 *
	 * As the vast majority of faults will be valid we will only perform
	 * the source reference check when there is a possibility of a
	 * deadlock. Attempt to lock the address space, if we cannot we then
	 * validate the source. If this is invalid we can skip the address
	 * space check, thus avoiding the deadlock:
	 */
	if (unlikely(!down_read_trylock(&mm->mmap_sem))) {
		if ((error_code & PF_USER) == 0 &&
		    !search_exception_tables(regs->ip)) {
			bad_area_nosemaphore(regs, error_code, address);
			return;
		}
		down_read(&mm->mmap_sem);
	} else {
		/*
		 * The above down_read_trylock() might have succeeded in
		 * which case we'll have missed the might_sleep() from
		 * down_read():
		 */
		might_sleep();
	}

	vma = find_vma(mm, address);
	if (unlikely(!vma)) {
		bad_area(regs, error_code, address);
		return;
	}
	if (likely(vma->vm_start <= address))
		goto good_area;
	if (unlikely(!(vma->vm_flags & VM_GROWSDOWN))) {
		bad_area(regs, error_code, address);
		return;
	}
	if (error_code & PF_USER) {
		/*
		 * Accessing the stack below %sp is always a bug.
		 * The large cushion allows instructions like enter
		 * and pusha to work. ("enter $65535, $31" pushes
		 * 32 pointers and then decrements %sp by 65535.)
		 */
		if (unlikely(address + 65536 + 32 * sizeof(unsigned long) < regs->sp)) {
			bad_area(regs, error_code, address);
			return;
		}
	}
	if (unlikely(expand_stack(vma, address))) {
		bad_area(regs, error_code, address);
		return;
	}

	/*
	 * Ok, we have a good vm_area for this memory access, so
	 * we can handle it..
	 */
good_area:
	write = error_code & PF_WRITE;

	if (unlikely(access_error(error_code, write, vma))) {
		bad_area_access_error(regs, error_code, address);
		return;
	}

	/*
	 * If for any reason at all we couldn't handle the fault,
	 * make sure we exit gracefully rather than endlessly redo
	 * the fault:
	 */
	fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);

	if (unlikely(fault & VM_FAULT_ERROR)) {
		mm_fault_error(regs, error_code, address, fault);
		return;
	}

	if (fault & VM_FAULT_MAJOR) {
		tsk->maj_flt++;
		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ, 1, 0,
				     regs, address);
	} else {
		tsk->min_flt++;
		perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, 0,
				     regs, address);
	}

	check_v8086_mode(regs, address, tsk);

	up_read(&mm->mmap_sem);
}

看看expand_stack函数

expand_stack调用expand_downwards函数,流程:

  • 首先检查该vma是不是已经有anon_vma了,没有的话分配一个anon_vma结构,同时建立一个匿名映射。为了能够顺利锁住这块堆栈,这里的vma必须得有一个能够关联到的anon_vma结构。注意,分配给一个线性区的页框在这个线性区被删除之前永远不会被释放
  • security_file_mmap函数由LSM模块调用,但是干啥的我也不知道还。。。
  • 锁住vma->anon_vma的根节点,匿名映射是一个再vma->anon_vma->root上的链表,如果同时两个有同样父进程的子进程操作这个链表会出错,所以需要锁住自旋锁,保证对链表的操作唯一性。
  • 获得了自旋锁以后需要判断地址是不是已经落在了vma的段落内了,因为可能其他进程已经改变了vma的vm_start起始地址了,这里已经关中断了。
	static int expand_downwards(struct vm_area_struct *vma,
				   unsigned long address)
{
	int error;

	/*
	 * We must make sure the anon_vma is allocated
	 * so that the anon_vma locking is not a noop.
	 */
	if (unlikely(anon_vma_prepare(vma)))
		return -ENOMEM;

	address &= PAGE_MASK;
	error = security_file_mmap(NULL, 0, 0, 0, address, 1);
	if (error)
		return error;

	vma_lock_anon_vma(vma);

	/*
	 * vma->vm_start/vm_end cannot change under us because the caller
	 * is required to hold the mmap_sem in read mode.  We need the
	 * anon_vma lock to serialize against concurrent expand_stacks.
	 */

	/* Somebody else might have raced and expanded it already */
	if (address < vma->vm_start) {
		unsigned long size, grow;

		size = vma->vm_end - address;
		grow = (vma->vm_start - address) >> PAGE_SHIFT;

		error = -ENOMEM;
		if (grow <= vma->vm_pgoff) {
			error = acct_stack_growth(vma, size, grow);
			if (!error) {
				vma->vm_start = address;
				vma->vm_pgoff -= grow;
				perf_event_mmap(vma);
			}
		}
	}
	vma_unlock_anon_vma(vma);
	khugepaged_enter_vma_merge(vma);
	return error;
}

看看handle_mm_fault函数

  • __set_current_state()设置当前进程的状态为运行,当进入中断的时候进程的状态为TASK_INTERRUPT,之后给PGGAULT加个计数。
  • 先获得pgd_offset,然后判断有没有值。同样的方式(判断有没有,没有就分配)分配pmd,分配pud,pmd,pte这里值得注意的是我们都是在内核环境里,虽然内核环境可以访问0-3G的内存地址,但是这里的限定地址都是指向页中间目录项,页上级目录项的线性的地址,这些地址属于高端地址还是低端地址呢?对于X86_32模式自然不需要多说,pud_t自然就是pgd的地址。出于通用性考虑,我们直接看pmd = pmd_alloc(mm, pud, address);的代码。可以看到,调用顺序是__pmd_alloc==>pmd_t *new = pmd_alloc_one(mm, address);==>(pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);==>alloc_pages。这里最后需要注意返回值是page_address(page)。用于页目录或者页表项的地址如果是低端地址,那么就直接使用低端地址映射回去,如果是高端地址,建立的地址是高端永久地址。但是我们这个过程并没有建立高端地址,所以必然是返回低端地址,也就是说页目录项和页中间目录项必然是低端地址。
  • 下面开始pte_alloc_map函数了,看这个函数实际上返回的是指向页描述符的地址,即struct page *pte;但是这块地址实际上是高端内存的临时映射,具体参看pte_offset_map函数实际是kmap_atomic也就是临时内核映射。所以这里的pte的线性地址可以是高端内存地址,然后就到了重头戏handle_pte_fault
	int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, unsigned int flags)
{
	pgd_t *pgd;
	pud_t *pud;
	pmd_t *pmd;
	pte_t *pte;

	__set_current_state(TASK_RUNNING);

	count_vm_event(PGFAULT);

	/* do counter updates before entering really critical section. */
	check_sync_rss_stat(current);

	if (unlikely(is_vm_hugetlb_page(vma)))
		return hugetlb_fault(mm, vma, address, flags);

	pgd = pgd_offset(mm, address);
	pud = pud_alloc(mm, pgd, address);
	if (!pud)
		return VM_FAULT_OOM;
	pmd = pmd_alloc(mm, pud, address);
	if (!pmd)
		return VM_FAULT_OOM;
	if (pmd_none(*pmd) && transparent_hugepage_enabled(vma)) {
		if (!vma->vm_ops)
			return do_huge_pmd_anonymous_page(mm, vma, address,
							  pmd, flags);
	} else {
		pmd_t orig_pmd = *pmd;
		barrier();
		if (pmd_trans_huge(orig_pmd)) {
			if (flags & FAULT_FLAG_WRITE &&
			    !pmd_write(orig_pmd) &&
			    !pmd_trans_splitting(orig_pmd))
				return do_huge_pmd_wp_page(mm, vma, address,
							   pmd, orig_pmd);
			return 0;
		}
	}

	/*
	 * Use __pte_alloc instead of pte_alloc_map, because we can't
	 * run pte_offset_map on the pmd, if an huge pmd could
	 * materialize from under us from a different thread.
	 */
	if (unlikely(pmd_none(*pmd)) && __pte_alloc(mm, vma, pmd, address))
		return VM_FAULT_OOM;
	/* if an huge pmd materialized from under us just retry later */
	if (unlikely(pmd_trans_huge(*pmd)))
		return 0;
	/*
	 * A regular pmd is established and it can't morph into a huge pmd
	 * from under us anymore at this point because we hold the mmap_sem
	 * read mode and khugepaged takes it in write mode. So now it's
	 * safe to run pte_offset_map().
	 */
	pte = pte_offset_map(pmd, address);

	return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

handle_pte_fault函数

这个函数可以说是非常核心的函数了,处理包括写拷贝,文件缓冲页,交换缓冲页等多种情况,可以将这个函数理解为包工头函数。 上一步的时候我们已经说过了,我们已经建立了内核高端内存临时映射,也就是说我们有了页表的临时线性地址(虽然这个地址还在高层),这个时候我们面对的问题就是给pte分配具体的地址,实际上这里存在几个问题,如何知道分配多少内存呢?这些分配的内存从伙伴分配器上摘下来之后又是怎么管理呢?

  • 根据页表的内容判断是不是在内存里,如果不在内存里说明要么是原先就不存在要么是交换出去或者在文件里。
  • 如果entry的内容为空,则说明是一直不存在,需要新分配内存,或者给文件建立内存映射。我们知道如果是文件的内存映射,那么vma->vm_ops != NULL,从而区分是不是建立文件映射。不是文件映射的话,调用do_anonymous_page函数,是文件映射调用do_linear_fault或者do_nonlinear_fault
  • 如果entry不为空,只能说明内存页被交换出去了,那就在交换区找到内存缓存,交换调用函数do_swap_page
  • 前面几种情况我们都看了,看最后部分的代码。如果物理内存中存在具体的页面,且flags包含写请求,那只能说明是写保护,也就是COW,当然也要检查下entry是不是写保护,从而决定是否调用do_wp_page。如果上面的代码都判断过了,却都没进去,那就设置页面最近被访问过即pte_mkyoung
	int handle_pte_fault(struct mm_struct *mm,
		     struct vm_area_struct *vma, unsigned long address,
		     pte_t *pte, pmd_t *pmd, unsigned int flags)
{
	pte_t entry;
	spinlock_t *ptl;

	entry = *pte;
	if (!pte_present(entry)) {
		if (pte_none(entry)) {
			if (vma->vm_ops) {
				if (likely(vma->vm_ops->fault))
					return do_linear_fault(mm, vma, address,
						pte, pmd, flags, entry);
			}
			return do_anonymous_page(mm, vma, address,
						 pte, pmd, flags);
		}
		if (pte_file(entry))
			return do_nonlinear_fault(mm, vma, address,
					pte, pmd, flags, entry);
		return do_swap_page(mm, vma, address,
					pte, pmd, flags, entry);
	}

	ptl = pte_lockptr(mm, pmd);
	spin_lock(ptl);
	if (unlikely(!pte_same(*pte, entry)))
		goto unlock;
	if (flags & FAULT_FLAG_WRITE) {
		if (!pte_write(entry))
			return do_wp_page(mm, vma, address,
					pte, pmd, ptl, entry);
		entry = pte_mkdirty(entry);
	}
	entry = pte_mkyoung(entry);
	if (ptep_set_access_flags(vma, address, pte, entry, flags & FAULT_FLAG_WRITE)) {
		update_mmu_cache(vma, address, pte);
	} else {
		/*
		 * This is needed only for protection faults but the arch code
		 * is not yet telling us if this is a protection fault or not.
		 * This still avoids useless tlb flushes for .text page faults
		 * with threads.
		 */
		if (flags & FAULT_FLAG_WRITE)
			flush_tlb_fix_spurious_fault(vma, address);
	}
unlock:
	pte_unmap_unlock(pte, ptl);
	return 0;
}

do_anonymous_page

该函数就是分配的真正地方了,

  • 先调用pte_unmap取消内核高端映射,

  • 调用alloc_zeroed_user_highpage_movable函数尝试分配高端内存,函数的实质是调用__alloc_zeroed_user_highpage,一层层的静态inline函数实质上最后调用的是__alloc_pages_nodemask函数。这里要注意,实际上只是分配一个内存页。

	static int do_anonymous_page(struct mm_struct *mm, struct vm_area_struct *vma,
		unsigned long address, pte_t *page_table, pmd_t *pmd,
		unsigned int flags)
{
	struct page *page;
	spinlock_t *ptl;
	pte_t entry;

	pte_unmap(page_table);

	/* Check if we need to add a guard page to the stack */
	if (check_stack_guard_page(vma, address) < 0)
		return VM_FAULT_SIGBUS;

	/* Use the zero-page for reads */
	if (!(flags & FAULT_FLAG_WRITE)) {
		entry = pte_mkspecial(pfn_pte(my_zero_pfn(address),
						vma->vm_page_prot));
		page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
		if (!pte_none(*page_table))
			goto unlock;
		goto setpte;
	}

	/* Allocate our own private page. */
	if (unlikely(anon_vma_prepare(vma)))
		goto oom;
	page = alloc_zeroed_user_highpage_movable(vma, address);
	if (!page)
		goto oom;
	__SetPageUptodate(page);

	if (mem_cgroup_newpage_charge(page, mm, GFP_KERNEL))
		goto oom_free_page;

	entry = mk_pte(page, vma->vm_page_prot);
	if (vma->vm_flags & VM_WRITE)
		entry = pte_mkwrite(pte_mkdirty(entry));

	page_table = pte_offset_map_lock(mm, pmd, address, &ptl);
	if (!pte_none(*page_table))
		goto release;

	inc_mm_counter_fast(mm, MM_ANONPAGES);
	page_add_new_anon_rmap(page, vma, address);
setpte:
	set_pte_at(mm, address, page_table, entry);

	/* No need to invalidate - it was non-present before */
	update_mmu_cache(vma, address, page_table);
unlock:
	pte_unmap_unlock(page_table, ptl);
	return 0;
release:
	mem_cgroup_uncharge_page(page);
	page_cache_release(page);
	goto unlock;
oom_free_page:
	page_cache_release(page);
oom:
	return VM_FAULT_OOM;
}

__alloc_pages_nodemask函数

我们这里的分析去掉诸如内核注入错误,配置进程只能运行特定CPU,

  • 调用first_zones_zonelist获取要分配内存的zone,
  • 调用get_page_from_freelist尝试初次获取内存,如果内存获取成功就讲分派到的page首页返回
  • 如果get_page_from_freelist没有成功获取内存,就调用__alloc_pages_slowpath,这个函数的调用可以说是最慢的,需要调用kswaped进程清理内存。我们最后看
	struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
			struct zonelist *zonelist, nodemask_t *nodemask)
{
	enum zone_type high_zoneidx = gfp_zone(gfp_mask);
	struct zone *preferred_zone;
	struct page *page;
	int migratetype = allocflags_to_migratetype(gfp_mask);

	gfp_mask &= gfp_allowed_mask;

	lockdep_trace_alloc(gfp_mask);

	might_sleep_if(gfp_mask & __GFP_WAIT);

	if (should_fail_alloc_page(gfp_mask, order))
		return NULL;

	/*
	 * Check the zones suitable for the gfp_mask contain at least one
	 * valid zone. It's possible to have an empty zonelist as a result
	 * of GFP_THISNODE and a memoryless node
	 */
	if (unlikely(!zonelist->_zonerefs->zone))
		return NULL;

	get_mems_allowed();
	/* The preferred zone is used for statistics later */
	first_zones_zonelist(zonelist, high_zoneidx,
				nodemask ? : &cpuset_current_mems_allowed,
				&preferred_zone);
	if (!preferred_zone) {
		put_mems_allowed();
		return NULL;
	}

	/* First allocation attempt */
	page = get_page_from_freelist(gfp_mask|__GFP_HARDWALL, nodemask, order,
			zonelist, high_zoneidx, ALLOC_WMARK_LOW|ALLOC_CPUSET,
			preferred_zone, migratetype);
	if (unlikely(!page))
		page = __alloc_pages_slowpath(gfp_mask, order,
				zonelist, high_zoneidx, nodemask,
				preferred_zone, migratetype);
	put_mems_allowed();

	trace_mm_page_alloc(page, order, gfp_mask, migratetype);
	return page;
}

get_page_from_freelist函数

我们考虑内存还是考虑没有CPU_SET和NUMA的情况

  • 判断zone_water_mark是否ok,
  • 调用buffered_rmqueue函数
	static struct page *
get_page_from_freelist(gfp_t gfp_mask, nodemask_t *nodemask, unsigned int order,
		struct zonelist *zonelist, int high_zoneidx, int alloc_flags,
		struct zone *preferred_zone, int migratetype)
{
	struct zoneref *z;
	struct page *page = NULL;
	int classzone_idx;
	struct zone *zone;
	nodemask_t *allowednodes = NULL;/* zonelist_cache approximation */
	int zlc_active = 0;		/* set if using zonelist_cache */
	int did_zlc_setup = 0;		/* just call zlc_setup() one time */

	classzone_idx = zone_idx(preferred_zone);
zonelist_scan:
	/*
	 * Scan zonelist, looking for a zone with enough free.
	 * See also cpuset_zone_allowed() comment in kernel/cpuset.c.
	 */
	for_each_zone_zonelist_nodemask(zone, z, zonelist,
						high_zoneidx, nodemask) {
		if (NUMA_BUILD && zlc_active &&
			!zlc_zone_worth_trying(zonelist, z, allowednodes))
				continue;
		if ((alloc_flags & ALLOC_CPUSET) &&
			!cpuset_zone_allowed_softwall(zone, gfp_mask))
				goto try_next_zone;

		BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
		if (!(alloc_flags & ALLOC_NO_WATERMARKS)) {
			unsigned long mark;
			int ret;

			mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
			if (zone_watermark_ok(zone, order, mark,
				    classzone_idx, alloc_flags))
				goto try_this_zone;

			if (zone_reclaim_mode == 0)
				goto this_zone_full;

			ret = zone_reclaim(zone, gfp_mask, order);
			switch (ret) {
			case ZONE_RECLAIM_NOSCAN:
				/* did not scan */
				goto try_next_zone;
			case ZONE_RECLAIM_FULL:
				/* scanned but unreclaimable */
				goto this_zone_full;
			default:
				/* did we reclaim enough */
				if (!zone_watermark_ok(zone, order, mark,
						classzone_idx, alloc_flags))
					goto this_zone_full;
			}
		}

try_this_zone:
		page = buffered_rmqueue(preferred_zone, zone, order,
						gfp_mask, migratetype);
		if (page)
			break;
this_zone_full:
		if (NUMA_BUILD)
			zlc_mark_zone_full(zonelist, z);
try_next_zone:
		if (NUMA_BUILD && !did_zlc_setup && nr_online_nodes > 1) {
			/*
			 * we do zlc_setup after the first zone is tried but only
			 * if there are multiple nodes make it worthwhile
			 */
			allowednodes = zlc_setup(zonelist, alloc_flags);
			zlc_active = 1;
			did_zlc_setup = 1;
		}
	}

	if (unlikely(NUMA_BUILD && page == NULL && zlc_active)) {
		/* Disable zlc cache for second zonelist scan */
		zlc_active = 0;
		goto zonelist_scan;
	}
	return page;
}

buffered_rmqueue函数

  • 先判断是从cpu的冷缓存还是热缓存取,这里冷热的判断用的是!!实际上就是个条件判断。
  • 如果order为0,这说明只是想要一个缓存页,从cpu的冷/热缓存取即可,也就是从每个per_cpu_pageset的per_cpu_page中取。如果内存不足,调用rmqueue_bulk函数从伙伴分配器里面取出。热缓存加到头,冷缓存加到尾
	static inline
struct page *buffered_rmqueue(struct zone *preferred_zone,
			struct zone *zone, int order, gfp_t gfp_flags,
			int migratetype)
{
	unsigned long flags;
	struct page *page;
	int cold = !!(gfp_flags & __GFP_COLD);

again:
	if (likely(order == 0)) {
		struct per_cpu_pages *pcp;
		struct list_head *list;

		local_irq_save(flags);
		pcp = &this_cpu_ptr(zone->pageset)->pcp;
		list = &pcp->lists[migratetype];
		if (list_empty(list)) {
			pcp->count += rmqueue_bulk(zone, 0,
					pcp->batch, list,
					migratetype, cold);
			if (unlikely(list_empty(list)))
				goto failed;
		}

		if (cold)
			page = list_entry(list->prev, struct page, lru);
		else
			page = list_entry(list->next, struct page, lru);

		list_del(&page->lru);
		pcp->count--;
	} else {
		if (unlikely(gfp_flags & __GFP_NOFAIL)) {
			/*
			 * __GFP_NOFAIL is not to be used in new code.
			 *
			 * All __GFP_NOFAIL callers should be fixed so that they
			 * properly detect and handle allocation failures.
			 *
			 * We most definitely don't want callers attempting to
			 * allocate greater than order-1 page units with
			 * __GFP_NOFAIL.
			 */
			WARN_ON_ONCE(order > 1);
		}
		spin_lock_irqsave(&zone->lock, flags);
		page = __rmqueue(zone, order, migratetype);
		spin_unlock(&zone->lock);
		if (!page)
			goto failed;
		__mod_zone_page_state(zone, NR_FREE_PAGES, -(1 << order));
	}

	__count_zone_vm_events(PGALLOC, zone, 1 << order);
	zone_statistics(preferred_zone, zone, gfp_flags);
	local_irq_restore(flags);

	VM_BUG_ON(bad_range(zone, page));
	if (prep_new_page(page, order, gfp_flags))
		goto again;
	return page;

failed:
	local_irq_restore(flags);
	return NULL;
}

结尾的闲言碎语

写到这里差不多就可以结束了,就不多说了。 狗头的赞赏码.jpg