kernel: backport fixes for realtek r8152

[openwrt/openwrt.git] / target / linux / generic / backport-5.15 / 020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch
diff --git a/target/linux/generic/backport-5.15/020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch b/target/linux/generic/backport-5.15/020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch

index 48bcaf3e3ea7001c5d5ade303010a9086f96dea9..73acadd804c0bfd7baf4b52ebde88ef333ea819d 100644 (file)
--- a/target/linux/generic/backport-5.15/020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch
+++ b/target/linux/generic/backport-5.15/020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch
@@ -1,104 +1,360 @@
-From a8e6015d9534f39abc08e6804566af059e498a60 Mon Sep 17 00:00:00 2001
+From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
  From: Yu Zhao <yuzhao@google.com>
-Date: Wed, 4 Aug 2021 01:31:34 -0600
-Subject: [PATCH 01/10] mm: x86, arm64: add arch_has_hw_pte_young()
+Date: Sun, 18 Sep 2022 01:59:58 -0600
+Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
  
-Some architectures automatically set the accessed bit in PTEs, e.g.,
-x86 and arm64 v8.2. On architectures that do not have this capability,
-clearing the accessed bit in a PTE triggers a page fault following the
-TLB miss of this PTE.
+Patch series "Multi-Gen LRU Framework", v14.
  
-Being aware of this capability can help make better decisions, i.e.,
-whether to limit the size of each batch of PTEs and the burst of
-batches when clearing the accessed bit.
+What's new
+==========
+1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
+   Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
+2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
+   machines. The old direct reclaim backoff, which tries to enforce a
+   minimum fairness among all eligible memcgs, over-swapped by about
+   (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
+   pulls the plug on swapping once the target is met, trades some
+   fairness for curtailed latency:
+   https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
+3. Fixed minior build warnings and conflicts. More comments and nits.
  
+TLDR
+====
+The current page reclaim is too expensive in terms of CPU usage and it
+often makes poor choices about what to evict. This patchset offers an
+alternative solution that is performant, versatile and
+straightforward.
+
+Patchset overview
+=================
+The design and implementation overview is in patch 14:
+https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
+
+01. mm: x86, arm64: add arch_has_hw_pte_young()
+02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+Take advantage of hardware features when trying to clear the accessed
+bit in many PTEs.
+
+03. mm/vmscan.c: refactor shrink_node()
+04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
+    its sole caller"
+Minor refactors to improve readability for the following patches.
+
+05. mm: multi-gen LRU: groundwork
+Adds the basic data structure and the functions that insert pages to
+and remove pages from the multi-gen LRU (MGLRU) lists.
+
+06. mm: multi-gen LRU: minimal implementation
+A minimal implementation without optimizations.
+
+07. mm: multi-gen LRU: exploit locality in rmap
+Exploits spatial locality to improve efficiency when using the rmap.
+
+08. mm: multi-gen LRU: support page table walks
+Further exploits spatial locality by optionally scanning page tables.
+
+09. mm: multi-gen LRU: optimize multiple memcgs
+Optimizes the overall performance for multiple memcgs running mixed
+types of workloads.
+
+10. mm: multi-gen LRU: kill switch
+Adds a kill switch to enable or disable MGLRU at runtime.
+
+11. mm: multi-gen LRU: thrashing prevention
+12. mm: multi-gen LRU: debugfs interface
+Provide userspace with features like thrashing prevention, working set
+estimation and proactive reclaim.
+
+13. mm: multi-gen LRU: admin guide
+14. mm: multi-gen LRU: design doc
+Add an admin guide and a design doc.
+
+Benchmark results
+=================
+Independent lab results
+-----------------------
+Based on the popularity of searches [01] and the memory usage in
+Google's public cloud, the most popular open-source memory-hungry
+applications, in alphabetical order, are:
+      Apache Cassandra      Memcached
+      Apache Hadoop         MongoDB
+      Apache Spark          PostgreSQL
+      MariaDB (MySQL)       Redis
+
+An independent lab evaluated MGLRU with the most widely used benchmark
+suites for the above applications. They posted 960 data points along
+with kernel metrics and perf profiles collected over more than 500
+hours of total benchmark time. Their final reports show that, with 95%
+confidence intervals (CIs), the above applications all performed
+significantly better for at least part of their benchmark matrices.
+
+On 5.14:
+1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
+   less wall time to sort three billion random integers, respectively,
+   under the medium- and the high-concurrency conditions, when
+   overcommitting memory. There were no statistically significant
+   changes in wall time for the rest of the benchmark matrix.
+2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
+   more transactions per minute (TPM), respectively, under the medium-
+   and the high-concurrency conditions, when overcommitting memory.
+   There were no statistically significant changes in TPM for the rest
+   of the benchmark matrix.
+3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
+   and [21.59, 30.02]% more operations per second (OPS), respectively,
+   for sequential access, random access and Gaussian (distribution)
+   access, when THP=always; 95% CIs [13.85, 15.97]% and
+   [23.94, 29.92]% more OPS, respectively, for random access and
+   Gaussian access, when THP=never. There were no statistically
+   significant changes in OPS for the rest of the benchmark matrix.
+4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
+   [2.16, 3.55]% more operations per second (OPS), respectively, for
+   exponential (distribution) access, random access and Zipfian
+   (distribution) access, when underutilizing memory; 95% CIs
+   [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
+   respectively, for exponential access, random access and Zipfian
+   access, when overcommitting memory.
+
+On 5.15:
+5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
+   and [4.11, 7.50]% more operations per second (OPS), respectively,
+   for exponential (distribution) access, random access and Zipfian
+   (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
+   [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
+   exponential access, random access and Zipfian access, when swap was
+   on.
+6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
+   less average wall time to finish twelve parallel TeraSort jobs,
+   respectively, under the medium- and the high-concurrency
+   conditions, when swap was on. There were no statistically
+   significant changes in average wall time for the rest of the
+   benchmark matrix.
+7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
+   minute (TPM) under the high-concurrency condition, when swap was
+   off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
+   respectively, under the medium- and the high-concurrency
+   conditions, when swap was on. There were no statistically
+   significant changes in TPM for the rest of the benchmark matrix.
+8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
+   [11.47, 19.36]% more total operations per second (OPS),
+   respectively, for sequential access, random access and Gaussian
+   (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
+   [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
+   for sequential access, random access and Gaussian access, when
+   THP=never.
+
+Our lab results
+---------------
+To supplement the above results, we ran the following benchmark suites
+on 5.16-rc7 and found no regressions [10].
+      fs_fio_bench_hdd_mq      pft
+      fs_lmbench               pgsql-hammerdb
+      fs_parallelio            redis
+      fs_postmark              stream
+      hackbench                sysbenchthread
+      kernbench                tpcc_spark
+      memcached                unixbench
+      multichase               vm-scalability
+      mutilate                 will-it-scale
+      nginx
+
+[01] https://trends.google.com
+[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
+[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
+[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
+[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
+[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
+[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
+[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
+[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
+[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
+
+Read-world applications
+=======================
+Third-party testimonials
+------------------------
+Konstantin reported [11]:
+   I have Archlinux with 8G RAM + zswap + swap. While developing, I
+   have lots of apps opened such as multiple LSP-servers for different
+   langs, chats, two browsers, etc... Usually, my system gets quickly
+   to a point of SWAP-storms, where I have to kill LSP-servers,
+   restart browsers to free memory, etc, otherwise the system lags
+   heavily and is barely usable.
+
+   1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
+   patchset, and I started up by opening lots of apps to create memory
+   pressure, and worked for a day like this. Till now I had not a
+   single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
+   getting to the point of 3G in SWAP before without a single
+   SWAP-storm.
+
+Vaibhav from IBM reported [12]:
+   In a synthetic MongoDB Benchmark, seeing an average of ~19%
+   throughput improvement on POWER10(Radix MMU + 64K Page Size) with
+   MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
+   three different request distributions, namely, Exponential, Uniform
+   and Zipfan.
+
+Shuang from U of Rochester reported [13]:
+   With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
+   and [9.26, 10.36]% higher throughput, respectively, for random
+   access, Zipfian (distribution) access and Gaussian (distribution)
+   access, when the average number of jobs per CPU is 1; 95% CIs
+   [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
+   throughput, respectively, for random access, Zipfian access and
+   Gaussian access, when the average number of jobs per CPU is 2.
+
+Daniel from Michigan Tech reported [14]:
+   With Memcached allocating ~100GB of byte-addressable Optante,
+   performance improvement in terms of throughput (measured as queries
+   per second) was about 10% for a series of workloads.
+
+Large-scale deployments
+-----------------------
+We've rolled out MGLRU to tens of millions of ChromeOS users and
+about a million Android users. Google's fleetwide profiling [15] shows
+an overall 40% decrease in kswapd CPU usage, in addition to
+improvements in other UX metrics, e.g., an 85% decrease in the number
+of low-memory kills at the 75th percentile and an 18% decrease in
+app launch time at the 50th percentile.
+
+The downstream kernels that have been using MGLRU include:
+1. Android [16]
+2. Arch Linux Zen [17]
+3. Armbian [18]
+4. ChromeOS [19]
+5. Liquorix [20]
+6. OpenWrt [21]
+7. post-factum [22]
+8. XanMod [23]
+
+[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
+[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
+[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
+[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
+[15] https://dl.acm.org/doi/10.1145/2749469.2750392
+[16] https://android.com
+[17] https://archlinux.org
+[18] https://armbian.com
+[19] https://chromium.org
+[20] https://liquorix.net
+[21] https://openwrt.org
+[22] https://codeberg.org/pf-kernel
+[23] https://xanmod.org
+
+Summary
+=======
+The facts are:
+1. The independent lab results and the real-world applications
+   indicate substantial improvements; there are no known regressions.
+2. Thrashing prevention, working set estimation and proactive reclaim
+   work out of the box; there are no equivalent solutions.
+3. There is a lot of new code; no smaller changes have been
+   demonstrated similar effects.
+
+Our options, accordingly, are:
+1. Given the amount of evidence, the reported improvements will likely
+   materialize for a wide range of workloads.
+2. Gauging the interest from the past discussions, the new features
+   will likely be put to use for both personal computers and data
+   centers.
+3. Based on Google's track record, the new code will likely be well
+   maintained in the long term. It'd be more difficult if not
+   impossible to achieve similar effects with other approaches.
+
+This patch (of 14):
+
+Some architectures automatically set the accessed bit in PTEs, e.g., x86
+and arm64 v8.2.  On architectures that do not have this capability,
+clearing the accessed bit in a PTE usually triggers a page fault following
+the TLB miss of this PTE (to emulate the accessed bit).
+
+Being aware of this capability can help make better decisions, e.g.,
+whether to spread the work out over a period of time to reduce bursty page
+faults when trying to clear the accessed bit in many PTEs.
+
+Note that theoretically this capability can be unreliable, e.g.,
+hotplugged CPUs might be different from builtin ones.  Therefore it should
+not be used in architecture-independent code that involves correctness,
+e.g., to determine whether TLB flushes are required (in combination with
+the accessed bit).
+
+Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
+Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
  Signed-off-by: Yu Zhao <yuzhao@google.com>
-Change-Id: Ib49b44fb56df3333a2ff1fcc496fb1980b976e7a
+Reviewed-by: Barry Song <baohua@kernel.org>
+Acked-by: Brian Geffon <bgeffon@google.com>
+Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
+Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
+Acked-by: Steven Barrett <steven@liquorix.net>
+Acked-by: Suleiman Souhlal <suleiman@google.com>
+Acked-by: Will Deacon <will@kernel.org>
+Tested-by: Daniel Byrne <djbyrne@mtu.edu>
+Tested-by: Donald Carr <d@chaos-reins.com>
+Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
+Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
+Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
+Tested-by: Sofia Trinh <sofia.trinh@edi.works>
+Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
+Cc: Andi Kleen <ak@linux.intel.com>
+Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: Hillf Danton <hdanton@sina.com>
+Cc: Jens Axboe <axboe@kernel.dk>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: Jonathan Corbet <corbet@lwn.net>
+Cc: Linus Torvalds <torvalds@linux-foundation.org>
+Cc: linux-arm-kernel@lists.infradead.org
+Cc: Matthew Wilcox <willy@infradead.org>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Michael Larabel <Michael@MichaelLarabel.com>
+Cc: Michal Hocko <mhocko@kernel.org>
+Cc: Mike Rapoport <rppt@kernel.org>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Tejun Heo <tj@kernel.org>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: Miaohe Lin <linmiaohe@huawei.com>
+Cc: Mike Rapoport <rppt@linux.ibm.com>
+Cc: Qi Zheng <zhengqi.arch@bytedance.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
  ---
- arch/arm64/include/asm/cpufeature.h |  5 +++++
- arch/arm64/include/asm/pgtable.h    | 13 ++++++++-----
- arch/arm64/kernel/cpufeature.c      | 10 ++++++++++
- arch/arm64/tools/cpucaps            |  1 +
- arch/x86/include/asm/pgtable.h      |  6 +++---
- include/linux/pgtable.h             | 13 +++++++++++++
- mm/memory.c                         | 14 +-------------
- 7 files changed, 41 insertions(+), 21 deletions(-)
-
---- a/arch/arm64/include/asm/cpufeature.h
-+++ b/arch/arm64/include/asm/cpufeature.h
-@@ -808,6 +808,11 @@ static inline bool system_supports_tlb_r
-               cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
- }
- 
-+static inline bool system_has_hw_af(void)
-+{
-+      return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
-+}
-+
- extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
- 
- static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
+ arch/arm64/include/asm/pgtable.h | 14 ++------------
+ arch/x86/include/asm/pgtable.h   |  6 +++---
+ include/linux/pgtable.h          | 13 +++++++++++++
+ mm/memory.c                      | 14 +-------------
+ 4 files changed, 19 insertions(+), 28 deletions(-)
+
  --- a/arch/arm64/include/asm/pgtable.h
  +++ b/arch/arm64/include/asm/pgtable.h
-@@ -999,13 +999,16 @@ static inline void update_mmu_cache(stru
+@@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
    * page after fork() + CoW for pfn mappings. We don't always have a
    * hardware-managed access flag on arm64.
    */
  -static inline bool arch_faults_on_old_pte(void)
-+static inline bool arch_has_hw_pte_young(bool local)
- {
+-{
  -      WARN_ON(preemptible());
-+      if (local) {
-+              WARN_ON(preemptible());
-+              return cpu_has_hw_af();
-+      }
- 
+-
  -      return !cpu_has_hw_af();
-+      return system_has_hw_af();
- }
+-}
  -#define arch_faults_on_old_pte                arch_faults_on_old_pte
-+#define arch_has_hw_pte_young         arch_has_hw_pte_young
++#define arch_has_hw_pte_young         cpu_has_hw_af
   
   /*
    * Experimentally, it's cheap to set the access flag in hardware and we
-@@ -1013,7 +1016,7 @@ static inline bool arch_faults_on_old_pt
+  * benefit from prefaulting mappings as 'old' to start with.
    */
- static inline bool arch_wants_old_prefaulted_pte(void)
- {
+-static inline bool arch_wants_old_prefaulted_pte(void)
+-{
  -      return !arch_faults_on_old_pte();
-+      return arch_has_hw_pte_young(true);
- }
- #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
+-}
+-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
++#define arch_wants_old_prefaulted_pte cpu_has_hw_af
+ 
+ #endif /* !__ASSEMBLY__ */
   
---- a/arch/arm64/kernel/cpufeature.c
-+++ b/arch/arm64/kernel/cpufeature.c
-@@ -2187,6 +2187,16 @@ static const struct arm64_cpu_capabiliti
-               .matches = has_hw_dbm,
-               .cpu_enable = cpu_enable_hw_dbm,
-       },
-+      {
-+              .desc = "Hardware update of the Access flag",
-+              .type = ARM64_CPUCAP_SYSTEM_FEATURE,
-+              .capability = ARM64_HW_AF,
-+              .sys_reg = SYS_ID_AA64MMFR1_EL1,
-+              .sign = FTR_UNSIGNED,
-+              .field_pos = ID_AA64MMFR1_HADBS_SHIFT,
-+              .min_field_value = 1,
-+              .matches = has_cpuid_feature,
-+      },
- #endif
-       {
-               .desc = "CRC32 instructions",
---- a/arch/arm64/tools/cpucaps
-+++ b/arch/arm64/tools/cpucaps
-@@ -35,6 +35,7 @@ HAS_STAGE2_FWB
- HAS_SYSREG_GIC_CPUIF
- HAS_TLB_RANGE
- HAS_VIRT_HOST_EXTN
-+HW_AF
- HW_DBM
- KVM_PROTECTED_MODE
- MISMATCHED_CACHE_TYPE
  --- a/arch/x86/include/asm/pgtable.h
  +++ b/arch/x86/include/asm/pgtable.h
  @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
@@ -108,7 +364,7 @@ Change-Id: Ib49b44fb56df3333a2ff1fcc496fb1980b976e7a
  -#define arch_faults_on_old_pte arch_faults_on_old_pte
  -static inline bool arch_faults_on_old_pte(void)
  +#define arch_has_hw_pte_young arch_has_hw_pte_young
-+static inline bool arch_has_hw_pte_young(bool local)
++static inline bool arch_has_hw_pte_young(void)
   {
  -      return false;
  +      return true;
@@ -123,12 +379,12 @@ Change-Id: Ib49b44fb56df3333a2ff1fcc496fb1980b976e7a
   
  +#ifndef arch_has_hw_pte_young
  +/*
-+ * Return whether the accessed bit is supported by the local CPU or all CPUs.
++ * Return whether the accessed bit is supported on the local CPU.
  + *
-+ * Those arches which have hw access flag feature need to implement their own
-+ * helper. By default, "false" means pagefault will be hit on old pte.
++ * This stub assumes accessing through an old PTE triggers a page fault.
++ * Architectures that automatically set the access bit should overwrite it.
  + */
-+static inline bool arch_has_hw_pte_young(bool local)
++static inline bool arch_has_hw_pte_young(void)
  +{
  +      return false;
  +}
@@ -158,12 +414,12 @@ Change-Id: Ib49b44fb56df3333a2ff1fcc496fb1980b976e7a
   #ifndef arch_wants_old_prefaulted_pte
   static inline bool arch_wants_old_prefaulted_pte(void)
   {
-@@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
+@@ -2791,7 +2779,7 @@ static inline int cow_user_page(struct p
          * On architectures with software "accessed" bits, we would
          * take a double page fault, so mark it accessed here.
          */
  -      if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
-+      if (!arch_has_hw_pte_young(true) && !pte_young(vmf->orig_pte)) {
++      if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
                 pte_t entry;
   
                 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);