target/linux/generic/backport-5.15/020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch

   1 From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
   2 From: Yu Zhao <yuzhao@google.com>
   3 Date: Sun, 18 Sep 2022 01:59:58 -0600
   4 Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
   5 MIME-Version: 1.0
   6 Content-Type: text/plain; charset=UTF-8
   7 Content-Transfer-Encoding: 8bit
   8
   9 Patch series "Multi-Gen LRU Framework", v14.
  10
  11 What's new
  12 ==========
  13 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
  14    Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
  15 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
  16    machines. The old direct reclaim backoff, which tries to enforce a
  17    minimum fairness among all eligible memcgs, over-swapped by about
  18    (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
  19    pulls the plug on swapping once the target is met, trades some
  20    fairness for curtailed latency:
  21    https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
  22 3. Fixed minior build warnings and conflicts. More comments and nits.
  23
  24 TLDR
  25 ====
  26 The current page reclaim is too expensive in terms of CPU usage and it
  27 often makes poor choices about what to evict. This patchset offers an
  28 alternative solution that is performant, versatile and
  29 straightforward.
  30
  31 Patchset overview
  32 =================
  33 The design and implementation overview is in patch 14:
  34 https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
  35
  36 01. mm: x86, arm64: add arch_has_hw_pte_young()
  37 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
  38 Take advantage of hardware features when trying to clear the accessed
  39 bit in many PTEs.
  40
  41 03. mm/vmscan.c: refactor shrink_node()
  42 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
  43     its sole caller"
  44 Minor refactors to improve readability for the following patches.
  45
  46 05. mm: multi-gen LRU: groundwork
  47 Adds the basic data structure and the functions that insert pages to
  48 and remove pages from the multi-gen LRU (MGLRU) lists.
  49
  50 06. mm: multi-gen LRU: minimal implementation
  51 A minimal implementation without optimizations.
  52
  53 07. mm: multi-gen LRU: exploit locality in rmap
  54 Exploits spatial locality to improve efficiency when using the rmap.
  55
  56 08. mm: multi-gen LRU: support page table walks
  57 Further exploits spatial locality by optionally scanning page tables.
  58
  59 09. mm: multi-gen LRU: optimize multiple memcgs
  60 Optimizes the overall performance for multiple memcgs running mixed
  61 types of workloads.
  62
  63 10. mm: multi-gen LRU: kill switch
  64 Adds a kill switch to enable or disable MGLRU at runtime.
  65
  66 11. mm: multi-gen LRU: thrashing prevention
  67 12. mm: multi-gen LRU: debugfs interface
  68 Provide userspace with features like thrashing prevention, working set
  69 estimation and proactive reclaim.
  70
  71 13. mm: multi-gen LRU: admin guide
  72 14. mm: multi-gen LRU: design doc
  73 Add an admin guide and a design doc.
  74
  75 Benchmark results
  76 =================
  77 Independent lab results
  78 -----------------------
  79 Based on the popularity of searches [01] and the memory usage in
  80 Google's public cloud, the most popular open-source memory-hungry
  81 applications, in alphabetical order, are:
  82       Apache Cassandra      Memcached
  83       Apache Hadoop         MongoDB
  84       Apache Spark          PostgreSQL
  85       MariaDB (MySQL)       Redis
  86
  87 An independent lab evaluated MGLRU with the most widely used benchmark
  88 suites for the above applications. They posted 960 data points along
  89 with kernel metrics and perf profiles collected over more than 500
  90 hours of total benchmark time. Their final reports show that, with 95%
  91 confidence intervals (CIs), the above applications all performed
  92 significantly better for at least part of their benchmark matrices.
  93
  94 On 5.14:
  95 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
  96    less wall time to sort three billion random integers, respectively,
  97    under the medium- and the high-concurrency conditions, when
  98    overcommitting memory. There were no statistically significant
  99    changes in wall time for the rest of the benchmark matrix.
 100 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
 101    more transactions per minute (TPM), respectively, under the medium-
 102    and the high-concurrency conditions, when overcommitting memory.
 103    There were no statistically significant changes in TPM for the rest
 104    of the benchmark matrix.
 105 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
 106    and [21.59, 30.02]% more operations per second (OPS), respectively,
 107    for sequential access, random access and Gaussian (distribution)
 108    access, when THP=always; 95% CIs [13.85, 15.97]% and
 109    [23.94, 29.92]% more OPS, respectively, for random access and
 110    Gaussian access, when THP=never. There were no statistically
 111    significant changes in OPS for the rest of the benchmark matrix.
 112 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
 113    [2.16, 3.55]% more operations per second (OPS), respectively, for
 114    exponential (distribution) access, random access and Zipfian
 115    (distribution) access, when underutilizing memory; 95% CIs
 116    [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
 117    respectively, for exponential access, random access and Zipfian
 118    access, when overcommitting memory.
 119
 120 On 5.15:
 121 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
 122    and [4.11, 7.50]% more operations per second (OPS), respectively,
 123    for exponential (distribution) access, random access and Zipfian
 124    (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
 125    [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
 126    exponential access, random access and Zipfian access, when swap was
 127    on.
 128 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
 129    less average wall time to finish twelve parallel TeraSort jobs,
 130    respectively, under the medium- and the high-concurrency
 131    conditions, when swap was on. There were no statistically
 132    significant changes in average wall time for the rest of the
 133    benchmark matrix.
 134 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
 135    minute (TPM) under the high-concurrency condition, when swap was
 136    off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
 137    respectively, under the medium- and the high-concurrency
 138    conditions, when swap was on. There were no statistically
 139    significant changes in TPM for the rest of the benchmark matrix.
 140 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
 141    [11.47, 19.36]% more total operations per second (OPS),
 142    respectively, for sequential access, random access and Gaussian
 143    (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
 144    [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
 145    for sequential access, random access and Gaussian access, when
 146    THP=never.
 147
 148 Our lab results
 149 ---------------
 150 To supplement the above results, we ran the following benchmark suites
 151 on 5.16-rc7 and found no regressions [10].
 152       fs_fio_bench_hdd_mq      pft
 153       fs_lmbench               pgsql-hammerdb
 154       fs_parallelio            redis
 155       fs_postmark              stream
 156       hackbench                sysbenchthread
 157       kernbench                tpcc_spark
 158       memcached                unixbench
 159       multichase               vm-scalability
 160       mutilate                 will-it-scale
 161       nginx
 162
 163 [01] https://trends.google.com
 164 [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
 165 [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
 166 [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
 167 [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
 168 [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
 169 [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
 170 [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
 171 [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
 172 [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
 173
 174 Read-world applications
 175 =======================
 176 Third-party testimonials
 177 ------------------------
 178 Konstantin reported [11]:
 179    I have Archlinux with 8G RAM + zswap + swap. While developing, I
 180    have lots of apps opened such as multiple LSP-servers for different
 181    langs, chats, two browsers, etc... Usually, my system gets quickly
 182    to a point of SWAP-storms, where I have to kill LSP-servers,
 183    restart browsers to free memory, etc, otherwise the system lags
 184    heavily and is barely usable.
 185
 186    1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
 187    patchset, and I started up by opening lots of apps to create memory
 188    pressure, and worked for a day like this. Till now I had not a
 189    single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
 190    getting to the point of 3G in SWAP before without a single
 191    SWAP-storm.
 192
 193 Vaibhav from IBM reported [12]:
 194    In a synthetic MongoDB Benchmark, seeing an average of ~19%
 195    throughput improvement on POWER10(Radix MMU + 64K Page Size) with
 196    MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
 197    three different request distributions, namely, Exponential, Uniform
 198    and Zipfan.
 199
 200 Shuang from U of Rochester reported [13]:
 201    With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
 202    and [9.26, 10.36]% higher throughput, respectively, for random
 203    access, Zipfian (distribution) access and Gaussian (distribution)
 204    access, when the average number of jobs per CPU is 1; 95% CIs
 205    [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
 206    throughput, respectively, for random access, Zipfian access and
 207    Gaussian access, when the average number of jobs per CPU is 2.
 208
 209 Daniel from Michigan Tech reported [14]:
 210    With Memcached allocating ~100GB of byte-addressable Optante,
 211    performance improvement in terms of throughput (measured as queries
 212    per second) was about 10% for a series of workloads.
 213
 214 Large-scale deployments
 215 -----------------------
 216 We've rolled out MGLRU to tens of millions of ChromeOS users and
 217 about a million Android users. Google's fleetwide profiling [15] shows
 218 an overall 40% decrease in kswapd CPU usage, in addition to
 219 improvements in other UX metrics, e.g., an 85% decrease in the number
 220 of low-memory kills at the 75th percentile and an 18% decrease in
 221 app launch time at the 50th percentile.
 222
 223 The downstream kernels that have been using MGLRU include:
 224 1. Android [16]
 225 2. Arch Linux Zen [17]
 226 3. Armbian [18]
 227 4. ChromeOS [19]
 228 5. Liquorix [20]
 229 6. OpenWrt [21]
 230 7. post-factum [22]
 231 8. XanMod [23]
 232
 233 [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
 234 [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
 235 [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
 236 [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
 237 [15] https://dl.acm.org/doi/10.1145/2749469.2750392
 238 [16] https://android.com
 239 [17] https://archlinux.org
 240 [18] https://armbian.com
 241 [19] https://chromium.org
 242 [20] https://liquorix.net
 243 [21] https://openwrt.org
 244 [22] https://codeberg.org/pf-kernel
 245 [23] https://xanmod.org
 246
 247 Summary
 248 =======
 249 The facts are:
 250 1. The independent lab results and the real-world applications
 251    indicate substantial improvements; there are no known regressions.
 252 2. Thrashing prevention, working set estimation and proactive reclaim
 253    work out of the box; there are no equivalent solutions.
 254 3. There is a lot of new code; no smaller changes have been
 255    demonstrated similar effects.
 256
 257 Our options, accordingly, are:
 258 1. Given the amount of evidence, the reported improvements will likely
 259    materialize for a wide range of workloads.
 260 2. Gauging the interest from the past discussions, the new features
 261    will likely be put to use for both personal computers and data
 262    centers.
 263 3. Based on Google's track record, the new code will likely be well
 264    maintained in the long term. It'd be more difficult if not
 265    impossible to achieve similar effects with other approaches.
 266
 267 This patch (of 14):
 268
 269 Some architectures automatically set the accessed bit in PTEs, e.g., x86
 270 and arm64 v8.2.  On architectures that do not have this capability,
 271 clearing the accessed bit in a PTE usually triggers a page fault following
 272 the TLB miss of this PTE (to emulate the accessed bit).
 273
 274 Being aware of this capability can help make better decisions, e.g.,
 275 whether to spread the work out over a period of time to reduce bursty page
 276 faults when trying to clear the accessed bit in many PTEs.
 277
 278 Note that theoretically this capability can be unreliable, e.g.,
 279 hotplugged CPUs might be different from builtin ones.  Therefore it should
 280 not be used in architecture-independent code that involves correctness,
 281 e.g., to determine whether TLB flushes are required (in combination with
 282 the accessed bit).
 283
 284 Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
 285 Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
 286 Signed-off-by: Yu Zhao <yuzhao@google.com>
 287 Reviewed-by: Barry Song <baohua@kernel.org>
 288 Acked-by: Brian Geffon <bgeffon@google.com>
 289 Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
 290 Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
 291 Acked-by: Steven Barrett <steven@liquorix.net>
 292 Acked-by: Suleiman Souhlal <suleiman@google.com>
 293 Acked-by: Will Deacon <will@kernel.org>
 294 Tested-by: Daniel Byrne <djbyrne@mtu.edu>
 295 Tested-by: Donald Carr <d@chaos-reins.com>
 296 Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
 297 Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
 298 Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
 299 Tested-by: Sofia Trinh <sofia.trinh@edi.works>
 300 Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
 301 Cc: Andi Kleen <ak@linux.intel.com>
 302 Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
 303 Cc: Catalin Marinas <catalin.marinas@arm.com>
 304 Cc: Dave Hansen <dave.hansen@linux.intel.com>
 305 Cc: Hillf Danton <hdanton@sina.com>
 306 Cc: Jens Axboe <axboe@kernel.dk>
 307 Cc: Johannes Weiner <hannes@cmpxchg.org>
 308 Cc: Jonathan Corbet <corbet@lwn.net>
 309 Cc: Linus Torvalds <torvalds@linux-foundation.org>
 310 Cc: linux-arm-kernel@lists.infradead.org
 311 Cc: Matthew Wilcox <willy@infradead.org>
 312 Cc: Mel Gorman <mgorman@suse.de>
 313 Cc: Michael Larabel <Michael@MichaelLarabel.com>
 314 Cc: Michal Hocko <mhocko@kernel.org>
 315 Cc: Mike Rapoport <rppt@kernel.org>
 316 Cc: Peter Zijlstra <peterz@infradead.org>
 317 Cc: Tejun Heo <tj@kernel.org>
 318 Cc: Vlastimil Babka <vbabka@suse.cz>
 319 Cc: Miaohe Lin <linmiaohe@huawei.com>
 320 Cc: Mike Rapoport <rppt@linux.ibm.com>
 321 Cc: Qi Zheng <zhengqi.arch@bytedance.com>
 322 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 323 ---
 324  arch/arm64/include/asm/pgtable.h | 14 ++------------
 325  arch/x86/include/asm/pgtable.h   |  6 +++---
 326  include/linux/pgtable.h          | 13 +++++++++++++
 327  mm/memory.c                      | 14 +-------------
 328  4 files changed, 19 insertions(+), 28 deletions(-)
 329
 330 --- a/arch/arm64/include/asm/pgtable.h
 331 +++ b/arch/arm64/include/asm/pgtable.h
 332 @@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
 333   * page after fork() + CoW for pfn mappings. We don't always have a
 334   * hardware-managed access flag on arm64.
 335   */
 336 -static inline bool arch_faults_on_old_pte(void)
 337 -{
 338 -       WARN_ON(preemptible());
 339 -
 340 -       return !cpu_has_hw_af();
 341 -}
 342 -#define arch_faults_on_old_pte         arch_faults_on_old_pte
 343 +#define arch_has_hw_pte_young          cpu_has_hw_af
 344
 345  /*
 346   * Experimentally, it's cheap to set the access flag in hardware and we
 347   * benefit from prefaulting mappings as 'old' to start with.
 348   */
 349 -static inline bool arch_wants_old_prefaulted_pte(void)
 350 -{
 351 -       return !arch_faults_on_old_pte();
 352 -}
 353 -#define arch_wants_old_prefaulted_pte  arch_wants_old_prefaulted_pte
 354 +#define arch_wants_old_prefaulted_pte  cpu_has_hw_af
 355
 356  #endif /* !__ASSEMBLY__ */
 357
 358 --- a/arch/x86/include/asm/pgtable.h
 359 +++ b/arch/x86/include/asm/pgtable.h
 360 @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
 361         return boot_cpu_has_bug(X86_BUG_L1TF);
 362  }
 363
 364 -#define arch_faults_on_old_pte arch_faults_on_old_pte
 365 -static inline bool arch_faults_on_old_pte(void)
 366 +#define arch_has_hw_pte_young arch_has_hw_pte_young
 367 +static inline bool arch_has_hw_pte_young(void)
 368  {
 369 -       return false;
 370 +       return true;
 371  }
 372
 373  #endif /* __ASSEMBLY__ */
 374 --- a/include/linux/pgtable.h
 375 +++ b/include/linux/pgtable.h
 376 @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young
 377  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 378  #endif
 379
 380 +#ifndef arch_has_hw_pte_young
 381 +/*
 382 + * Return whether the accessed bit is supported on the local CPU.
 383 + *
 384 + * This stub assumes accessing through an old PTE triggers a page fault.
 385 + * Architectures that automatically set the access bit should overwrite it.
 386 + */
 387 +static inline bool arch_has_hw_pte_young(void)
 388 +{
 389 +       return false;
 390 +}
 391 +#endif
 392 +
 393  #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
 394  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 395                                        unsigned long address,
 396 --- a/mm/memory.c
 397 +++ b/mm/memory.c
 398 @@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
 399                                         2;
 400  #endif
 401
 402 -#ifndef arch_faults_on_old_pte
 403 -static inline bool arch_faults_on_old_pte(void)
 404 -{
 405 -       /*
 406 -        * Those arches which don't have hw access flag feature need to
 407 -        * implement their own helper. By default, "true" means pagefault
 408 -        * will be hit on old pte.
 409 -        */
 410 -       return true;
 411 -}
 412 -#endif
 413 -
 414  #ifndef arch_wants_old_prefaulted_pte
 415  static inline bool arch_wants_old_prefaulted_pte(void)
 416  {
 417 @@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
 418          * On architectures with software "accessed" bits, we would
 419          * take a double page fault, so mark it accessed here.
 420          */
 421 -       if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
 422 +       if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
 423                 pte_t entry;
 424
 425                 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);