-From a8e6015d9534f39abc08e6804566af059e498a60 Mon Sep 17 00:00:00 2001
+From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
From: Yu Zhao <yuzhao@google.com>
-Date: Wed, 4 Aug 2021 01:31:34 -0600
-Subject: [PATCH 01/10] mm: x86, arm64: add arch_has_hw_pte_young()
+Date: Sun, 18 Sep 2022 01:59:58 -0600
+Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
+MIME-Version: 1.0
+Content-Type: text/plain; charset=UTF-8
+Content-Transfer-Encoding: 8bit
-Some architectures automatically set the accessed bit in PTEs, e.g.,
-x86 and arm64 v8.2. On architectures that do not have this capability,
-clearing the accessed bit in a PTE triggers a page fault following the
-TLB miss of this PTE.
+Patch series "Multi-Gen LRU Framework", v14.
-Being aware of this capability can help make better decisions, i.e.,
-whether to limit the size of each batch of PTEs and the burst of
-batches when clearing the accessed bit.
+What's new
+==========
+1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
+ Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
+2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
+ machines. The old direct reclaim backoff, which tries to enforce a
+ minimum fairness among all eligible memcgs, over-swapped by about
+ (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
+ pulls the plug on swapping once the target is met, trades some
+ fairness for curtailed latency:
+ https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
+3. Fixed minior build warnings and conflicts. More comments and nits.
+TLDR
+====
+The current page reclaim is too expensive in terms of CPU usage and it
+often makes poor choices about what to evict. This patchset offers an
+alternative solution that is performant, versatile and
+straightforward.
+
+Patchset overview
+=================
+The design and implementation overview is in patch 14:
+https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
+
+01. mm: x86, arm64: add arch_has_hw_pte_young()
+02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
+Take advantage of hardware features when trying to clear the accessed
+bit in many PTEs.
+
+03. mm/vmscan.c: refactor shrink_node()
+04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
+ its sole caller"
+Minor refactors to improve readability for the following patches.
+
+05. mm: multi-gen LRU: groundwork
+Adds the basic data structure and the functions that insert pages to
+and remove pages from the multi-gen LRU (MGLRU) lists.
+
+06. mm: multi-gen LRU: minimal implementation
+A minimal implementation without optimizations.
+
+07. mm: multi-gen LRU: exploit locality in rmap
+Exploits spatial locality to improve efficiency when using the rmap.
+
+08. mm: multi-gen LRU: support page table walks
+Further exploits spatial locality by optionally scanning page tables.
+
+09. mm: multi-gen LRU: optimize multiple memcgs
+Optimizes the overall performance for multiple memcgs running mixed
+types of workloads.
+
+10. mm: multi-gen LRU: kill switch
+Adds a kill switch to enable or disable MGLRU at runtime.
+
+11. mm: multi-gen LRU: thrashing prevention
+12. mm: multi-gen LRU: debugfs interface
+Provide userspace with features like thrashing prevention, working set
+estimation and proactive reclaim.
+
+13. mm: multi-gen LRU: admin guide
+14. mm: multi-gen LRU: design doc
+Add an admin guide and a design doc.
+
+Benchmark results
+=================
+Independent lab results
+-----------------------
+Based on the popularity of searches [01] and the memory usage in
+Google's public cloud, the most popular open-source memory-hungry
+applications, in alphabetical order, are:
+ Apache Cassandra Memcached
+ Apache Hadoop MongoDB
+ Apache Spark PostgreSQL
+ MariaDB (MySQL) Redis
+
+An independent lab evaluated MGLRU with the most widely used benchmark
+suites for the above applications. They posted 960 data points along
+with kernel metrics and perf profiles collected over more than 500
+hours of total benchmark time. Their final reports show that, with 95%
+confidence intervals (CIs), the above applications all performed
+significantly better for at least part of their benchmark matrices.
+
+On 5.14:
+1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
+ less wall time to sort three billion random integers, respectively,
+ under the medium- and the high-concurrency conditions, when
+ overcommitting memory. There were no statistically significant
+ changes in wall time for the rest of the benchmark matrix.
+2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
+ more transactions per minute (TPM), respectively, under the medium-
+ and the high-concurrency conditions, when overcommitting memory.
+ There were no statistically significant changes in TPM for the rest
+ of the benchmark matrix.
+3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
+ and [21.59, 30.02]% more operations per second (OPS), respectively,
+ for sequential access, random access and Gaussian (distribution)
+ access, when THP=always; 95% CIs [13.85, 15.97]% and
+ [23.94, 29.92]% more OPS, respectively, for random access and
+ Gaussian access, when THP=never. There were no statistically
+ significant changes in OPS for the rest of the benchmark matrix.
+4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
+ [2.16, 3.55]% more operations per second (OPS), respectively, for
+ exponential (distribution) access, random access and Zipfian
+ (distribution) access, when underutilizing memory; 95% CIs
+ [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
+ respectively, for exponential access, random access and Zipfian
+ access, when overcommitting memory.
+
+On 5.15:
+5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
+ and [4.11, 7.50]% more operations per second (OPS), respectively,
+ for exponential (distribution) access, random access and Zipfian
+ (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
+ [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
+ exponential access, random access and Zipfian access, when swap was
+ on.
+6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
+ less average wall time to finish twelve parallel TeraSort jobs,
+ respectively, under the medium- and the high-concurrency
+ conditions, when swap was on. There were no statistically
+ significant changes in average wall time for the rest of the
+ benchmark matrix.
+7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
+ minute (TPM) under the high-concurrency condition, when swap was
+ off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
+ respectively, under the medium- and the high-concurrency
+ conditions, when swap was on. There were no statistically
+ significant changes in TPM for the rest of the benchmark matrix.
+8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
+ [11.47, 19.36]% more total operations per second (OPS),
+ respectively, for sequential access, random access and Gaussian
+ (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
+ [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
+ for sequential access, random access and Gaussian access, when
+ THP=never.
+
+Our lab results
+---------------
+To supplement the above results, we ran the following benchmark suites
+on 5.16-rc7 and found no regressions [10].
+ fs_fio_bench_hdd_mq pft
+ fs_lmbench pgsql-hammerdb
+ fs_parallelio redis
+ fs_postmark stream
+ hackbench sysbenchthread
+ kernbench tpcc_spark
+ memcached unixbench
+ multichase vm-scalability
+ mutilate will-it-scale
+ nginx
+
+[01] https://trends.google.com
+[02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
+[03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
+[04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
+[05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
+[06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
+[07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
+[08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
+[09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
+[10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
+
+Read-world applications
+=======================
+Third-party testimonials
+------------------------
+Konstantin reported [11]:
+ I have Archlinux with 8G RAM + zswap + swap. While developing, I
+ have lots of apps opened such as multiple LSP-servers for different
+ langs, chats, two browsers, etc... Usually, my system gets quickly
+ to a point of SWAP-storms, where I have to kill LSP-servers,
+ restart browsers to free memory, etc, otherwise the system lags
+ heavily and is barely usable.
+
+ 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
+ patchset, and I started up by opening lots of apps to create memory
+ pressure, and worked for a day like this. Till now I had not a
+ single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
+ getting to the point of 3G in SWAP before without a single
+ SWAP-storm.
+
+Vaibhav from IBM reported [12]:
+ In a synthetic MongoDB Benchmark, seeing an average of ~19%
+ throughput improvement on POWER10(Radix MMU + 64K Page Size) with
+ MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
+ three different request distributions, namely, Exponential, Uniform
+ and Zipfan.
+
+Shuang from U of Rochester reported [13]:
+ With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
+ and [9.26, 10.36]% higher throughput, respectively, for random
+ access, Zipfian (distribution) access and Gaussian (distribution)
+ access, when the average number of jobs per CPU is 1; 95% CIs
+ [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
+ throughput, respectively, for random access, Zipfian access and
+ Gaussian access, when the average number of jobs per CPU is 2.
+
+Daniel from Michigan Tech reported [14]:
+ With Memcached allocating ~100GB of byte-addressable Optante,
+ performance improvement in terms of throughput (measured as queries
+ per second) was about 10% for a series of workloads.
+
+Large-scale deployments
+-----------------------
+We've rolled out MGLRU to tens of millions of ChromeOS users and
+about a million Android users. Google's fleetwide profiling [15] shows
+an overall 40% decrease in kswapd CPU usage, in addition to
+improvements in other UX metrics, e.g., an 85% decrease in the number
+of low-memory kills at the 75th percentile and an 18% decrease in
+app launch time at the 50th percentile.
+
+The downstream kernels that have been using MGLRU include:
+1. Android [16]
+2. Arch Linux Zen [17]
+3. Armbian [18]
+4. ChromeOS [19]
+5. Liquorix [20]
+6. OpenWrt [21]
+7. post-factum [22]
+8. XanMod [23]
+
+[11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
+[12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
+[13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
+[14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
+[15] https://dl.acm.org/doi/10.1145/2749469.2750392
+[16] https://android.com
+[17] https://archlinux.org
+[18] https://armbian.com
+[19] https://chromium.org
+[20] https://liquorix.net
+[21] https://openwrt.org
+[22] https://codeberg.org/pf-kernel
+[23] https://xanmod.org
+
+Summary
+=======
+The facts are:
+1. The independent lab results and the real-world applications
+ indicate substantial improvements; there are no known regressions.
+2. Thrashing prevention, working set estimation and proactive reclaim
+ work out of the box; there are no equivalent solutions.
+3. There is a lot of new code; no smaller changes have been
+ demonstrated similar effects.
+
+Our options, accordingly, are:
+1. Given the amount of evidence, the reported improvements will likely
+ materialize for a wide range of workloads.
+2. Gauging the interest from the past discussions, the new features
+ will likely be put to use for both personal computers and data
+ centers.
+3. Based on Google's track record, the new code will likely be well
+ maintained in the long term. It'd be more difficult if not
+ impossible to achieve similar effects with other approaches.
+
+This patch (of 14):
+
+Some architectures automatically set the accessed bit in PTEs, e.g., x86
+and arm64 v8.2. On architectures that do not have this capability,
+clearing the accessed bit in a PTE usually triggers a page fault following
+the TLB miss of this PTE (to emulate the accessed bit).
+
+Being aware of this capability can help make better decisions, e.g.,
+whether to spread the work out over a period of time to reduce bursty page
+faults when trying to clear the accessed bit in many PTEs.
+
+Note that theoretically this capability can be unreliable, e.g.,
+hotplugged CPUs might be different from builtin ones. Therefore it should
+not be used in architecture-independent code that involves correctness,
+e.g., to determine whether TLB flushes are required (in combination with
+the accessed bit).
+
+Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
+Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
Signed-off-by: Yu Zhao <yuzhao@google.com>
-Change-Id: Ib49b44fb56df3333a2ff1fcc496fb1980b976e7a
+Reviewed-by: Barry Song <baohua@kernel.org>
+Acked-by: Brian Geffon <bgeffon@google.com>
+Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
+Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
+Acked-by: Steven Barrett <steven@liquorix.net>
+Acked-by: Suleiman Souhlal <suleiman@google.com>
+Acked-by: Will Deacon <will@kernel.org>
+Tested-by: Daniel Byrne <djbyrne@mtu.edu>
+Tested-by: Donald Carr <d@chaos-reins.com>
+Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
+Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
+Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
+Tested-by: Sofia Trinh <sofia.trinh@edi.works>
+Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
+Cc: Andi Kleen <ak@linux.intel.com>
+Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
+Cc: Catalin Marinas <catalin.marinas@arm.com>
+Cc: Dave Hansen <dave.hansen@linux.intel.com>
+Cc: Hillf Danton <hdanton@sina.com>
+Cc: Jens Axboe <axboe@kernel.dk>
+Cc: Johannes Weiner <hannes@cmpxchg.org>
+Cc: Jonathan Corbet <corbet@lwn.net>
+Cc: Linus Torvalds <torvalds@linux-foundation.org>
+Cc: linux-arm-kernel@lists.infradead.org
+Cc: Matthew Wilcox <willy@infradead.org>
+Cc: Mel Gorman <mgorman@suse.de>
+Cc: Michael Larabel <Michael@MichaelLarabel.com>
+Cc: Michal Hocko <mhocko@kernel.org>
+Cc: Mike Rapoport <rppt@kernel.org>
+Cc: Peter Zijlstra <peterz@infradead.org>
+Cc: Tejun Heo <tj@kernel.org>
+Cc: Vlastimil Babka <vbabka@suse.cz>
+Cc: Miaohe Lin <linmiaohe@huawei.com>
+Cc: Mike Rapoport <rppt@linux.ibm.com>
+Cc: Qi Zheng <zhengqi.arch@bytedance.com>
+Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
- arch/arm64/include/asm/cpufeature.h | 5 +++++
- arch/arm64/include/asm/pgtable.h | 13 ++++++++-----
- arch/arm64/kernel/cpufeature.c | 10 ++++++++++
- arch/arm64/tools/cpucaps | 1 +
- arch/x86/include/asm/pgtable.h | 6 +++---
- include/linux/pgtable.h | 13 +++++++++++++
- mm/memory.c | 14 +-------------
- 7 files changed, 41 insertions(+), 21 deletions(-)
-
---- a/arch/arm64/include/asm/cpufeature.h
-+++ b/arch/arm64/include/asm/cpufeature.h
-@@ -808,6 +808,11 @@ static inline bool system_supports_tlb_r
- cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
- }
-
-+static inline bool system_has_hw_af(void)
-+{
-+ return IS_ENABLED(CONFIG_ARM64_HW_AFDBM) && cpus_have_const_cap(ARM64_HW_AF);
-+}
-+
- extern int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
-
- static inline u32 id_aa64mmfr0_parange_to_phys_shift(int parange)
+ arch/arm64/include/asm/pgtable.h | 14 ++------------
+ arch/x86/include/asm/pgtable.h | 6 +++---
+ include/linux/pgtable.h | 13 +++++++++++++
+ mm/memory.c | 14 +-------------
+ 4 files changed, 19 insertions(+), 28 deletions(-)
+
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
-@@ -999,13 +999,16 @@ static inline void update_mmu_cache(stru
+@@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
* page after fork() + CoW for pfn mappings. We don't always have a
* hardware-managed access flag on arm64.
*/
-static inline bool arch_faults_on_old_pte(void)
-+static inline bool arch_has_hw_pte_young(bool local)
- {
+-{
- WARN_ON(preemptible());
-+ if (local) {
-+ WARN_ON(preemptible());
-+ return cpu_has_hw_af();
-+ }
-
+-
- return !cpu_has_hw_af();
-+ return system_has_hw_af();
- }
+-}
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-+#define arch_has_hw_pte_young arch_has_hw_pte_young
++#define arch_has_hw_pte_young cpu_has_hw_af
/*
* Experimentally, it's cheap to set the access flag in hardware and we
-@@ -1013,7 +1016,7 @@ static inline bool arch_faults_on_old_pt
+ * benefit from prefaulting mappings as 'old' to start with.
*/
- static inline bool arch_wants_old_prefaulted_pte(void)
- {
+-static inline bool arch_wants_old_prefaulted_pte(void)
+-{
- return !arch_faults_on_old_pte();
-+ return arch_has_hw_pte_young(true);
- }
- #define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
+-}
+-#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
++#define arch_wants_old_prefaulted_pte cpu_has_hw_af
+
+ #endif /* !__ASSEMBLY__ */
---- a/arch/arm64/kernel/cpufeature.c
-+++ b/arch/arm64/kernel/cpufeature.c
-@@ -2187,6 +2187,16 @@ static const struct arm64_cpu_capabiliti
- .matches = has_hw_dbm,
- .cpu_enable = cpu_enable_hw_dbm,
- },
-+ {
-+ .desc = "Hardware update of the Access flag",
-+ .type = ARM64_CPUCAP_SYSTEM_FEATURE,
-+ .capability = ARM64_HW_AF,
-+ .sys_reg = SYS_ID_AA64MMFR1_EL1,
-+ .sign = FTR_UNSIGNED,
-+ .field_pos = ID_AA64MMFR1_HADBS_SHIFT,
-+ .min_field_value = 1,
-+ .matches = has_cpuid_feature,
-+ },
- #endif
- {
- .desc = "CRC32 instructions",
---- a/arch/arm64/tools/cpucaps
-+++ b/arch/arm64/tools/cpucaps
-@@ -35,6 +35,7 @@ HAS_STAGE2_FWB
- HAS_SYSREG_GIC_CPUIF
- HAS_TLB_RANGE
- HAS_VIRT_HOST_EXTN
-+HW_AF
- HW_DBM
- KVM_PROTECTED_MODE
- MISMATCHED_CACHE_TYPE
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
-#define arch_faults_on_old_pte arch_faults_on_old_pte
-static inline bool arch_faults_on_old_pte(void)
+#define arch_has_hw_pte_young arch_has_hw_pte_young
-+static inline bool arch_has_hw_pte_young(bool local)
++static inline bool arch_has_hw_pte_young(void)
{
- return false;
+ return true;
+#ifndef arch_has_hw_pte_young
+/*
-+ * Return whether the accessed bit is supported by the local CPU or all CPUs.
++ * Return whether the accessed bit is supported on the local CPU.
+ *
-+ * Those arches which have hw access flag feature need to implement their own
-+ * helper. By default, "false" means pagefault will be hit on old pte.
++ * This stub assumes accessing through an old PTE triggers a page fault.
++ * Architectures that automatically set the access bit should overwrite it.
+ */
-+static inline bool arch_has_hw_pte_young(bool local)
++static inline bool arch_has_hw_pte_young(void)
+{
+ return false;
+}
#ifndef arch_wants_old_prefaulted_pte
static inline bool arch_wants_old_prefaulted_pte(void)
{
-@@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
+@@ -2791,7 +2779,7 @@ static inline int cow_user_page(struct p
* On architectures with software "accessed" bits, we would
* take a double page fault, so mark it accessed here.
*/
- if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
-+ if (!arch_has_hw_pte_young(true) && !pte_young(vmf->orig_pte)) {
++ if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
pte_t entry;
vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);