kernel: fix mtk_eth_soc throughput regressions on gigabit PHY ports
[openwrt/openwrt.git] / target / linux / generic / backport-5.15 / 020-v6.1-01-mm-x86-arm64-add-arch_has_hw_pte_young.patch
1 From a4103262b01a1b8704b37c01c7c813df91b7b119 Mon Sep 17 00:00:00 2001
2 From: Yu Zhao <yuzhao@google.com>
3 Date: Sun, 18 Sep 2022 01:59:58 -0600
4 Subject: [PATCH 01/29] mm: x86, arm64: add arch_has_hw_pte_young()
5 MIME-Version: 1.0
6 Content-Type: text/plain; charset=UTF-8
7 Content-Transfer-Encoding: 8bit
8
9 Patch series "Multi-Gen LRU Framework", v14.
10
11 What's new
12 ==========
13 1. OpenWrt, in addition to Android, Arch Linux Zen, Armbian, ChromeOS,
14 Liquorix, post-factum and XanMod, is now shipping MGLRU on 5.15.
15 2. Fixed long-tailed direct reclaim latency seen on high-memory (TBs)
16 machines. The old direct reclaim backoff, which tries to enforce a
17 minimum fairness among all eligible memcgs, over-swapped by about
18 (total_mem>>DEF_PRIORITY)-nr_to_reclaim. The new backoff, which
19 pulls the plug on swapping once the target is met, trades some
20 fairness for curtailed latency:
21 https://lore.kernel.org/r/20220918080010.2920238-10-yuzhao@google.com/
22 3. Fixed minior build warnings and conflicts. More comments and nits.
23
24 TLDR
25 ====
26 The current page reclaim is too expensive in terms of CPU usage and it
27 often makes poor choices about what to evict. This patchset offers an
28 alternative solution that is performant, versatile and
29 straightforward.
30
31 Patchset overview
32 =================
33 The design and implementation overview is in patch 14:
34 https://lore.kernel.org/r/20220918080010.2920238-15-yuzhao@google.com/
35
36 01. mm: x86, arm64: add arch_has_hw_pte_young()
37 02. mm: x86: add CONFIG_ARCH_HAS_NONLEAF_PMD_YOUNG
38 Take advantage of hardware features when trying to clear the accessed
39 bit in many PTEs.
40
41 03. mm/vmscan.c: refactor shrink_node()
42 04. Revert "include/linux/mm_inline.h: fold __update_lru_size() into
43 its sole caller"
44 Minor refactors to improve readability for the following patches.
45
46 05. mm: multi-gen LRU: groundwork
47 Adds the basic data structure and the functions that insert pages to
48 and remove pages from the multi-gen LRU (MGLRU) lists.
49
50 06. mm: multi-gen LRU: minimal implementation
51 A minimal implementation without optimizations.
52
53 07. mm: multi-gen LRU: exploit locality in rmap
54 Exploits spatial locality to improve efficiency when using the rmap.
55
56 08. mm: multi-gen LRU: support page table walks
57 Further exploits spatial locality by optionally scanning page tables.
58
59 09. mm: multi-gen LRU: optimize multiple memcgs
60 Optimizes the overall performance for multiple memcgs running mixed
61 types of workloads.
62
63 10. mm: multi-gen LRU: kill switch
64 Adds a kill switch to enable or disable MGLRU at runtime.
65
66 11. mm: multi-gen LRU: thrashing prevention
67 12. mm: multi-gen LRU: debugfs interface
68 Provide userspace with features like thrashing prevention, working set
69 estimation and proactive reclaim.
70
71 13. mm: multi-gen LRU: admin guide
72 14. mm: multi-gen LRU: design doc
73 Add an admin guide and a design doc.
74
75 Benchmark results
76 =================
77 Independent lab results
78 -----------------------
79 Based on the popularity of searches [01] and the memory usage in
80 Google's public cloud, the most popular open-source memory-hungry
81 applications, in alphabetical order, are:
82 Apache Cassandra Memcached
83 Apache Hadoop MongoDB
84 Apache Spark PostgreSQL
85 MariaDB (MySQL) Redis
86
87 An independent lab evaluated MGLRU with the most widely used benchmark
88 suites for the above applications. They posted 960 data points along
89 with kernel metrics and perf profiles collected over more than 500
90 hours of total benchmark time. Their final reports show that, with 95%
91 confidence intervals (CIs), the above applications all performed
92 significantly better for at least part of their benchmark matrices.
93
94 On 5.14:
95 1. Apache Spark [02] took 95% CIs [9.28, 11.19]% and [12.20, 14.93]%
96 less wall time to sort three billion random integers, respectively,
97 under the medium- and the high-concurrency conditions, when
98 overcommitting memory. There were no statistically significant
99 changes in wall time for the rest of the benchmark matrix.
100 2. MariaDB [03] achieved 95% CIs [5.24, 10.71]% and [20.22, 25.97]%
101 more transactions per minute (TPM), respectively, under the medium-
102 and the high-concurrency conditions, when overcommitting memory.
103 There were no statistically significant changes in TPM for the rest
104 of the benchmark matrix.
105 3. Memcached [04] achieved 95% CIs [23.54, 32.25]%, [20.76, 41.61]%
106 and [21.59, 30.02]% more operations per second (OPS), respectively,
107 for sequential access, random access and Gaussian (distribution)
108 access, when THP=always; 95% CIs [13.85, 15.97]% and
109 [23.94, 29.92]% more OPS, respectively, for random access and
110 Gaussian access, when THP=never. There were no statistically
111 significant changes in OPS for the rest of the benchmark matrix.
112 4. MongoDB [05] achieved 95% CIs [2.23, 3.44]%, [6.97, 9.73]% and
113 [2.16, 3.55]% more operations per second (OPS), respectively, for
114 exponential (distribution) access, random access and Zipfian
115 (distribution) access, when underutilizing memory; 95% CIs
116 [8.83, 10.03]%, [21.12, 23.14]% and [5.53, 6.46]% more OPS,
117 respectively, for exponential access, random access and Zipfian
118 access, when overcommitting memory.
119
120 On 5.15:
121 5. Apache Cassandra [06] achieved 95% CIs [1.06, 4.10]%, [1.94, 5.43]%
122 and [4.11, 7.50]% more operations per second (OPS), respectively,
123 for exponential (distribution) access, random access and Zipfian
124 (distribution) access, when swap was off; 95% CIs [0.50, 2.60]%,
125 [6.51, 8.77]% and [3.29, 6.75]% more OPS, respectively, for
126 exponential access, random access and Zipfian access, when swap was
127 on.
128 6. Apache Hadoop [07] took 95% CIs [5.31, 9.69]% and [2.02, 7.86]%
129 less average wall time to finish twelve parallel TeraSort jobs,
130 respectively, under the medium- and the high-concurrency
131 conditions, when swap was on. There were no statistically
132 significant changes in average wall time for the rest of the
133 benchmark matrix.
134 7. PostgreSQL [08] achieved 95% CI [1.75, 6.42]% more transactions per
135 minute (TPM) under the high-concurrency condition, when swap was
136 off; 95% CIs [12.82, 18.69]% and [22.70, 46.86]% more TPM,
137 respectively, under the medium- and the high-concurrency
138 conditions, when swap was on. There were no statistically
139 significant changes in TPM for the rest of the benchmark matrix.
140 8. Redis [09] achieved 95% CIs [0.58, 5.94]%, [6.55, 14.58]% and
141 [11.47, 19.36]% more total operations per second (OPS),
142 respectively, for sequential access, random access and Gaussian
143 (distribution) access, when THP=always; 95% CIs [1.27, 3.54]%,
144 [10.11, 14.81]% and [8.75, 13.64]% more total OPS, respectively,
145 for sequential access, random access and Gaussian access, when
146 THP=never.
147
148 Our lab results
149 ---------------
150 To supplement the above results, we ran the following benchmark suites
151 on 5.16-rc7 and found no regressions [10].
152 fs_fio_bench_hdd_mq pft
153 fs_lmbench pgsql-hammerdb
154 fs_parallelio redis
155 fs_postmark stream
156 hackbench sysbenchthread
157 kernbench tpcc_spark
158 memcached unixbench
159 multichase vm-scalability
160 mutilate will-it-scale
161 nginx
162
163 [01] https://trends.google.com
164 [02] https://lore.kernel.org/r/20211102002002.92051-1-bot@edi.works/
165 [03] https://lore.kernel.org/r/20211009054315.47073-1-bot@edi.works/
166 [04] https://lore.kernel.org/r/20211021194103.65648-1-bot@edi.works/
167 [05] https://lore.kernel.org/r/20211109021346.50266-1-bot@edi.works/
168 [06] https://lore.kernel.org/r/20211202062806.80365-1-bot@edi.works/
169 [07] https://lore.kernel.org/r/20211209072416.33606-1-bot@edi.works/
170 [08] https://lore.kernel.org/r/20211218071041.24077-1-bot@edi.works/
171 [09] https://lore.kernel.org/r/20211122053248.57311-1-bot@edi.works/
172 [10] https://lore.kernel.org/r/20220104202247.2903702-1-yuzhao@google.com/
173
174 Read-world applications
175 =======================
176 Third-party testimonials
177 ------------------------
178 Konstantin reported [11]:
179 I have Archlinux with 8G RAM + zswap + swap. While developing, I
180 have lots of apps opened such as multiple LSP-servers for different
181 langs, chats, two browsers, etc... Usually, my system gets quickly
182 to a point of SWAP-storms, where I have to kill LSP-servers,
183 restart browsers to free memory, etc, otherwise the system lags
184 heavily and is barely usable.
185
186 1.5 day ago I migrated from 5.11.15 kernel to 5.12 + the LRU
187 patchset, and I started up by opening lots of apps to create memory
188 pressure, and worked for a day like this. Till now I had not a
189 single SWAP-storm, and mind you I got 3.4G in SWAP. I was never
190 getting to the point of 3G in SWAP before without a single
191 SWAP-storm.
192
193 Vaibhav from IBM reported [12]:
194 In a synthetic MongoDB Benchmark, seeing an average of ~19%
195 throughput improvement on POWER10(Radix MMU + 64K Page Size) with
196 MGLRU patches on top of 5.16 kernel for MongoDB + YCSB across
197 three different request distributions, namely, Exponential, Uniform
198 and Zipfan.
199
200 Shuang from U of Rochester reported [13]:
201 With the MGLRU, fio achieved 95% CIs [38.95, 40.26]%, [4.12, 6.64]%
202 and [9.26, 10.36]% higher throughput, respectively, for random
203 access, Zipfian (distribution) access and Gaussian (distribution)
204 access, when the average number of jobs per CPU is 1; 95% CIs
205 [42.32, 49.15]%, [9.44, 9.89]% and [20.99, 22.86]% higher
206 throughput, respectively, for random access, Zipfian access and
207 Gaussian access, when the average number of jobs per CPU is 2.
208
209 Daniel from Michigan Tech reported [14]:
210 With Memcached allocating ~100GB of byte-addressable Optante,
211 performance improvement in terms of throughput (measured as queries
212 per second) was about 10% for a series of workloads.
213
214 Large-scale deployments
215 -----------------------
216 We've rolled out MGLRU to tens of millions of ChromeOS users and
217 about a million Android users. Google's fleetwide profiling [15] shows
218 an overall 40% decrease in kswapd CPU usage, in addition to
219 improvements in other UX metrics, e.g., an 85% decrease in the number
220 of low-memory kills at the 75th percentile and an 18% decrease in
221 app launch time at the 50th percentile.
222
223 The downstream kernels that have been using MGLRU include:
224 1. Android [16]
225 2. Arch Linux Zen [17]
226 3. Armbian [18]
227 4. ChromeOS [19]
228 5. Liquorix [20]
229 6. OpenWrt [21]
230 7. post-factum [22]
231 8. XanMod [23]
232
233 [11] https://lore.kernel.org/r/140226722f2032c86301fbd326d91baefe3d7d23.camel@yandex.ru/
234 [12] https://lore.kernel.org/r/87czj3mux0.fsf@vajain21.in.ibm.com/
235 [13] https://lore.kernel.org/r/20220105024423.26409-1-szhai2@cs.rochester.edu/
236 [14] https://lore.kernel.org/r/CA+4-3vksGvKd18FgRinxhqHetBS1hQekJE2gwco8Ja-bJWKtFw@mail.gmail.com/
237 [15] https://dl.acm.org/doi/10.1145/2749469.2750392
238 [16] https://android.com
239 [17] https://archlinux.org
240 [18] https://armbian.com
241 [19] https://chromium.org
242 [20] https://liquorix.net
243 [21] https://openwrt.org
244 [22] https://codeberg.org/pf-kernel
245 [23] https://xanmod.org
246
247 Summary
248 =======
249 The facts are:
250 1. The independent lab results and the real-world applications
251 indicate substantial improvements; there are no known regressions.
252 2. Thrashing prevention, working set estimation and proactive reclaim
253 work out of the box; there are no equivalent solutions.
254 3. There is a lot of new code; no smaller changes have been
255 demonstrated similar effects.
256
257 Our options, accordingly, are:
258 1. Given the amount of evidence, the reported improvements will likely
259 materialize for a wide range of workloads.
260 2. Gauging the interest from the past discussions, the new features
261 will likely be put to use for both personal computers and data
262 centers.
263 3. Based on Google's track record, the new code will likely be well
264 maintained in the long term. It'd be more difficult if not
265 impossible to achieve similar effects with other approaches.
266
267 This patch (of 14):
268
269 Some architectures automatically set the accessed bit in PTEs, e.g., x86
270 and arm64 v8.2. On architectures that do not have this capability,
271 clearing the accessed bit in a PTE usually triggers a page fault following
272 the TLB miss of this PTE (to emulate the accessed bit).
273
274 Being aware of this capability can help make better decisions, e.g.,
275 whether to spread the work out over a period of time to reduce bursty page
276 faults when trying to clear the accessed bit in many PTEs.
277
278 Note that theoretically this capability can be unreliable, e.g.,
279 hotplugged CPUs might be different from builtin ones. Therefore it should
280 not be used in architecture-independent code that involves correctness,
281 e.g., to determine whether TLB flushes are required (in combination with
282 the accessed bit).
283
284 Link: https://lkml.kernel.org/r/20220918080010.2920238-1-yuzhao@google.com
285 Link: https://lkml.kernel.org/r/20220918080010.2920238-2-yuzhao@google.com
286 Signed-off-by: Yu Zhao <yuzhao@google.com>
287 Reviewed-by: Barry Song <baohua@kernel.org>
288 Acked-by: Brian Geffon <bgeffon@google.com>
289 Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
290 Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
291 Acked-by: Steven Barrett <steven@liquorix.net>
292 Acked-by: Suleiman Souhlal <suleiman@google.com>
293 Acked-by: Will Deacon <will@kernel.org>
294 Tested-by: Daniel Byrne <djbyrne@mtu.edu>
295 Tested-by: Donald Carr <d@chaos-reins.com>
296 Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
297 Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
298 Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
299 Tested-by: Sofia Trinh <sofia.trinh@edi.works>
300 Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
301 Cc: Andi Kleen <ak@linux.intel.com>
302 Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
303 Cc: Catalin Marinas <catalin.marinas@arm.com>
304 Cc: Dave Hansen <dave.hansen@linux.intel.com>
305 Cc: Hillf Danton <hdanton@sina.com>
306 Cc: Jens Axboe <axboe@kernel.dk>
307 Cc: Johannes Weiner <hannes@cmpxchg.org>
308 Cc: Jonathan Corbet <corbet@lwn.net>
309 Cc: Linus Torvalds <torvalds@linux-foundation.org>
310 Cc: linux-arm-kernel@lists.infradead.org
311 Cc: Matthew Wilcox <willy@infradead.org>
312 Cc: Mel Gorman <mgorman@suse.de>
313 Cc: Michael Larabel <Michael@MichaelLarabel.com>
314 Cc: Michal Hocko <mhocko@kernel.org>
315 Cc: Mike Rapoport <rppt@kernel.org>
316 Cc: Peter Zijlstra <peterz@infradead.org>
317 Cc: Tejun Heo <tj@kernel.org>
318 Cc: Vlastimil Babka <vbabka@suse.cz>
319 Cc: Miaohe Lin <linmiaohe@huawei.com>
320 Cc: Mike Rapoport <rppt@linux.ibm.com>
321 Cc: Qi Zheng <zhengqi.arch@bytedance.com>
322 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
323 ---
324 arch/arm64/include/asm/pgtable.h | 14 ++------------
325 arch/x86/include/asm/pgtable.h | 6 +++---
326 include/linux/pgtable.h | 13 +++++++++++++
327 mm/memory.c | 14 +-------------
328 4 files changed, 19 insertions(+), 28 deletions(-)
329
330 --- a/arch/arm64/include/asm/pgtable.h
331 +++ b/arch/arm64/include/asm/pgtable.h
332 @@ -999,23 +999,13 @@ static inline void update_mmu_cache(stru
333 * page after fork() + CoW for pfn mappings. We don't always have a
334 * hardware-managed access flag on arm64.
335 */
336 -static inline bool arch_faults_on_old_pte(void)
337 -{
338 - WARN_ON(preemptible());
339 -
340 - return !cpu_has_hw_af();
341 -}
342 -#define arch_faults_on_old_pte arch_faults_on_old_pte
343 +#define arch_has_hw_pte_young cpu_has_hw_af
344
345 /*
346 * Experimentally, it's cheap to set the access flag in hardware and we
347 * benefit from prefaulting mappings as 'old' to start with.
348 */
349 -static inline bool arch_wants_old_prefaulted_pte(void)
350 -{
351 - return !arch_faults_on_old_pte();
352 -}
353 -#define arch_wants_old_prefaulted_pte arch_wants_old_prefaulted_pte
354 +#define arch_wants_old_prefaulted_pte cpu_has_hw_af
355
356 #endif /* !__ASSEMBLY__ */
357
358 --- a/arch/x86/include/asm/pgtable.h
359 +++ b/arch/x86/include/asm/pgtable.h
360 @@ -1397,10 +1397,10 @@ static inline bool arch_has_pfn_modify_c
361 return boot_cpu_has_bug(X86_BUG_L1TF);
362 }
363
364 -#define arch_faults_on_old_pte arch_faults_on_old_pte
365 -static inline bool arch_faults_on_old_pte(void)
366 +#define arch_has_hw_pte_young arch_has_hw_pte_young
367 +static inline bool arch_has_hw_pte_young(void)
368 {
369 - return false;
370 + return true;
371 }
372
373 #endif /* __ASSEMBLY__ */
374 --- a/include/linux/pgtable.h
375 +++ b/include/linux/pgtable.h
376 @@ -259,6 +259,19 @@ static inline int pmdp_clear_flush_young
377 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
378 #endif
379
380 +#ifndef arch_has_hw_pte_young
381 +/*
382 + * Return whether the accessed bit is supported on the local CPU.
383 + *
384 + * This stub assumes accessing through an old PTE triggers a page fault.
385 + * Architectures that automatically set the access bit should overwrite it.
386 + */
387 +static inline bool arch_has_hw_pte_young(void)
388 +{
389 + return false;
390 +}
391 +#endif
392 +
393 #ifndef __HAVE_ARCH_PTEP_GET_AND_CLEAR
394 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
395 unsigned long address,
396 --- a/mm/memory.c
397 +++ b/mm/memory.c
398 @@ -121,18 +121,6 @@ int randomize_va_space __read_mostly =
399 2;
400 #endif
401
402 -#ifndef arch_faults_on_old_pte
403 -static inline bool arch_faults_on_old_pte(void)
404 -{
405 - /*
406 - * Those arches which don't have hw access flag feature need to
407 - * implement their own helper. By default, "true" means pagefault
408 - * will be hit on old pte.
409 - */
410 - return true;
411 -}
412 -#endif
413 -
414 #ifndef arch_wants_old_prefaulted_pte
415 static inline bool arch_wants_old_prefaulted_pte(void)
416 {
417 @@ -2782,7 +2770,7 @@ static inline bool cow_user_page(struct
418 * On architectures with software "accessed" bits, we would
419 * take a double page fault, so mark it accessed here.
420 */
421 - if (arch_faults_on_old_pte() && !pte_young(vmf->orig_pte)) {
422 + if (!arch_has_hw_pte_young() && !pte_young(vmf->orig_pte)) {
423 pte_t entry;
424
425 vmf->pte = pte_offset_map_lock(mm, vmf->pmd, addr, &vmf->ptl);