kernel: Update MGLRU patchset
[openwrt/staging/dedeckeh.git] / target / linux / generic / backport-5.15 / 020-v6.1-05-mm-multi-gen-LRU-groundwork.patch
1 From a9b328add8422921a0dbbef162730800e16e8cfd Mon Sep 17 00:00:00 2001
2 From: Yu Zhao <yuzhao@google.com>
3 Date: Sun, 18 Sep 2022 02:00:02 -0600
4 Subject: [PATCH 05/29] mm: multi-gen LRU: groundwork
5 MIME-Version: 1.0
6 Content-Type: text/plain; charset=UTF-8
7 Content-Transfer-Encoding: 8bit
8
9 Evictable pages are divided into multiple generations for each lruvec.
10 The youngest generation number is stored in lrugen->max_seq for both
11 anon and file types as they are aged on an equal footing. The oldest
12 generation numbers are stored in lrugen->min_seq[] separately for anon
13 and file types as clean file pages can be evicted regardless of swap
14 constraints. These three variables are monotonically increasing.
15
16 Generation numbers are truncated into order_base_2(MAX_NR_GENS+1) bits
17 in order to fit into the gen counter in page->flags. Each truncated
18 generation number is an index to lrugen->lists[]. The sliding window
19 technique is used to track at least MIN_NR_GENS and at most
20 MAX_NR_GENS generations. The gen counter stores a value within [1,
21 MAX_NR_GENS] while a page is on one of lrugen->lists[]. Otherwise it
22 stores 0.
23
24 There are two conceptually independent procedures: "the aging", which
25 produces young generations, and "the eviction", which consumes old
26 generations. They form a closed-loop system, i.e., "the page reclaim".
27 Both procedures can be invoked from userspace for the purposes of working
28 set estimation and proactive reclaim. These techniques are commonly used
29 to optimize job scheduling (bin packing) in data centers [1][2].
30
31 To avoid confusion, the terms "hot" and "cold" will be applied to the
32 multi-gen LRU, as a new convention; the terms "active" and "inactive" will
33 be applied to the active/inactive LRU, as usual.
34
35 The protection of hot pages and the selection of cold pages are based
36 on page access channels and patterns. There are two access channels:
37 one through page tables and the other through file descriptors. The
38 protection of the former channel is by design stronger because:
39 1. The uncertainty in determining the access patterns of the former
40 channel is higher due to the approximation of the accessed bit.
41 2. The cost of evicting the former channel is higher due to the TLB
42 flushes required and the likelihood of encountering the dirty bit.
43 3. The penalty of underprotecting the former channel is higher because
44 applications usually do not prepare themselves for major page
45 faults like they do for blocked I/O. E.g., GUI applications
46 commonly use dedicated I/O threads to avoid blocking rendering
47 threads.
48
49 There are also two access patterns: one with temporal locality and the
50 other without. For the reasons listed above, the former channel is
51 assumed to follow the former pattern unless VM_SEQ_READ or VM_RAND_READ is
52 present; the latter channel is assumed to follow the latter pattern unless
53 outlying refaults have been observed [3][4].
54
55 The next patch will address the "outlying refaults". Three macros, i.e.,
56 LRU_REFS_WIDTH, LRU_REFS_PGOFF and LRU_REFS_MASK, used later are added in
57 this patch to make the entire patchset less diffy.
58
59 A page is added to the youngest generation on faulting. The aging needs
60 to check the accessed bit at least twice before handing this page over to
61 the eviction. The first check takes care of the accessed bit set on the
62 initial fault; the second check makes sure this page has not been used
63 since then. This protocol, AKA second chance, requires a minimum of two
64 generations, hence MIN_NR_GENS.
65
66 [1] https://dl.acm.org/doi/10.1145/3297858.3304053
67 [2] https://dl.acm.org/doi/10.1145/3503222.3507731
68 [3] https://lwn.net/Articles/495543/
69 [4] https://lwn.net/Articles/815342/
70
71 Link: https://lkml.kernel.org/r/20220918080010.2920238-6-yuzhao@google.com
72 Signed-off-by: Yu Zhao <yuzhao@google.com>
73 Acked-by: Brian Geffon <bgeffon@google.com>
74 Acked-by: Jan Alexander Steffens (heftig) <heftig@archlinux.org>
75 Acked-by: Oleksandr Natalenko <oleksandr@natalenko.name>
76 Acked-by: Steven Barrett <steven@liquorix.net>
77 Acked-by: Suleiman Souhlal <suleiman@google.com>
78 Tested-by: Daniel Byrne <djbyrne@mtu.edu>
79 Tested-by: Donald Carr <d@chaos-reins.com>
80 Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
81 Tested-by: Konstantin Kharlamov <Hi-Angel@yandex.ru>
82 Tested-by: Shuang Zhai <szhai2@cs.rochester.edu>
83 Tested-by: Sofia Trinh <sofia.trinh@edi.works>
84 Tested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
85 Cc: Andi Kleen <ak@linux.intel.com>
86 Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
87 Cc: Barry Song <baohua@kernel.org>
88 Cc: Catalin Marinas <catalin.marinas@arm.com>
89 Cc: Dave Hansen <dave.hansen@linux.intel.com>
90 Cc: Hillf Danton <hdanton@sina.com>
91 Cc: Jens Axboe <axboe@kernel.dk>
92 Cc: Johannes Weiner <hannes@cmpxchg.org>
93 Cc: Jonathan Corbet <corbet@lwn.net>
94 Cc: Linus Torvalds <torvalds@linux-foundation.org>
95 Cc: Matthew Wilcox <willy@infradead.org>
96 Cc: Mel Gorman <mgorman@suse.de>
97 Cc: Miaohe Lin <linmiaohe@huawei.com>
98 Cc: Michael Larabel <Michael@MichaelLarabel.com>
99 Cc: Michal Hocko <mhocko@kernel.org>
100 Cc: Mike Rapoport <rppt@kernel.org>
101 Cc: Mike Rapoport <rppt@linux.ibm.com>
102 Cc: Peter Zijlstra <peterz@infradead.org>
103 Cc: Qi Zheng <zhengqi.arch@bytedance.com>
104 Cc: Tejun Heo <tj@kernel.org>
105 Cc: Vlastimil Babka <vbabka@suse.cz>
106 Cc: Will Deacon <will@kernel.org>
107 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
108 ---
109 fs/fuse/dev.c | 3 +-
110 include/linux/mm.h | 2 +
111 include/linux/mm_inline.h | 177 +++++++++++++++++++++++++++++-
112 include/linux/mmzone.h | 100 +++++++++++++++++
113 include/linux/page-flags-layout.h | 13 ++-
114 include/linux/page-flags.h | 4 +-
115 include/linux/sched.h | 4 +
116 kernel/bounds.c | 5 +
117 mm/Kconfig | 8 ++
118 mm/huge_memory.c | 3 +-
119 mm/memcontrol.c | 2 +
120 mm/memory.c | 25 +++++
121 mm/mm_init.c | 6 +-
122 mm/mmzone.c | 2 +
123 mm/swap.c | 10 +-
124 mm/vmscan.c | 75 +++++++++++++
125 16 files changed, 425 insertions(+), 14 deletions(-)
126
127 diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
128 index d6b5339c56e2..4ec08f7c3e75 100644
129 --- a/fs/fuse/dev.c
130 +++ b/fs/fuse/dev.c
131 @@ -785,7 +785,8 @@ static int fuse_check_page(struct page *page)
132 1 << PG_active |
133 1 << PG_workingset |
134 1 << PG_reclaim |
135 - 1 << PG_waiters))) {
136 + 1 << PG_waiters |
137 + LRU_GEN_MASK | LRU_REFS_MASK))) {
138 dump_page(page, "fuse: trying to steal weird page");
139 return 1;
140 }
141 diff --git a/include/linux/mm.h b/include/linux/mm.h
142 index e4e1817bb3b8..699068f39aa0 100644
143 --- a/include/linux/mm.h
144 +++ b/include/linux/mm.h
145 @@ -1093,6 +1093,8 @@ vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf);
146 #define ZONES_PGOFF (NODES_PGOFF - ZONES_WIDTH)
147 #define LAST_CPUPID_PGOFF (ZONES_PGOFF - LAST_CPUPID_WIDTH)
148 #define KASAN_TAG_PGOFF (LAST_CPUPID_PGOFF - KASAN_TAG_WIDTH)
149 +#define LRU_GEN_PGOFF (KASAN_TAG_PGOFF - LRU_GEN_WIDTH)
150 +#define LRU_REFS_PGOFF (LRU_GEN_PGOFF - LRU_REFS_WIDTH)
151
152 /*
153 * Define the bit shifts to access each section. For non-existent
154 diff --git a/include/linux/mm_inline.h b/include/linux/mm_inline.h
155 index a822d6b690a5..65320d2b8f60 100644
156 --- a/include/linux/mm_inline.h
157 +++ b/include/linux/mm_inline.h
158 @@ -26,10 +26,13 @@ static inline int page_is_file_lru(struct page *page)
159
160 static __always_inline void __update_lru_size(struct lruvec *lruvec,
161 enum lru_list lru, enum zone_type zid,
162 - int nr_pages)
163 + long nr_pages)
164 {
165 struct pglist_data *pgdat = lruvec_pgdat(lruvec);
166
167 + lockdep_assert_held(&lruvec->lru_lock);
168 + WARN_ON_ONCE(nr_pages != (int)nr_pages);
169 +
170 __mod_lruvec_state(lruvec, NR_LRU_BASE + lru, nr_pages);
171 __mod_zone_page_state(&pgdat->node_zones[zid],
172 NR_ZONE_LRU_BASE + lru, nr_pages);
173 @@ -86,11 +89,177 @@ static __always_inline enum lru_list page_lru(struct page *page)
174 return lru;
175 }
176
177 +#ifdef CONFIG_LRU_GEN
178 +
179 +static inline bool lru_gen_enabled(void)
180 +{
181 + return true;
182 +}
183 +
184 +static inline bool lru_gen_in_fault(void)
185 +{
186 + return current->in_lru_fault;
187 +}
188 +
189 +static inline int lru_gen_from_seq(unsigned long seq)
190 +{
191 + return seq % MAX_NR_GENS;
192 +}
193 +
194 +static inline int page_lru_gen(struct page *page)
195 +{
196 + unsigned long flags = READ_ONCE(page->flags);
197 +
198 + return ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
199 +}
200 +
201 +static inline bool lru_gen_is_active(struct lruvec *lruvec, int gen)
202 +{
203 + unsigned long max_seq = lruvec->lrugen.max_seq;
204 +
205 + VM_WARN_ON_ONCE(gen >= MAX_NR_GENS);
206 +
207 + /* see the comment on MIN_NR_GENS */
208 + return gen == lru_gen_from_seq(max_seq) || gen == lru_gen_from_seq(max_seq - 1);
209 +}
210 +
211 +static inline void lru_gen_update_size(struct lruvec *lruvec, struct page *page,
212 + int old_gen, int new_gen)
213 +{
214 + int type = page_is_file_lru(page);
215 + int zone = page_zonenum(page);
216 + int delta = thp_nr_pages(page);
217 + enum lru_list lru = type * LRU_INACTIVE_FILE;
218 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
219 +
220 + VM_WARN_ON_ONCE(old_gen != -1 && old_gen >= MAX_NR_GENS);
221 + VM_WARN_ON_ONCE(new_gen != -1 && new_gen >= MAX_NR_GENS);
222 + VM_WARN_ON_ONCE(old_gen == -1 && new_gen == -1);
223 +
224 + if (old_gen >= 0)
225 + WRITE_ONCE(lrugen->nr_pages[old_gen][type][zone],
226 + lrugen->nr_pages[old_gen][type][zone] - delta);
227 + if (new_gen >= 0)
228 + WRITE_ONCE(lrugen->nr_pages[new_gen][type][zone],
229 + lrugen->nr_pages[new_gen][type][zone] + delta);
230 +
231 + /* addition */
232 + if (old_gen < 0) {
233 + if (lru_gen_is_active(lruvec, new_gen))
234 + lru += LRU_ACTIVE;
235 + __update_lru_size(lruvec, lru, zone, delta);
236 + return;
237 + }
238 +
239 + /* deletion */
240 + if (new_gen < 0) {
241 + if (lru_gen_is_active(lruvec, old_gen))
242 + lru += LRU_ACTIVE;
243 + __update_lru_size(lruvec, lru, zone, -delta);
244 + return;
245 + }
246 +}
247 +
248 +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
249 +{
250 + unsigned long seq;
251 + unsigned long flags;
252 + int gen = page_lru_gen(page);
253 + int type = page_is_file_lru(page);
254 + int zone = page_zonenum(page);
255 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
256 +
257 + VM_WARN_ON_ONCE_PAGE(gen != -1, page);
258 +
259 + if (PageUnevictable(page))
260 + return false;
261 + /*
262 + * There are three common cases for this page:
263 + * 1. If it's hot, e.g., freshly faulted in or previously hot and
264 + * migrated, add it to the youngest generation.
265 + * 2. If it's cold but can't be evicted immediately, i.e., an anon page
266 + * not in swapcache or a dirty page pending writeback, add it to the
267 + * second oldest generation.
268 + * 3. Everything else (clean, cold) is added to the oldest generation.
269 + */
270 + if (PageActive(page))
271 + seq = lrugen->max_seq;
272 + else if ((type == LRU_GEN_ANON && !PageSwapCache(page)) ||
273 + (PageReclaim(page) &&
274 + (PageDirty(page) || PageWriteback(page))))
275 + seq = lrugen->min_seq[type] + 1;
276 + else
277 + seq = lrugen->min_seq[type];
278 +
279 + gen = lru_gen_from_seq(seq);
280 + flags = (gen + 1UL) << LRU_GEN_PGOFF;
281 + /* see the comment on MIN_NR_GENS about PG_active */
282 + set_mask_bits(&page->flags, LRU_GEN_MASK | BIT(PG_active), flags);
283 +
284 + lru_gen_update_size(lruvec, page, -1, gen);
285 + /* for rotate_reclaimable_page() */
286 + if (reclaiming)
287 + list_add_tail(&page->lru, &lrugen->lists[gen][type][zone]);
288 + else
289 + list_add(&page->lru, &lrugen->lists[gen][type][zone]);
290 +
291 + return true;
292 +}
293 +
294 +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
295 +{
296 + unsigned long flags;
297 + int gen = page_lru_gen(page);
298 +
299 + if (gen < 0)
300 + return false;
301 +
302 + VM_WARN_ON_ONCE_PAGE(PageActive(page), page);
303 + VM_WARN_ON_ONCE_PAGE(PageUnevictable(page), page);
304 +
305 + /* for migrate_page_states() */
306 + flags = !reclaiming && lru_gen_is_active(lruvec, gen) ? BIT(PG_active) : 0;
307 + flags = set_mask_bits(&page->flags, LRU_GEN_MASK, flags);
308 + gen = ((flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
309 +
310 + lru_gen_update_size(lruvec, page, gen, -1);
311 + list_del(&page->lru);
312 +
313 + return true;
314 +}
315 +
316 +#else /* !CONFIG_LRU_GEN */
317 +
318 +static inline bool lru_gen_enabled(void)
319 +{
320 + return false;
321 +}
322 +
323 +static inline bool lru_gen_in_fault(void)
324 +{
325 + return false;
326 +}
327 +
328 +static inline bool lru_gen_add_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
329 +{
330 + return false;
331 +}
332 +
333 +static inline bool lru_gen_del_page(struct lruvec *lruvec, struct page *page, bool reclaiming)
334 +{
335 + return false;
336 +}
337 +
338 +#endif /* CONFIG_LRU_GEN */
339 +
340 static __always_inline void add_page_to_lru_list(struct page *page,
341 struct lruvec *lruvec)
342 {
343 enum lru_list lru = page_lru(page);
344
345 + if (lru_gen_add_page(lruvec, page, false))
346 + return;
347 +
348 update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
349 list_add(&page->lru, &lruvec->lists[lru]);
350 }
351 @@ -100,6 +269,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
352 {
353 enum lru_list lru = page_lru(page);
354
355 + if (lru_gen_add_page(lruvec, page, true))
356 + return;
357 +
358 update_lru_size(lruvec, lru, page_zonenum(page), thp_nr_pages(page));
359 list_add_tail(&page->lru, &lruvec->lists[lru]);
360 }
361 @@ -107,6 +279,9 @@ static __always_inline void add_page_to_lru_list_tail(struct page *page,
362 static __always_inline void del_page_from_lru_list(struct page *page,
363 struct lruvec *lruvec)
364 {
365 + if (lru_gen_del_page(lruvec, page, false))
366 + return;
367 +
368 list_del(&page->lru);
369 update_lru_size(lruvec, page_lru(page), page_zonenum(page),
370 -thp_nr_pages(page));
371 diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
372 index 6ba100216530..0c39f72184d0 100644
373 --- a/include/linux/mmzone.h
374 +++ b/include/linux/mmzone.h
375 @@ -294,6 +294,102 @@ enum lruvec_flags {
376 */
377 };
378
379 +#endif /* !__GENERATING_BOUNDS_H */
380 +
381 +/*
382 + * Evictable pages are divided into multiple generations. The youngest and the
383 + * oldest generation numbers, max_seq and min_seq, are monotonically increasing.
384 + * They form a sliding window of a variable size [MIN_NR_GENS, MAX_NR_GENS]. An
385 + * offset within MAX_NR_GENS, i.e., gen, indexes the LRU list of the
386 + * corresponding generation. The gen counter in page->flags stores gen+1 while
387 + * a page is on one of lrugen->lists[]. Otherwise it stores 0.
388 + *
389 + * A page is added to the youngest generation on faulting. The aging needs to
390 + * check the accessed bit at least twice before handing this page over to the
391 + * eviction. The first check takes care of the accessed bit set on the initial
392 + * fault; the second check makes sure this page hasn't been used since then.
393 + * This process, AKA second chance, requires a minimum of two generations,
394 + * hence MIN_NR_GENS. And to maintain ABI compatibility with the active/inactive
395 + * LRU, e.g., /proc/vmstat, these two generations are considered active; the
396 + * rest of generations, if they exist, are considered inactive. See
397 + * lru_gen_is_active().
398 + *
399 + * PG_active is always cleared while a page is on one of lrugen->lists[] so that
400 + * the aging needs not to worry about it. And it's set again when a page
401 + * considered active is isolated for non-reclaiming purposes, e.g., migration.
402 + * See lru_gen_add_page() and lru_gen_del_page().
403 + *
404 + * MAX_NR_GENS is set to 4 so that the multi-gen LRU can support twice the
405 + * number of categories of the active/inactive LRU when keeping track of
406 + * accesses through page tables. This requires order_base_2(MAX_NR_GENS+1) bits
407 + * in page->flags.
408 + */
409 +#define MIN_NR_GENS 2U
410 +#define MAX_NR_GENS 4U
411 +
412 +#ifndef __GENERATING_BOUNDS_H
413 +
414 +struct lruvec;
415 +
416 +#define LRU_GEN_MASK ((BIT(LRU_GEN_WIDTH) - 1) << LRU_GEN_PGOFF)
417 +#define LRU_REFS_MASK ((BIT(LRU_REFS_WIDTH) - 1) << LRU_REFS_PGOFF)
418 +
419 +#ifdef CONFIG_LRU_GEN
420 +
421 +enum {
422 + LRU_GEN_ANON,
423 + LRU_GEN_FILE,
424 +};
425 +
426 +/*
427 + * The youngest generation number is stored in max_seq for both anon and file
428 + * types as they are aged on an equal footing. The oldest generation numbers are
429 + * stored in min_seq[] separately for anon and file types as clean file pages
430 + * can be evicted regardless of swap constraints.
431 + *
432 + * Normally anon and file min_seq are in sync. But if swapping is constrained,
433 + * e.g., out of swap space, file min_seq is allowed to advance and leave anon
434 + * min_seq behind.
435 + *
436 + * The number of pages in each generation is eventually consistent and therefore
437 + * can be transiently negative.
438 + */
439 +struct lru_gen_struct {
440 + /* the aging increments the youngest generation number */
441 + unsigned long max_seq;
442 + /* the eviction increments the oldest generation numbers */
443 + unsigned long min_seq[ANON_AND_FILE];
444 + /* the multi-gen LRU lists, lazily sorted on eviction */
445 + struct list_head lists[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
446 + /* the multi-gen LRU sizes, eventually consistent */
447 + long nr_pages[MAX_NR_GENS][ANON_AND_FILE][MAX_NR_ZONES];
448 +};
449 +
450 +void lru_gen_init_lruvec(struct lruvec *lruvec);
451 +
452 +#ifdef CONFIG_MEMCG
453 +void lru_gen_init_memcg(struct mem_cgroup *memcg);
454 +void lru_gen_exit_memcg(struct mem_cgroup *memcg);
455 +#endif
456 +
457 +#else /* !CONFIG_LRU_GEN */
458 +
459 +static inline void lru_gen_init_lruvec(struct lruvec *lruvec)
460 +{
461 +}
462 +
463 +#ifdef CONFIG_MEMCG
464 +static inline void lru_gen_init_memcg(struct mem_cgroup *memcg)
465 +{
466 +}
467 +
468 +static inline void lru_gen_exit_memcg(struct mem_cgroup *memcg)
469 +{
470 +}
471 +#endif
472 +
473 +#endif /* CONFIG_LRU_GEN */
474 +
475 struct lruvec {
476 struct list_head lists[NR_LRU_LISTS];
477 /* per lruvec lru_lock for memcg */
478 @@ -311,6 +407,10 @@ struct lruvec {
479 unsigned long refaults[ANON_AND_FILE];
480 /* Various lruvec state flags (enum lruvec_flags) */
481 unsigned long flags;
482 +#ifdef CONFIG_LRU_GEN
483 + /* evictable pages divided into generations */
484 + struct lru_gen_struct lrugen;
485 +#endif
486 #ifdef CONFIG_MEMCG
487 struct pglist_data *pgdat;
488 #endif
489 diff --git a/include/linux/page-flags-layout.h b/include/linux/page-flags-layout.h
490 index ef1e3e736e14..240905407a18 100644
491 --- a/include/linux/page-flags-layout.h
492 +++ b/include/linux/page-flags-layout.h
493 @@ -55,7 +55,8 @@
494 #define SECTIONS_WIDTH 0
495 #endif
496
497 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
498 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_SHIFT \
499 + <= BITS_PER_LONG - NR_PAGEFLAGS
500 #define NODES_WIDTH NODES_SHIFT
501 #elif defined(CONFIG_SPARSEMEM_VMEMMAP)
502 #error "Vmemmap: No space for nodes field in page flags"
503 @@ -89,8 +90,8 @@
504 #define LAST_CPUPID_SHIFT 0
505 #endif
506
507 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT \
508 - <= BITS_PER_LONG - NR_PAGEFLAGS
509 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
510 + KASAN_TAG_WIDTH + LAST_CPUPID_SHIFT <= BITS_PER_LONG - NR_PAGEFLAGS
511 #define LAST_CPUPID_WIDTH LAST_CPUPID_SHIFT
512 #else
513 #define LAST_CPUPID_WIDTH 0
514 @@ -100,10 +101,12 @@
515 #define LAST_CPUPID_NOT_IN_PAGE_FLAGS
516 #endif
517
518 -#if ZONES_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH \
519 - > BITS_PER_LONG - NR_PAGEFLAGS
520 +#if ZONES_WIDTH + LRU_GEN_WIDTH + SECTIONS_WIDTH + NODES_WIDTH + \
521 + KASAN_TAG_WIDTH + LAST_CPUPID_WIDTH > BITS_PER_LONG - NR_PAGEFLAGS
522 #error "Not enough bits in page flags"
523 #endif
524
525 +#define LRU_REFS_WIDTH 0
526 +
527 #endif
528 #endif /* _LINUX_PAGE_FLAGS_LAYOUT */
529 diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
530 index fbfd3fad48f2..a7d7ff4c621d 100644
531 --- a/include/linux/page-flags.h
532 +++ b/include/linux/page-flags.h
533 @@ -845,7 +845,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
534 1UL << PG_private | 1UL << PG_private_2 | \
535 1UL << PG_writeback | 1UL << PG_reserved | \
536 1UL << PG_slab | 1UL << PG_active | \
537 - 1UL << PG_unevictable | __PG_MLOCKED)
538 + 1UL << PG_unevictable | __PG_MLOCKED | LRU_GEN_MASK)
539
540 /*
541 * Flags checked when a page is prepped for return by the page allocator.
542 @@ -856,7 +856,7 @@ static inline void ClearPageSlabPfmemalloc(struct page *page)
543 * alloc-free cycle to prevent from reusing the page.
544 */
545 #define PAGE_FLAGS_CHECK_AT_PREP \
546 - (PAGEFLAGS_MASK & ~__PG_HWPOISON)
547 + ((PAGEFLAGS_MASK & ~__PG_HWPOISON) | LRU_GEN_MASK | LRU_REFS_MASK)
548
549 #define PAGE_FLAGS_PRIVATE \
550 (1UL << PG_private | 1UL << PG_private_2)
551 diff --git a/include/linux/sched.h b/include/linux/sched.h
552 index e418935f8db6..545f6b1ccd50 100644
553 --- a/include/linux/sched.h
554 +++ b/include/linux/sched.h
555 @@ -911,6 +911,10 @@ struct task_struct {
556 #ifdef CONFIG_MEMCG
557 unsigned in_user_fault:1;
558 #endif
559 +#ifdef CONFIG_LRU_GEN
560 + /* whether the LRU algorithm may apply to this access */
561 + unsigned in_lru_fault:1;
562 +#endif
563 #ifdef CONFIG_COMPAT_BRK
564 unsigned brk_randomized:1;
565 #endif
566 diff --git a/kernel/bounds.c b/kernel/bounds.c
567 index 9795d75b09b2..5ee60777d8e4 100644
568 --- a/kernel/bounds.c
569 +++ b/kernel/bounds.c
570 @@ -22,6 +22,11 @@ int main(void)
571 DEFINE(NR_CPUS_BITS, ilog2(CONFIG_NR_CPUS));
572 #endif
573 DEFINE(SPINLOCK_SIZE, sizeof(spinlock_t));
574 +#ifdef CONFIG_LRU_GEN
575 + DEFINE(LRU_GEN_WIDTH, order_base_2(MAX_NR_GENS + 1));
576 +#else
577 + DEFINE(LRU_GEN_WIDTH, 0);
578 +#endif
579 /* End of constants */
580
581 return 0;
582 diff --git a/mm/Kconfig b/mm/Kconfig
583 index c048dea7e342..0eeb27397884 100644
584 --- a/mm/Kconfig
585 +++ b/mm/Kconfig
586 @@ -897,6 +897,14 @@ config IO_MAPPING
587 config SECRETMEM
588 def_bool ARCH_HAS_SET_DIRECT_MAP && !EMBEDDED
589
590 +config LRU_GEN
591 + bool "Multi-Gen LRU"
592 + depends on MMU
593 + # make sure page->flags has enough spare bits
594 + depends on 64BIT || !SPARSEMEM || SPARSEMEM_VMEMMAP
595 + help
596 + A high performance LRU implementation to overcommit memory.
597 +
598 source "mm/damon/Kconfig"
599
600 endmenu
601 diff --git a/mm/huge_memory.c b/mm/huge_memory.c
602 index 98ff57c8eda6..f260ef82f03a 100644
603 --- a/mm/huge_memory.c
604 +++ b/mm/huge_memory.c
605 @@ -2366,7 +2366,8 @@ static void __split_huge_page_tail(struct page *head, int tail,
606 #ifdef CONFIG_64BIT
607 (1L << PG_arch_2) |
608 #endif
609 - (1L << PG_dirty)));
610 + (1L << PG_dirty) |
611 + LRU_GEN_MASK | LRU_REFS_MASK));
612
613 /* ->mapping in first tail page is compound_mapcount */
614 VM_BUG_ON_PAGE(tail > 2 && page_tail->mapping != TAIL_MAPPING,
615 diff --git a/mm/memcontrol.c b/mm/memcontrol.c
616 index b68b2fe639fd..8b634dc72e7f 100644
617 --- a/mm/memcontrol.c
618 +++ b/mm/memcontrol.c
619 @@ -5178,6 +5178,7 @@ static void __mem_cgroup_free(struct mem_cgroup *memcg)
620
621 static void mem_cgroup_free(struct mem_cgroup *memcg)
622 {
623 + lru_gen_exit_memcg(memcg);
624 memcg_wb_domain_exit(memcg);
625 __mem_cgroup_free(memcg);
626 }
627 @@ -5241,6 +5242,7 @@ static struct mem_cgroup *mem_cgroup_alloc(void)
628 memcg->deferred_split_queue.split_queue_len = 0;
629 #endif
630 idr_replace(&mem_cgroup_idr, memcg, memcg->id.id);
631 + lru_gen_init_memcg(memcg);
632 return memcg;
633 fail:
634 mem_cgroup_id_remove(memcg);
635 diff --git a/mm/memory.c b/mm/memory.c
636 index 392b7326a2d2..7d5be951de9e 100644
637 --- a/mm/memory.c
638 +++ b/mm/memory.c
639 @@ -4778,6 +4778,27 @@ static inline void mm_account_fault(struct pt_regs *regs,
640 perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN, 1, regs, address);
641 }
642
643 +#ifdef CONFIG_LRU_GEN
644 +static void lru_gen_enter_fault(struct vm_area_struct *vma)
645 +{
646 + /* the LRU algorithm doesn't apply to sequential or random reads */
647 + current->in_lru_fault = !(vma->vm_flags & (VM_SEQ_READ | VM_RAND_READ));
648 +}
649 +
650 +static void lru_gen_exit_fault(void)
651 +{
652 + current->in_lru_fault = false;
653 +}
654 +#else
655 +static void lru_gen_enter_fault(struct vm_area_struct *vma)
656 +{
657 +}
658 +
659 +static void lru_gen_exit_fault(void)
660 +{
661 +}
662 +#endif /* CONFIG_LRU_GEN */
663 +
664 /*
665 * By the time we get here, we already hold the mm semaphore
666 *
667 @@ -4809,11 +4830,15 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
668 if (flags & FAULT_FLAG_USER)
669 mem_cgroup_enter_user_fault();
670
671 + lru_gen_enter_fault(vma);
672 +
673 if (unlikely(is_vm_hugetlb_page(vma)))
674 ret = hugetlb_fault(vma->vm_mm, vma, address, flags);
675 else
676 ret = __handle_mm_fault(vma, address, flags);
677
678 + lru_gen_exit_fault();
679 +
680 if (flags & FAULT_FLAG_USER) {
681 mem_cgroup_exit_user_fault();
682 /*
683 diff --git a/mm/mm_init.c b/mm/mm_init.c
684 index 9ddaf0e1b0ab..0d7b2bd2454a 100644
685 --- a/mm/mm_init.c
686 +++ b/mm/mm_init.c
687 @@ -65,14 +65,16 @@ void __init mminit_verify_pageflags_layout(void)
688
689 shift = 8 * sizeof(unsigned long);
690 width = shift - SECTIONS_WIDTH - NODES_WIDTH - ZONES_WIDTH
691 - - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH;
692 + - LAST_CPUPID_SHIFT - KASAN_TAG_WIDTH - LRU_GEN_WIDTH - LRU_REFS_WIDTH;
693 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_widths",
694 - "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Flags %d\n",
695 + "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d Gen %d Tier %d Flags %d\n",
696 SECTIONS_WIDTH,
697 NODES_WIDTH,
698 ZONES_WIDTH,
699 LAST_CPUPID_WIDTH,
700 KASAN_TAG_WIDTH,
701 + LRU_GEN_WIDTH,
702 + LRU_REFS_WIDTH,
703 NR_PAGEFLAGS);
704 mminit_dprintk(MMINIT_TRACE, "pageflags_layout_shifts",
705 "Section %d Node %d Zone %d Lastcpupid %d Kasantag %d\n",
706 diff --git a/mm/mmzone.c b/mm/mmzone.c
707 index eb89d6e018e2..2ec0d7793424 100644
708 --- a/mm/mmzone.c
709 +++ b/mm/mmzone.c
710 @@ -81,6 +81,8 @@ void lruvec_init(struct lruvec *lruvec)
711
712 for_each_lru(lru)
713 INIT_LIST_HEAD(&lruvec->lists[lru]);
714 +
715 + lru_gen_init_lruvec(lruvec);
716 }
717
718 #if defined(CONFIG_NUMA_BALANCING) && !defined(LAST_CPUPID_NOT_IN_PAGE_FLAGS)
719 diff --git a/mm/swap.c b/mm/swap.c
720 index af3cad4e5378..0bdc96661fb6 100644
721 --- a/mm/swap.c
722 +++ b/mm/swap.c
723 @@ -446,6 +446,11 @@ void lru_cache_add(struct page *page)
724 VM_BUG_ON_PAGE(PageActive(page) && PageUnevictable(page), page);
725 VM_BUG_ON_PAGE(PageLRU(page), page);
726
727 + /* see the comment in lru_gen_add_page() */
728 + if (lru_gen_enabled() && !PageUnevictable(page) &&
729 + lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
730 + SetPageActive(page);
731 +
732 get_page(page);
733 local_lock(&lru_pvecs.lock);
734 pvec = this_cpu_ptr(&lru_pvecs.lru_add);
735 @@ -547,7 +552,7 @@ static void lru_deactivate_file_fn(struct page *page, struct lruvec *lruvec)
736
737 static void lru_deactivate_fn(struct page *page, struct lruvec *lruvec)
738 {
739 - if (PageActive(page) && !PageUnevictable(page)) {
740 + if (!PageUnevictable(page) && (PageActive(page) || lru_gen_enabled())) {
741 int nr_pages = thp_nr_pages(page);
742
743 del_page_from_lru_list(page, lruvec);
744 @@ -661,7 +666,8 @@ void deactivate_file_page(struct page *page)
745 */
746 void deactivate_page(struct page *page)
747 {
748 - if (PageLRU(page) && PageActive(page) && !PageUnevictable(page)) {
749 + if (PageLRU(page) && !PageUnevictable(page) &&
750 + (PageActive(page) || lru_gen_enabled())) {
751 struct pagevec *pvec;
752
753 local_lock(&lru_pvecs.lock);
754 diff --git a/mm/vmscan.c b/mm/vmscan.c
755 index dc5f0381513f..41826fe17eb3 100644
756 --- a/mm/vmscan.c
757 +++ b/mm/vmscan.c
758 @@ -2821,6 +2821,81 @@ static bool can_age_anon_pages(struct pglist_data *pgdat,
759 return can_demote(pgdat->node_id, sc);
760 }
761
762 +#ifdef CONFIG_LRU_GEN
763 +
764 +/******************************************************************************
765 + * shorthand helpers
766 + ******************************************************************************/
767 +
768 +#define for_each_gen_type_zone(gen, type, zone) \
769 + for ((gen) = 0; (gen) < MAX_NR_GENS; (gen)++) \
770 + for ((type) = 0; (type) < ANON_AND_FILE; (type)++) \
771 + for ((zone) = 0; (zone) < MAX_NR_ZONES; (zone)++)
772 +
773 +static struct lruvec __maybe_unused *get_lruvec(struct mem_cgroup *memcg, int nid)
774 +{
775 + struct pglist_data *pgdat = NODE_DATA(nid);
776 +
777 +#ifdef CONFIG_MEMCG
778 + if (memcg) {
779 + struct lruvec *lruvec = &memcg->nodeinfo[nid]->lruvec;
780 +
781 + /* for hotadd_new_pgdat() */
782 + if (!lruvec->pgdat)
783 + lruvec->pgdat = pgdat;
784 +
785 + return lruvec;
786 + }
787 +#endif
788 + VM_WARN_ON_ONCE(!mem_cgroup_disabled());
789 +
790 + return pgdat ? &pgdat->__lruvec : NULL;
791 +}
792 +
793 +/******************************************************************************
794 + * initialization
795 + ******************************************************************************/
796 +
797 +void lru_gen_init_lruvec(struct lruvec *lruvec)
798 +{
799 + int gen, type, zone;
800 + struct lru_gen_struct *lrugen = &lruvec->lrugen;
801 +
802 + lrugen->max_seq = MIN_NR_GENS + 1;
803 +
804 + for_each_gen_type_zone(gen, type, zone)
805 + INIT_LIST_HEAD(&lrugen->lists[gen][type][zone]);
806 +}
807 +
808 +#ifdef CONFIG_MEMCG
809 +void lru_gen_init_memcg(struct mem_cgroup *memcg)
810 +{
811 +}
812 +
813 +void lru_gen_exit_memcg(struct mem_cgroup *memcg)
814 +{
815 + int nid;
816 +
817 + for_each_node(nid) {
818 + struct lruvec *lruvec = get_lruvec(memcg, nid);
819 +
820 + VM_WARN_ON_ONCE(memchr_inv(lruvec->lrugen.nr_pages, 0,
821 + sizeof(lruvec->lrugen.nr_pages)));
822 + }
823 +}
824 +#endif
825 +
826 +static int __init init_lru_gen(void)
827 +{
828 + BUILD_BUG_ON(MIN_NR_GENS + 1 >= MAX_NR_GENS);
829 + BUILD_BUG_ON(BIT(LRU_GEN_WIDTH) <= MAX_NR_GENS);
830 +
831 + return 0;
832 +};
833 +late_initcall(init_lru_gen);
834 +
835 +#endif /* CONFIG_LRU_GEN */
836 +
837 static void shrink_lruvec(struct lruvec *lruvec, struct scan_control *sc)
838 {
839 unsigned long nr[NR_LRU_LISTS];
840 --
841 2.40.0
842