Skip to content
  • Andy Lutomirski's avatar
    x86/mm: Flush more aggressively in lazy TLB mode · b956575b
    Andy Lutomirski authored
    Since commit:
    
      94b1b03b
    
     ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
    
    x86's lazy TLB mode has been all the way lazy: when running a kernel thread
    (including the idle thread), the kernel keeps using the last user mm's
    page tables without attempting to maintain user TLB coherence at all.
    
    From a pure semantic perspective, this is fine -- kernel threads won't
    attempt to access user pages, so having stale TLB entries doesn't matter.
    
    Unfortunately, I forgot about a subtlety.  By skipping TLB flushes,
    we also allow any paging-structure caches that may exist on the CPU
    to become incoherent.  This means that we can have a
    paging-structure cache entry that references a freed page table, and
    the CPU is within its rights to do a speculative page walk starting
    at the freed page table.
    
    I can imagine this causing two different problems:
    
     - A speculative page walk starting from a bogus page table could read
       IO addresses.  I haven't seen any reports of this causing problems.
    
     - A speculative page walk that involves a bogus page table can install
       garbage in the TLB.  Such garbage would always be at a user VA, but
       some AMD CPUs have logic that triggers a machine check when it notices
       these bogus entries.  I've seen a couple reports of this.
    
    Boris further explains the failure mode:
    
    > It is actually more of an optimization which assumes that paging-structure
    > entries are in WB DRAM:
    >
    > "TlbCacheDis: cacheable memory disable. Read-write. 0=Enables
    > performance optimization that assumes PML4, PDP, PDE, and PTE entries
    > are in cacheable WB-DRAM; memory type checks may be bypassed, and
    > addresses outside of WB-DRAM may result in undefined behavior or NB
    > protocol errors. 1=Disables performance optimization and allows PML4,
    > PDP, PDE and PTE entries to be in any memory type. Operating systems
    > that maintain page tables in memory types other than WB- DRAM must set
    > TlbCacheDis to insure proper operation."
    >
    > The MCE generated is an NB protocol error to signal that
    >
    > "Link: A specific coherent-only packet from a CPU was issued to an
    > IO link. This may be caused by software which addresses page table
    > structures in a memory type other than cacheable WB-DRAM without
    > properly configuring MSRC001_0015[TlbCacheDis]. This may occur, for
    > example, when page table structure addresses are above top of memory. In
    > such cases, the NB will generate an MCE if it sees a mismatch between
    > the memory operation generated by the core and the link type."
    >
    > I'm assuming coherent-only packets don't go out on IO links, thus the
    > error.
    
    To fix this, reinstate TLB coherence in lazy mode.  With this patch
    applied, we do it in one of two ways:
    
     - If we have PCID, we simply switch back to init_mm's page tables
       when we enter a kernel thread -- this seems to be quite cheap
       except for the cost of serializing the CPU.
    
     - If we don't have PCID, then we set a flag and switch to init_mm
       the first time we would otherwise need to flush the TLB.
    
    The /sys/kernel/debug/x86/tlb_use_lazy_mode debug switch can be changed
    to override the default mode for benchmarking.
    
    In theory, we could optimize this better by only flushing the TLB in
    lazy CPUs when a page table is freed.  Doing that would require
    auditing the mm code to make sure that all page table freeing goes
    through tlb_remove_page() as well as reworking some data structures
    to implement the improved flush logic.
    
    Reported-by: default avatarMarkus Trippelsdorf <markus@trippelsdorf.de>
    Reported-by: default avatarAdam Borowski <kilobyte@angband.pl>
    Signed-off-by: default avatarAndy Lutomirski <luto@kernel.org>
    Signed-off-by: default avatarBorislav Petkov <bp@suse.de>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Brian Gerst <brgerst@gmail.com>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Cc: Eric Biggers <ebiggers@google.com>
    Cc: Johannes Hirte <johannes.hirte@datenkhaos.de>
    Cc: Kees Cook <keescook@chromium.org>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Nadav Amit <nadav.amit@gmail.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Roman Kagan <rkagan@virtuozzo.com>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Fixes: 94b1b03b ("x86/mm: Rework lazy TLB mode and TLB freshness tracking")
    Link: http://lkml.kernel.org/r/20171009170231.fkpraqokz6e4zeco@pd.tnic
    
    
    Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
    b956575b