1. 23 Jan, 2020 1 commit
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Fix sharing context ids between kernel & userspace · 5d2e5dd5
      Aneesh Kumar K.V authored
      Commit 0034d395 ("powerpc/mm/hash64: Map all the kernel regions in
      the same 0xc range") has a bug in the definition of MIN_USER_CONTEXT.
      
      The result is that the context id used for the vmemmap and the lowest
      context id handed out to userspace are the same. The context id is
      essentially the process identifier as far as the first stage of the
      MMU translation is concerned.
      
      This can result in multiple SLB entries with the same VSID (Virtual
      Segment ID), accessible to the kernel and some random userspace
      process that happens to get the overlapping id, which is not expected
      eg:
      
        07 c00c000008000000 40066bdea7000500  1T  ESID=   c00c00  VSID=      66bdea7 LLP:100
        12 0002000008000000 40066bdea7000d80  1T  ESID=      200  VSID=      66bdea7 LLP:100
      
      Even though the user process and the kernel use the same VSID, the
      permissions in the hash page table prevent the user process from
      reading or writing to any kernel mappings.
      
      It can also lead to SLB entries with different base page size
      encodings (LLP), eg:
      
        05 c00c000008000000 00006bde0053b500 256M ESID=c00c00000  VSID=    6bde0053b LLP:100
        09 0000000008000000 00006bde0053bc80 256M ESID=        0  VSID=    6bde0053b LLP:  0
      
      Such SLB entries can result in machine checks, eg. as seen on a G5:
      
        Oops: Machine check, sig: 7 [#1]
        BE PAGE SIZE=64K MU-Hash SMP NR_CPUS=4 NUMA Power Mac
        NIP: c00000000026f248 LR: c000000000295e58 CTR: 0000000000000000
        REGS: c0000000erfd3d70 TRAP: 0200 Tainted: G M (5.5.0-rcl-gcc-8.2.0-00010-g228b667d8ea1)
        MSR: 9000000000109032 <SF,HV,EE,ME,IR,DR,RI> CR: 24282048 XER: 00000000
        DAR: c00c000000612c80 DSISR: 00000400 IRQMASK: 0
        ...
        NIP [c00000000026f248] .kmem_cache_free+0x58/0x140
        LR  [c088000008295e58] .putname 8x88/0xa
        Call Trace:
          .putname+0xB8/0xa
          .filename_lookup.part.76+0xbe/0x160
          .do_faccessat+0xe0/0x380
          system_call+0x5c/ex68
      
      This happens with 256MB segments and 64K pages, as the duplicate VSID
      is hit with the first vmemmap segment and the first user segment, and
      older 32-bit userspace maps things in the first user segment.
      
      On other CPUs a machine check is not seen. Instead the userspace
      process can get stuck continuously faulting, with the fault never
      properly serviced, due to the kernel not understanding that there is
      already a HPTE for the address but with inaccessible permissions.
      
      On machines with 1T segments we've not seen the bug hit other than by
      deliberately exercising it. That seems to be just a matter of luck
      though, due to the typical layout of the user virtual address space
      and the ranges of vmemmap that are typically populated.
      
      To fix it we add 2 to MIN_USER_CONTEXT. This ensures the lowest
      context given to userspace doesn't overlap with the VMEMMAP context,
      or with the context for INVALID_REGION_ID.
      
      Fixes: 0034d395
      
       ("powerpc/mm/hash64: Map all the kernel regions in the same 0xc range")
      Cc: stable@vger.kernel.org # v5.2+
      Reported-by: default avatarChristian Marillat <marillat@debian.org>
      Reported-by: default avatarRomain Dolbeau <romain@dolbeau.org>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      [mpe: Account for INVALID_REGION_ID, mostly rewrite change log]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20200123102547.11623-1-mpe@ellerman.id.au
      5d2e5dd5
  2. 30 May, 2019 1 commit
  3. 21 Apr, 2019 6 commits
  4. 23 Feb, 2019 1 commit
  5. 14 Oct, 2018 3 commits
  6. 09 Oct, 2018 1 commit
    • Suraj Jitindar Singh's avatar
      KVM: PPC: Book3S HV: Implement H_TLB_INVALIDATE hcall · e3b6b466
      Suraj Jitindar Singh authored
      
      
      When running a nested (L2) guest the guest (L1) hypervisor will use
      the H_TLB_INVALIDATE hcall when it needs to change the partition
      scoped page tables or the partition table which it manages.  It will
      use this hcall in the situations where it would use a partition-scoped
      tlbie instruction if it were running in hypervisor mode.
      
      The H_TLB_INVALIDATE hcall can invalidate different scopes:
      
      Invalidate TLB for a given target address:
      - This invalidates a single L2 -> L1 pte
      - We need to invalidate any L2 -> L0 shadow_pgtable ptes which map the L2
        address space which is being invalidated. This is because a single
        L2 -> L1 pte may have been mapped with more than one pte in the
        L2 -> L0 page tables.
      
      Invalidate the entire TLB for a given LPID or for all LPIDs:
      - Invalidate the entire shadow_pgtable for a given nested guest, or
        for all nested guests.
      
      Invalidate the PWC (page walk cache) for a given LPID or for all LPIDs:
      - We don't cache the PWC, so nothing to do.
      
      Invalidate the entire TLB, PWC and partition table for a given/all LPIDs:
      - Here we re-read the partition table entry and remove the nested state
        for any nested guest for which the first doubleword of the partition
        table entry is now zero.
      
      The H_TLB_INVALIDATE hcall takes as parameters the tlbie instruction
      word (of which only the RIC, PRS and R fields are used), the rS value
      (giving the lpid, where required) and the rB value (giving the IS, AP
      and EPN values).
      
      [paulus@ozlabs.org - adapted to having the partition table in guest
      memory, added the H_TLB_INVALIDATE implementation, removed tlbie
      instruction emulation, reworded the commit message.]
      
      Reviewed-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: default avatarSuraj Jitindar Singh <sjitindarsingh@gmail.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e3b6b466
  7. 03 Oct, 2018 1 commit
  8. 19 Sep, 2018 4 commits
    • Nicholas Piggin's avatar
      powerpc/64s/hash: provide arch_setup_exec hooks for hash slice setup · 2e162674
      Nicholas Piggin authored
      
      
      This will be used by the SLB code in the next patch, but for now this
      sets the slb_addr_limit to the correct size for 32-bit tasks.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      2e162674
    • Nicholas Piggin's avatar
      powerpc/64s/hash: remove user SLB data from the paca · 8fed04d0
      Nicholas Piggin authored
      
      
      User SLB mappig data is copied into the PACA from the mm->context so
      it can be accessed by the SLB miss handlers.
      
      After the C conversion, SLB miss handlers now run with relocation on,
      and user SLB misses are able to take recursive kernel SLB misses, so
      the user SLB mapping data can be removed from the paca and accessed
      directly.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      8fed04d0
    • Nicholas Piggin's avatar
      powerpc/64s/hash: remove the vmalloc segment from the bolted SLB · 85376e2a
      Nicholas Piggin authored
      
      
      Remove the vmalloc segment from bolted SLBEs. This is not required to
      be bolted, and seems like it was added to help pre-load the SLB on
      context switch. However there are now other segments like the vmemmap
      segment and non-zero node memory that often take misses after a context
      switch, so it is better to solve this in a more general way.
      
      A subsequent change will track free SLB entries and uses those rather
      than round-robin overwrite valid entries, which makes it far less
      likely for kernel SLBEs to be evicted after they are installed.
      
      Signed-off-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      85376e2a
    • Mahesh Salgaonkar's avatar
      powerpc/pseries: Dump the SLB contents on SLB MCE errors. · c6d15258
      Mahesh Salgaonkar authored
      
      
      If we get a machine check exceptions due to SLB errors then dump the
      current SLB contents which will be very much helpful in debugging the
      root cause of SLB errors. Introduce an exclusive buffer per cpu to hold
      faulty SLB entries. In real mode mce handler saves the old SLB contents
      into this buffer accessible through paca and print it out later in virtual
      mode.
      
      With this patch the console will log SLB contents like below on SLB MCE
      errors:
      
      [  507.297236] SLB contents of cpu 0x1
      [  507.297237] Last SLB entry inserted at slot 16
      [  507.297238] 00 c000000008000000 400ea1b217000500
      [  507.297239]   1T  ESID=   c00000  VSID=      ea1b217 LLP:100
      [  507.297240] 01 d000000008000000 400d43642f000510
      [  507.297242]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297243] 11 f000000008000000 400a86c85f000500
      [  507.297244]   1T  ESID=   f00000  VSID=      a86c85f LLP:100
      [  507.297245] 12 00007f0008000000 4008119624000d90
      [  507.297246]   1T  ESID=       7f  VSID=      8119624 LLP:110
      [  507.297247] 13 0000000018000000 00092885f5150d90
      [  507.297247]  256M ESID=        1  VSID=   92885f5150 LLP:110
      [  507.297248] 14 0000010008000000 4009e7cb50000d90
      [  507.297249]   1T  ESID=        1  VSID=      9e7cb50 LLP:110
      [  507.297250] 15 d000000008000000 400d43642f000510
      [  507.297251]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297252] 16 d000000008000000 400d43642f000510
      [  507.297253]   1T  ESID=   d00000  VSID=      d43642f LLP:110
      [  507.297253] ----------------------------------
      [  507.297254] SLB cache ptr value = 3
      [  507.297254] Valid SLB cache entries:
      [  507.297255] 00 EA[0-35]=    7f000
      [  507.297256] 01 EA[0-35]=        1
      [  507.297257] 02 EA[0-35]=     1000
      [  507.297257] Rest of SLB cache entries:
      [  507.297258] 03 EA[0-35]=    7f000
      [  507.297258] 04 EA[0-35]=        1
      [  507.297259] 05 EA[0-35]=     1000
      [  507.297260] 06 EA[0-35]=       12
      [  507.297260] 07 EA[0-35]=    7f000
      
      Suggested-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Suggested-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: default avatarMahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Reviewed-by: default avatarNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      c6d15258
  9. 10 Aug, 2018 1 commit
  10. 30 Jul, 2018 1 commit
  11. 24 Jul, 2018 1 commit
  12. 20 Jan, 2018 1 commit
  13. 13 Nov, 2017 1 commit
  14. 31 Aug, 2017 1 commit
  15. 16 Aug, 2017 1 commit
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hugetlb: Add support for reserving gigantic huge pages via kernel command line · 79cc38de
      Aneesh Kumar K.V authored
      With commit aa888a74
      
       ("hugetlb: support larger than MAX_ORDER") we added
      support for allocating gigantic hugepages via kernel command line. Switch
      ppc64 arch specific code to use that.
      
      W.r.t FSL support, we now limit our allocation range using BOOTMEM_ALLOC_ACCESSIBLE.
      
      We use the kernel command line to do reservation of hugetlb pages on powernv
      platforms. On pseries hash mmu mode the supported gigantic huge page size is
      16GB and that can only be allocated with hypervisor assist. For pseries the
      command line option doesn't do the allocation. Instead pseries does gigantic
      hugepage allocation based on hypervisor hint that is specified via
      "ibm,expected#pages" property of the memory node.
      
      Cc: Scott Wood <oss@buserror.net>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      79cc38de
  16. 28 Apr, 2017 1 commit
  17. 01 Apr, 2017 2 commits
  18. 31 Mar, 2017 4 commits
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Convert mask to unsigned long · 59248aec
      Aneesh Kumar K.V authored
      
      
      This doesn't have any functional change. But helps in avoiding mistakes
      in case the shift bit changes
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      59248aec
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Support 68 bit VA · e6f81a92
      Aneesh Kumar K.V authored
      
      
      Inorder to support large effective address range (512TB), we want to
      increase the virtual address bits to 68. But we do have platforms like
      p4 and p5 that can only do 65 bit VA. We support those platforms by
      limiting context bits on them to 16.
      
      The protovsid -> vsid conversion is verified to work with both 65 and 68
      bit va values. I also documented the restrictions in a table format as
      part of code comments.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      e6f81a92
    • Michael Ellerman's avatar
      powerpc/mm/hash: Check for non-kernel address in get_kernel_vsid() · 85beb1c4
      Michael Ellerman authored
      
      
      get_kernel_vsid() has a very stern comment saying that it's only valid
      for kernel addresses, but there's nothing in the code to enforce that.
      
      Rather than hoping our callers are well behaved, add a check and return
      a VSID of 0 (invalid).
      
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      85beb1c4
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Use context ids 1-4 for the kernel · 941711a3
      Aneesh Kumar K.V authored
      
      
      Currently we use the top 4 context ids (0x7fffc-0x7ffff) for the kernel.
      Kernel VSIDs are built using these top context values and effective the
      segement ID. In subsequent patches we want to increase the max effective
      address to 512TB. We will achieve that by increasing the effective
      segment IDs there by increasing virtual address range.
      
      We will be switching to a 68bit virtual address in the following patch.
      But platforms like Power4 and Power5 only support a 65 bit virtual
      address. We will handle that by limiting the context bits to 16 instead
      of 19 on those platforms. That means the max context id will have a
      different value on different platforms.
      
      So that we don't have to deal with the kernel context ids changing
      between different platforms, move the kernel context ids down to use
      context ids 1-4.
      
      We can't use segment 0 of context-id 0, because that maps to VSID 0,
      which we want to keep as invalid, so we avoid context-id 0 entirely.
      Similarly we can't use the last segment of the maximum context, so we
      avoid it too.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      [mpe: Switch from 0-3 to 1-4 so VSID=0 remains invalid]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      941711a3
  19. 10 Feb, 2017 1 commit
    • David Gibson's avatar
      powerpc/pseries: Add support for hash table resizing · dbcf929c
      David Gibson authored
      
      
      This adds support for using two hypercalls to change the size of the
      main hash page table while running as a PAPR guest. For now these
      hypercalls are only in experimental qemu versions.
      
      The interface is two part: first H_RESIZE_HPT_PREPARE is used to
      allocate and prepare the new hash table. This may be slow, but can be
      done asynchronously. Then, H_RESIZE_HPT_COMMIT is used to switch to the
      new hash table. This requires that no CPUs be concurrently updating the
      HPT, and so must be run under stop_machine().
      
      This also adds a debugfs file which can be used to manually control
      HPT resizing or testing purposes.
      
      Signed-off-by: default avatarDavid Gibson <david@gibson.dropbear.id.au>
      Reviewed-by: default avatarPaul Mackerras <paulus@samba.org>
      [mpe: Rename the debugfs file to "hpt_order"]
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      dbcf929c
  20. 30 Jan, 2017 1 commit
    • Aneesh Kumar K.V's avatar
      powerpc/mm/hash: Properly mask the ESID bits when building proto VSID · 79270e0a
      Aneesh Kumar K.V authored
      
      
      The proto VSID is built using both the MMU context id and effective
      segment ID (ESID). We should not have overlapping bits between those.
      That could result in us having a VSID collision. With the current code
      we missed masking the top bits of the ESID. This implies for kernel
      address we ended up using the top 4 bits of the ESID as part of the
      proto VSID, which is wrong.
      
      The current code use the top 4 context values (0x7fffc - 0x7ffff) for
      the kernel. With those context IDs used for the kernel, we don't run
      into VSID collisions because we get the same proto VSID irrespective of
      whether we mask the ESID bits or not. eg:
      
        ea         = 0xf000000000000000
        context    = 0x7ffff
      
        w/out masking:
        proto_vsid = (0x7ffff << 6 | 0xf000000000000000 >> 40)
      	     = (0x1ffffc0 | 0xf00000)
      	     =  0x1ffffc0
      
        with masking:
        proto_vsid = (0x7ffff << 6 | ((0xf000000000000000 >> 40) & 0x3f))
      	     = (0x1ffffc0 | (0xf00000 & 0x3f))
      	     =  0x1ffffc0 | 0)
      	     =  0x1ffffc0
      
      So although there is no bug, the code is still overly subtle, so fix it
      to save ourselves pain in future.
      
      Signed-off-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      79270e0a
  21. 16 Nov, 2016 1 commit
    • Paul Mackerras's avatar
      powerpc/64: Simplify adaptation to new ISA v3.00 HPTE format · 6b243fcf
      Paul Mackerras authored
      This changes the way that we support the new ISA v3.00 HPTE format.
      Instead of adapting everything that uses HPTE values to handle either
      the old format or the new format, depending on which CPU we are on,
      we now convert explicitly between old and new formats if necessary
      in the low-level routines that actually access HPTEs in memory.
      This limits the amount of code that needs to know about the new
      format and makes the conversions explicit.  This is OK because the
      old format contains all the information that is in the new format.
      
      This also fixes operation under a hypervisor, because the H_ENTER
      hypercall (and other hypercalls that deal with HPTEs) will continue
      to require the HPTE value to be supplied in the old format.  At
      present the kernel will not boot in HPT mode on POWER9 under a
      hypervisor.
      
      This fixes and partially reverts commit 50de596d
      ("powerpc/mm/hash: Add support for Power9 Hash", 2016-04-29).
      
      Fixes: 50de596d
      
       ("powerpc/mm/hash: Add support for Power9 Hash")
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarMichael Ellerman <mpe@ellerman.id.au>
      6b243fcf
  22. 09 Sep, 2016 1 commit
    • Paul Mackerras's avatar
      powerpc/mm: Speed up computation of base and actual page size for a HPTE · 0eeede0c
      Paul Mackerras authored
      
      
      This replaces a 2-D search through an array with a simple 8-bit table
      lookup for determining the actual and/or base page size for a HPT entry.
      
      The encoding in the second doubleword of the HPTE is designed to encode
      the actual and base page sizes without using any more bits than would be
      needed for a 4k page number, by using between 1 and 8 low-order bits of
      the RPN (real page number) field to encode the page sizes.  A single
      "large page" bit in the first doubleword indicates that these low-order
      bits are to be interpreted like this.
      
      We can determine the page sizes by using the low-order 8 bits of the RPN
      to look up a 256-entry table.  For actual page sizes less than 1MB, some
      of the upper bits of these 8 bits are going to be real address bits, but
      we can cope with that by replicating the entries for those smaller page
      sizes.
      
      While we're at it, let's move the hpte_page_size() and hpte_base_page_size()
      functions from a KVM-specific header to a header for 64-bit HPT systems,
      since this computation doesn't have anything specifically to do with KVM.
      
      Reviewed-by: default avatarAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: default avatarPaul Mackerras <paulus@ozlabs.org>
      0eeede0c
  23. 01 Aug, 2016 2 commits
  24. 26 Jul, 2016 2 commits