Skip to content
  • Jonathan Toppins's avatar
    mm: ratelimit PFNs busy info message · 75dddef3
    Jonathan Toppins authored
    The RDMA subsystem can generate several thousand of these messages per
    second eventually leading to a kernel crash.  Ratelimit these messages
    to prevent this crash.
    
    Doug said:
     "I've been carrying a version of this for several kernel versions. I
      don't remember when they started, but we have one (and only one) class
      of machines: Dell PE R730xd, that generate these errors. When it
      happens, without a rate limit, we get rcu timeouts and kernel oopses.
      With the rate limit, we just get a lot of annoying kernel messages but
      the machine continues on, recovers, and eventually the memory
      operations all succeed"
    
    And:
     "> Well... why are all these EBUSY's occurring? It sounds inefficient
      > (at least) but if it is expected, normal and unavoidable then
      > perhaps we should just remove that message altogether?
    
      I don't have an answer to that question. To be honest, I haven't
      looked real hard. We never had this at all, then it started out of the
      blue, but only on our Dell 730xd machines (and it hits all of them),
      but no other classes or brands of machines. And we have our 730xd
      machines loaded up with different brands and models of cards (for
      instance one dedicated to mlx4 hardware, one for qib, one for mlx5, an
      ocrdma/cxgb4 combo, etc), so the fact that it hit all of the machines
      meant it wasn't tied to any particular brand/model of RDMA hardware.
      To me, it always smelled of a hardware oddity specific to maybe the
      CPUs or mainboard chipsets in these machines, so given that I'm not an
      mm expert anyway, I never chased it down.
    
      A few other relevant details: it showed up somewhere around 4.8/4.9 or
      thereabouts. It never happened before, but the prinkt has been there
      since the 3.18 days, so possibly the test to trigger this message was
      changed, or something else in the allocator changed such that the
      situation started happening on these machines?
    
      And, like I said, it is specific to our 730xd machines (but they are
      all identical, so that could mean it's something like their specific
      ram configuration is causing the allocator to hit this on these
      machine but not on other machines in the cluster, I don't want to say
      it's necessarily the model of chipset or CPU, there are other bits of
      identicalness between these machines)"
    
    Link: http://lkml.kernel.org/r/499c0f6cc10d6eb829a67f2a4d75b4228a9b356e.1501695897.git.jtoppins@redhat.com
    
    
    Signed-off-by: default avatarJonathan Toppins <jtoppins@redhat.com>
    Reviewed-by: default avatarDoug Ledford <dledford@redhat.com>
    Tested-by: default avatarDoug Ledford <dledford@redhat.com>
    Cc: Michal Hocko <mhocko@suse.com>
    Cc: Vlastimil Babka <vbabka@suse.cz>
    Cc: Mel Gorman <mgorman@techsingularity.net>
    Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
    Cc: <stable@vger.kernel.org>
    Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
    Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
    75dddef3