1. 14 Feb, 2015 11 commits
  2. 13 Feb, 2015 6 commits
    • Joe Perches's avatar
      printk: correct timeout comment, neaten MODULE_PARM_DESC · 205bd3d2
      Joe Perches authored
      
      
      Neaten the MODULE_PARAM_DESC message.
      Use 30 seconds in the comment for the zap console locks timeout.
      
      Signed-off-by: default avatarJoe Perches <joe@perches.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      205bd3d2
    • Cyril Bur's avatar
      kernel/sched/clock.c: add another clock for use with the soft lockup watchdog · 545a2bf7
      Cyril Bur authored
      
      
      When the hypervisor pauses a virtualised kernel the kernel will observe a
      jump in timebase, this can cause spurious messages from the softlockup
      detector.
      
      Whilst these messages are harmless, they are accompanied with a stack
      trace which causes undue concern and more problematically the stack trace
      in the guest has nothing to do with the observed problem and can only be
      misleading.
      
      Futhermore, on POWER8 this is completely avoidable with the introduction
      of the Virtual Time Base (VTB) register.
      
      This patch (of 2):
      
      This permits the use of arch specific clocks for which virtualised kernels
      can use their notion of 'running' time, not the elpased wall time which
      will include host execution time.
      
      Signed-off-by: default avatarCyril Bur <cyrilbur@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Jones <drjones@redhat.com>
      Acked-by: default avatarDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ben Zhang <benzh@chromium.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      545a2bf7
    • Andy Lutomirski's avatar
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski authored
      
      
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: default avatarAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: default avatarRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
    • Rasmus Villemoes's avatar
      kernel/cpuset.c: Mark cpuset_init_current_mems_allowed as __init · 8f4ab07f
      Rasmus Villemoes authored
      
      
      The only caller of cpuset_init_current_mems_allowed is the __init
      annotated build_all_zonelists_init, so we can also make the former __init.
      
      Signed-off-by: default avatarRasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vishnu Pratap Singh <vishnu.ps@samsung.com>
      Cc: Pintu Kumar <pintu.k@samsung.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      8f4ab07f
    • Kirill A. Shutemov's avatar
      mm: do not use mm->nr_pmds on !MMU configurations · 2d2f5119
      Kirill A. Shutemov authored
      
      
      mm->nr_pmds doesn't make sense on !MMU configurations
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      2d2f5119
    • Vladimir Davydov's avatar
      cgroup: release css->id after css_free · 01e58659
      Vladimir Davydov authored
      
      
      Currently, we release css->id in css_release_work_fn, right before calling
      css_free callback, so that when css_free is called, the id may have
      already been reused for a new cgroup.
      
      I am going to use css->id to create unique names for per memcg kmem
      caches.  Since kmem caches are destroyed only on css_free, I need css->id
      to be freed after css_free was called to avoid name clashes.  This patch
      therefore moves css->id removal to css_free_work_fn.  To prevent
      css_from_id from returning a pointer to a stale css, it makes
      css_release_work_fn replace the css ptr at css_idr:css->id with NULL.
      
      Signed-off-by: default avatarVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      01e58659
  3. 12 Feb, 2015 5 commits
    • Kirill A. Shutemov's avatar
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov authored
      
      
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarTyler Baker <tyler.baker@linaro.org>
      Tested-by: default avatarTyler Baker <tyler.baker@linaro.org>
      Tested-by: default avatarNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • Kirill A. Shutemov's avatar
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov authored
      
      
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      
      Signed-off-by: default avatarKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: default avatarDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: default avatarCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: default avatarSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • Michal Hocko's avatar
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko authored
      Commit 5695be14
      
       ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Suggested-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • Michal Hocko's avatar
      PM: convert printk to pr_* equivalent · 35536ae1
      Michal Hocko authored
      
      
      While touching this area let's convert printk to pr_*.  This also makes
      the printing of continuation lines done properly.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Acked-by: default avatarTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      35536ae1
    • Michal Hocko's avatar
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko authored
      This patchset addresses a race which was described in the changelog for
      5695be14
      
       ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      
      Signed-off-by: default avatarMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
  4. 11 Feb, 2015 3 commits
    • Steven Rostedt (Red Hat)'s avatar
      ring-buffer: Do not wake up a splice waiter when page is not full · 1e0d6714
      Steven Rostedt (Red Hat) authored
      When an application connects to the ring buffer via splice, it can only
      read full pages. Splice does not work with partial pages. If there is
      not enough data to fill a page, the splice command will either block
      or return -EAGAIN (if set to nonblock).
      
      Code was added where if the page is not full, to just sleep again.
      The problem is, it will get woken up again on the next event. That
      is, when something is written into the ring buffer, if there is a waiter
      it will wake it up. The waiter would then check the buffer, see that
      it still does not have enough data to fill a page and go back to sleep.
      To make matters worse, when the waiter goes back to sleep, it could
      cause another event, which would wake it back up again to see it
      doesn't have enough data and sleep again. This produces a tremendous
      overhead and fills the ring buffer with noise.
      
      For example, recording sched_switch on an idle system for 10 seconds
      produces 25,350,475 events!!!
      
      Create another wait queue for those waiters wanting full pages.
      When an event is written, it only wakes up waiters if there's a full
      page of data. It does not wake up the waiter if the page is not yet
      full.
      
      After this change, recording sched_switch on an idle system for 10
      seconds produces only 800 events. Getting rid of 25,349,675 useless
      events (99.9969% of events!!), is something to take seriously.
      
      Cc: stable@vger.kernel.org # 3.16+
      Cc: Rabin Vincent <rabin@rab.in>
      Fixes: e30f53aa
      
       "tracing: Do not busy wait in buffer splice"
      Signed-off-by: default avatarSteven Rostedt <rostedt@goodmis.org>
      1e0d6714
    • Peter Zijlstra's avatar
      module: Replace over-engineered nested sleep · 9cc019b8
      Peter Zijlstra authored
      
      
      Since the introduction of the nested sleep warning; we've established
      that the occasional sleep inside a wait_event() is fine.
      
      wait_event() loops are invariant wrt. spurious wakeups, and the
      occasional sleep has a similar effect on them. As long as its occasional
      its harmless.
      
      Therefore replace the 'correct' but verbose wait_woken() thing with
      a simple annotation to shut up the warning.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      9cc019b8
    • Peter Zijlstra's avatar
      module: Annotate nested sleep in resolve_symbol() · d64810f5
      Peter Zijlstra authored
      
      
      Because wait_event() loops are safe vs spurious wakeups we can allow the
      occasional sleep -- which ends up being very similar.
      
      Reported-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: default avatarDave Jones <davej@codemonkey.org.uk>
      Signed-off-by: default avatarRusty Russell <rusty@rustcorp.com.au>
      d64810f5
  5. 10 Feb, 2015 2 commits
  6. 09 Feb, 2015 1 commit
  7. 06 Feb, 2015 3 commits
  8. 05 Feb, 2015 2 commits
  9. 04 Feb, 2015 7 commits
    • Josh Poimboeuf's avatar
      livepatch: rename config to CONFIG_LIVEPATCH · 12cf89b5
      Josh Poimboeuf authored
      
      
      Rename CONFIG_LIVE_PATCHING to CONFIG_LIVEPATCH to make the naming of
      the config and the code more consistent.
      
      Signed-off-by: default avatarJosh Poimboeuf <jpoimboe@redhat.com>
      Reviewed-by: default avatarJingoo Han <jg1.han@samsung.com>
      Signed-off-by: default avatarJiri Kosina <jkosina@suse.cz>
      12cf89b5
    • Mark Rutland's avatar
      perf: Decouple unthrottling and rotating · 2fde4f94
      Mark Rutland authored
      
      
      Currently the adjusments made as part of perf_event_task_tick() use the
      percpu rotation lists to iterate over any active PMU contexts, but these
      are not used by the context rotation code, having been replaced by
      separate (per-context) hrtimer callbacks. However, some manipulation of
      the rotation lists (i.e. removal of contexts) has remained in
      perf_rotate_context(). This leads to the following issues:
      
      * Contexts are not always removed from the rotation lists. Removal of
        PMUs which have been placed in rotation lists, but have not been
        removed by a hrtimer callback can result in corruption of the rotation
        lists (when memory backing the context is freed).
      
        This has been observed to result in hangs when PMU drivers built as
        modules are inserted and removed around the creation of events for
        said PMUs.
      
      * Contexts which do not require rotation may be removed from the
        rotation lists as a result of a hrtimer, and will not be considered by
        the unthrottling code in perf_event_task_tick.
      
      This patch fixes the issue by updating the rotation ist when events are
      scheduled in/out, ensuring that each rotation list stays in sync with
      the HW state. As each event holds a refcount on the module of its PMU,
      this ensures that when a PMU module is unloaded none of its CPU contexts
      can be in a rotation list. By maintaining a list of perf_event_contexts
      rather than perf_event_cpu_contexts, we don't need separate paths to
      handle the cpu and task contexts, which also makes the code a little
      simpler.
      
      As the rotation_list variables are not used for rotation, these are
      renamed to active_ctx_list, which better matches their current function.
      perf_pmu_rotate_{start,stop} are renamed to
      perf_pmu_ctx_{activate,deactivate}.
      
      Reported-by: default avatarJohannes Jensen <johannes.jensen@arm.com>
      Signed-off-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Will Deacon <Will.Deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150129134511.GR17721@leverpostej
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      2fde4f94
    • Mark Rutland's avatar
      perf: Drop module reference on event init failure · cc34b98b
      Mark Rutland authored
      
      
      When initialising an event, perf_init_event will call try_module_get() to
      ensure that the PMU's module cannot be removed for the lifetime of the
      event, with __free_event() dropping the reference when the event is
      finally destroyed. If something fails after the event has been
      initialised, but before the event is installed, perf_event_alloc will
      drop the reference on the module.
      
      However, if we fail to initialise an event for some reason (e.g. we ask
      an uncore PMU to perform sampling, and it refuses to initialise the
      event), we do not drop the refcount. If we try to open such a bogus
      event without a precise IDR type, we will loop over each PMU in the pmus
      list, incrementing each of their refcounts without decrementing them.
      
      This patch adds a module_put when pmu->event_init(event) fails, ensuring
      that the refcounts are balanced in failure cases. As the innards of the
      precise and search based initialisation look very similar, this logic is
      hoisted out into a new helper function. While the early return for the
      failed try_module_get is removed from the search case, this is handled
      by the remaining return when ret is not -ENOENT.
      
      Signed-off-by: Mark Rutland's avatarMark Rutland <mark.rutland@arm.com>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1420642611-22667-1-git-send-email-mark.rutland@arm.com
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      cc34b98b
    • Jiri Olsa's avatar
      perf: Use POLLIN instead of POLL_IN for perf poll data in flag · 7c60fc0e
      Jiri Olsa authored
      
      
      Currently we flag available data (via poll syscall) on perf fd with
      POLL_IN macro, which is normally used for SIGIO interface.
      
      We've been lucky, because POLLIN (0x1) is subset of POLL_IN (0x20001)
      and sys_poll (do_pollfd function) cut the extra bit out (0x20000).
      
      Signed-off-by: default avatarJiri Olsa <jolsa@kernel.org>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1422467678-22341-1-git-send-email-jolsa@kernel.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      7c60fc0e
    • Peter Zijlstra's avatar
      perf: Fix put_event() ctx lock · a83fe28e
      Peter Zijlstra authored
      
      
      So what I suspect; but I'm in zombie mode today it seems; is that while
      I initially thought that it was impossible for ctx to change when
      refcount dropped to 0, I now suspect its possible.
      
      Note that until perf_remove_from_context() the event is still active and
      visible on the lists. So a concurrent sys_perf_event_open() from another
      task into this task can race.
      
      Reported-by: default avatarVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: mark.rutland@arm.com
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150129134434.GB26304@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      a83fe28e
    • Peter Zijlstra (Intel)'s avatar
      perf: Fix move_group() order · 8f95b435
      Peter Zijlstra (Intel) authored
      
      
      Jiri reported triggering the new WARN_ON_ONCE in event_sched_out over
      the weekend:
      
        event_sched_out.isra.79+0x2b9/0x2d0
        group_sched_out+0x69/0xc0
        ctx_sched_out+0x106/0x130
        task_ctx_sched_out+0x37/0x70
        __perf_install_in_context+0x70/0x1a0
        remote_function+0x48/0x60
        generic_exec_single+0x15b/0x1d0
        smp_call_function_single+0x67/0xa0
        task_function_call+0x53/0x80
        perf_install_in_context+0x8b/0x110
      
      I think the below should cure this; if we install a group leader it
      will iterate the (still intact) group list and find its siblings and
      try and install those too -- even though those still have the old
      event->ctx -- in the new ctx.
      
      Upon installing the first group sibling we'd try and schedule out the
      group and trigger the above warn.
      
      Fix this by installing the group leader last, installing siblings
      would have no effect, they're not reachable through the group lists
      and therefore we don't schedule them.
      
      Also delay resetting the state until we're absolutely sure the events
      are quiescent.
      
      Reported-by: default avatarJiri Olsa <jolsa@redhat.com>
      Reported-by: default avatar <vincent.weaver@maine.edu>
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150126162639.GA21418@twins.programming.kicks-ass.net
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      8f95b435
    • Peter Zijlstra's avatar
      perf: Fix event->ctx locking · f63a8daa
      Peter Zijlstra authored
      
      
      There have been a few reported issues wrt. the lack of locking around
      changing event->ctx. This patch tries to address those.
      
      It avoids the whole rwsem thing; and while it appears to work, please
      give it some thought in review.
      
      What I did fail at is sensible runtime checks on the use of
      event->ctx, the RCU use makes it very hard.
      
      Signed-off-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20150123125834.209535886@infradead.org
      
      
      Signed-off-by: default avatarIngo Molnar <mingo@kernel.org>
      f63a8daa