1. 12 Sep, 2019 1 commit
    • Roman Gushchin's avatar
      cgroup: freezer: fix frozen state inheritance · 97a61369
      Roman Gushchin authored
      
      
      If a new child cgroup is created in the frozen cgroup hierarchy
      (one or more of ancestor cgroups is frozen), the CGRP_FREEZE cgroup
      flag should be set. Otherwise if a process will be attached to the
      child cgroup, it won't become frozen.
      
      The problem can be reproduced with the test_cgfreezer_mkdir test.
      
      This is the output before this patch:
        ~/test_freezer
        ok 1 test_cgfreezer_simple
        ok 2 test_cgfreezer_tree
        ok 3 test_cgfreezer_forkbomb
        Cgroup /sys/fs/cgroup/cg_test_mkdir_A/cg_test_mkdir_B isn't frozen
        not ok 4 test_cgfreezer_mkdir
        ok 5 test_cgfreezer_rmdir
        ok 6 test_cgfreezer_migrate
        ok 7 test_cgfreezer_ptrace
        ok 8 test_cgfreezer_stopped
        ok 9 test_cgfreezer_ptraced
        ok 10 test_cgfreezer_vfork
      
      And with this patch:
        ~/test_freezer
        ok 1 test_cgfreezer_simple
        ok 2 test_cgfreezer_tree
        ok 3 test_cgfreezer_forkbomb
        ok 4 test_cgfreezer_mkdir
        ok 5 test_cgfreezer_rmdir
        ok 6 test_cgfreezer_migrate
        ok 7 test_cgfreezer_ptrace
        ok 8 test_cgfreezer_stopped
        ok 9 test_cgfreezer_ptraced
        ok 10 test_cgfreezer_vfork
      Reported-by: default avatarMark Crossen <mcrossen@fb.com>
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Fixes: 76f969e8
      
       ("cgroup: cgroup v2 freezer")
      Cc: Tejun Heo <tj@kernel.org>
      Cc: stable@vger.kernel.org # v5.2+
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      97a61369
  2. 21 Jun, 2019 1 commit
  3. 14 Jun, 2019 1 commit
  4. 10 Jun, 2019 2 commits
    • Tejun Heo's avatar
      cgroup: Fix css_task_iter_advance_css_set() cset skip condition · c596687a
      Tejun Heo authored
      While adding handling for dying task group leaders c03cd773
      
      
      ("cgroup: Include dying leaders with live threads in PROCS
      iterations") added an inverted cset skip condition to
      css_task_iter_advance_css_set().  It should skip cset if it's
      completely empty but was incorrectly testing for the inverse condition
      for the dying_tasks list.  Fix it.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Fixes: c03cd773 ("cgroup: Include dying leaders with live threads in PROCS iterations")
      Reported-by: syzbot+d4bba5ccd4f9a2a68681@syzkaller.appspotmail.com
      c596687a
    • Jens Axboe's avatar
      cgroup/bfq: revert bfq.weight symlink change · cf892988
      Jens Axboe authored
      There's some discussion on how to do this the best, and Tejun prefers
      that BFQ just create the file itself instead of having cgroups support
      a symlink feature.
      
      Hence revert commit 54b7b868 and 19e9da9e
      
       for 5.2, and this
      can be done properly for 5.3.
      Signed-off-by: default avatarJens Axboe <axboe@kernel.dk>
      cf892988
  5. 07 Jun, 2019 1 commit
  6. 05 Jun, 2019 1 commit
  7. 01 Jun, 2019 1 commit
    • Chris Down's avatar
      mm, memcg: consider subtrees in memory.events · 9852ae3f
      Chris Down authored
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.name
      
      Signed-off-by: default avatarChris Down <chris@chrisdown.name>
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: default avatarShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: default avatarAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: default avatarLinus Torvalds <torvalds@linux-foundation.org>
      9852ae3f
  8. 31 May, 2019 3 commits
    • Tejun Heo's avatar
      cgroup: add cgroup_parse_float() · a5e112e6
      Tejun Heo authored
      
      
      cgroup already uses floating point for percent[ile] numbers and there
      are several controllers which want to take them as input.  Add a
      generic parse helper to handle inputs.
      
      Update the interface convention documentation about the use of
      percentage numbers.  While at it, also clarify the default time unit.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      a5e112e6
    • Tejun Heo's avatar
      cgroup: Include dying leaders with live threads in PROCS iterations · c03cd773
      Tejun Heo authored
      
      
      CSS_TASK_ITER_PROCS currently iterates live group leaders; however,
      this means that a process with dying leader and live threads will be
      skipped.  IOW, cgroup.procs might be empty while cgroup.threads isn't,
      which is confusing to say the least.
      
      Fix it by making cset track dying tasks and include dying leaders with
      live threads in PROCS iteration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarTopi Miettinen <toiwoton@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      c03cd773
    • Tejun Heo's avatar
      cgroup: Implement css_task_iter_skip() · b636fd38
      Tejun Heo authored
      
      
      When a task is moved out of a cset, task iterators pointing to the
      task are advanced using the normal css_task_iter_advance() call.  This
      is fine but we'll be tracking dying tasks on csets and thus moving
      tasks from cset->tasks to (to be added) cset->dying_tasks.  When we
      remove a task from cset->tasks, if we advance the iterators, they may
      move over to the next cset before we had the chance to add the task
      back on the dying list, which can allow the task to escape iteration.
      
      This patch separates out skipping from advancing.  Skipping only moves
      the affected iterators to the next pointer rather than fully advancing
      it and the following advancing will recognize that the cursor has
      already been moved forward and do the rest of advancing.  This ensures
      that when a task moves from one list to another in its cset, as long
      as it moves in the right direction, it's always visible to iteration.
      
      This doesn't cause any visible behavior changes.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      b636fd38
  9. 28 May, 2019 2 commits
    • Roman Gushchin's avatar
      bpf: decouple the lifetime of cgroup_bpf from cgroup itself · 4bfc0bb2
      Roman Gushchin authored
      
      
      Currently the lifetime of bpf programs attached to a cgroup is bound
      to the lifetime of the cgroup itself. It means that if a user
      forgets (or intentionally avoids) to detach a bpf program before
      removing the cgroup, it will stay attached up to the release of the
      cgroup. Since the cgroup can stay in the dying state (the state
      between being rmdir()'ed and being released) for a very long time, it
      leads to a waste of memory. Also, it blocks a possibility to implement
      the memcg-based memory accounting for bpf objects, because a circular
      reference dependency will occur. Charged memory pages are pinning the
      corresponding memory cgroup, and if the memory cgroup is pinning
      the attached bpf program, nothing will be ever released.
      
      A dying cgroup can not contain any processes, so the only chance for
      an attached bpf program to be executed is a live socket associated
      with the cgroup. So in order to release all bpf data early, let's
      count associated sockets using a new percpu refcounter. On cgroup
      removal the counter is transitioned to the atomic mode, and as soon
      as it reaches 0, all bpf programs are detached.
      
      Because cgroup_bpf_release() can block, it can't be called from
      the percpu ref counter callback directly, so instead an asynchronous
      work is scheduled.
      
      The reference counter is not socket specific, and can be used for any
      other types of programs, which can be executed from a cgroup-bpf hook
      outside of the process context, had such a need arise in the future.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Cc: jolsa@redhat.com
      Signed-off-by: default avatarAlexei Starovoitov <ast@kernel.org>
      4bfc0bb2
    • Oleg Nesterov's avatar
      locking/percpu-rwsem: Add DEFINE_PERCPU_RWSEM(), use it to initialize cgroup_threadgroup_rwsem · 3f2947b7
      Oleg Nesterov authored
      
      
      Turn DEFINE_STATIC_PERCPU_RWSEM() into __DEFINE_PERCPU_RWSEM() with the
      additional "is_static" argument to introduce DEFINE_PERCPU_RWSEM().
      
      Change cgroup.c to use DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem).
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Reviewed-by: default avatarIngo Molnar <mingo@kernel.org>
      Signed-off-by: default avatarPaul E. McKenney <paulmck@linux.ibm.com>
      3f2947b7
  10. 25 May, 2019 2 commits
  11. 15 May, 2019 2 commits
  12. 06 May, 2019 2 commits
  13. 19 Apr, 2019 4 commits
    • Roman Gushchin's avatar
      cgroup: add tracing points for cgroup v2 freezer · 4c476d8c
      Roman Gushchin authored
      
      
      Add cgroup:cgroup_freeze and cgroup:cgroup_unfreeze events,
      which are using the existing cgroup tracing infrastructure.
      
      Add the cgroup_event event class, which is similar to the cgroup
      class, but contains an additional integer field to store a new
      value (the level field is dropped).
      
      Also add two tracing events: cgroup_notify_populated and
      cgroup_notify_frozen, which are raised in a generic way using
      the TRACE_CGROUP_PATH() macro.
      
      This allows to trace cgroup state transitions and is generally
      helpful for debugging the cgroup freezer code.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      4c476d8c
    • Roman Gushchin's avatar
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin authored
      
      
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      No-objection-from-me-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
    • Roman Gushchin's avatar
      cgroup: protect cgroup->nr_(dying_)descendants by css_set_lock · 4dcabece
      Roman Gushchin authored
      
      
      The number of descendant cgroups and the number of dying
      descendant cgroups are currently synchronized using the cgroup_mutex.
      
      The number of descendant cgroups will be required by the cgroup v2
      freezer, which will use it to determine if a cgroup is frozen
      (depending on total number of descendants and number of frozen
      descendants). It's not always acceptable to grab the cgroup_mutex,
      especially from quite hot paths (e.g. exit()).
      
      To avoid this, let's additionally synchronize these counters using
      the css_set_lock.
      
      So, it's safe to read these counters with either cgroup_mutex or
      css_set_lock locked, and for changing both locks should be acquired.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      4dcabece
    • Roman Gushchin's avatar
      cgroup: implement __cgroup_task_count() helper · aade7f9e
      Roman Gushchin authored
      
      
      The helper is identical to the existing cgroup_task_count()
      except it doesn't take the css_set_lock by itself, assuming
      that the caller does.
      
      Also, move cgroup_task_count() implementation into
      kernel/cgroup/cgroup.c, as there is nothing specific to cgroup v1.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: kernel-team@fb.com
      aade7f9e
  14. 04 Apr, 2019 1 commit
  15. 06 Mar, 2019 1 commit
  16. 28 Feb, 2019 9 commits
    • David Howells's avatar
      kernfs, sysfs, cgroup, intel_rdt: Support fs_context · 23bf1b6b
      David Howells authored
      
      
      Make kernfs support superblock creation/mount/remount with fs_context.
      
      This requires that sysfs, cgroup and intel_rdt, which are built on kernfs,
      be made to support fs_context also.
      
      Notes:
      
       (1) A kernfs_fs_context struct is created to wrap fs_context and the
           kernfs mount parameters are moved in here (or are in fs_context).
      
       (2) kernfs_mount{,_ns}() are made into kernfs_get_tree().  The extra
           namespace tag parameter is passed in the context if desired
      
       (3) kernfs_free_fs_context() is provided as a destructor for the
           kernfs_fs_context struct, but for the moment it does nothing except
           get called in the right places.
      
       (4) sysfs doesn't wrap kernfs_fs_context since it has no parameters to
           pass, but possibly this should be done anyway in case someone wants to
           add a parameter in future.
      
       (5) A cgroup_fs_context struct is created to wrap kernfs_fs_context and
           the cgroup v1 and v2 mount parameters are all moved there.
      
       (6) cgroup1 parameter parsing error messages are now handled by invalf(),
           which allows userspace to collect them directly.
      
       (7) cgroup1 parameter cleanup is now done in the context destructor rather
           than in the mount/get_tree and remount functions.
      
      Weirdies:
      
       (*) cgroup_do_get_tree() calls cset_cgroup_from_root() with locks held,
           but then uses the resulting pointer after dropping the locks.  I'm
           told this is okay and needs commenting.
      
       (*) The cgroup refcount web.  This really needs documenting.
      
       (*) cgroup2 only has one root?
      
      Add a suggestion from Thomas Gleixner in which the RDT enablement code is
      placed into its own function.
      
      [folded a leak fix from Andrey Vagin]
      Signed-off-by: default avatarDavid Howells <dhowells@redhat.com>
      cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc: Tejun Heo <tj@kernel.org>
      cc: Li Zefan <lizefan@huawei.com>
      cc: Johannes Weiner <hannes@cmpxchg.org>
      cc: cgroups@vger.kernel.org
      cc: fenghua.yu@intel.com
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      23bf1b6b
    • Al Viro's avatar
      cgroup: store a reference to cgroup_ns into cgroup_fs_context · cca8f327
      Al Viro authored
      
      
      ... and trim cgroup_do_mount() arguments (renaming it to cgroup_do_get_tree())
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cca8f327
    • Al Viro's avatar
      cgroup_do_mount(): massage calling conventions · 71d883c3
      Al Viro authored
      
      
      pass it fs_context instead of fs_type/flags/root triple, have
      it return int instead of dentry and make it deal with setting
      fc->root.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      71d883c3
    • Al Viro's avatar
      cgroup: stash cgroup_root reference into cgroup_fs_context · cf6299b1
      Al Viro authored
      
      
      Note that this reference is *NOT* contributing to refcount of
      cgroup_root in question and is valid only until cgroup_do_mount()
      returns.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      cf6299b1
    • Al Viro's avatar
      cgroup2: switch to option-by-option parsing · e34a98d5
      Al Viro authored
      
      
      [again, carved out of patch by dhowells]
      [NB: we probably want to handle "source" in parse_param here]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      e34a98d5
    • Al Viro's avatar
      cgroup1: switch to option-by-option parsing · 8d2451f4
      Al Viro authored
      
      
      [dhowells should be the author - it's carved out of his patch]
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      8d2451f4
    • Al Viro's avatar
      cgroup: take options parsing into ->parse_monolithic() · f5dfb531
      Al Viro authored
      
      
      Store the results in cgroup_fs_context.  There's a nasty twist caused
      by the enabling/disabling subsystems - we can't do the checks sensitive
      to that until cgroup_mutex gets grabbed.  Frankly, these checks are
      complete bullshit (e.g. all,none combination is accepted if all subsystems
      are disabled; so's cpusets,none and all,cpusets when cpusets is disabled,
      etc.), but touching that would be a userland-visible behaviour change ;-/
      
      So we do parsing in ->parse_monolithic() and have the consistency checks
      done in check_cgroupfs_options(), with the latter called (on already parsed
      options) from cgroup1_get_tree() and cgroup1_reconfigure().
      
      Freeing the strdup'ed strings is done from fs_context destructor, which
      somewhat simplifies the life for cgroup1_{get_tree,reconfigure}().
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      f5dfb531
    • Al Viro's avatar
      7feeef58
    • Al Viro's avatar
      cgroup: start switching to fs_context · 90129625
      Al Viro authored
      
      
      Unfortunately, cgroup is tangled into kernfs infrastructure.
      To avoid converting all kernfs-based filesystems at once,
      we need to untangle the remount part of things, instead of
      having it go through kernfs_sop_remount_fs().  Fortunately,
      it's not hard to do.
      
      This commit just gets cgroup/cgroup1 to use fs_context to
      deliver options on mount and remount paths.  Parsing those
      is going to be done in the next commits; for now we do
      pretty much what legacy case does.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      90129625
  17. 31 Jan, 2019 2 commits
    • Oleg Nesterov's avatar
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab
      Oleg Nesterov authored
      
      
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      			wait(NULL);
      		else
      			exit(0);
      	}
      
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      
      Test-case:
      
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      
      Without this patch the forking process fails soon after migration.
      
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: default avatarHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      51bee5ab
    • Valdis Kletnieks's avatar
      bpf, cgroups: clean up kerneldoc warnings · 1832f4ef
      Valdis Kletnieks authored
      
      
      Building with W=1 reveals some bitrot:
      
        CC      kernel/bpf/cgroup.o
      kernel/bpf/cgroup.c:238: warning: Function parameter or member 'flags' not described in '__cgroup_bpf_attach'
      kernel/bpf/cgroup.c:367: warning: Function parameter or member 'unused_flags' not described in '__cgroup_bpf_detach'
      
      Add a kerneldoc line for 'flags'.
      
      Fixing the warning for 'unused_flags' is best approached by
      removing the unused parameter on the function call.
      Signed-off-by: default avatarValdis Kletnieks <valdis.kletnieks@vt.edu>
      Acked-by: default avatarSong Liu <songliubraving@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
      1832f4ef
  18. 17 Jan, 2019 2 commits
    • Al Viro's avatar
      cgroup: saner refcounting for cgroup_root · 35ac1184
      Al Viro authored
      
      
      * make the reference from superblock to cgroup_root counting -
      do cgroup_put() in cgroup_kill_sb() whether we'd done
      percpu_ref_kill() or not; matching grab is done when we allocate
      a new root.  That gives the same refcounting rules for all callers
      of cgroup_do_mount() - a reference to cgroup_root has been grabbed
      by caller and it either is transferred to new superblock or dropped.
      
      * have cgroup_kill_sb() treat an already killed refcount as "just
      don't bother killing it, then".
      
      * after successful cgroup_do_mount() have cgroup1_mount() recheck
      if we'd raced with mount/umount from somebody else and cgroup_root
      got killed.  In that case we drop the superblock and bugger off
      with -ERESTARTSYS, same as if we'd found it in the list already
      dying.
      
      * don't bother with delayed initialization of refcount - it's
      unreliable and not needed.  No need to prevent attempts to bump
      the refcount if we find cgroup_root of another mount in progress -
      sget will reuse an existing superblock just fine and if the
      other sb manages to die before we get there, we'll catch
      that immediately after cgroup_do_mount().
      
      * don't bother with kernfs_pin_sb() - no need for doing that
      either.
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      35ac1184
    • Al Viro's avatar
      fix cgroup_do_mount() handling of failure exits · 399504e2
      Al Viro authored
      same story as with last May fixes in sysfs (7b745a4e
      
      
      "unfuck sysfs_mount()"); new_sb is left uninitialized
      in case of early errors in kernfs_mount_ns() and papering
      over it by treating any error from kernfs_mount_ns() as
      equivalent to !new_ns ends up conflating the cases when
      objects had never been transferred to a superblock with
      ones when that has happened and resulting new superblock
      had been dropped.  Easily fixed (same way as in sysfs
      case).  Additionally, there's a superblock leak on
      kernfs_node_dentry() failure *and* a dentry leak inside
      kernfs_node_dentry() itself - the latter on probably
      impossible errors, but the former not impossible to trigger
      (as the matter of fact, injecting allocation failures
      at that point *does* trigger it).
      
      Cc: stable@kernel.org
      Signed-off-by: default avatarAl Viro <viro@zeniv.linux.org.uk>
      399504e2
  19. 28 Dec, 2018 1 commit
    • Ondrej Mosnacek's avatar
      cgroup: fix parsing empty mount option string · e250d91d
      Ondrej Mosnacek authored
      This fixes the case where all mount options specified are consumed by an
      LSM and all that's left is an empty string. In this case cgroupfs should
      accept the string and not fail.
      
      How to reproduce (with SELinux enabled):
      
          # umount /sys/fs/cgroup/unified
          # mount -o context=system_u:object_r:cgroup_t:s0 -t cgroup2 cgroup2 /sys/fs/cgroup/unified
          mount: /sys/fs/cgroup/unified: wrong fs type, bad option, bad superblock on cgroup2, missing codepage or helper program, or other error.
          # dmesg | tail -n 1
          [   31.575952] cgroup: cgroup2: unknown option ""
      
      Fixes: 67e9c74b ("cgroup: replace __DEVEL__sane_behavior with cgroup2 fs type")
      [NOTE: should apply on top of commit 5136f636
      
       ("cgroup: implement "nsdelegate" mount option"), older versions need manual rebase]
      Suggested-by: default avatarStephen Smalley <sds@tycho.nsa.gov>
      Signed-off-by: default avatarOndrej Mosnacek <omosnace@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      e250d91d
  20. 08 Dec, 2018 1 commit