1. 19 Apr, 2019 1 commit
    • Roman Gushchin's avatar
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin authored
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      This patch implements freezer for cgroup v2.
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: default avatarRoman Gushchin <guro@fb.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      No-objection-from-me-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
  2. 31 Jan, 2019 1 commit
    • Oleg Nesterov's avatar
      cgroup/pids: turn cgroup_subsys->free() into cgroup_subsys->release() to fix the accounting · 51bee5ab
      Oleg Nesterov authored
      The only user of cgroup_subsys->free() callback is pids_cgrp_subsys which
      needs pids_free() to uncharge the pid.
      However, ->free() is called from __put_task_struct()->cgroup_free() and this
      is too late. Even the trivial program which does
      	for (;;) {
      		int pid = fork();
      		assert(pid >= 0);
      		if (pid)
      can run out of limits because release_task()->call_rcu(delayed_put_task_struct)
      implies an RCU gp after the task/pid goes away and before the final put().
      	mkdir -p /tmp/CG
      	mount -t cgroup2 none /tmp/CG
      	echo '+pids' > /tmp/CG/cgroup.subtree_control
      	mkdir /tmp/CG/PID
      	echo 2 > /tmp/CG/PID/pids.max
      	perl -e 'while ($p = fork) { wait; } $p // die "fork failed: $!\n"' &
      	echo $! > /tmp/CG/PID/cgroup.procs
      Without this patch the forking process fails soon after migration.
      Rename cgroup_subsys->free() to cgroup_subsys->release() and move the callsite
      into the new helper, cgroup_release(), called by release_task() which actually
      frees the pid(s).
      Reported-by: default avatarHerton R. Krzesinski <hkrzesin@redhat.com>
      Reported-by: default avatarJan Stancek <jstancek@redhat.com>
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  3. 08 Dec, 2018 1 commit
  4. 02 Nov, 2018 1 commit
  5. 26 Oct, 2018 1 commit
  6. 24 Sep, 2018 1 commit
  7. 22 Sep, 2018 1 commit
  8. 12 Aug, 2018 1 commit
    • Andrey Ignatov's avatar
      bpf: Introduce bpf_skb_ancestor_cgroup_id helper · 77236281
      Andrey Ignatov authored
      == Problem description ==
      It's useful to be able to identify cgroup associated with skb in TC so
      that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
      helper can help with this.
      Though in real life cgroup hierarchy and hierarchy to apply a policy to
      don't map 1:1.
      It's often the case that there is a container and corresponding cgroup,
      but there are many more sub-cgroups inside container, e.g. because it's
      delegated to containerized application to control resources for its
      subsystems, or to separate application inside container from infra that
      belongs to containerization system (e.g. sshd).
      At the same time it may be useful to apply a policy to container as a
      If multiple containers like this are run on a host (what is often the
      case) and many of them have sub-cgroups, it may not be possible to apply
      per-container policy in TC with existing helpers such as
      bpf_skb_under_cgroup or bpf_skb_cgroup_id:
      * bpf_skb_cgroup_id will return id of immediate cgroup associated with
        skb, i.e. if it's a sub-cgroup inside container, it can't be used to
        identify container's cgroup;
      * bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
        i.e. if there are N containers on a host and a policy has to be
        applied to M of them (0 <= M <= N), it'd require M calls to
        bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
        load new BPF program.
      == Solution ==
      The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
      used to get id of cgroup v2 that is an ancestor of cgroup associated
      with skb at specified level of cgroup hierarchy.
      That way admin can place all containers on one level of cgroup hierarchy
      (what is a good practice in general and already used in many
      configurations) and identify specific cgroup on this level no matter
      what sub-cgroup skb is associated with.
      E.g. if there is a cgroup hierarchy:
      , then having skb associated with root/container1/app11/sub-app-a/ it's
      possible to get ancestor at level 1, what is container1 and apply policy
      for this container, or apply another policy if it's container2.
      Policies can be kept e.g. in a hash map where key is a container cgroup
      id and value is an action.
      Levels where container cgroups are created are usually known in advance
      whether cgroup hierarchy inside container may be hard to predict
      especially in case when its creation is delegated to containerized
      == Implementation details ==
      The helper gets ancestor by walking parents up to specified level.
      Another option would be to get different kind of "id" from
      cgroup->ancestor_ids[level] and use it with idr_find() to get struct
      cgroup for ancestor. But that would require radix lookup what doesn't
      seem to be better (at least it's not obviously better).
      Format of return value of the new helper is same as that of
      Signed-off-by: default avatarAndrey Ignatov <rdna@fb.com>
      Signed-off-by: default avatarDaniel Borkmann <daniel@iogearbox.net>
  9. 26 Apr, 2018 2 commits
    • Tejun Heo's avatar
      cgroup: Replace cgroup_rstat_mutex with a spinlock · 0fa294fb
      Tejun Heo authored
      Currently, rstat flush path is protected with a mutex which is fine as
      all the existing users are from interface file show path.  However,
      rstat is being generalized for use by controllers and flushing from
      atomic contexts will be necessary.
      This patch replaces cgroup_rstat_mutex with a spinlock and adds a
      irq-safe flush function - cgroup_rstat_flush_irqsafe().  Explicit
      yield handling is added to the flush path so that other flush
      functions can yield to other threads and flushers.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup: Factor out and expose cgroup_rstat_*() interface functions · 6162cef0
      Tejun Heo authored
      cgroup_rstat is being generalized so that controllers can use it too.
      This patch factors out and exposes the following interface functions.
      * cgroup_rstat_updated(): Renamed from cgroup_rstat_cpu_updated() for
      * cgroup_rstat_flush_hold/release(): Factored out from base stat
      * cgroup_rstat_flush(): Verbatim expose.
      While at it, drop assert on cgroup_rstat_mutex in
      cgroup_base_stat_flush() as it crosses layers and make a minor comment
      v2: Added EXPORT_SYMBOL_GPL(cgroup_rstat_updated) to fix a build bug.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  10. 02 Nov, 2017 1 commit
    • Greg Kroah-Hartman's avatar
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman authored
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      How this work was done:
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
      All documentation files were explicitly excluded.
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
         For non */uapi/* files that summary was:
         SPDX license identifier                            # files
         GPL-2.0                                              11139
         and resulted in the first patch in this series.
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                        930
         and resulted in the second patch in this series.
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
         SPDX license identifier                            # files
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
         and that resulted in the third patch in this series.
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: default avatarKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: default avatarPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: default avatarThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
  11. 26 Oct, 2017 1 commit
    • Tejun Heo's avatar
      cgroup, sched: Move basic cpu stats from cgroup.stat to cpu.stat · d41bf8c9
      Tejun Heo authored
      The basic cpu stat is currently shown with "cpu." prefix in
      cgroup.stat, and the same information is duplicated in cpu.stat when
      cpu controller is enabled.  This is ugly and not very scalable as we
      want to expand the coverage of stat information which is always
      This patch makes cgroup core always create "cpu.stat" file and show
      the basic cpu stat there and calls the cpu controller to show the
      extra stats when enabled.  This ensures that the same information
      isn't presented in multiple places and makes future expansion of basic
      stats easier.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra (Intel) <peterz@infradead.org>
  12. 25 Sep, 2017 2 commits
    • Tejun Heo's avatar
      cgroup: Implement cgroup2 basic CPU usage accounting · 041cd640
      Tejun Heo authored
      In cgroup1, while cpuacct isn't actually controlling any resources, it
      is a separate controller due to combination of two factors -
      1. enabling cpu controller has significant side effects, and 2. we
      have to pick one of the hierarchies to account CPU usages on.  cpuacct
      controller is effectively used to designate a hierarchy to track CPU
      usages on.
      cgroup2's unified hierarchy removes the second reason and we can
      account basic CPU usages by default.  While we can use cpuacct for
      this purpose, both its interface and implementation leave a lot to be
      desired - it collects and exposes two sources of truth which don't
      agree with each other and some of the exposed statistics don't make
      much sense.  Also, it propagates all the way up the hierarchy on each
      accounting event which is unnecessary.
      This patch adds basic resource accounting mechanism to cgroup2's
      unified hierarchy and accounts CPU usages using it.
      * All accountings are done per-cpu and don't propagate immediately.
        It just bumps the per-cgroup per-cpu counters and links to the
        parent's updated list if not already on it.
      * On a read, the per-cpu counters are collected into the global ones
        and then propagated upwards.  Only the per-cpu counters which have
        changed since the last read are propagated.
      * CPU usage stats are collected and shown in "cgroup.stat" with "cpu."
        prefix.  Total usage is collected from scheduling events.  User/sys
        breakdown is sourced from tick sampling and adjusted to the usage
        using cputime_adjust().
      This keeps the accounting side hot path O(1) and per-cpu and the read
      side O(nr_updated_since_last_read).
      v2: Minor changes and documentation updates as suggested by Waiman and
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
    • Tejun Heo's avatar
      cpuacct: Introduce cgroup_account_cputime[_field]() · d2cc5ed6
      Tejun Heo authored
      Introduce cgroup_account_cputime[_field]() which wrap cpuacct_charge()
      and cgroup_account_field().  This doesn't introduce any functional
      changes and will be used to add cgroup basic resource accounting.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarPeter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@redhat.com>
  13. 11 Aug, 2017 1 commit
    • Tejun Heo's avatar
      cgroup: misc changes · 3e48930c
      Tejun Heo authored
      Misc trivial changes to prepare for future changes.  No functional
      * Expose cgroup_get(), cgroup_tryget() and cgroup_parent().
      * Implement task_dfl_cgroup() which dereferences css_set->dfl_cgrp.
      * Rename cgroup_stats_show() to cgroup_stat_show() for consistency
        with the file name.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  14. 29 Jul, 2017 3 commits
  15. 21 Jul, 2017 3 commits
    • Tejun Heo's avatar
      cgroup: implement CSS_TASK_ITER_THREADED · 450ee0c1
      Tejun Heo authored
      cgroup v2 is in the process of growing thread granularity support.
      Once thread mode is enabled, the root cgroup of the subtree serves as
      the dom_cgrp to which the processes of the subtree conceptually belong
      and domain-level resource consumptions not tied to any specific task
      are charged.  In the subtree, threads won't be subject to process
      granularity or no-internal-task constraint and can be distributed
      arbitrarily across the subtree.
      This patch implements a new task iterator flag CSS_TASK_ITER_THREADED,
      which, when used on a dom_cgrp, makes the iteration include the tasks
      on all the associated threaded css_sets.  "cgroup.procs" read path is
      updated to use it so that reading the file on a proc_cgrp lists all
      processes.  This will also be used by controller implementations which
      need to walk processes or tasks at the resource domain level.
      Task iteration is implemented nested in css_set iteration.  If
      CSS_TASK_ITER_THREADED is specified, after walking tasks of each
      !threaded css_set, all the associated threaded css_sets are visited
      before moving onto the next !threaded css_set.
      v2: ->cur_pcset renamed to ->cur_dcset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup: introduce cgroup->dom_cgrp and threaded css_set handling · 454000ad
      Tejun Heo authored
      cgroup v2 is in the process of growing thread granularity support.  A
      threaded subtree is composed of a thread root and threaded cgroups
      which are proper members of the subtree.
      The root cgroup of the subtree serves as the domain cgroup to which
      the processes (as opposed to threads / tasks) of the subtree
      conceptually belong and domain-level resource consumptions not tied to
      any specific task are charged.  Inside the subtree, threads won't be
      subject to process granularity or no-internal-task constraint and can
      be distributed arbitrarily across the subtree.
      This patch introduces cgroup->dom_cgrp along with threaded css_set
      * cgroup->dom_cgrp points to self for normal and thread roots.  For
        proper thread subtree members, points to the dom_cgrp (the thread
      * css_set->dom_cset points to self if for normal and thread roots.  If
        threaded, points to the css_set which belongs to the cgrp->dom_cgrp.
        The dom_cgrp serves as the resource domain and keeps the matching
        csses available.  The dom_cset holds those csses and makes them
        easily accessible.
      * All threaded csets are linked on their dom_csets to enable iteration
        of all threaded tasks.
      * cgroup->nr_threaded_children keeps track of the number of threaded
      This patch adds the above but doesn't actually use them yet.  The
      following patches will build on top.
      v4: ->nr_threaded_children added.
      v3: ->proc_cgrp/cset renamed to ->dom_cgrp/cset.  Updated for the new
          enable-threaded-per-cgroup behavior.
      v2: Added cgroup_is_threaded() helper.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup: add @flags to css_task_iter_start() and implement CSS_TASK_ITER_PROCS · bc2fb7ed
      Tejun Heo authored
      css_task_iter currently always walks all tasks.  With the scheduled
      cgroup v2 thread support, the iterator would need to handle multiple
      types of iteration.  As a preparation, add @flags to
      css_task_iter_start() and implement CSS_TASK_ITER_PROCS.  If the flag
      is not specified, it walks all tasks as before.  When asserted, the
      iterator only walks the group leaders.
      For now, the only user of the flag is cgroup v2 "cgroup.procs" file
      which no longer needs to skip non-leader tasks in cgroup_procs_next().
      Note that cgroup v1 "cgroup.procs" can't use the group leader walk as
      v1 "cgroup.procs" doesn't mean "list all thread group leaders in the
      cgroup" but "list all thread group id's with any threads in the
      While at it, update cgroup_procs_show() to use task_pid_vnr() instead
      of task_tgid_vnr().  As the iteration guarantees that the function
      only sees group leaders, this doesn't change the output and will allow
      sharing the function for thread iteration.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  16. 17 Jul, 2017 1 commit
    • Tejun Heo's avatar
      cgroup: distinguish local and children populated states · 788b950c
      Tejun Heo authored
      cgrp->populated_cnt counts both local (the cgroup's populated
      css_sets) and subtree proper (populated children) so that it's only
      zero when the whole subtree, including self, is empty.
      This patch splits the counter into two so that local and children
      populated states are tracked separately.  It allows finer-grained
      tests on the state of the hierarchy which will be used to replace
      css_set walking local populated test.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  17. 24 May, 2017 1 commit
    • Tejun Heo's avatar
      cpuset: consider dying css as offline · 41c25707
      Tejun Heo authored
      In most cases, a cgroup controller don't care about the liftimes of
      cgroups.  For the controller, a css becomes online when ->css_online()
      is called on it and offline when ->css_offline() is called.
      However, cpuset is special in that the user interface it exposes cares
      whether certain cgroups exist or not.  Combined with the RCU delay
      between cgroup removal and css offlining, this can lead to user
      visible behavior oddities where operations which should succeed after
      cgroup removals fail for some time period.  The effects of cgroup
      removals are delayed when seen from userland.
      This patch adds css_is_dying() which tests whether offline is pending
      and updates is_cpuset_online() so that the function returns false also
      while offline is pending.  This gets rid of the userland visible
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarDaniel Jordan <daniel.m.jordan@oracle.com>
      Link: http://lkml.kernel.org/r/327ca1f5-7957-fbb9-9e5f-9ba149d40ba2@oracle.com
      Cc: stable@vger.kernel.org
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  18. 24 Mar, 2017 1 commit
  19. 17 Mar, 2017 1 commit
    • Tejun Heo's avatar
      cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
      Tejun Heo authored
      Creation of a kthread goes through a couple interlocked stages between
      the kthread itself and its creator.  Once the new kthread starts
      running, it initializes itself and wakes up the creator.  The creator
      then can further configure the kthread and then let it start doing its
      job by waking it up.
      In this configuration-by-creator stage, the creator is the only one
      that can wake it up but the kthread is visible to userland.  When
      altering the kthread's attributes from userland is allowed, this is
      fine; however, for cases where CPU affinity is critical,
      kthread_bind() is used to first disable affinity changes from userland
      and then set the affinity.  This also prevents the kthread from being
      migrated into non-root cgroups as that can affect the CPU affinity and
      many other things.
      Unfortunately, the cgroup side of protection is racy.  While the
      PF_NO_SETAFFINITY flag prevents further migrations, userland can win
      the race before the creator sets the flag with kthread_bind() and put
      the kthread in a non-root cgroup, which can lead to all sorts of
      problems including incorrect CPU affinity and starvation.
      This bug got triggered by userland which periodically tries to migrate
      all processes in the root cpuset cgroup to a non-root one.  Per-cpu
      workqueue workers got caught while being created and ended up with
      incorrected CPU affinity breaking concurrency management and sometimes
      stalling workqueue execution.
      This patch adds task->no_cgroup_migration which disallows the task to
      be migrated by userland.  kthreadd starts with the flag set making
      every child kthread start in the root cgroup with migration
      disallowed.  The flag is cleared after the kthread finishes
      initialization by which time PF_NO_SETAFFINITY is set if the kthread
      should stay in the root cgroup.
      It'd be better to wait for the initialization instead of failing but I
      couldn't think of a way of implementing that without adding either a
      new PF flag, or sleeping and retrying from waiting side.  Even if
      userland depends on changing cgroup membership of a kthread, it either
      has to be synchronized with kthread_create() or periodically repeat,
      so it's unlikely that this would break anything.
      v2: Switch to a simpler implementation using a new task_struct bit
          field suggested by Oleg.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Suggested-by: default avatarOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-debugged-by: default avatarChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  20. 06 Mar, 2017 1 commit
  21. 27 Dec, 2016 1 commit
  22. 13 Aug, 2016 1 commit
  23. 10 Aug, 2016 2 commits
    • Tejun Heo's avatar
      cgroup: make cgroup_path() and friends behave in the style of strlcpy() · 4c737b41
      Tejun Heo authored
      cgroup_path() and friends used to format the path from the end and
      thus the resulting path usually didn't start at the start of the
      passed in buffer.  Also, when the buffer was too small, the partial
      result was truncated from the head rather than tail and there was no
      way to tell how long the full path would be.  These make the functions
      less robust and more awkward to use.
      With recent updates to kernfs_path(), cgroup_path() and friends can be
      made to behave in strlcpy() style.
      * cgroup_path(), cgroup_path_ns[_locked]() and task_cgroup_path() now
        always return the length of the full path.  If buffer is too small,
        it contains nul terminated truncated output.
      * All users updated accordingly.
      v2: cgroup_path() usage in kernel/sched/debug.c converted.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
    • Tejun Heo's avatar
      kernfs: make kernfs_path*() behave in the style of strlcpy() · 3abb1d90
      Tejun Heo authored
      kernfs_path*() functions always return the length of the full path but
      the path content is undefined if the length is larger than the
      provided buffer.  This makes its behavior different from strlcpy() and
      requires error handling in all its users even when they don't care
      about truncation.  In addition, the implementation can actully be
      simplified by making it behave properly in strlcpy() style.
      * Update kernfs_path_from_node_locked() to always fill up the buffer
        with path.  If the buffer is not large enough, the output is
        truncated and terminated.
      * kernfs_path() no longer needs error handling.  Make it a simple
        inline wrapper around kernfs_path_from_node().
      * sysfs_warn_dup()'s use of kernfs_path() doesn't need error handling.
        Updated accordingly.
      * cgroup_path()'s use of kernfs_path() updated to retain the old
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Acked-by: default avatarGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Acked-by: default avatarSerge Hallyn <serge.hallyn@ubuntu.com>
  24. 08 Aug, 2016 1 commit
  25. 01 Jul, 2016 1 commit
  26. 16 Feb, 2016 1 commit
    • Aditya Kali's avatar
      cgroup: introduce cgroup namespaces · a79a908f
      Aditya Kali authored
      Introduce the ability to create new cgroup namespace. The newly created
      cgroup namespace remembers the cgroup of the process at the point
      of creation of the cgroup namespace (referred as cgroupns-root).
      The main purpose of cgroup namespace is to virtualize the contents
      of /proc/self/cgroup file. Processes inside a cgroup namespace
      are only able to see paths relative to their namespace root
      (unless they are moved outside of their cgroupns-root, at which point
       they will see a relative path from their cgroupns-root).
      For a correctly setup container this enables container-tools
      (like libcontainer, lxc, lmctfy, etc.) to create completely virtualized
      containers without leaking system level cgroup hierarchy to the task.
      This patch only implements the 'unshare' part of the cgroupns.
      Signed-off-by: default avatarAditya Kali <adityakali@google.com>
      Signed-off-by: default avatarSerge Hallyn <serge.hallyn@canonical.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  27. 09 Dec, 2015 1 commit
    • Tejun Heo's avatar
      sock, cgroup: add sock->sk_cgroup · bd1060a1
      Tejun Heo authored
      In cgroup v1, dealing with cgroup membership was difficult because the
      number of membership associations was unbound.  As a result, cgroup v1
      grew several controllers whose primary purpose is either tagging
      membership or pull in configuration knobs from other subsystems so
      that cgroup membership test can be avoided.
      net_cls and net_prio controllers are examples of the latter.  They
      allow configuring network-specific attributes from cgroup side so that
      network subsystem can avoid testing cgroup membership; unfortunately,
      these are not only cumbersome but also problematic.
      Both net_cls and net_prio aren't properly hierarchical.  Both inherit
      configuration from the parent on creation but there's no interaction
      afterwards.  An ancestor doesn't restrict the behavior in its subtree
      in anyway and configuration changes aren't propagated downwards.
      Especially when combined with cgroup delegation, this is problematic
      because delegatees can mess up whatever network configuration
      implemented at the system level.  net_prio would allow the delegatees
      to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
      the same for classid.
      While it is possible to solve these issues from controller side by
      implementing hierarchical allowable ranges in both controllers, it
      would involve quite a bit of complexity in the controllers and further
      obfuscate network configuration as it becomes even more difficult to
      tell what's actually being configured looking from the network side.
      While not much can be done for v1 at this point, as membership
      handling is sane on cgroup v2, it'd be better to make cgroup matching
      behave like other network matches and classifiers than introducing
      further complications.
      In preparation, this patch updates sock->sk_cgrp_data handling so that
      it points to the v2 cgroup that sock was created in until either
      net_prio or net_cls is used.  Once either of the two is used,
      sock->sk_cgrp_data reverts to its previous role of carrying prioidx
      and classid.  This is to avoid adding yet another cgroup related field
      to struct sock.
      As the mode switching can happen at most once per boot, the switching
      mechanism is aimed at lowering hot path overhead.  It may leak a
      finite, likely small, number of cgroup refs and report spurious
      prioidx or classid on switching; however, dynamic updates of prioidx
      and classid have always been racy and lossy - socks between creation
      and fd installation are never updated, config changes don't update
      existing sockets at all, and prioidx may index with dead and recycled
      cgroup IDs.  Non-critical inaccuracies from small race windows won't
      make any noticeable difference.
      This patch doesn't make use of the pointer yet.  The following patch
      will implement netfilter match for cgroup2 membership.
      v2: Use sock_cgroup_data to avoid inflating struct sock w/ another
          cgroup specific field.
      v3: Add comments explaining why sock_data_prioidx() and
          sock_data_classid() use different fallback values.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Daniel Wagner <daniel.wagner@bmw-carit.de>
      CC: Neil Horman <nhorman@tuxdriver.com>
      Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
  28. 03 Dec, 2015 2 commits
    • Oleg Nesterov's avatar
      cgroup: kill cgrp_ss_priv[CGROUP_CANFORK_COUNT] and friends · b53202e6
      Oleg Nesterov authored
      Now that nobody use the "priv" arg passed to can_fork/cancel_fork/fork we can
      kill CGROUP_CANFORK_COUNT/SUBSYS_TAG/etc and cgrp_ss_priv[] in copy_process().
      Signed-off-by: default avatarOleg Nesterov <oleg@redhat.com>
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup: fix handling of multi-destination migration from subtree_control enabling · 1f7dd3e5
      Tejun Heo authored
      Consider the following v2 hierarchy.
        P0 (+memory) --- P1 (-memory) --- A
                                       \- B
      P0 has memory enabled in its subtree_control while P1 doesn't.  If
      both A and B contain processes, they would belong to the memory css of
      P1.  Now if memory is enabled on P1's subtree_control, memory csses
      should be created on both A and B and A's processes should be moved to
      the former and B's processes the latter.  IOW, enabling controllers
      can cause atomic migrations into different csses.
      The core cgroup migration logic has been updated accordingly but the
      controller migration methods haven't and still assume that all tasks
      migrate to a single target css; furthermore, the methods were fed the
      css in which subtree_control was updated which is the parent of the
      target csses.  pids controller depends on the migration methods to
      move charges and this made the controller attribute charges to the
      wrong csses often triggering the following warning by driving a
      counter negative.
       WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
       Modules linked in:
       CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
        ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
        ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
        ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
       Call Trace:
        [<ffffffff81551ffc>] dump_stack+0x4e/0x82
        [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
        [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
        [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
        [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
        [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
        [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
        [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
        [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
        [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
        [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
        [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
        [<ffffffff81265f88>] __vfs_write+0x28/0xe0
        [<ffffffff812666fc>] vfs_write+0xac/0x1a0
        [<ffffffff81267019>] SyS_write+0x49/0xb0
        [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
      This patch fixes the bug by removing @css parameter from the three
      migration methods, ->can_attach, ->cancel_attach() and ->attach() and
      updating cgroup_taskset iteration helpers also return the destination
      css in addition to the task being migrated.  All controllers are
      updated accordingly.
      * Controllers which don't care whether there are one or multiple
        target csses can be converted trivially.  cpu, io, freezer, perf,
        netclassid and netprio fall in this category.
      * cpuset's current implementation assumes that there's single source
        and destination and thus doesn't support v2 hierarchy already.  The
        only change made by this patchset is how that single destination css
        is obtained.
      * memory migration path already doesn't do anything on v2.  How the
        single destination css is obtained is updated and the prep stage of
        mem_cgroup_can_attach() is reordered to accomodate the change.
      * pids is the only controller which was affected by this bug.  It now
        correctly handles multi-destination migrations and no longer causes
        counter underflow from incorrect accounting.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-and-tested-by: default avatarDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
  29. 20 Nov, 2015 2 commits
    • Tejun Heo's avatar
      cgroup: implement cgroup_get_from_path() and expose cgroup_put() · 16af4396
      Tejun Heo authored
      Implement cgroup_get_from_path() using kernfs_walk_and_get() which
      obtains a default hierarchy cgroup from its path.  This will be used
      to allow cgroup path based matching from outside cgroup proper -
      e.g. networking and perf.
      v2: Add EXPORT_SYMBOL_GPL(cgroup_get_from_path).
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
    • Tejun Heo's avatar
      cgroup: record ancestor IDs and reimplement cgroup_is_descendant() using it · b11cfb58
      Tejun Heo authored
      cgroup_is_descendant() currently walks up the hierarchy and compares
      each ancestor to the cgroup in question.  While enough for cgroup core
      usages, this can't be used in hot paths to test cgroup membership.
      This patch adds cgroup->ancestor_ids[] which records the IDs of all
      ancestors including self and cgroup->level for the nesting level.
      This allows testing whether a given cgroup is a descendant of another
      in three finite steps - testing whether the two belong to the same
      hierarchy, whether the descendant candidate is at the same or a higher
      level than the ancestor and comparing the recorded ancestor_id at the
      matching level.  cgroup_is_descendant() is accordingly reimplmented
      and made inline.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
  30. 16 Nov, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: fix cftype->file_offset handling · 34c06254
      Tejun Heo authored
       ("cgroup: generalize obtaining the handles of and
      notifying cgroup files") introduced cftype->file_offset so that the
      handles for per-css file instances can be recorded.  These handles
      then can be used, for example, to generate file modified
      Unfortunately, it made the wrong assumption that files are created
      once for a given css and removed on its destruction.  Due to the
      dependencies among subsystems, a css may be hidden from userland and
      then later shown again.  This is implemented by removing and
      re-creating the affected files, so the associated kernfs_node for a
      given cgroup file may change over time.  This incorrect assumption led
      to the corruption of css->files lists.
      Reimplement cftype->file_offset handling so that cgroup_file->kn is
      protected by a lock and updated as files are created and destroyed.
      This also makes keeping them on per-cgroup list unnecessary.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Reported-by: default avatarJames Sedgwick <jsedgwick@fb.com>
      Fixes: 6f60eade
       ("cgroup: generalize obtaining the handles of and notifying cgroup files")
      Acked-by: default avatarJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: default avatarZefan Li <lizefan@huawei.com>
  31. 15 Oct, 2015 1 commit
    • Tejun Heo's avatar
      cgroup: keep zombies associated with their original cgroups · 2e91fa7f
      Tejun Heo authored
      cgroup_exit() is called when a task exits and disassociates the
      exiting task from its cgroups and half-attach it to the root cgroup.
      This is unnecessary and undesirable.
      No controller actually needs an exiting task to be disassociated with
      non-root cgroups.  Both cpu and perf_event controllers update the
      association to the root cgroup from their exit callbacks just to keep
      consistent with the cgroup core behavior.
      Also, this disassociation makes it difficult to track resources held
      by zombies or determine where the zombies came from.  Currently, pids
      controller is completely broken as it uncharges on exit and zombies
      always escape the resource restriction.  With cgroup association being
      reset on exit, fixing it is pretty painful.
      There's no reason to reset cgroup membership on exit.  The zombie can
      be removed from its css_set so that it doesn't show up on
      "cgroup.procs" and thus can't be migrated or interfere with cgroup
      removal.  It can still pin and point to the css_set so that its cgroup
      membership is maintained.  This patch makes cgroup core keep zombies
      associated with their cgroups at the time of exit.
      * Previous patches decoupled populated_cnt tracking from css_set
        lifetime, so a dying task can be simply unlinked from its css_set
        while pinning and pointing to the css_set.  This keeps css_set
        association from task side alive while hiding it from "cgroup.procs"
        and populated_cnt tracking.  The css_set reference is dropped when
        the task_struct is freed.
      * ->exit() callback no longer needs the css arguments as the
        associated css never changes once PF_EXITING is set.  Removed.
      * cpu and perf_events controllers no longer need ->exit() callbacks.
        There's no reason to explicitly switch away on exit.  The final
        schedule out is enough.  The callbacks are removed.
      * On traditional hierarchies, nothing changes.  "/proc/PID/cgroup"
        still reports "/" for all zombies.  On the default hierarchy,
        "/proc/PID/cgroup" keeps reporting the cgroup that the task belonged
        to at the time of exit.  If the cgroup gets removed before the task
        is reaped, " (deleted)" is appended.
      v2: Build brekage due to missing dummy cgroup_free() when
          !CONFIG_CGROUP fixed.
      Signed-off-by: default avatarTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>