• Jon Paul Maloy's avatar
    tipc: add neighbor monitoring framework · 35c55c98
    Jon Paul Maloy authored
    TIPC based clusters are by default set up with full-mesh link
    connectivity between all nodes. Those links are expected to provide
    a short failure detection time, by default set to 1500 ms. Because
    of this, the background load for neighbor monitoring in an N-node
    cluster increases with a factor N on each node, while the overall
    monitoring traffic through the network infrastructure increases at
    a ~(N * (N - 1)) rate. Experience has shown that such clusters don't
    scale well beyond ~100 nodes unless we significantly increase failure
    discovery tolerance.
    This commit introduces a framework and an algorithm that drastically
    reduces this background load, while basically maintaining the original
    failure detection times across the whole cluster. Using this algorithm,
    background load will now grow at a rate of ~(2 * sqrt(N)) per node, and
    at ~(2 * N * sqrt(N)) in traffic overhead. As an example, each node will
    now have to actively monitor 38 neighbors in a 400-node cluster, instead
    of as before 399.
    This "Overlapping Ring Supervision Algorithm" is completely distributed
    and employs no centralized or coordinated state. It goes as follows:
    - Each node makes up a linearly ascending, circular list of all its N
      known neighbors, based on their TIPC node identity. This algorithm
      must be the same on all nodes.
    - The node then selects the next M = sqrt(N) - 1 nodes downstream from
      itself in the list, and chooses to actively monitor those. This is
      called its "local monitoring domain".
    - It creates a domain record describing the monitoring domain, and
      piggy-backs this in the data area of all neighbor monitoring messages
      (LINK_PROTOCOL/STATE) leaving that node. This means that all nodes in
      the cluster eventually (default within 400 ms) will learn about
      its monitoring domain.
    - Whenever a node discovers a change in its local domain, e.g., a node
      has been added or has gone down, it creates and sends out a new
      version of its node record to inform all neighbors about the change.
    - A node receiving a domain record from anybody outside its local domain
      matches this against its own list (which may not look the same), and
      chooses to not actively monitor those members of the received domain
      record that are also present in its own list. Instead, it relies on
      indications from the direct monitoring nodes if an indirectly
      monitored node has gone up or down. If a node is indicated lost, the
      receiving node temporarily activates its own direct monitoring towards
      that node in order to confirm, or not, that it is actually gone.
    - Since each node is actively monitoring sqrt(N) downstream neighbors,
      each node is also actively monitored by the same number of upstream
      neighbors. This means that all non-direct monitoring nodes normally
      will receive sqrt(N) indications that a node is gone.
    - A major drawback with ring monitoring is how it handles failures that
      cause massive network partitionings. If both a lost node and all its
      direct monitoring neighbors are inside the lost partition, the nodes in
      the remaining partition will never receive indications about the loss.
      To overcome this, each node also chooses to actively monitor some
      nodes outside its local domain. Those nodes are called remote domain
      "heads", and are selected in such a way that no node in the cluster
      will be more than two direct monitoring hops away. Because of this,
      each node, apart from monitoring the member of its local domain, will
      also typically monitor sqrt(N) remote head nodes.
    - As an optimization, local list status, domain status and domain
      records are marked with a generation number. This saves senders from
      unnecessarily conveying  unaltered domain records, and receivers from
      performing unneeded re-adaptations of their node monitoring list, such
      as re-assigning domain heads.
    - As a measure of caution we have added the possibility to disable the
      new algorithm through configuration. We do this by keeping a threshold
      value for the cluster size; a cluster that grows beyond this value
      will switch from full-mesh to ring monitoring, and vice versa when
      it shrinks below the value. This means that if the threshold is set to
      a value larger than any anticipated cluster size (default size is 32)
      the new algorithm is effectively disabled. A patch set for altering the
      threshold value and for listing the table contents will follow shortly.
    - This change is fully backwards compatible.
    Acked-by: default avatarYing Xue <ying.xue@windriver.com>
    Signed-off-by: default avatarJon Maloy <jon.maloy@ericsson.com>
    Signed-off-by: default avatarDavid S. Miller <davem@davemloft.net>
core.h 4.59 KB