Skip to content
GitLab
Projects Groups Topics Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
  • Sign in
  • T trappy
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributor statistics
    • Graph
    • Compare revisions
    • Locked files
  • Issues 19
    • Issues 19
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 6
    • Merge requests 6
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages and registries
    • Packages and registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Tooling
  • trappy
  • Issues
  • #250
Closed
Open
Issue created Jun 12, 2017 by Darryl Green@Darryl.GreenContributor

Add a generic DF join API

Created by: derkling

In many analysis it happens that we are interested in joining information coming from different DF. For example, let say we have a trace like this:

            adbd-5709  [007]  2943.184105: sched_contrib_scale_f: cpu=7 cpu_scale_factor=1
            adbd-5709  [007]  2943.184105: sched_load_avg_cpu:   cpu=7 util_avg=825
     ->transport-5713  [006]  2943.184106: sched_load_avg_cpu:   cpu=6 util_avg=292
     ->transport-5713  [006]  2943.184107: sched_contrib_scale_f: cpu=6 cpu_scale_factor=2
            adbd-5709  [007]  2943.184108: sched_load_avg_cpu:   cpu=7 util_avg=850
            adbd-5709  [007]  2943.184109: sched_contrib_scale_f: cpu=7 cpu_scale_factor=3
            adbd-5709  [007]  2943.184110: sched_load_avg_cpu:   cpu=6 util_avg=315

Currently we can easily build two DF, one for sched_load_avg_cpu and another for sched_contrib_scale_f.

However, in some analysis it could be useful and correlate the information from these two events, thus getting a single DF where we see a consistent view of the most updated information from both.

In these cases we have a "master_df", e.g. sched_load_avg_cpu, where we want to propagate into the information from a "secondary_df", e.g. sched_contrib_scale_f.

This would require to:

  1. Join the master_df with the secondary_df

  2. Fix any index collision eventually happening, for example in the previous small trace we can see that at the exact time 2943.184105 we have one event for both master_df and secondary_df on each CPU.

A join of these two DF should grant that:

  • the order of the events is consistent with the trace ordering, i.e. sched_contrib_scale_f should be before sched_load_avg_cpu in CPU7 but after in CPU6
  • the time difference between the two events should be almost not noticiable, thus probably we should fix the overlapping timestamps by adding one 1ns to each duplicated index, thus removing the index collision without risking to create a new one with a following event.

Than we need to:

  1. forward propagate each secondary_df columns by considering the value of a "pivot" column which is shared among the two DFs, for example the value cpu can be used to forward propagate the others sched_contrib_scale_f columns (i.e. freq_scale_factor and cpu_scale_factor) in the sched_load_avg_cpu rows

  2. remove all the secondary_df rows which values have been already properly propagated in the following primary_df rows

All these operations together should be supported by a new generic convenience API which, once called with something like:

trappy.ftrace.utils.merge_df(primary_df = 'sched_load_avg_cpu',
                             secondary_df='sched_contrib_scale_f',
                             pivot='cpu')

Where, primary_df is:

          cpu  util_avg
Time                   
0.000000    7       825
0.000001    6       292
0.000003    7       850
0.000005    6       315

and secondary_df is:

          cpu  cpu_scale_factor
Time                           
0.000000    7                 1
0.000002    6                 2
0.000004    7                 3

should returns a single DF which is:

          cpu  util_avg  cpu_scale_factor
Time                                     
0.000000  7.0     825.0               1.0
0.000001  6.0     292.0               NaN   <- since we do not have before a  valid secondary_df entry
0.000003  7.0     850.0               1.0  <- propagation of the previous value
0.000005  6.0     315.0               2.0

Here is a notebook to play with the same example: https://gist.github.com/derkling/786e911ae01ca170377e1893d6696384 where we can see that the current join API needs to be extended to get the exact result we described before.

Assignee
Assign to
Time tracking