Add a generic DF join API
Created by: derkling
In many analysis it happens that we are interested in joining information coming from different DF. For example, let say we have a trace like this:
adbd-5709  2943.184105: sched_contrib_scale_f: cpu=7 cpu_scale_factor=1 adbd-5709  2943.184105: sched_load_avg_cpu: cpu=7 util_avg=825 ->transport-5713  2943.184106: sched_load_avg_cpu: cpu=6 util_avg=292 ->transport-5713  2943.184107: sched_contrib_scale_f: cpu=6 cpu_scale_factor=2 adbd-5709  2943.184108: sched_load_avg_cpu: cpu=7 util_avg=850 adbd-5709  2943.184109: sched_contrib_scale_f: cpu=7 cpu_scale_factor=3 adbd-5709  2943.184110: sched_load_avg_cpu: cpu=6 util_avg=315
Currently we can easily build two DF, one for sched_load_avg_cpu and another for sched_contrib_scale_f.
However, in some analysis it could be useful and correlate the information from these two events, thus getting a single DF where we see a consistent view of the most updated information from both.
In these cases we have a "master_df", e.g. sched_load_avg_cpu, where we want to propagate into the information from a "secondary_df", e.g. sched_contrib_scale_f.
This would require to:
Join the master_df with the secondary_df
Fix any index collision eventually happening, for example in the previous small trace we can see that at the exact time 2943.184105 we have one event for both master_df and secondary_df on each CPU.
A join of these two DF should grant that:
- the order of the events is consistent with the trace ordering, i.e. sched_contrib_scale_f should be before sched_load_avg_cpu in CPU7 but after in CPU6
- the time difference between the two events should be almost not noticiable, thus probably we should fix the overlapping timestamps by adding one 1ns to each duplicated index, thus removing the index collision without risking to create a new one with a following event.
Than we need to:
forward propagate each secondary_df columns by considering the value of a "pivot" column which is shared among the two DFs, for example the value
cpucan be used to forward propagate the others
cpu_scale_factor) in the
remove all the secondary_df rows which values have been already properly propagated in the following primary_df rows
All these operations together should be supported by a new generic convenience API which, once called with something like:
trappy.ftrace.utils.merge_df(primary_df = 'sched_load_avg_cpu', secondary_df='sched_contrib_scale_f', pivot='cpu')
Where, primary_df is:
cpu util_avg Time 0.000000 7 825 0.000001 6 292 0.000003 7 850 0.000005 6 315
and secondary_df is:
cpu cpu_scale_factor Time 0.000000 7 1 0.000002 6 2 0.000004 7 3
should returns a single DF which is:
cpu util_avg cpu_scale_factor Time 0.000000 7.0 825.0 1.0 0.000001 6.0 292.0 NaN <- since we do not have before a valid secondary_df entry 0.000003 7.0 850.0 1.0 <- propagation of the previous value 0.000005 6.0 315.0 2.0
Here is a notebook to play with the same example: https://gist.github.com/derkling/786e911ae01ca170377e1893d6696384 where we can see that the current join API needs to be extended to get the exact result we described before.