Skip to content

Polars support

Douglas Raillard requested to merge douglas-raillard-arm/lisa:_home_pr90452 into main

Polars dataframe support:

  • You'll need to re-run lisa-install (as indicated when sourcing init_env) to install the new dependency
  • This is a large MR Please let me know if anything goes wrong after the update, including significant performance regression
  • You can now use polars end-to-end using either trace.df_event(..., df_fmt='polars-lazyframe') or trace.get_view(df_fmt='polars-lazyframe').df_event(...) .If you do so, note that the polars dataframe differs from the equivalent pandas dataframe in 2 ways: the Time column is using a pl.Duration("ns")dtype. This contrasts with the float64 index we used with pandas. The main implications are that:
    • it will require converting back to float for bokeh you need to be careful when e.g. filtering on that column (use lisa.datautils.Timestamp().as_nanoseconds , or build a pl.duration() yourself). Note that Timestamp() behaves like a float for backward compat, but actually encodes a nanosecond-precision integer. This allows avoiding the accuracy issues of large-magnitude floats.
    • The dataframe you get will differ from what you would get with pandas if you use trace.df_event(window=...) or trace slices, due to a change of the defaults for df_event(signals=...) parameter when df_fmt='polars-lazyframe' is used. You now need to provide your own signals. This was done to put all events on the same footing and not require any event-specific support in LISA.
  • The polars LazyFrame support means you can now use out-of-core processing (larger-than-memory dataset), within the limits of what polars can achieve. Polars streaming API is a fast-moving target though, so check regularly for updates on that front for new supported features.
  • TraceBase.get_view() has been heavily extended. In addition to windowing, it can now handle:
    • namespaces
    • events preloading
    • signals definitions
    • change to default dataframe format
    • time normalization (so the trace starts at 0 instead of whatever timestamp there is)
    • custom function ran on the LazyFrame returned by df_event() . This can be used to provide your own "custom views" of the trace (e.g. a trace that only shows the events pertaining to a specific task). Note that this may or may not break some analysis, so be careful when doing that.

Other less significant upgrades are:

  • Since we use an integer-backed pl.Duration("ns") the polars LazyFrame/DataFrame and migrated the Trace internals to that, timestamp deduplication for pandas is now done by adding nanoseconds, instead of adding the minimal delta (nextafter()). This shouldn't really be noticeable anyway. TraceEventCheckerBase decorators (such as requires_events()) will now preload all the events mentioned in the checker. This typically saves a lot of parsing time as these events will be requested, and we can just parse them all with a single invocation of the parser  (and therefore single trace traversal).
  • Some classes have been made private, such as TraceView and TraceCache . They were never really intended to be instantiated by end-users anyway, so you should not be impacted.
  • Significant in maintenance but hidden to the user: TraceView has been split  in several classes, and it's now much easier to add a new feature to get_view() or deprecate existing ones if needed.
  • Minor adjustments to the Rust parser, mostly to support the polars use case better
Edited by Douglas Raillard

Merge request reports