Skip to content

trace/base: Optimizing DataFrame memory footprint

Darryl Green requested to merge github/fork/derkling/optimize-df-memory into master

Created by: derkling

Under the hood pandas represents numeric values as NumPy ndarrays and stores them in a continuous block of memory. Values of the same column are represented using the same type and thus number of bytes.

Many types in pandas have multiple subtypes that can use fewer bytes to represent each value. For example, the float type has the float16, float32, and float64 subtypes.

Use the function pd.to_numeric() to downcast numeric types to use for each value the minumum number of bytes which is still enough to represent the maximum value for a given column.

Use also "Categoricals" introduced in Pandas since version 0.15. The category type uses integer values under the hood to represent the values in a column, rather than the raw values. Use category to efficiently compress the representation of string values by replacing 64bit string pointers with an index using less bits.

Credits goes to:

Using pandas with Large Data Sets https://www.dataquest.io/blog/pandas-big-data/

where these changes are proposed and discussed in details.

The proposed change applied to a 473M example trace gives the following results:

               | Events |  Memory (MB)  |  Compression
               |  count | Before  After |      percent

-----------------+--------+ --------------+------------- clock_disable | 18368 | 3.34 0.88 | 73.652695 clock_enable | 19149 | 3.43 0.92 | 73.177843 clock_set_rate | 42099 | 7.63 2.01 | 73.656619 cpu_idle | 272726 | 28.87 12.74 | 55.871147 sched_switch | 315951 | 82.86 23.26 | 71.928554

Signed-off-by: Patrick Bellasi patrick.bellasi@arm.com

Merge request reports