An introduction to the collapse R package, testing and benchmarking some common data manipulation tasks against data.table and dplyr
I recently came across collapse which is a C/C++ based package for data transformation and statistical computing in R. It offers a lot of functionality for statistical programming and time series analysis, and integrates well with the two main packages used for data manipulation in R, dplyr
and data.table
.
My focus in this post will be just to look at some basic data manipulation operations such as subsetting, recoding variables, de-duplicating and aggregating data to calculate summary statistics.
I’ve written this primarily as a self-learning aid and I am sure there will likely be better ways to perform the operations or other features of collapse
that are worth investigating. Corrections and suggestions would be welcome.
First thing to do is to create a sample dataset so that we can test out the functions.
library(data.table)
set.seed(1)
test_data <- data.table( ID = 1:10e6,
values = sample(100:1000, size = 50, replace = TRUE),
Code = sample(LETTERS[1:4], 10e6, replace = TRUE),
City = rep(c("Tokyo","Berlin","Denver","Stockholm"), 10e6))
setkey(test_data, City)
As collapse is geared towards efficiency of operation I thought it best to use a fairly large dataset - this dataset has 40 million records.
Classes 'data.table' and 'data.frame': 40000000 obs. of 4 variables:
$ ID : int 2 6 10 14 18 22 26 30 34 38 ...
$ values: int 778 398 696 429 824 977 973 951 521 605 ...
$ Code : chr "C" "D" "B" "A" ...
$ City : chr "Berlin" "Berlin" "Berlin" "Berlin" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "City"
A very common task is to group data by some categorical feature and then summarise it. Here I calculate the mean value for each city using the fmean
function.
collapse::fmean(x= test_data$values, g= test_data$City)
Berlin Denver Stockholm Tokyo
633.56 546.44 633.56 546.44
We can compare the performance of this function against a data.table and dplyr way of doing the same thing. The box plot below shows the results (please see the side note on Microbenchmark for details of how to produce this chart)
data.table is on average the best performing but collapse does provide some improvement on dplyr’s group by.
Subsetting or filtering the data based on some criteria is also a really common task. Here I used the fsubset
function to filter the data only where the Code is equal to A.
collapse::fsubset(test_data, Code =="A")
ID values Code City
1: 14 429 A Berlin
2: 26 973 A Berlin
3: 30 951 A Berlin
4: 38 605 A Berlin
5: 46 636 A Berlin
---
10002424: 9999877 481 A Tokyo
10002425: 9999897 474 A Tokyo
10002426: 9999941 442 A Tokyo
10002427: 9999945 783 A Tokyo
10002428: 9999953 228 A Tokyo
Again we can compare against data.table and dplyr - here data.table wins hands down, but collapse offers a significant improvement on dplyr.
Sometimes it is necessary to select only the unique values in a dataset. Collapse provides the ‘funique’ function which allows you to specify which columns are to be used for identifying unique rows in the dataset.
Here I used fselect
to select the City and Code columns and then funique
to return the unique combinations of City and Code.
test_data %>% fselect(City, Code) %>% funique()
City Code
1: Berlin C
2: Berlin D
3: Berlin B
4: Berlin A
5: Denver B
6: Denver C
7: Denver D
8: Denver A
9: Stockholm B
10: Stockholm C
11: Stockholm A
12: Stockholm D
13: Tokyo C
14: Tokyo D
15: Tokyo A
16: Tokyo B
Collapse seems to be significantly better here.
collapse offers a whole range of substitute functions for converting matrices, dataframes etc, Here I look at the converting a character vector to a factor.
f1 <- collapse::qF(test_data$City)
#you could replace f1 with test_data$City to recode the factor in the dataframe itself.
str(f1)
Factor w/ 4 levels "Berlin","Denver",..: 1 1 1 1 1 1 1 1 1 1 ...
I compared this against the base method of doing this using as.factor
. Collapse offers a siginificant performance improvement here over the base function.
In practice I commonly use as.factor
inside a data.table and use lapply to convert multiple fields. qF
incorporates well in the same way. collapse
provides a whole suite of re-coding functions (e.g. replacing NAs and Infs) which I think would be easy to incorporate with data.table
or dplyr
and clearly offer an improved level of performance.
I think collapse
is a great package with a whole array of functions which could be useful to improve the performance of data manipulation tasks, although in my opinion I would still rely on data.table/dplyr as the primary go to’s for handling data.
One thing I really like about collapse
is that the %>%
is incorporated and it integrates well with both dplyr and data.table. It might be a good option for those not familiar with data.table who want an improved performance.
I have only touched the surface here as collapse
offers a lot of additional statistical functionality which could be very useful especially when analysing financial data.
I don’t think collapse
will replace dplyr
or data.table
but I guess it’s not about either/or but about taking the best parts of every package and using them to improve your code. It’s definitely a package worth further investigation which I plan to do by trying out more of the functions and incorporating them into my workflow.
Throughout this post I used the microbenchmark package to compare the performance of each operation. One of the aspects of microbenchmark I really like is that it comes with a default method to provide plot outputs of the results. This is in the form of a ggplot object which creates a violin plot.
However, I’m not really a fan of the violin plots and wanted to change the default to a box plot. microbenchmark.autoplot is a S3 method which can easily be amended by using getS3method
to return the underlying function and then modifying. Here I changed ggplot2::stat_ydensity()
to ggplot2::geom_boxplot()
.
This post on Stack overflow was very helpful.
#getS3method("autoplot", "microbenchmark")
function (object, ..., log = TRUE, y_max = 1.05 * max(object$time))
{
if (!requireNamespace("ggplot2"))
stop("Missing package 'ggplot2'.")
y_min <- 0
object$ntime <- convert_to_unit(object$time, "t")
plt <- ggplot2::ggplot(object, ggplot2::aes_string(x = "expr",
y = "ntime"))
plt <- plt + ggplot2::coord_cartesian(ylim = c(y_min, y_max))
plt <- plt + ggplot2::stat_ydensity()
plt <- plt + ggplot2::scale_x_discrete(name = "")
y_label <- sprintf("Time [%s]", attr(object$ntime,
"unit"))
plt <- if (log) {
plt + ggplot2::scale_y_log10(name = y_label)
}
else {
plt + ggplot2::scale_y_continuous(name = y_label)
}
plt <- plt + ggplot2::coord_flip()
plt
}
# I changetd ggplot2::stat_ydensity() to geom_boxplot plus flipped the axees
plot_benchmark <- function (object, ..., log = TRUE, y_max = 1.05 * max(object$time))
{
if (!requireNamespace("ggplot2"))
stop("Missing package 'ggplot2'.")
y_min <- 0
object$ntime <- microbenchmark:::convert_to_unit(object$time, "t")
plt <- ggplot2::ggplot(object, ggplot2::aes_string(x = "expr",
y = "ntime"))
plt <- plt + ggplot2::coord_cartesian(ylim = c(y_min, y_max))
plt <- plt + ggplot2::geom_boxplot()
plt <- plt + ggplot2::scale_x_discrete(name = "")
y_label <- sprintf("Time [%s]", attr(object$ntime,
"unit"))
plt <- if (log) {
plt + ggplot2::scale_y_log10(name = y_label)
}
else {
plt + ggplot2::scale_y_continuous(name = y_label)
}
plt <- plt + ggplot2::coord_flip()
plt + theme_minimal_hgrid(12)
}
setting value
version R version 3.6.3 (2020-02-29) os Windows 10 x64
system x86_64, mingw32
ui RStudio
language en_US.utf8