Using data.table with magrittr pipes: best of both worlds
Should we use magrittr pipes with data.table?
Why ask the question?
If you are fairly new to R, you might find it puzzling / intriguing that R questions on Stack Overflow tend to attract a range of solutions which all have different syntax “styles”, but almost all seem to be valid answers to some extent (as indicated by the number of upvotes to the solution). This is because there are three main syntax styles in the R universe:
- base R - This refers to a syntax style that mostly utilises functions and operators available within the basic packages, notably base but also stats. These are packages that are loaded automatically when you start up R.1 In the context of
irisdataset, the use of
iris[["Species"]]to refer to columns, or
subset(iris, Species == "setosa")to subset/filter rows are characteristic of the base R style. Impressionistically speaking, I’d say that R code written prior to 2014 (based on books and Stack Overflow solutions) are generally more likely to be in this style. (dplyr superseded plyr in 2014)2
- tidyverse / dplyr style - This is a style that is increasingly becoming the ‘standard’ style for data analysis due to its superb readability and consistency with an ecosystem of packages e.g. “official” tidyverse packages such as stringr and purrr, but also packages written with the same principles in mind such as tidytext, tidyquant, and srvyr, to name a few. The tidyverse style is based on a set of principles which is designed to enhance analysis through greater readability and reproducibility. The hallmark of this style is the
%>%pipe operator (from the magrittr package), which chains up analysis operations in a way that enables you to “read” code in the form of “do x, then do y, then do z”. Other functions (also called dplyr ‘verbs’ of data manipulation) that are characteristic of this style include
- data.table style - Unlike the tidyverse style, the data.table style is based off a single package: data.table. The package description for data.table puts it as:
“Fast aggregation of large data (e.g. 100GB in RAM), fast ordered joins, fast add/modify/delete of columns by group using no copies at all, list columns, friendly and fast character-separated-value read/write. Offers a natural and flexible syntax, for faster development.”
The key advantage that I value in data.table is its speed, especially when working with grouping operations that involve a large number of groups (e.g. analysis by PEOPLE groups in a VISIT/TRANSACTION level database), where my experience is that it is much faster than dplyr. Syntax-wise, it is fairly readable, perhaps somewhere in between tidyverse and base, where I’d say tidyverse is most readable, and least for base (Caveat: there is always an element of subjectivity in this - whatever you are most familiar with you tend to to find easier to read!).
Okay, now to the main point of this post.
All these three styles mentioned above have their own pros and cons, but the general convention is that one should stick with a single style in the same piece of analysis, so that other people can more easily what you are doing, enabling easier collaboration and greater reproducibility. As a general rule, consistency of style is good practice - imagine trying to read someone else’s analysis where they filter a row in three different ways! 🙈
But what if I had a legitimate reason for combining these styles in my code?
Use Case: Combining magrittr pipes and data.table
I’ve once worked on a piece of analysis where I used the tidyverse style (i.e. dplyr verbs + magrittr pipes), chiefly for its advantageous of being very readable and intuitive. Everything worked fine when I was only dealing with the summarised numbers from the analysis, but when I had to group or join data from the significantly larger raw data it became agonisingly slow - to the extent that you can literally make cups of tea in between runs.
☕ ☕ ☕
I then decided to switch to data.table for that specific part of the analysis, but felt somewhat uneasy with the fact that I would be making the code more complex by introducing another “style”. Then I realised that I could minimise the complexity and preserving the readability by using magrittr pipes (
%>%) together with data.table operations.
Below is an example where I would take some data in an ordinary ‘base’ data frame, convert into a data.table object, run some analysis on it, and convert this back into a “tidyverse-friendly” tibble object.
library(nycflights13) nycflights13::flights %>% # `flights` dataset from nycflights13 data.table::as.data.table() -> flights_tb # Convert to data.table # Run some analysis on delay times # Create a new column called carrier_flight, then use it as a grouping variable # Use data.table syntax flights_tb %>% .[, carrier_flight := paste0(carrier,"_",flight)] %>% .[,.(dep_delay = mean(dep_delay, na.rm = TRUE), arr_delay = mean(arr_delay, na.rm = TRUE)), by = carrier_flight] %>% dplyr::as_tibble() # Convert this back to a tibble (tidyverse) object
## # A tibble: 5,725 x 3 ## carrier_flight dep_delay arr_delay ## <chr> <dbl> <dbl> ## 1 UA_1545 1.51 -9.75 ## 2 UA_1714 5.11 -2.68 ## 3 AA_1141 2.10 -6.56 ## 4 B6_725 0.975 1.66 ## 5 DL_461 -0.709 -5.83 ## 6 UA_1696 21.6 8 ## 7 B6_507 0.789 -0.133 ## 8 EV_5708 16.9 8.85 ## 9 B6_79 -0.441 -2.32 ## 10 AA_301 -3.01 -10.9 ## # ... with 5,715 more rows
My justification of such a ‘hybrid’ approach is that it marries the readability of dplyr / magrittr pipes
%>% and the blazing speed of data.table. To be fair, doing everything in dplyr isn’t always slower, although there are clearly scenarios where data.table has a performance advantage. (See either Tom Alby’s post on dplyr vs sparklyr vs data.table performance and Stephen Turner’s post comparing various R expressions using microbenchmark)
The other alternative is to stick with data.table entirely, but the way that data.table is chained “traditionally” does not feel as intuitive to me (hitting ‘return’ between each
flights_tb[,carrier_flight:=paste0(carrier,"_",flight) ][,.(dep_delay = mean(dep_delay, na.rm = TRUE), arr_delay = mean(arr_delay, na.rm = TRUE)), by = carrier_flight]
Given these circumstances, combining
%>% and data.table was an approach that worked really well for me.
I later discovered that quite a few others have also adopted this convention of combining
%>% and data.table:
This Stack Overflow response
This blog post by G L Li on the differences between data.table pipes and magrittr pipes
Another blog by Jeffrey Horner that talks about combining the two styles
It’s good to know I’m not the only one!
The wider divergence between the three syntax styles within R also is an interesting subject on its own. The question in this blog is mainly about tidyverse versus data.table, but that doesn’t mean there isn’t a debate around base R: Jozef Hajnala has a website called case4base, which as advertised, makes a case for using base R.
But how should new users learn R? To what extent should we combine styles? When should we use which style? These are all important questions that the R community needs to think and discuss about as the number of R users continue to grow.
See following link regarding packages that are loaded on start up: https://stat.ethz.ch/R-manual/R-patched/library/base/html/Startup.html↩