LondonR: Hadley Wickham & tidyverse's greatest hits
Meeting Hadley!
Last Monday, I had the pleasure of attending a talk given by Hadley Wickham at LondonR, which was held at one of their usual venues at the UCL Darwin Lecture Theatre.
For most readers of this blog, Hadley needs no introduction: it is a running joke amongst R users that if tidyverse hadn’t been rebranded, it would’ve been known as the hadleyverse - and this pretty much says it all. If it weren’t for his contributions to all these packages (tidyr, dplyr, and gplot2 - to name a few), I probably wouldn’t even be using R today.
I had really looked forward to this event, as it’s always an interesting experience to meet in real life these people you seem to know so well or have heard so much about virtually. Another occasion I could recall was Hilary Parker’s keynote address at EARL, which I know her through her brilliant data science podcast (co-hosted with Roger Peng), Not So Standard Deviations. Do check it out - I highly recommend it.
In this post, I’m going to briefly summarize()
(sorry 😆) what Hadley covered in his talk, and some of my thoughts on his points.
Tidyverse: the greatest hits
This was perhaps the busiest LondonR sessions I’ve ever been to, but understandably so! The lecture hall usually has a fair number of free seats left, but on this occasion late-comers struggled to find free seats. Speaking to a couple of other attendees around me, nobody seemed to know in advance what Hadley’s talk was going to be about - this was kept somewhat mysterious when the event opened for registration. Matt Aldridge (CEO of Mango Solutions) introduced the event, and apparently this is a brand new talk by Hadley: Tidyverse: the greatest hits.
Interestingly, this turned out to be one of those attention-catching titles - what Hadley really planned to talk about was the greatest mistakes of the tidyverse. As he claims, whilst the intuitive expectation of good developers and R coders may be that they make fewer mistakes, it’s more the case for him that he makes many mistakes as fast as possible - which, I imagine, is partly responsible for his prolific body of work in R. Unfortunately, some of these “mistakes” have become “permanent” within tidyverse, which in his talk he explained some of these oddities in the tidyverse that most users probably have questioned about at some point.
Hadley mentioned a number of these “permanent mistakes”, and probably two of those which tidyverse users resonated the most with are:
- the conflicting function names with
stats::filter()
andstats::lag()
- the use of the
+
operator rather than%>%
in ggplot2
Most regular tidyverse users would probably have wondered on at least one occasion about the first case: why are there always function conflicts when we run library(tidyverse)
, and how do we get rid of it? The function conflict happens because dplyr::filter()
and the base stats::filter()
have the same name, so it will always conflict if you load tidyverse. It is the classic programming problem of naming variables and functions: you may think a variable name is intuitive or sensible initially, but not thinking it through thoroughly enough can sometimes come back and bite you in the future.
Hadley’s argument for choosing filter()
as the dplyr verb for filtering rows despite the existence of stats::filter()
is because of the relatively niche applications of the latter function.
This is what the documentation of stats::filter()
says about the function:
Applies linear filtering to a univariate time series or to each series separately of a multivariate time series.
I’ve never used stats::filter()
myself and personally find filter()
to be quite an intuitive verb, so I’m not too much bothered by this one.
Another similar function-naming “mistake” that Hadley talked about is the lack of intuitiveness of gather()
and spread()
, where the criticism is that it isn’t immediately clear to an unfamiliar tidyr user which of those functions converts data from long to wide format, and vice versa. Unlike dplyr::filter()
where there are no plans for a new filtering or subsetting function, in the development version of tidyr there will be two new functions for pivoting data frames, pivot_wide()
and pivot_long()
, which remove the ambiguity you get in spread()
and gather()
. Note that there isn’t any intent to deprecate spread()
and gather()
, but I think you simply get two new alternatives which make the code easier to read and use.
The other interesting mistake that Hadley talked about is the +
operator in ggplot2. To put it simply, this refers to the problem that whilst the rest of the tidyverse uses the pipe operator %>%
to chain analysis steps together, ggplot2 alone uses a different operator. Here’s a simple illustration of the problem:
iris %>% # You can pipe
select(Species, Sepal.Length, Sepal.Width) %>% # Still piping
ggplot(aes(x = Sepal.Length, y = Sepal.Width, colour = Species)) + # You cannot use pipe here
geom_point()
If you use the %>%
operator instead of +
once you start to use the ggplot functions, you get the following error message:
Error:
mapping
must be created byaes()
Did you use %>% instead of +?
This is very much a mistake of legacy, because the magrittr pipe %>%
was not in use when ggplot2 was written. Again, this feels like a quirk or inconvenience that tidyverse users will need to live with, but from a macro perspective ggplot2 is still a fantastic package with powerful functionalities that played a significant role in popularising the use of R.
Hannah Frick lists a couple more points that Hadley mentioned during his talk:
The greatest tidyverse mistakes:
— Hannah Frick (@hfcfrick) August 19, 2019
💥 no pipe in ggplot2
💥 overwriting filter and lag
💥 using the . in arg names
💥 tidyeval pushed too early
💥 tidyverse as a name made some people think it's meant to be used in isolation - nah, use it with whatever in #rstats is useful for you!
Conclusions
Ultimately, what Hadley wanted to get across for talking about the “mistakes of tidyverse” is perhaps this slide:
Quote on slide:
“The cost of never making a mistake is very often never making a change. It’s just too incredibly hard to be sure.” - GeePaw Hill
Here’s a link to the GeePaw Hill blog Hadley referred to.
Are these mistakes “necessary evil”? It’s something to debate about, really. My own view is that these slight quirks and inconveniences are a small price to pay, if they are unavoidable in developing these highly effective R packages in such a short space of time. I do have to say, I am also both impressed and humbled to see Hadley being so open (if not apologetic) about these mistakes in the tidyverse.
In the Q&As, there were also a couple of questions that you’d almost expect: what is the future of R, given Python? Hadley’s view is that whilst Python is a great programming language, there are certain peculiarities about R - such as non-standard evaluation - which make it highly effective for doing data science. This enables R to achieve a certain level of fluency for doing data science that Python, without non-standard evaluation, is unlikely to do so. He didn’t go into a huge amount of detail on this controversial debate (raising the Python vs R question amongst this audience is probably no less controversial than Brexit amongst the Brits). However, I did find it intriguing that for someone who did so much for the R language, he sees Python and R as complementary rather than competitive.
Hadley also made it very clear that the intent isn’t for tidyverse to be used on its own (without other packages), but as a starting point. It doesn’t matter, for instance, if you use data.table and tidyverse together.
This was a great LondonR session, even though the content was much more about programming than data science. A big thank you to Hadley Wickham and the Mango Solutions team for organising this!