Musings on R

Common Statistical Tests in R - Part I

2022-10-13T00:00:00+00:00

Introduction

This post will focus on common statistical tests in R to understand and validate the relationship between two variables.

There must be tons of similar tutorials around, you may be thinking. So why?

The primary (and selfish) goal of the post is to create a guide that is practical enough for myself to refer to from time to time. This post is edited from my own notes from learning statistics and R, and have been applied to a data example/scenario that I am familiar with. This means that the examples should be easily generalisable and mostly consistent with my usual coding approach (mostly ‘tidy’ and using pipes). Along the way, this will hopefully benefit others who are learning statistics and R too.

image from Giphy

To illustrate the R code, I will be using a sample dataset pq_data from the package vivainsights, which is a cross-sectional time-series dataset measuring the collaboration behaviour of simulated employees in an organization. Each row represents an employee on a certain week, with columns measuring behaviours such as total weekly time spent in email, meetings, chats, and so on. The vivainsights package itself provides visualisation and analysis functions tailored for these datasets which are available from Microsoft Viva Insights.

A note about the structure of this post: in the real world, one should as a best practice visually check the data distribution and run tests for assumptions like normality prior to performing any tests. For the sake of narrative and covering all the scenarios, this practice isn’t really observed in this post. Hence, please be forgiving as you see us run ‘head first’ into a test without examining the data - and avoid this in real life!

Set-up: packages and data

The package vivainsights is available on CRAN, so you can install this with install.packages("vivainsights").

You can load the dataset in R by calling pq_data after loading the vivainsights package. Here is a preview of the first ten columns of the dataset using dplyr::glimpse():

library(vivainsights)

glimpse(pq_data[, 1:10])

## Rows: 5,593
## Columns: 10
## $ PersonId                           <chr> "2b625906-1f36-3273-8d0d-13e714c5f6~
## $ MetricDate                         <date> 2021-12-26, 2021-12-26, 2021-12-26~
## $ After_hours_call_hours             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_chat_hours             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_collaboration_hours    <dbl> 7.6624994, 2.4908612, 0.1625000, 1.~
## $ After_hours_email_hours            <dbl> 0.2600000, 0.5883611, 0.1625000, 0.~
## $ After_hours_meeting_hours          <dbl> 7.50, 2.00, 0.00, 1.25, 19.00, 0.25~
## $ After_hours_scheduled_call_hours   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_unscheduled_call_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Call_hours                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~

This tutorial also uses functions from tidyverse, so ensure that you run library(tidyverse) to reproduce the example outputs.

Framing the problem

One of the most fundamental tasks in statistics and data science is to understand the relation between two variables. Sometimes the motivation is understand whether the relationship is causal, i.e. whether one causes another. This is not always the case, as for instance, one may simply wish to test for multicollinearity when selecting predictors for a model.¹

Our dataset pq_data represents the simulated collaboration data of a company, and each row represents an employee’s week. There are two metrics of interest:

Multitasking_hours measures the total number of hours the person spent sending emails or instant messages during a meeting or a Teams call.
After_hours_collaboration_hours measures the number of hours a person has spent in collaboration (meetings, emails, IMs, and calls) outside of working hours.²

Imagine then we have two questions to address:

Do managers multi-task more than senior individual contributors (IC)?
The HR leadership suspects that meeting multitasking behaviour could be correlated with after-hours working, as the former represents wasted time and productivity during meetings. What can we do to understand the relationship between the two?

In this post, we will tackle the first question, and focus primarily on comparison tests and their non-parametric equivalents in R. In subsequent posts I would also like to cover other relevant tools/concepts such as correlation tests, regression tests, effect size, and statistical power.

It is worth noting that the first question postulates a relation between a categorical variable (manager/ senior IC) and a continuous variable (multitasking hours), whereas the second question a relation between two continuous variables (multitasking hours, afterhours collaboration). The types of the variables in question help determine which tests are appropriate.

The categorical variable that provides us information on whether an employee is a manager or a senior IC in pq_data is stored in LevelDesignation. We can use vivainsights::hrvar_count() to explore this variable:

hrvar_count(pq_data, hrvar = "LevelDesignation")

1. Comparison tests: the t-test

Two common comparison tests would be the t-test and Analysis of Variance (ANOVA). The oft-cited practical difference between the two is that you would use the t-test for comparing means between two groups, and ANOVA for more than two groups. There is a bit more nuance than that, but we will start with the t-test.

A t-test can be paired or unpaired, where the former is used for comparing the means of two groups in the same population, and the latter for independent samples from two populations or groups. Since managers and senior ICs are two different populations, an unpaired (two-sample) t-test is therefore appropriate for the scenario in question two.

Before we jump into the test, we’ll need to prepare the data. Since we are interested in the difference between managers and senior ICs, we will first need to create a factor variable from the data that has only two levels. In the below code, we will first filter out any values of LevelDesignation that are not "Manager" and "Senior IC", and create a new factor column as ManagerIndicator:

pq_data_grouped <-
  pq_data %>%
  filter(LevelDesignation %in% c("Manager", "Senior IC")) %>%
  mutate(
    ManagerIndicator =
      factor(LevelDesignation,
      levels = c("Manager", "Senior IC"))
  )

Recall also that our dataset pq_data is a cross-sectional time-series dataset, which means that for every individual identified by PersonId, there will be multiple rows representing a snapshot of a different week. In other words, a unique identifier would be something like a PersonWeekId. To simplify the dataset so that we are looking at person averages, we can group the dataset by PersonId and calculate the mean of Multitasking_hours for each person. After this manipulation, Multitasking_hours would represent the mean multitasking hours per person, as opposed to per person per week. Let us do this by building on the pipe-chain:

pq_data_grouped <-
  pq_data %>%
  filter(LevelDesignation %in% c("Manager", "Senior IC")) %>%
  mutate(
    ManagerIndicator =
      factor(LevelDesignation,
      levels = c("Manager", "Senior IC"))
  ) %>%
  group_by(PersonId, ManagerIndicator) %>%
  summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop")
  
glimpse(pq_data_grouped)

## Rows: 56
## Columns: 3
## $ PersonId           <chr> "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~
## $ ManagerIndicator   <fct> Senior IC, Manager, Senior IC, Senior IC, Manager, ~
## $ Multitasking_hours <dbl> 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~

Now our data is in the right format.

Let us presume that the data satisfies all the assumptions of the t-test, and see what happens when we run it with the base t.test() function:

t.test(
  Multitasking_hours ~ ManagerIndicator,
  data = pq_data_grouped,
  paired = FALSE
)

## 
##  Welch Two Sample t-test
## 
## data:  Multitasking_hours by ManagerIndicator
## t = 10.097, df = 28.758, p-value = 5.806e-11
## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0
## 95 percent confidence interval:
##  0.3444870 0.5195712
## sample estimates:
##   mean in group Manager mean in group Senior IC 
##               0.8103354               0.3783063

In the function, the predictor and outcome variables are supplied using a tilde (~) format common in R, and we have specified paired = FALSE to use an unpaired t-test. As for the output,

t represents the t-statistic.
df represents the degree of freedom.
p-value is - well - the p-value. The value here shows to be significant, as it is smaller than the significance level at 0.05.
the test allows us to reject the null hypothesis that the means of multitasking hours between managers and ICs are the same.

Note that the t-test used here is the Welch’s t-test, which is an adaptation of the classic Student’s t-test. The Welch’s t-test compares the variances of the two groups (i.e. handling heteroscedasticity), whereas the classic Student’s t-test assumes the variances of the two groups to be equal (fancy term = homoscedastic).

1.1 Testing for normality

But hang on!

There are several assumptions behind the classic t-test we haven’t examined properly, namely:

independence - sample is independent
normality - data for each group is normally distributed
homoscedasticity - data across samples have equal variance

We can at least be sure of (1), as we know that senior ICs and Managers are separate populations. However, (2) and (3) are assumptions that we have to validate and address specifically. To test whether our data is normally distributed, we can use the Shapiro-Wilk test of normality, with the function shapiro.test():

pq_data_grouped %>%
  group_by(ManagerIndicator) %>%
  summarise(
    p = shapiro.test(Multitasking_hours)$p.value,
    statistic = shapiro.test(Multitasking_hours)$statistic
  )

## # A tibble: 2 x 3
##   ManagerIndicator      p statistic
##   <fct>             <dbl>     <dbl>
## 1 Manager          0.146      0.936
## 2 Senior IC        0.0722     0.941

As both p-values show up as less than 0.05, the test implies that we should reject the null hypothesis that the data are normally distributed (i.e. not normally distributed). To confirm, you can also perform a visual check for normality using a histogram or a Q-Q plot.

# Multitasking hours - IC
mth_ic <-
  pq_data_grouped %>%
  filter(ManagerIndicator == "Senior IC") %>%
  pull(Multitasking_hours) 

qqnorm(mth_ic, pch = 1, frame = FALSE)
qqline(mth_ic, col = "steelblue", lwd = 2)

# Multitasking hours - Manager
mth_man <-
  pq_data_grouped %>%
  filter(ManagerIndicator == "Manager") %>%
  pull(Multitasking_hours) 

qqnorm(mth_man, pch = 1, frame = FALSE)
qqline(mth_man, col = "steelblue", lwd = 2)

In the Q-Q plots, the points broadly adhere to the reference line. Therefore, the graphical approach suggests that the Shapiro-Wilk test may have been slightly over-sensitive. Below is a good thing to bear in mind:³

Statistical tests have the advantage of making an objective judgment of normality but have the disadvantage of sometimes not being sensitive enough at low sample sizes or overly sensitive to large sample sizes. Graphical interpretation has the advantage of allowing good judgment to assess normality in situations when numerical tests might be over or undersensitive.

In other words, the sample sizes may have well played a role in the significant result in our Shapiro-Wilk test.⁴ As our data isn’t conclusively normal - this in turn makes the unpaired t-test less conclusive. When we cannot safely assume normality, we can consider other alternatives such as the non-parametric two-samples Wilcoxon Rank-Sum test. This is covered further down below.

1.2 Testing for equality of variance (homoscedasticity)

Asides from normality, another assumption of the t-test that we hadn’t properly test for prior to running t.test() is to check for equality of variance across the two groups (homoscedasticity). Thankfully, this was not something we had to worry about as we used the Welch’s t-test. Recall that the classic Student’s t-test assumes equality between the two variances, but the Welch’s t-test already takes the difference in variance into account.

If required, however, here is an example on how you can test for homoscedasticity in R, using var.test():

# F test to compare two variances
var.test(
  Multitasking_hours ~ ManagerIndicator,
  data = pq_data_grouped
  )

## 
##  F test to compare two variances
## 
## data:  Multitasking_hours by ManagerIndicator
## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##   2.146082 10.318237
## sample estimates:
## ratio of variances 
##           4.572575

The var.test() function ran above is an F-test (i.e. uses the F-distribution) used to compare whether the variances of two samples are the same. Under the null hypothesis of the tests, there should be homoscedasticity and as the f-statistic is a ratio of variances, the f-statistic would tend towards 1. The arguments are provided in a similar format to t.test().

It appears that homoscedasticity does not hold: since the p-value is less than 0.05, we should reject the null hypothesis that variances between the manager and IC dataset are equal. The Student’s t-test would not have been appropriate here, and we were correct to have used the Welch’s t-test.

Homoscedasticity can also be examined visually, using a boxplot or a dotplot (using graphics::dotchart() - suitable for small datasets). The code to do so would be as follows. For this example, visual examination is a bit more challenging as the senior IC and Manager groups have starkly different levels of multi-tasking hours.

dotchart(
  x = pq_data_grouped$Multitasking_hours,
  groups = pq_data_grouped$ManagerIndicator
)

boxplot(
  Multitasking_hours ~ ManagerIndicator,
  data = pq_data_grouped
)

2. Non-parametric tests

2.1 Wilcoxon Rank-Sum Test

Previously, we could not safely rely on the unpaired two-sample t-test because we are not fully confident that the data satisfies the normality condition. As an alternative, we can use the Wilcoxon Rank-Sum test (aka Mann Whitney U Test). The Wilcoxon test is described as a non-parametric test, which in statistics typically means that there is no specification on a distribution, or the parameters of a distribution. In this case, the Wilcoxon test does not assume a normal distribution.

Another difference between the Wilcoxon Rank-Sum test and the unpaired t-test is that the former tests whether two populations have the same shape via comparing medians, whereas the latter parametric test compares means between two independent groups.

This is run using wilcox.test()

wilcox.test(
  Multitasking_hours ~ ManagerIndicator,
  data = pq_data_grouped,
  paired = FALSE
)

## 
##  Wilcoxon rank sum exact test
## 
## data:  Multitasking_hours by ManagerIndicator
## W = 752, p-value = 2.842e-14
## alternative hypothesis: true location shift is not equal to 0

The p-value of the test is less than the significance level (alpha = 0.05), which allows us to conclude that Managers’ median multitasking hours is significantly different from the ICs’.

Note that the Wilcoxon Rank-Sum test is different from the similarly named Wilcoxon Signed-Rank test, which is the equivalent alternative for the paired t-test. To perform the Wilcoxon Signed-Rank test instead, you can simply specify the argument to be paired = TRUE. Similar to the decision of whether to use the paired or the unpaired t-test, you should ensure that the one-sample condition applies if you use the Wilcoxon Signed-Rank test.

2.2 Kruskal-Wallis test

So far, we have only been looking at tests which compare exactly two populations. If we are looking for a test that works with comparisons across three or more populations, we can consider the Kruskal-Wallis test.

Let us create a new data frame that is grouped at the PersonId level, but filtering out fewer values in LevelDesignation:

pq_data_grouped_2 <-
  pq_data %>%
  filter(LevelDesignation %in% c(
    "Support",
    "Senior IC",
    "Junior IC",
    "Manager",
    "Director"
  )) %>%
  mutate(ManagerIndicator = factor(LevelDesignation)) %>%
  group_by(PersonId, ManagerIndicator) %>%
  summarise(Multitasking_hours = mean(Multitasking_hours), .groups = "drop")
  
glimpse(pq_data_grouped_2)

## Rows: 198
## Columns: 3
## $ PersonId           <chr> "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~
## $ ManagerIndicator   <fct> Support, Senior IC, Manager, Support, Support, Supp~
## $ Multitasking_hours <dbl> 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~

We can then run the Kruskal-Wallis test:

kruskal.test(
  Multitasking_hours ~ ManagerIndicator,
  data = pq_data_grouped_2
)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Multitasking_hours by ManagerIndicator
## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16

Based on the Kruskal-Wallis test, we reject the null hypothesis and we conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking. The most obvious downside to this method is that it does not tell us which groups are different from which, so this may need to be followed up with multiple pairwise-comparison tests (also known as post-hoc tests).

3. Comparison tests: ANOVA

3.1 ANOVA

What if we want to run the t-test across more than two groups?

Analysis of Variance (ANOVA) is an alternative method that generalises the t-test beyond two groups, so it is used to compare three or more groups.

There are several versions of ANOVA. The simple version is the one-way ANOVA, but there is also two-way ANOVA which is used to estimate how the mean of a quantitative variable changes according to the levels of two categorical variables (e.g. rain/no-rain and weekend/weekday with respect to ice cream sales). In this example we will focus on one-way ANOVA.

There are three assumptions in ANOVA, and this may look familiar:

The data are independent.
The responses for each factor level have a normal population distribution.
These distributions have the same variance.

These assumptions are the same as those required for the classic t-test above, and it is recommended that you check for variance and normality prior to ANOVA.

ANOVA calculates the ratio of the between-group variance and the within-group variance (quantified using sum of squares), and then compares this with a threshold from the Fisher distribution (typically based on a significance level). The key function is aov():

res_aov <-
  aov(
    Multitasking_hours ~ ManagerIndicator,
    data = pq_data_grouped_2
  )

summary(res_aov)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## ManagerIndicator   4  40.55   10.14   504.6 <2e-16 ***
## Residuals        193   3.88    0.02                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The interpretation is as follows:⁵

Df: degrees of freedom for…
- the outcome variable, i.e. the number of levels in the variable minus 1
- the residuals, i.e. the total number of observations minus one and minus the number of levels in the outcome variables
Sum Sq: sum of squares, i.e. the total variation between the group means and the overall mean
Mean Sq: mean of the sum of squares, calculated by dividing the sum of squares by the degrees of freedom for each parameter
F value: test statistic from the F test. This is the mean square of each independent variable divided by the mean square of the residuals. The larger the F value, the more likely it is that the variation caused by the outcome variable is real and not due to chance.
Pr(>F): p-value of the F-statistic. This shows how likely it is that the F-value calculated from the test would have occurred if the null hypothesis of no difference among group means were true.

Given that the p-value is smaller than 0.05, we reject the null hypothesis, so we reject the hypothesis that all means are equal. Therefore, we can conclude that at least one value in LevelDesignation is different in terms of their weekly hours spent multitasking.

Antoine Soetewey’s blog recommends the use of the report package, which can help you make sense of the results more easily:

library(report)

report(res_aov)

## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that:
## 
##   - The main effect of ManagerIndicator is statistically significant and large
## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

The same drawback that applies to the Kruskall-Wallis test also applies to ANOVA, in that doesn’t actually tell you which exact group is different from which; it only tells you whether any group differs significantly from the group mean. This ANOVA test is hence sometimes also referred to as an ‘omnibus’ test.

3.2 Next steps after ANOVA

A pairwise t-test (note: pairwise, not paired!) is likely required to provide more information, and it is recommended that you review the p-value adjustment methods when doing so.⁶ Type I errors are more likely when running t-tests pairwise across many variables, and therefore correction is necessary. Here is an example of how you might run a pairwise t-test:

pairwise.t.test(
  x = pq_data_grouped_2$Multitasking_hours,
  g = pq_data_grouped_2$ManagerIndicator,
  paired = FALSE,
  p.adjust.method = "bonferroni"
)

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator 
## 
##           Director Junior IC Manager Senior IC
## Junior IC <2e-16   -         -       -        
## Manager   <2e-16   <2e-16    -       -        
## Senior IC <2e-16   1         <2e-16  -        
## Support   <2e-16   1         <2e-16  1        
## 
## P value adjustment method: bonferroni

It may not be surprising that a pairwise method also exists as a follow-up for the Kruskall-Wallis test - which is the pairwise Wilcoxon test! This can be run using pairwise.wilcox.test(). The API for the pairwise.wilcox.test() is very similar to pairwise.t.test() where you can change the p-value adjustment method using the argument p.adjust.method:

pairwise.wilcox.test(
  x = pq_data_grouped_2$Multitasking_hours,
  g = pq_data_grouped_2$ManagerIndicator,
  paired = FALSE,
  p.adjust.method = "bonferroni"
)

## 
##  Pairwise comparisons using Wilcoxon rank sum exact test 
## 
## data:  pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator 
## 
##           Director Junior IC Manager Senior IC
## Junior IC 5.3e-09  -         -       -        
## Manager   3.3e-09  1.3e-09   -       -        
## Senior IC 5.9e-11  1         2.8e-13 -        
## Support   1.3e-08  1         8.6e-13 1        
## 
## P value adjustment method: bonferroni

4. Summary

So far, the following tests we performed have yielded similar results:

For comparing Senior ICs and Managers:
- unpaired two-sample t-test (assumes normality)
- Wilcoxon Rank-Sum test (non-parametric)
For comparing across more than two values:
- ANOVA (assumes normality)
- Kruskal-Wallis test (non-parametric)
For following up on (2) with pairwise comparisons:
- pairwise t-test with correction (assumes normality)
- pairwise Wilcoxon test (non-parametric)

To the first business question, we can conclude that Senior ICs have significantly lower multitasking hours than Managers. Although the data for the two groups are not normal or equal in variance, the mitigating solutions we used have also found the differences to be significant. Moreover, it appears that significant differences also exist across other levels when we reviewed the post-hoc tests.

4.1 Should I use a t-test or ANOVA for comparing exactly two groups?

One question worth discussing is the scenario at (1). Suppose that normality is observed in both groups, does it make a difference whether I use the t-test or ANOVA if I am comparing exactly two groups?

The textbook recommendation is that whenever one is comparing exactly two groups one should use the t-test, and ANOVA whenever there are more than two groups being compared. What can get confusing here is that there is the classic Student’s t-test and the Welch’s t-test.

When ANOVA is used to compare two groups, the results will be equivalent to a classic (Student’s) t-test with equal variances.⁷ However, if we are talking about the Welch’s t-test instead, it may be preferable over ANOVA because the Welch’s t-test takes into account heteroscedasticity. When there is heteroscedasticity, ANOVA (as well as Kruskall-Wallis) would become unstable and produce Type I errors, such as:

conservative estimates for large sample sizes
inflated estimates for small sample size⁸

To further complicate matters, there is also a method called Welch’s ANOVA which is like classic ANOVA but handles unequal variances better. This can be done in R using oneway.test(), but there is some debate around best practice that is beyond the scope of this post. ⁹ It would be prudent to run the Welch versions of the tests whenever we suspect the data to be heteroscedastic.

The recurring themes here are: (1) to check for heteroscedasticity and normality, and (2) to run multiple tests to acquire a more comprehensive view.

4.2 t-tests, ANOVA, and linear regression - are they completely different?

The common assumptions shared by the three methods may have gave it away, but the t-test, ANOVA, and linear regression are actually related in the sense that one is a special case of another.

The t-test is considered a special case of ANOVA, since the classic Student’s t-test is the same as ANOVA in comparing two groups when variances are equal. When the t-test statistic is squared, you get the corresponding f-statistic in the ANOVA.¹⁰

On the other hand, an ANOVA model is the same as a regression with a dummy variable. In fact, the aov() function in R is a wrapper around the linear regression function lm(). Steve Midway’s Analysis in R has a chapter which compares the outputs when running ANOVA using lm() versus aov().

All of these procedures are subsumed under the General Linear Model and share the same assumptions.

End Notes

This has been a very long post - hope you have found this useful! Due to the vastness of the subject, it will not be possible to detail every consideration and method. However, this should hopefully make flow charts like the below easier to follow:

Flowchart for inferential statistics from Grosofsky (2009)

Please comment in the Disqus box down below if you have any feedback or suggestions. Do also check out the References list below for further reading; as I wrote this I have attempted to link to the brilliant resources referenced as diligently as possible.

References

a scenario in modelling where your predictor variables are correlated, which could lead to a poor inference.↩︎
See https://learn.microsoft.com/en-us/viva/insights/use/metric-definitions for definitions.↩︎
See Mishra P, Pandey CM, Singh U, Gupta A, Sahu C, Keshri A. Descriptive statistics and normality tests for statistical data. Ann Card Anaesth. 2019 Jan-Mar;22(1):67-72. doi: 10.4103/aca.ACA_157_18. PMID: 30648682; PMCID: PMC6350423.↩︎
The other well-known alternative test for normality is the Kolmogorov-Smirnoff test, run in R using ks.test(). The KS test looks at the quantile where your empirical cumulative distribution function differs maximally from the normal’s theoretical cumulative distribution function. This is often somewhere in the middle of the distribution. On the other hand, the Shapiro-Wilk test focusses on the tails of the distribution, which is consistent to what we are seeing the Q-Q plots.↩︎
References original article at https://www.scribbr.com/statistics/anova-in-r/.↩︎
An alternative is the Tukey Honest Significant Differences (TukeyHSD()), which won’t be detailed here. The TukeyHSD() function operates on top of the object returned by aov().↩︎
See this discussion and this.↩︎
https://www.statisticshowto.com/welchs-anova/↩︎
See https://statisticsbyjim.com/anova/welchs-anova-compared-to-classic-one-way-anova/; https://blog.minitab.com/en/adventures-in-statistics-2/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete; http://ritsokiguess.site/docs/2017/05/19/welch-analysis-of-variance/. See also Liu, H. (2015). Comparing Welch ANOVA, a Kruskal-Wallis test, and traditional ANOVA in case of heterogeneity of variance. Virginia Commonwealth University.↩︎
It is worth a quick footnote on the differences between the t-statistic and the f-statistic. The f-statistic is an output that is found in both the F-tests for variance (see var.test()) and ANOVA (see aov()). The f-statistic is a ratio of two variances, and variance is squared standard deviation. Note that the f-tests for variance and ANOVA are not the same, as the former compares variances of two populations whereas the latter compares within- and between-group variances, even though both tests use the f-distribution. When there are only two groups for the one-way ANOVA F-test, the f-statistic is equal to the square of the Student’s t-statistic.↩︎

Top 10 tips to make your R package even more awesome

2020-12-22T00:00:00+00:00

What this post is about

This post shares top ten tips on how to make your R package even more awesome than it already is. As an R developer, you’ve already put a lot of work into creating and testing your package - so why waste this opportunity to really showcase your work and make it go even further? The tips mentioned in this post can be divided into three main categories:

Communicating your package: so others can access your package and try it more easily
Wrestling time back from developer chores: so you can spend more time on the important things
DevOps best practices: so other fellow R users will feel more confident about using your package, and make it easier for other developers to collaborate or contribute.

This post assumes that you’ve already written an R package, and therefore won’t focus on the coding component of R package development.

Background

Before we begin, shameless plug alert: I’ve written a few tiny R packages. Here are a few of them:

rwa: you can run Relative Weights Analysis (a.k.a. Key Drivers Analysis) to measure variable importance. This is available on CRAN.
hkdatasets: contains datasets that relate to Hong Kong, and is used for our own projects at Hong Kong Districts Info. This is also available on CRAN.
parallaxr: allows you to generate pretty parallax scroll documents with R and Markdown.
surveytoolbox: this package contains all the ‘convenience’ functions back in the days when I was analysing mostly survey data.
hkdistrictballs: created for fun, that allows you to generate “country ball” graphics but for the 18 districts of Hong Kong. Makes use of the magick package.

Admittedly, I did not write all these R packages for entirely altruistic reasons. Writing an R package is an exercise that is valuable in itself, as it allows you to put all your most commonly used custom functions into a neat, self-contained package which you can just load at the start of your analysis sessions, instead of copying and pasting snippets of code from GitHub Gists or randomly placed R scripts.

I used to keep a GitHub Gist which contained 1000+ lines of my most used functions, but trust me, you won’t want to do that. Not only does such a maniacal workflow make the likelihood of your future self being able to reproduce your work completely dependent on your organisational or documentation skills, it also represents a potential loss to your colleagues or the R community, as all the work that you have put into writing your custom functions will help nobody else but yourself, as nobody else can access or understand your functions.

However, one big reason why I write all these R packages is because I enjoy the creative process. I believe a significant, but sometimes neglected, part of writing R packages is communicating to your package users on why they should use your package, and how they can use them. Easy to follow examples, reproducible vignettes, documentation that isn’t 100% technical-lingo - all these things help with making an R package easier to use, yet are unrelated with the quality or the implementation of the R code itself. A lot of this is about communication, which is mostly what this post is about (for the code quality aspect, I would recommend resources like Advanced R or R Packages instead).

So here are my top ten recommendations on how to make your R package even more awesome than it currently is. Let’s go!

1. Create a package website with pkgdown

Whilst this tip is quite well-known, it’s place in the top ten is unquestionable. The pkgdown package makes it incredibly easy to create a package website straight from the files that ‘naturally’ exist in your package, such as README.md and DESCRIPTION. This package website will document all the functions in your package, running even all your examples in R scripts (under @examples), and make it incredibly easy for your users to navigate your package and try out its functionalities.

The alternative is to make your users go through the official PDF R package manual - which although is easy enough to generate with devtools::build_manual() - is not the easiest to navigate, does not natively support plot examples, and definitely more likely to put off new R users from using your package.

For an example of the website in action, here is an R package that I’ve recently written for work, which leverages pkgdown to showcase the large number of functions in the package, and to include an “Analyst Guide” to make it easier to explore the package’s features.

The set-up I would recommend is to set up a GitHub Actions that generates the pkgdown website to a separate gh-pages branch every time you push a commit to the main or master branch on GitHub, and set your GitHub Pages to point to gh-pages for hosting.

What this effectively means is that you will a package website that practically “maintains itself”, as the website will update itself as you update your package (like DESCRIPTION or the function documentation) and push the changes onto GitHub. What’s more, this set up is free as it’s hosted on GitHub!

To set all this up, you just need to run:

usethis::use_github_actions("pkgdown")

This line of code will configure a GitHub Actions workflow to build and deploy your pkgdown site whenever you push changes to GitHub. This should be created and saved at .github/workflows/pkgdown.yaml. The only manual step you’ll need to do is to go to Settings in your GitHub repo, go to Options, and scroll down until you see GitHub Pages. For Source, the page site should be set to being built from the root folder of the gh-pages.

Once this is set-up and saved, it should just take a few minutes and you should be able to see your website at https://<YOUR-GITHUB-USERNAME>.github.io/<YOUR-PACKAGE-NAME>. You can also of course use a custom domain if you wish.

If you’d like to customise your website, you may add a _pkgdown.yml file which you can specify things like what to show your navigation bar, Google Analytics tracking code, site theme, social network icons, etc. There are plenty of package sites that are set up this way, so if you’re looking for inspiration you can just take a peek at the _pkgdown.yml file for any pkgdown sites that use this set-up (you can start with the actual pkgdown pkgdown site). The five R packages mentioned at the beginning of this post also uses this set-up.

2. Automated R CMD checks with GitHub Actions

Chances are, if you’ve already written a package, you’ll at least have run a R CMD check, or ran devtools::check() to test for errors in your R package.¹ The R CMD check automatically checks your code for common problems, e.g.:

whether the package can be successfully installed on different operating systems
whether there are syntax errors in the script
whether there are undocumented arguments in your functions, etc.

Now, you can either run this manually on your local machine, OR, you can configure GitHub Actions to run this check automatically whenever push a commit or merge a change to your main/master branch. The bonus with the latter, of course, is that you get a nice fancy badge that you can place in your README.md, like this:

The only thing you have to make sure is that your package passes these checks before you add the badge for the first time, otherwise you’ll get an alarming failing badge on your repo!

The easiest way to add GitHub Actions, again, is to use the usethis package:

usethis::use_github_actions()

Similar to tip #1, this adds a yaml file under .github/workflows called R-CMD-check.yaml. To add a badge, you can then run:

usethis::use_github_actions_badge()

You can check the usethis documentation on the specific details of this function.

Adding automated checks embodies the principles of CI/CD (continuous integration, continuous delivery) coding practice, which prefers regular and frequent code check-ins to version control repositories. Automated checks is a form of continuous testing, which is a condition for CI/CD. As an outcome, the argument goes that this leads to better collaboration due to greater transparency, and higher software quality due to continuous testing. Errors can be identified sooner, plus a ‘passing’ badge helps assure potential users of your package that you have done your homework to make sure that your package is passing all the basic checks.

3. CodeFactor

Speaking of badges, here’s another that you can add to your GitHub!

CodeFactor performs an automated review of your R code for code quality, and returns a grade (just like in school!). As you’ll see, it’s possible to get an A+, but you can also get a few of the following grades:

Instead of checking whether your functions fail or whether your package can be successfully installed, CodeFactor checks for things like:

Whether you use library() within a function - which is not recommended
Whether you have arguments which have been defined but never used in function
Whether you adopt sub-ideal practices like 1:100 (instead of seq_along()) or sapply() (due to return type uncertainty. )
Using options() directly inside a function instead of withr::with_options()

This is a great way to review your code automatically, instead of badgering a friend who happens to be an experienced R developer to review your package for you.

And speaking of badgers, I highly recommend checking out the badger package, which allows you to generate badges in your README. There are so many other badges that you can add to your package README (e.g. code coverage, number of downloads), but I won’t detail them here as this would turn into a post about badges.

4. Use conventional commits

Write every commit message as if it's part of a PR to your future employer.
- Confucius
— 🐢 Florian (@fistful_of_bass) December 15, 2020

There are many reasons for making sure your commit messages are sensible rather than unhelpful and silly (e.g. “update repo lol”), including the one cited above. Here, the recommendation is to actually take this further and use conventional commits. What this refers to is the adherence to a set of conventions when writing commit messages by expressing intent. Each commit message would be prefixed with, for instance, fix: or feat: to indicate whether it is a bug fix or a feature change. Some examples are:

feat: add new barplot function - a new feature introduced
fix: syntax error - a bug fix
format: ggplot theme changes - a change to formatting that doesn’t affect code logic
perf: remove nested loop - a change to performance by removing nested loops
docs: add examples - a change to the documentation only

You can find out more about conventional commits here. I highly recommend at least reading through the FAQ section, which answers some common questions which pop up when you are coming across conventional commits for the first time.

The benefit of using conventional commits is that it increases the transparency of the entire project, and makes it more welcoming and inclusive for collaborators. I’m also sure it will impress potential future employers, with its incredible neatness! It will also make things much easier when you are writing up pull request summaries and any package change logs.

To make this even more inclusive for other collaborators, you can add a Git Style Guide to the Wiki page of your GitHub repository, like this. Kudos to Avision Ho for sharing this idea and concept with me in the first place.

5. Package start-up message

This is probably the most controversial tip in this post, i.e. adding a start-up message to your package. This is a short snippet of message that you can write to your package users which will come up whenever they run library(YOURPACKAGE).

Why might you do this? Personally, I think it is a nice way to put certain details such as where to find out more resources about the package, or report bugs. Some developers also use this space to include a few lines to advertise some of their other work. In tidyquant, you get a subtle start-up message when you load the package:

== Need to Learn tidyquant? =====================================================
Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
</> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

How do you add a start-up message? This can be done adding a function .onAttach() to one of your R scripts in the package. Here’s one I’ve created earlier for the wpa package:

.onAttach <- function(libname, pkgname) {
  message <- c("\n Thank you for using the {wpa} R package!",
               "\n \n Our analysts have taken every care to ensure that this package runs smoothly and bug-free.",
               "\n However, if you do happen to encounter any, please email mac@microsoft.com to report any issues.",
               "\n \n Happy coding!")
  packageStartupMessage(message)
}

The reason why this is controversial is because some argue that package start-up messages clutter up the console and interfere with reproducibility. ² However, there is also another line of argument that defends the right of open-source developers to place adverts in the packages that they’ve worked so hard on (see this Twitter thread). Of course, you might just want to add a welcome message rather than an advert to your package, but I’ll leave this to the reader to decide.

6. Add a GIF in your README

GIFs are awesome, even in the context of R package READMEs. I’ve recently experimented with screen-recording an example of my package in action, converting the video into a GIF, and adding it to the README - receiving mostly positive feedback. See the below example from the parallaxr package:

If your package allows you to generate visual outputs like plots or HTML widgets, this is a great way to let potential users see what they can achieve without leaving it only to their imagination (“what happens when I run foo_bar()?”).

7. Add a Contributor Guide and PR templates

This tip is actually what GitHub recommends under its settings in Insights > Community. And there are good reasons for doing so. The recommendation is that you should add a contributor guideline (CONTRIBUTING.md) and pull request template to your repository so that it makes it easier for others to collaborate on your package.

I would highly recommend doing anything that would make it easier for others to contribute, as I think it’s fair to say that the number of contributions (in the form of submitted issues, forks, and pull requests) is a mark of an R package’s success (you can measure using GitHub Stars too if you want, I guess).

GitHub has a comprehensive guide on how to add a Contributor Guide, and it’s really up to you to decide on how you would like others to contribute changes to your package. Still not sure what to put on your CONTRIBUTING.md? The best places to look are the big, popular R package GitHub repositories, and look at what they put in theirs (probably one of the most important takeaways of this post).

To add a pull request template, you’ll need to add a file named pull_request_template.md in the .github subfolder of your package. Certain things you may consider adding to your pull request template are:

Summary of changes from the branch
Checks to perform when reviewing the pull request
What issues are linked to this pull request

You can use this version originally put together by Avision Ho as a starting point for authoring your own templates.

8. Add a hex sticker

There’s no way an R package is complete without a hex sticker. It’s tradition, it’s cool, although arguably not essential - but why not? It’s very easy to add one, and it makes people want to download your package first even when they haven’t quite figured out the use case for your code yet.

What’s more, you can create an R package hex sticker with an R package! If you’ve not heard of it yet, you should give GuangchuangYu’s hexSticker a go.

Alternatively, if you’re some what visual artist yourself, you can also choose to create one on your own with Inkscape, which is an open-source vector graphic editing software. Choose an existing hex sticker as a template, and edit the underlying SVG.

I would recommend editing with SVG because it preserves resolution, which may come into handy one day if your R package makes it big and people want to print it on merch. Dreaming on…

9. Create a package cheatsheet

Although I’m not aware if there are any R packages out there (tell me if you do) that can generate a package cheatsheet for you, it’s one of the things that are totally worth doing even manually.

A cheatsheet helps users view at a glance all the functions that are available in your package, and categorised in a meaningful way as you yourself (the developer) would have done it. The RStudio cheatsheet collection provides plenty of examples that you can reference, as well as a template for which you can create your own cheatsheet using either Keynote or PowerPoint. Here’s one I made earlier.

10. Submit to CRAN

Okay, this is kind of a no-brainer, and everyone ideally would want to have their package to be submitted to CRAN. It really is something you should try to do, even if it is a bit of work getting all the bits right, as it gives your package a mark of approval and boosted popularity.

Having automated R CMD checks will help you get there slightly faster and easier, and to be honest I did not find the process as difficult as I previously imagined. All the CRAN reviewers (who are volunteers, by the way!) have all been very helpful and explicit in their feedback on what needs to be changed in order to re-submit a package. Having said that, it’s courtesy to make sure you test and review your package thoroughly before submitting your packages to CRAN so you don’t waste time for both the CRAN team and yourself! Submitting to CRAN is a substantial topic in itself, so I’m going to just put down some links.

Karl Broman has a pretty informative primer on how to get your R package on CRAN.

Bonus tip…

Since the last tip was probably slightly less informative, I’ve decided to throw in a bonus tip, which is a list of channels in which you should try to promote your R package:

Write a blog about your package, and submit to R-bloggers. There is a huge readership / following with R-bloggers, and this is a great way of getting the R community aware of your package.
Submit your package to RWeekly, either as a blog or as a simple package release message. You can submit to RWeekly by creating a pull request to merge to its DRAFT.md, or use one of the other submission methods listed on the website.
Post your package release message on Twitter with the #rstats hashtag. This makes it much more likely for the package to be picked up by the R community. Note that the convention is to use #Rstats rather than #R as a hashtag - see https://www.t4rstats.com/hashtags-what-are-they-good-for.html.

I declare #rstats the official R statistical prog lang hashtag, pass it on to friends, family and Stata users
— Drew Conway (@drewconway) April 3, 2009

If you use Reddit, consider posting in the Rstats subreddit.

Finally, it’s worth emphasising that the best way to learn how to improve your package is to look at how others do it. In the process of writing this post. I’ve learnt something myself when looking at the sjmisc package GitHub repository, i.e. a way to make it easy for others to cite your R package, with:

citation('data.table')

I’m sure there are plenty of other great tips out there that I’ve not included, but again I hope this post was useful enough. If you enjoyed this post, please comment in the original blog link. Take care and stay safe, and happy coding!

See https://r-pkgs.org/r-cmd-check.html for a detailed explanation of the R CMD check. ↩
See https://win-vector.com/2019/08/30/it-is-time-for-cran-to-ban-package-ads/. ↩

Comparing Common Operations in dplyr and data.table

2020-11-06T00:00:00+00:00

Background

This post compares common data manipulation operations in dplyr and data.table.

For new-comers to R who are not aware, there are many ways to do the same thing in R. Depending on the purpose of the code (readability vs creating functions) and the size of the data, I for one often find myself switching from one flavour (or dialect) of R data manipulation to another. Generally, I prefer the dplyr style for its readability and intuitiveness (for myself), data.table for its speed in grouping and summarising operations,¹ and base R when I am writing functions. This is by no means the R community consensus by the way (perfectly aware that I am venturing into a total minefield),² but is more of a representation of how I personally navigate the messy (but awesome) R world.

In this post, I am going to list out some of the most common data manipulations in both styles:

group_by(), summarise() (a single column)
group_by(), summarise_at() (multiple columns)
filter(), mutate()
mutate_at() (changing multiple columns)
Row-wise operations
Vectorised multiple if-else (case_when())
Function-writing: referencing a column with string

There is a vast amount of resources out there on the internet on the comparison of dplyr and data.table. For those who love to get into the details, I would really recommend Atrebas’s seminal blog post that gives a comprehensive tour of dplyr and data.table, comparing the code side-by-side. I would also recommend this comparison of the three R dialects by Jason Mercer, which not only includes base R in its comparison, but also goes into a fair bit of detail on elements such as piping/chaining (%>%). There’s also a very excellent cheat sheet from DataCamp, linked here.

Why write a new blog post then, you ask? One key (selfish / self-centred) reason is that I myself often refer to my blog for an aide-memoire on how to do a certain thing in R, and my notes are optimised to only contain my most frequently used code. They also contain certain idiosyncracies in the way that I code (e.g. using pipes with data.table), which I’d like to be upfront about - and would at the same time very much welcome any discussion on it. It is perhaps also justifiable that I at least attempted to build on and unify the work of others in this post, which I have argued as what is ultimately important in relation of duplicated R artefacts.

Rambling on… so here we go!

To make it easy to reproduce results, I am going to just stick to the good ol’ mtcars and iris datasets which come shipped with R. I will also err on the side of verbosity and load the packages at the beginning of each code chunk, as if each code chunk is its own independent R session.

1. `group_by()`, `summarise()` (a single column)

Analysis: Maximum MPG (mpg) value for each cylinder type in the mtcars dataset.
Operations: Summarise with the max() function by group.

To group by and summarise values, you would run something like this in dplyr:

library(dplyr)

mtcars %>%
    group_by(cyl) %>%
    summarise(max_mpg = max(mpg), .groups = "drop_last")

You could do the same in data.table, and still use magrittr pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

mtcars %>%
    as.data.table() %>%
    .[,.(max_mpg = max(mpg)), by = cyl]

2. `group_by()`, `summarise_at()` (multiple columns)

Analysis: Average mean value for Sepal.Width and Sepal.Length for each iris Species in the iris dataset.
Operations: Summarise with the mean() function by group.

Note: this is slightly different from the scenario above because the “summarisation” is applied to multiple columns.

In dplyr:

library(dplyr)

# Option 1
iris %>%
    group_by(Species) %>%
    summarise_at(vars(contains("Sepal")),~mean(.))

# Option 2
iris %>%
  group_by(Species) %>%
  summarise(across(contains("Sepal"), mean), .groups = "drop_last")

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Option 1
iris %>%
    as.data.table() %>%
    .[,lapply(.SD, mean), by = Species, .SDcols = c("Sepal.Length", "Sepal.Width")]
    
# Option 2
iris %>%
  as.data.table() %>%
  .[,lapply(.SD, mean), by = Species, .SDcols = names(.) %like% "Sepal"]

3. `filter()`, `mutate()`

Analysis: Find out what the multiple of Sepal.Width and Sepal.Length would be for the iris species setosa.
Operations: Filter by Species=="setosa" and create a new column called Sepal_Index.

In dplyr:

library(dplyr)

iris %>%
    filter(Species == "setosa") %>%
    mutate(Sepal_Index = Sepal.Width * Sepal.Length)

In data.table:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

iris %>%
    as.data.table() %>%
    .[, Species := as.character(Species)] %>%
    .[Species == "setosa"] %>%
    .[, Sepal_Index := Sepal.Width * Sepal.Length] %>%
  .[]

4. `mutate_at()` (changing multiple columns)

Analysis: Multiply Sepal.Width and Sepal.Length by 100.
Operations: As above

In dplyr:

library(dplyr)

# Option 1
iris %>%
    mutate_at(vars(Sepal.Length, Sepal.Width), ~.*100)

# Option 2
iris %>%
  mutate(across(starts_with("Sepal"), ~.*100))

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)


sepal_vars <- c("Sepal.Length", "Sepal.Width")

iris %>%
  as.data.table() %>%
  .[,as.vector(sepal_vars) := lapply(.SD, function(x) x * 100), .SDcols = sepal_vars] %>%
  .[]

5. Row-wise operations

This is always an awkward one, even for dplyr. For this, I will list a couple of options for row-wise calculations.

Analysis: Create a TotalSize column by summing all four columns of Sepal.Length, Sepal.Width, Petal.Length, and Petal.Width.
Operations: As above

In dplyr:

library(dplyr)

# Option 1 - use `rowwise()`
iris %>%
  rowwise() %>%
  mutate(TotalSize = sum(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))

# Option 2 - use `apply()` and `select()`
# Select all columns BUT `Species`
iris %>%
  mutate(TotalSize = select(., -Species) %>% apply(MARGIN = 1, FUN = sum))

# Option 3 - `rowwise()` and `c_across()`
# Select all columns BUT `Species`
iris %>%
  rowwise() %>%
  mutate(TotalSize = sum(c_across(-Species)))

In data.table with pipes:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Get all the column names in Species except for `Species`
all_vars <- names(iris)[names(iris) != "Species"]

iris %>%
  as.data.table() %>%
  .[, "Sepal_Total" := apply(.SD, 1, sum), .SDcols = all_vars] %>%
  .[]

6. Vectorised multiple if-else (`case_when()`)

Analysis: Classify an Age into different categories
Operations: Create a new column called AgeLabel based on the Age variable

In dplyr:

library(dplyr)

age_data <- tibble(Age = seq(1, 100))

age_data %>%
  mutate(AgeLabel = case_when(Age < 18 ~ "0 - 17",
                              Age < 35 ~ "18 - 34",
                              Age < 65 ~ "35 - 64",
                              TRUE ~ "65+"))

In data.table:

library(data.table)
library(magrittr) # Or any package that imports the pipe (`%>%`)

# Option 1 - without pipes
age_data <- data.table(Age = 0:100)
age_data[, AgeLabel := "65+"]
age_data[Age < 65, AgeLabel := "35-64"]
age_data[Age < 35, AgeLabel := "18-34"]
age_data[Age < 18, AgeLabel := "0-17"]        

# Option 2 - with pipes
age_data2 <- data.table(Age = 0:100)

age_data2 %>%
  .[, AgeLabel := "65+"] %>%
  .[Age < 65, AgeLabel := "35-64"] %>%
  .[Age < 35, AgeLabel := "18-34"] %>%
  .[Age < 18, AgeLabel := "0-17"] %>%
  .[]

One thing to note is that there are two options here - Option 2 with and Option 1 without using magrittr pipes. The reason why Option 1 is possible without any assignment (<-) is because of reference semantics in data.table. When := is used in data.table, a change is made to the data.table object via ‘modify by reference’, without creating a copy of the data.table object; when you assign it to a new object, that is referred to as ‘modify by copy’.

As Tyson Barrett nicely summarises, this ‘modifying by reference’ behaviour in data.table is partly what makes it efficient, but can be surprising if you do not expect or understand it; however, the good news is that data.table gives you the option whether to modify by reference or by making a copy.

7. Function-writing: referencing a column with string

Requirement: Create a function that will multiply a column by three. A string should be supplied to the argument to specify the column to be multiplied. The function returns the original data frame with the modified column.

Here, I intentionally name the packages explicitly within the function and not load them, as it’s best practice for functions to be able to run on their own without loading in an entire library.

In dplyr:

multiply_three <- function(data, variable){
  
  dplyr::mutate(data, !!rlang::sym(variable) := !!rlang::sym(variable) * 3)
}

multiply_three(iris, "Sepal.Length")

In data.table:

(See https://stackoverflow.com/questions/45982595/r-using-get-and-data-table-within-a-user-defined-function)

multiply_three <- function(data, variable){
  
  dt <- data.table::as.data.table(data)
  dt[, as.character(substitute(variable)) := get(variable) * 3]
  dt[] # Print
}

multiply_three(iris, "Sepal.Length")

End Note

This is it! For anything with greater detail, please consult the blogs and cheat sheets I recommended at the beginning of this blog post. I’d say this covers 65% (not a strictly empirical statistic) of my needs for data manipulation, so I hope this is of some help to you. (The gather() vs melt() vs pivot_longer() subject is a whole other beast, and ought to be dealt with in another post)

Elio Campitelli has an [excellent blog post] on Why I love data.table, which is a nice short piece on why data.table is pretty awesome.↩︎
As noted in the DS4PS blog, the debate of dplyr versus data.table has resulted in “Twitter clashes, and even became an inspiration for memes.”↩︎

A Shiny app on Hong Kong District Councillors

2020-09-05T00:00:00+00:00

👀 TL;DR

We built an R Shiny app to improve access to information on Hong Kong’s local politicians. This is so that voters can make more informed choices. The app shows basic information on each politician, alongside a live feed of their Facebook page and illustrative maps of their district. We took advantage of this project to test out a range of R packages and techniques and to implement some DevOps best practices, which we will discuss in this post.

This project is an attempt to help make a difference with R programming. It’s an opportunity for us to learn, to code, to have fun, and to make a difference.

This blog post is originally published on https://martinctc.github.io/blog/, and co-authored by Martin Chan and Avision Ho.

💻 Overview

Our project was mainly motivated by an observation that the engagement of the Hong Kong public with their local politicians was very low.¹ Historically, the work of Hong Kong’s District Councillors (DCs) are neither widely known nor closely scrutinised by the public media. Until recently, most District Councillors did not use webpages or Facebook pages to share their work, but instead favour distributing physical copies of ‘work reports’ via Direct Mail. This has changed significantly with the 2019 District Council election, which was a significant election where the turnout has jumped to 71% (from 47% in 2015), for different reasons. For context, Hong Kong’s District Councils is the most local level of government, and is the only level in which there is full universal suffrage for all candidates.

As of the summer of 2020, we identified that 96% (434) of the 452 District Councillors elected in 2019 actually have a dedicated Facebook page for delivering updates to and engaging with local residents. However, these Facebook pages have to be manually searched for online, and there is not a readily available tool where people can quickly map a District to a District Councillor and to their Facebook feeds.

As a wise person once said, “If you can solve a problem effectively in R, why the hell not?”. We tackled this problem by creating a Shiny app in R, which brings the Facebook feeds and constituency information for Hong Kong’s district councillors in one place. In this way, people will be able to access the currently disparately stored information in a single web app.

You can access:

The Shiny app here.
Our GitHub repository here.
Don’t forget to also provide some feedback to the Shiny app here!

Whether you are more of an R enthusiast or simply someone who has an interest in Hong Kong politics (hopefully both!), we hope this post will bring you some inspiration on how you can use R not just as a great tool for data analysis, but also as an enabler for you to do something tangible for your community and contribute to causes you care about.

🔍 What is in the app?

The Shiny app is built like a dashboard which combines information about each district councillor alongside their Facebook page posts (if it exists) and the district they serve, illustrated on an interactive map. By using the District and Constituency dropdown lists, you can retrieve information about the District Councillor and their Facebook feed.

Specifically, there are several key components that were used on top of the incredible shiny package:

shinydashboard: For mobile-friendly dashboard layout.
- We understood that our users, primarily HK citizens, frequently use mobiles. Thus, to ensure this app was useful to them, we centred our design on how the app looked on their mobile browsers.

googlesheets4: For seamless access to Google Sheets.
- We understood that our users are not all technical so we stored the core data in a format and platform familiar and accessible to most people, Google Sheets.
- At a later stage of the app development, we migrated to storing the data in an R package we wrote, called hkdatasets as we sought to keep the data in one place. However, the Google Sheets implementation worked very well, and the app could be deployed with no impact on performance or user experience.
sf and leaflet: For importing geographic data and creating interactive maps.
- We understood that our users may want to explore other parts of Hong Kong but may not know the names of each constituency. Thus, we provided a map functionality to improve the ease they can learn more about different parts of Hong Kong.
rintrojs: For interactive tutorials.
- We understood that our users are not necessarily keen to read pages of instructions on how to use the app, especially if they are on mobile. Thus, we implemented a dynamic feature that walks them through visually each component of the app.

🗄️ How was the data collected?

Since there was no existing single data source on the DCs, we had to put this together ourselves. All the data on each DC, their constituency, the party they belong to, and their Facebook page was all collected manually through a combination of Wikipedia and Facebook. The data was initially housed on Google Sheets, for multiple reasons:

Using Google Sheets made it easy for multiple people to collaborate on data entry.
Keeping the data outside of the repo has the advantage of keeping the memory size minimal, in line with best practices.
By storing the data in Google Sheets, non-technical users would also be able to access the data too.

Most of all, it was easy to access the Google Sheets data with the {googlesheets4} package! For editing the data for pre-processing, a key function is googlesheets4::gs4_auth(), which directs the developer to a web browser, asked to sign in to their Google account, and to grant googlesheets4 permission to operate on their behalf with Google Sheets. We then set up the main Google Sheet - the nicely formatted version intended for the app to ingest - to provide read-only access to anyone with the link, and used googlesheets4::gs4_deauth() to access the public Google Sheet in a de-authorised state. The Shiny app itself does not have any particular Google credentials stored alongside it (which it shouldn’t, for security reasons), and this workflow allows (i) collaborators/developers to edit the data from R and (ii) for the app to access the Google Sheet data without any need for users to login.

This Google Sheet is available here.

Creating a map with constituency boundaries also required additional data. Boundaries for each constituency were obtained through a Freedom of Information (FOI) request by a member of the public here (see discussion of shapefiles below).

This was pretty much Phase #1 of data collection, where we had single Google Sheet with basic information about the District Councillors and their Facebook feeds, which enabled us to create a proof of concept of the Shiny app, i.e. making sure that we can set up a mechanism where the user can select a constituency and the app returns the corresponding Facebook feed of the District Councillor.

Based on user feedback, we started with Phase #2 of data collection, which involved a web-scraping exercise on the official Hong Kong District Council website and the HK01 News Page on the 2019 District Council elections to get extra data points, such as: - Contact email address - Contact number - Office address - Number of votes, and share of votes won in 2019

A function that was extremely helpful for figuring out the URL of the District Councillors’ individual official pages is the following. What this does is to run a Bing search on the https://www.districtcouncils.gov.hk website, and scrape from the search result any links which match what we want (based on what the URL string looks like). Although this doesn’t always work, it helped us a long way with the 452 District Councillors.

scrape_dcs <- function(search_term){

  query_string <- paste("site: https://www.districtcouncils.gov.hk", search_term)

  squery <- URLencode(query_string)

  squeryfull <- paste0("https://www.bing.com/search?q=", squery)

  main_page <- xml2::read_html(squeryfull)

  temp <- html_nodes(main_page, '.b_title a') %>%
    html_attr("href")

  temp[grepl("member_id=", temp)]
}

One key thing to note is that all of the above data we compiled is available and accessible in the public domain, where we simply took an extra step to improve the accessibility.² The Phase #2 data was used in the final app to provide more information to the user when a particular constituency or District Councillor is selected.

📦 Creating a data package

Our data R package, hkdatasets, is to some extent a spin-off of this project. We decided to migrate from Google Sheets to an R data package approach, for the following reasons:

An R data package could allow us to provide more detailed documentation and tracking of how the data would change over time. If we choose to expand the dataset in the future, we can easily add this to the package release notes.
An R data package would fit well with our broader ambition to work on other Hong Kong themed, open-source projects. From sharing our project with friends, we were approached to help with another project to visualise Hong Kong traffic collisions data, where the repo is here. As part of this, we obtained this data via an FOI request on traffic collisions, where the data is also available through hkdatasets.
Make it easier for learners and students in the R community to practise with the datasets we’ve put together, without having to learn about the googlesheets4 package. Our thinking is that this would benefit others as other data packages like nycflights13 and babynames have benefitted us as we learned R.

hkdatasets is currently only available on GitHub, and our aim is to release it on CRAN in the future so that more R users to take advantage of it. Check out our GitHub repo to find out more about it.

🔗 Linking our Shiny App to Facebook

When we first conceptualised this project, our aim has always been to make the Facebook Page content the centre piece of the app. This was contingent on using some form of Facebook API to access content on the District Councillors’ Public Pages, which we initially thought would be easy as Public Page content is ‘out there’, and shouldn’t require any additional permissions or approvals.

It turns out, in order to read public posts from Facebook Pages that we do not have admin access to requires a certain permission called Page Public Content Access, which in turn requires us to submit our app to Facebook for review. Reading several threads (such as this) online soon convinced us that this would be a fairly challenging process, as we need to effectively submit a proposal on why we had to request this permission. To add to the difficulty, we understood that the App Review process had been put on pause at the time, due to the re-allocation of resourcing during COVID-19.

This drove us to search for a workaround, and this is where we stumbled across iframes as a solution. An iframe is basically a frame that enables you embed a HTML document within another HTML document (they’ve existed for a long time, as I recall using them in the really early GeoCities and Xanga websites).

The iframe concept roughly works as follows. All the Facebook Page URLs are saved in a vector called FacebookURL, and this is “wrapped” with some HTML markup that enables it to be rendered within the Shiny App as an UI component (using shiny::uiOutput():

chunk1 <- '<iframe src="https://www.facebook.com/plugins/page.php?href='
chunk3 <- '&tabs=timeline&width=400&height=800&small_header=false&adapt_container_width=true&hide_cover=false&show_facepile=true&appId=3131730406906292" width="400" height="800" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true" allow="encrypted-media"></iframe>'
iframe <- paste0(chunk1, FacebookURL, chunk3)

Although the iframe solution comes with its own challenges, such as the difficulty in making it truly responsive / mobile-optimised, it was nonetheless an expedient and effective workaround that allowed us to produce a proof-of-concept; the alternative was to splash around in Facebook’s API documentation and discussion boards for at least another month to achieve the App Approval (bearing in mind that we were working on this in our own free time, with limited resources).

🌍 Visualising the shapefiles

The first rule of optimisation is you don’t.

— Michael A. Jackson

We acquired shapefiles in order to be able to visualise the individual Disticts on a map, which we obtained from AccessInfo.HK. A shapefile is, according to the ArcGIS website:

… a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas).

These shapefiles could be easily used as part of a ggplot2 workflow, which we created with geom_sf() to rapidly get a Proof of Concept. This was to quickly visualise the districts and how they look in relation to the Shiny app.

The code we used was as follows:

map_hk_districts <- ggplot() +
  geom_sf(data = shape_hk, fill = '#009E73') +
  geom_sf(data = shape_district, fill = '#56B4E9', alpha = 0.2, linetype = 'dotted', size = 0.2)

(Image shows an earlier iteration of the app)

Once we settled on how the map looked in relation to the Shiny app, we then spent some additional time and effort to investigate using leaflet. The reason for moving to leaflet maps because of their interactivity: we understood our users would want to explore the HK map interactively to find out what consituency they belong to or to find out one that was of interest. This was because we were aware that people may know what region they live in but they may not know the name of the consituency.

💭 What are our next steps?

There were some cool features that we would have liked to, but have not been able to implement:

precommit hooks: Those familiar with Python may be aware of pre-commit hooks as ways to automatically detect whether your repo contains anything sensitive like a .secrets file. Setting this up will enable us to have automated checks run each time we make a commit to assure we are follow specified standards.
- Unfortunately, we named our repo with a hyphen so the pre-commit hooks won’t work.
- codecov: Allows us to robustly test the functions in our code so that they work under a multitude of scenarios such as when users encounter problems.
modularise shiny code: Ensures our Shiny code is chunked so individual pieces of logic are isolated.
- This makes the overall code easier to follow as it separates the objects that are connected from those that are not. It also makes testing easier because you can test each isolated chunk.
language selection: Currently the app is a smorgasbord of English and Chinese. Consequently, it looks messy. We want to implement the ability for the user to choose which language they want to see the app in and the app’s language will update accordingly.
Release to alpha testers to get early feedback.
More of our enhancements / spikes are listed here on GitHub

One of the things that we wanted to try out with this open-source project is to adhere to some DevOps best practices, yet unfortunately some of these were either easier to set up from the beginning, or require more time and knowledge (on our part) to set up. As we develop a V2 of this Shiny App and work on other projects, we hope to find the opportunity to implement more of the above features.

🔥 Other features in the app

There were also a number of features that we have implemented, but were not detailed in this post. For instance:

Adding a searchable DataTable with information on the District Councillors, with the DT package
Embedding a user survey within the Shiny app
Adding a tutorial to go through features of the Shiny app, using the rintrojs package
Adding loader animations with shinycssloaders

We will cover more of that detail in a Part 2 of this blog, so watch this space!

💪 Who is behind this?

Multiple people contributed to this work. Avision Ho is a data scientist who wrote the majority of the Shiny app, and who was also previously interviewed on this blog. Avision is a co-author on this post. Ocean Cheung came up with the original idea of this app, and made it all possible with his knowledge and network with District Councillors. We would also like to credit Justin Yim, Tiffany Chau, and Gabriel Tam for their feedback and advice on the scope and the direction of this app. We are currently working on a number of other projects, which you can find out more from our website: https://hong-kong-districts-info.github.io/.

(Disclaimer! We are not affiliated to any political individuals nor movements. We are simply some people who’d like to contribute to society through code and open-source projects.)

✋ Want to get involved?

We’re looking for collaborators or reviewers, so please send us an email (hkdistricts.info@gmail.com), or comment down below if you are interested! We would also appreciate any feedback or questions, which you could either comment below or respond to our in-app survey. You can also get an idea of things we are planning to work on through our Trello board here.

When we first started out, we were just a couple of people who wanted to learn and practise a new skill (e.g. building a Shiny app, implementing best practices), and wanted a meaningful open-source project that we could work on. Read more about our Vision Statement here.

There are many reasons for this, and arguably a similar phenomenon can be observed in most local elections in other countries. See Lee, F. L., & Chan, J. M. (2008). Making sense of participation: The political culture of pro-democracy demonstrators in Hong Kong. The China Quarterly, 84-101.↩︎
This is in compliance with the ICO’s description of the ‘public domain’, i.e. that information is only in the public domain if it is realistically accessible to a member of the general public at the time of the request. It must be available in practice, not just in theory.↩︎

Vignette: Generate your own ggplot theme gallery

2020-05-08T00:00:00+00:00

Background

I’ve always found it a bit of a pain to explore and choose from all the different themes available out there for {ggplot2}.

Yes I know, I know - there are probably tons of websites out there with a ggplot theme gallery which I can Google,¹ but it’s always more fun if you can create your own. So here’s my attempt to do this, on a lockdown Bank Holiday afternoon.

DIY ggplot theme gallery 📊

1. Start with a list of plots and a list of themes

The outcome I want to achieve from this is to create something that would make it easier to decide which ggplot theme to pick for the visualisation at hand. The solution doesn’t need to be fancy: it would be helpful enough to generate all the combinations of plot types X themes, so I can browse through them and get inspirations more easily.

I took a leaf out of Shayne Lynn’s book/blog and created a couple of “base plots” using iris (yes, boring, but it works). I did these for four types of plots:

scatter plot
bar plot
box plot
density plot

I then assigned these four plots into a list object called plot_list, and converted them into a tibble (plot_base) that I could use for joining afterwards.

This step is then repeated for themes, where I virtually punched in all the existing themes in {ggplot2} and {ggthemes} into a named list (theme_list), and also create a tibble (theme_base). You can make this list as long and exhaustive as you want, but for this example I didn’t want to go into overkill.

You’ll see that I’ve made the names quite elaborate in terms of specifying the package source. The reason for this is because these names will be used afterwards in the plot output, and it will be helpful for identifying the function for generating the theme in the gallery.

#### Load packages ####
library(tidyverse)
library(ggthemes) # Optional - only for testing additional themes


#### Create base plots ####
## scatter plot
point_plot <-
  ggplot(iris, aes(x=jitter(Sepal.Width),
                   y=jitter(Sepal.Length),
                   col=Species)) +
  geom_point() +
  labs(x="Sepal Width (cm)",
       y="Sepal Length (cm)",
       col="Species",
       title="Iris Dataset - Scatter plot")

## bar plot
bar_plot <-
  iris %>%
  group_by(Species) %>%
  summarise(Sepal.Width = mean(Sepal.Width)) %>%
  ggplot(aes(x=Species, y=Sepal.Width, fill=Species)) +
  geom_col() +
  labs(x="Species",
       y="Mean Sepal Width (cm)",
       fill="Species",
       title="Iris Dataset - Bar plot")

## box plot
box_plot <- ggplot(iris,
                   aes(x=Species,
                       y=Sepal.Width,
                       fill=Species)) +
  geom_boxplot() +
  labs(x="Species",
       y="Sepal Width (cm)",
       fill="Species",
       title="Iris Dataset - Box plot")

## density plot
density_plot <-
  iris %>%
  ggplot(aes(x = Sepal.Length, fill = Species)) +
  geom_density() +
  facet_wrap(.~Species) +
  labs(x="Sepal Length (cm)",
       y="Density",
       fill="Species",
       title="Iris Dataset - Density plot")

#### Create iteration table ####
## Put all base plots in a list
plot_list <-
  list("bar plot" = bar_plot,
       "box plot" = box_plot,
       "scatter plot" = point_plot,
       "density plot" = density_plot)

## Convert list into a tibble
plot_base <-
  tibble(plot = plot_list,
         plot_names = names(plot_list))

## Put all themes to test in a named list
## names will be fed into subtitles
theme_list <-
  list("ggplot2::theme_minimal()" = theme_minimal(),
       "ggplot2::theme_classic()" = theme_classic(),
       "ggplot2::theme_bw()" = theme_bw(),
       "ggplot2::theme_gray()" = theme_gray(),
       "ggplot2::theme_linedraw()" = theme_linedraw(),
       "ggplot2::theme_light()" = theme_light(),
       "ggplot2::theme_dark()" = theme_dark(),
       "ggthemes::theme_economist()" = ggthemes::theme_economist(),
       "ggthemes::theme_economist_white()" = ggthemes::theme_economist_white(),
       "ggthemes::theme_calc()" = ggthemes::theme_calc(),
       "ggthemes::theme_clean()" = ggthemes::theme_clean(),
       "ggthemes::theme_excel()" = ggthemes::theme_excel(),
       "ggthemes::theme_excel_new()" = ggthemes::theme_excel_new(),
       "ggthemes::theme_few()" = ggthemes::theme_few(),
       "ggthemes::theme_fivethirtyeight()" = ggthemes::theme_fivethirtyeight(),
       "ggthemes::theme_foundation()" = ggthemes::theme_foundation(),
       "ggthemes::theme_gdocs()" = ggthemes::theme_gdocs(),
       "ggthemes::theme_hc()" = ggthemes::theme_hc(),
       "ggthemes::theme_igray()" = ggthemes::theme_igray(),
       "ggthemes::theme_solarized()" = ggthemes::theme_solarized(),
       "ggthemes::theme_solarized_2()" = ggthemes::theme_solarized_2(),
       "ggthemes::theme_solid()" = ggthemes::theme_solid(),
       "ggthemes::theme_stata()" = ggthemes::theme_stata(),
       "ggthemes::theme_tufte()" = ggthemes::theme_tufte(),
       "ggthemes::theme_wsj()" = ggthemes::theme_wsj())

## Convert list into a tibble
theme_base <-
  tibble(theme = theme_list,
         theme_names = names(theme_list))

plot_base

## # A tibble: 4 x 2
##   plot         plot_names  
##   <named list> <chr>       
## 1 <gg>         bar plot    
## 2 <gg>         box plot    
## 3 <gg>         scatter plot
## 4 <gg>         density plot

theme_base

## # A tibble: 25 x 2
##    theme        theme_names                      
##    <named list> <chr>                            
##  1 <theme>      ggplot2::theme_minimal()         
##  2 <theme>      ggplot2::theme_classic()         
##  3 <theme>      ggplot2::theme_bw()              
##  4 <theme>      ggplot2::theme_gray()            
##  5 <theme>      ggplot2::theme_linedraw()        
##  6 <theme>      ggplot2::theme_light()           
##  7 <theme>      ggplot2::theme_dark()            
##  8 <theme>      ggthemes::theme_economist()      
##  9 <theme>      ggthemes::theme_economist_white()
## 10 <theme>      ggthemes::theme_calc()           
## # ... with 15 more rows

2. Create an iteration table

The next step is to create what I call an iteration table. Here I use tidyr::expand_grid(), which creates a tibble from all combinations of inputs. Actually you can use either tidyr::expand_grid() or the base function expand.grid(), but I like the fact that the former returns a tibble rather than a data frame.

The output is all_combos, which is a two column tibble with all combinations of theme_names and plot_names, as character vectors. I then use left_join() twice to bring in the themes and the base plots:

## Create an iteration data frame
## Use `expand_grid()` to generate all combinations
## of themes and plots

all_combos <-
  expand_grid(plot_names = plot_base$plot_names,
              theme_names = theme_base$theme_names)
  
iter_df <-
  all_combos %>%
  left_join(plot_base, by = "plot_names") %>%
  left_join(theme_base, by = "theme_names") %>%
  select(theme_names, theme, plot_names, plot) # Reorder columns

iter_df

## # A tibble: 100 x 4
##    theme_names                       theme   plot_names plot  
##    <chr>                             <list>  <chr>      <list>
##  1 ggplot2::theme_minimal()          <theme> bar plot   <gg>  
##  2 ggplot2::theme_classic()          <theme> bar plot   <gg>  
##  3 ggplot2::theme_bw()               <theme> bar plot   <gg>  
##  4 ggplot2::theme_gray()             <theme> bar plot   <gg>  
##  5 ggplot2::theme_linedraw()         <theme> bar plot   <gg>  
##  6 ggplot2::theme_light()            <theme> bar plot   <gg>  
##  7 ggplot2::theme_dark()             <theme> bar plot   <gg>  
##  8 ggthemes::theme_economist()       <theme> bar plot   <gg>  
##  9 ggthemes::theme_economist_white() <theme> bar plot   <gg>  
## 10 ggthemes::theme_calc()            <theme> bar plot   <gg>  
## # ... with 90 more rows

3. Run your ggplot gallery!

The final step is to create the ggplot “gallery”.

I used purrr::pmap() on iter_df, which applies a function to the data frame, using the values in each column as inputs to the arguments of the function. You will see that:

iter_label is ultimately used as the names for the list of plots (plot_gallery).
label within the function is used for populating the subtitles of the plots
output_plot is the plot that is created within the function

#### Run plots ####
## Use `pmap()` to run all the plots-theme combinations

## Create labels to be used as names for `plot_gallery`
iter_label <-
  paste0("Theme: ",
         iter_df$theme_names,
         "; Plot type: ",
         iter_df$plot_names)

## Create a list of plots
plot_gallery <-
  iter_df %>%
  pmap(function(theme_names, theme, plot_names, plot){
    
    label <- 
      paste0("Theme: ",
             theme_names,
             "\nPlot type: ",
             plot_names)

    output_plot <-
      plot +
      theme +
      labs(subtitle = label)
    
    return(output_plot)
  }) %>%
  set_names(iter_label)


plot_gallery

## $`Theme: ggplot2::theme_minimal(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_classic(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_bw(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_gray(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_linedraw(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_light(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_dark(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_economist(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_economist_white(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_calc(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_clean(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_excel(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_excel_new(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_few(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_foundation(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_gdocs(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_hc(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_igray(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_solarized(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_solarized_2(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_solid(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_stata(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_tufte(); Plot type: bar plot`

## 
## $`Theme: ggthemes::theme_wsj(); Plot type: bar plot`

## 
## $`Theme: ggplot2::theme_minimal(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_classic(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_bw(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_gray(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_linedraw(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_light(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_dark(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_economist(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_economist_white(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_calc(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_clean(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_excel(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_excel_new(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_few(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_foundation(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_gdocs(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_hc(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_igray(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_solarized(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_solarized_2(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_solid(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_stata(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_tufte(); Plot type: box plot`

## 
## $`Theme: ggthemes::theme_wsj(); Plot type: box plot`

## 
## $`Theme: ggplot2::theme_minimal(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_classic(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_bw(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_gray(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_linedraw(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_light(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_dark(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_economist(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_economist_white(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_calc(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_clean(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_excel(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_excel_new(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_few(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_foundation(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_gdocs(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_hc(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_igray(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_solarized(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_solarized_2(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_solid(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_stata(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_tufte(); Plot type: scatter plot`

## 
## $`Theme: ggthemes::theme_wsj(); Plot type: scatter plot`

## 
## $`Theme: ggplot2::theme_minimal(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_classic(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_bw(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_gray(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_linedraw(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_light(); Plot type: density plot`

## 
## $`Theme: ggplot2::theme_dark(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_economist(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_economist_white(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_calc(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_clean(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_excel(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_excel_new(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_few(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_foundation(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_gdocs(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_hc(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_igray(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_solarized(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_solarized_2(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_solid(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_stata(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_tufte(); Plot type: density plot`

## 
## $`Theme: ggthemes::theme_wsj(); Plot type: density plot`

End Notes

And here it is! That didn’t take that many lines of code, but you can already generate a great number of plots with expand_grid() and pmap().

I should also caveat that this is by no means a “pretty” gallery; it’s very much a minimal implementation, but is good enough for my own consumption.

See https://ggplot2.tidyverse.org/reference/ggtheme.html and https://cmdlinetips.com/2019/10/8-ggplot2-themes/ for instance.↩︎

Vignette: Simulating a minimal SPSS dataset from R

2020-04-30T00:00:00+00:00

What this is about 📖

I will simulate a minimal labelled survey dataset that can be exported as a SPSS (.SAV) file (with full variable and value labels) in R. I will also attempt to fabricate ‘meaningful patterns’ to the dataset such that it can be more effectively used for creating demo examples.

image from Giphy

Background

Simulating data is one of the most useful skills to have in R. For one, it is helpful when you’re debugging code, and you want to create a reprex (reproducible example) to ask for help more effectively (help others help you , as the saying goes.)¹ However, regardless of whether you’re a researcher or a business analyst, the data associated with your code is likely to be either confidential so you cannot share it on Stack Overflow, or way too large or complex for you to upload anyway. Creating an example dataset from a few lines of code which you can safely share is an effective way to get around this problem.

Data simulation is slightly more tricky with survey datasets, which are characterised by (1) labels on both variable and values/codes, and (2) a large proportion of ordinal / categorical variables.

For instance, a Net Promoter Score (NPS) variable is usually accompanied with the variable label “On a scale of 0-10, how likely are you to recommend X to a friend or family?” (i.e. the actual question asked in a survey), and is itself an instance of an ordinal variable. If you are trying to produce an example that hinges on an issue where labels are relevant, you would also need to simulate the labels as well.

There are also educational reasons for simulating data: it is useful to simulate data to demo an analysis or a function, because this makes it easy for the audience to reproduce the example. For this purpose, it would be especially beneficial if you can simulate a dataset where there you can introduce some arbitrary relationships between the variables, rather than them being completely random (sample() all the way).

Personally, I have in the past found it a pain to simulate datasets which are suited for demo-ing survey related functions, especially when I was working on examples for the {surveytoolbox} package 📦. Hence, this is partly an attempt to simulate a labelled dataset that is minimally sufficient for demonstrating some of the {surveytoolbox} functions.

🏷 For more information specifically on manipulating labels in R, do check out a previous post I’ve written on working with SPSS labels in R.

Getting started

To run this example, we’ll need to load {tidyverse}, {surveytoolbox}, and {haven}. Specifically, I’m using {tidyverse} for its data manipulation functions, {surveytoolbox} for functions to set up variable/value labels, and finally {haven} to export the data as a .SAV file.

Note that {surveytoolbox} is currently not available on CRAN yet, but you can install this by running devtools::install_github("martinctc/surveytoolbox"). You’ll need {devtools} installed, if you haven’t got it already.

In addition to loading the packages, we will also set the seed² with set.seed() to make the simulated numbers reproducible:

library(tidyverse)
library(surveytoolbox) # Install with devtools::install_github("martinctc/surveytoolbox")
library(haven)

set.seed(100) # Enable reproducibility - 100 is arbitrary

Create individual vectors

For the purpose of clarity and ease of debugging, my approach will be to first set up each simulated variable as individual labelled vectors, and then bind them together into a data frame at the end. To adorn variable and value labels to a numeric vector, I will use set_varl() and set_vall() from {surveytoolbox} to do these tasks respectively.

I want to create a dataset with 1000 observations, so I will start with creating v_id as an ID variable running from 1 to 1000, which can simply be generated with the seq() function.³ I will then use set_varl() from {surveytoolbox} to set a variable label for the v_id vector. The second argument of set_varl() takes in a character vector and assigns it as the variable label of the target variable - super straightforward.

## Record Identifier
v_id <-
  seq(1, 1000) %>%
  set_varl("Record Identifier")

The same goes for v_gender, but this time I want to also (1) apply an arbitrary probability to the distribution, and (2) give each value in the vector a value label (“Male”, “Female”, “Other”).

To do (1), I pass a numeric vector to the prob argument to represent the probabilities that 1, 2, and 3 will fall out for n = 1000.

To do (2), I run set_vall() and pass the desired labels to the value_labels argument. set_vall() acccepts a named character vector to be assigned as value labels.

Finally, I run set_varl() again to make sure that a variable label is present.

## Gender
v_gender <-
  sample(x = 1:3,
         size = 1000, replace = TRUE,
         prob = c(.48, .48, .04)) %>% # arbitrary probability
  set_vall(value_labels = c("Male" = 1,
                            "Female" = 2,
                            "Other" = 3)) %>%
  set_varl("Q1. Gender")

Now that we’ve got our ID variable and a basic grouping variable (gender), let’s also create some mock metric variables.

I want to create a 5-point scale KPI variable (which could represent customer satisfaction or likelihood to recommend). One way to do this is to simply run sample() again, and do the same thing we did for v_gender:

## KPI - #1 simple sampling
v_kpi <-
  sample(x = 1:5,
         size = 1000,
         replace = TRUE) %>%
  set_vall(value_labels = c("Extremely dissatisfied" = 1,
                            "Somewhat dissatisfied" = 2,
                            "Neither" = 3,
                            "Satisfied" = 4,
                            "Extremely satisfied" = 5)) %>%
  set_varl("Q2. KPI")

Whilst the above approach is straightforward, the downside is that the numbers are likely to look completely random if we try to actually analyse the results - which is what sample() is supposed to do - but clearly isn’t ideal.

I want to simulate numbers that are more realistic, i.e. data which will form a discernible pattern when grouping and summarising by gender. What I’ll therefore do is to iterate through each number in v_gender, and sample numbers based on the gender of the ‘respondent’.

The values that are passed below to the prob argument within sample() are completely arbitrary, but are designed to generate results where a bigger KPI value is more likely if v_gender == 1, followed by v_gender == 3, then v_gender == 2.

Note that I’ve used map2_dbl() here (from the {purrr} package, part of {tidyverse}), which “loops” through v_gender and returns a numeric value for each iteration.

## KPI - #2 gender-dependent sampling
v_kpi <-
  v_gender %>%
  map_dbl(function(x){
    if(x == 1){
      sample(1:5,
             size = 1,
             prob = c(10, 17, 17, 28, 28)) # Sum to 100
    } else if(x == 2){
      sample(1:5,
             size = 1,
             prob = c(11, 22, 28, 22, 17)) # Sum to 100

    } else {
      sample(1:5,
             size = 1,
             prob = c(13, 20, 20, 27, 20)) # Sum to 100
    }
  }) %>%
  set_vall(value_labels = c("Extremely dissatisfied" = 1,
                            "Somewhat dissatisfied" = 2,
                            "Neither" = 3,
                            "Satisfied" = 4,
                            "Extremely satisfied" = 5)) %>%
  set_varl("Q2. KPI")

To add a level of complexity, let me also simulate a mock NPS variable. One way to do this is to punch in random numbers like how it is done above with v_kpi, but this will involve a lot more random punching than is desirable for a 11-point scale NPS variable.

I will therefore instead write a custom function called skew_inputs() that ‘expands’ three arbitrary input numbers into 11 numbers, which will then serve as the probability anchors for my sample() functions later on.

## Generate skew inputs for sample probability
##
## `value1`, `value2` and `value3`
## generate the skewed probabilities
##
skew_inputs <- function(value1, value2, value3){
  
  all_n <-
  c(rep(value1, 7), # 0 - 6
    rep(value2, 2), # 7 - 8
    rep(value3, 2)) # 9 - 10
  
  return(sort(all_n))
}

## Outcome KPI - NPS
v_nps <-
  v_gender %>%
  map_dbl(function(x){
    if(x == 1){

      sample(0:10, size = 1, prob = skew_inputs(1, 1, 8))

    } else if(x == 2){

      sample(0:10, size = 1, prob = skew_inputs(2, 3, 5))

    } else if(x == 3){

      sample(0:10, size = 1, prob = skew_inputs(1, 3, 6))

    } else {

      stop("Error - check x")

    }
  }) %>%
  set_varl("Q3. NPS")

Admittedly that the above procedure isn’t minimal, but note that this is a trade-off to introduce some arbitrary patterns to the data. A ‘quick and dirty’ alternative simulation would simply be to run sample(x = 0:10, size = 1000, replace = TRUE) for v_nps.

There is one slight technicality: the so-called NPS question is strictly speaking a likelihood to recommend question which ranges from 0 to 10, and the Net Promoter Score itself is calculated on a recoded version of that question where Detractors (scoring 0 to 6) have to be coded as -100, Passives (scoring 7 to 8) as 0, and Promoters (scoring 9 to 10) as +100. The Net Promoter Score is simply calculated as a mean of those recoded values.

Fortunately, the {surveytoolbox} package comes shipped with a as_nps() function that does this recoding for you, and also automatically applies the value labels. let’s call this new variable v_nps2:

## Outcome KPI - Recoded NPS (NPS2)

v_nps2 <- as_nps(v_nps) %>% set_varl("Q3X. Recoded NPS")

Combine vectors

Now that all the individual variables are set up, I can simply combine them all into a tibble in one swift movement⁴:

#### Combine individual vectors ####
combined_df <-
  tibble(id = v_id,
         gender = v_gender,
         kpi = v_kpi,
         nps = v_nps,
         nps2 = v_nps2)

Results!

image from Giphy

Let’s run a few checks on our dataset to confirm that everything has worked out okay.

The classic {dplyr} glimpse():

combined_df %>% glimpse()

## Observations: 1,000
## Variables: 5
## $ id     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ gender <int+lbl> 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2,...
## $ kpi    <dbl+lbl> 3, 2, 5, 2, 5, 5, 2, 2, 5, 3, 3, 4, 1, 5, 4, 4, 4, 1, 2,...
## $ nps    <dbl> 10, 5, 10, 8, 10, 9, 7, 5, 2, 5, 1, 4, 5, 9, 9, 10, 9, 3, 10...
## $ nps2   <dbl+lbl> 100, -100, 100, 0, 100, 100, 0, -100, -100, -100, -100, ...

Then head() to see the first five rows:

combined_df %>% head()

## # A tibble: 6 x 5
##      id     gender                       kpi   nps             nps2
##   <int>  <int+lbl>                 <dbl+lbl> <dbl>        <dbl+lbl>
## 1     1 2 [Female] 3 [Neither]                  10  100 [Promoter] 
## 2     2 2 [Female] 2 [Somewhat dissatisfied]     5 -100 [Detractor]
## 3     3 1 [Male]   5 [Extremely satisfied]      10  100 [Promoter] 
## 4     4 2 [Female] 2 [Somewhat dissatisfied]     8    0 [Passive]  
## 5     5 2 [Female] 5 [Extremely satisfied]      10  100 [Promoter] 
## 6     6 1 [Male]   5 [Extremely satisfied]       9  100 [Promoter]

So it appears that the value labels have been properly attached, and the range of values are what we’d expect. Now what about the “fake patterns”?

Looking at the topline result of the data, we seem to have succeeded in fabricating some sensible patterns in the data. It appears that this company X will need to work harder at winning over its female customers, who have rated them lower on two KPI metrics:

combined_df %>%
  group_by(gender) %>%
  summarise(n = n_distinct(id),
            kpi = mean(kpi),
            nps2 = mean(nps2))

## # A tibble: 3 x 4
##       gender     n   kpi  nps2
##    <int+lbl> <int> <dbl> <dbl>
## 1 1 [Male]     490  3.49 31.0 
## 2 2 [Female]   464  3.07 -8.62
## 3 3 [Other]     46  3.15 17.4

Check the labels 🏷🏷🏷

Finally I’d like to share a couple of functions that enable you to explore the labels in a labelled dataset. surveytoolbox::varl_tb() accepts a labelled data frame, and returns a two-column data frame with the variable name and its corresponding variable label:

combined_df %>% varl_tb()

## # A tibble: 5 x 2
##   var    var_label        
##   <chr>  <chr>            
## 1 id     Record Identifier
## 2 gender Q1. Gender       
## 3 kpi    Q2. KPI          
## 4 nps    Q3. NPS          
## 5 nps2   Q3X. Recoded NPS

surveytoolbox::data_dict() takes this further, and shows also the value labels as a third column. This is what effectively what’s typically referred to as a code frame in a market research context:

combined_df %>%
  select(-id) %>%
  data_dict()

##      var        label_var
## 1 gender       Q1. Gender
## 2    kpi          Q2. KPI
## 3    nps          Q3. NPS
## 4   nps2 Q3X. Recoded NPS
##                                                                                label_val
## 1                                                                    Male; Female; Other
## 2 Extremely dissatisfied; Somewhat dissatisfied; Neither; Satisfied; Extremely satisfied
## 3                                                                                       
## 4                                            Detractor; Passive; Promoter; Missing value
##                              value
## 1                          1; 2; 3
## 2                    1; 2; 3; 4; 5
## 3 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10
## 4                     -100; 0; 100

I would also highly recommend the view_df() function from {sjPlot}, which exports a similar overview of variables and labels in a nicely formatted HTML table. For huge labelled datasets, this offers a fantastic light-weight way to browse through your variables and labels.

combined_df %>% sjPlot::view_df()

Once we’ve checked all the labels and we’re happy with everything, we can then export our dataset with haven::write_sav()! If everything’s worked properly, all the labels should appear properly if you choose to open your example dataset in SPSS, or Q:

combined_df %>% haven::write_sav("Simulated Dataset.sav")

End notes

I hope you’ve found this vignette useful!

If you ever get a chance to try out {surveytoolbox}, I would really appreciate if you can submit any issues/feedback on GitHub, or get in touch with me directly. I’m looking for collaborators to make the package more user-friendly and powerful, so if you’re interested, please don’t be shy and give me a shout! 😄

Check out this RStudio Community thread to learn more about reprex (the portmanteau reprex is coined by Romain Francois)↩︎
If you’re not familiar with this concept / approach, I’d recommend checking out this Stack Overflow thread.↩︎
For those who are more ambitious, I would recommend checking out the {uuid} package for generating proper GUIDs (Globally Unique Identifier). However, this then wouldn’t be minimal, so I would just stick with running a simple seq() sequence.↩︎
I shouldn’t need to footnote this, but here’s a Rocky Flintstone tribute for any Belinkers out there. 🤣↩︎

Data Chats: An Interview on Data-Driven Campaigns, Bias & Ethics

2020-04-27T00:00:00+00:00

Background

One of the motives for starting the Data Chats interview series was to shed light on the many ways in which data and analytics professionals operate across different fields and cultures. Previously, Avision Ho (Senior Data Scientist at the British Department for Education at the time) and Abhishek Modi (Data Science Consultant at Deloitte at the time) described the data science career journey and answered technology-specific questions (e.g. favourite R packages). So I thought I’d do an interview on how analytics is applied in a very different, yet important, setting: politics.

This time, I have the pleasure to speak with Christopher Treshan Perera and Nanthida Rakwong from Worldacquire, a digital consultancy with politics as a core practice area. They launched with a mission to use analytics and digital tech in political campaigns, public affairs and human rights. They notably managed campaigns at the 2019 Thailand general election and the 2019 Hong Kong District Council elections; at the latter, they helped a pro-democracy candidate defeat a long-standing incumbent. Their co-founders have spoken at the United Nations and the UK Parliament on ethical issues in technology and society¹ and managed research for Konrad-Adenauer-Stiftung, Germany’s governing party think tank.

In this in-depth interview, Christopher and Nanthida discuss how they navigate analytics and politics, challenges they encountered (e.g. how to obtain reliable data in Thailand), ethical questions (Cambridge Analytica, GDPR) and other practical considerations.

The Interview

M: It’s great to have you guys here. Tell me a bit about Worldacquire and your journey to bringing together analytics and politics!

C: We launched two years ago to explore how AI and data-driven technology could enhance democracy - whether by helping aspiring politicians win elections or by supporting research and campaigning efforts of parties and public organisations.

N: I grew up in Thailand, but moved to London in 2010 to work as a political consultant for Amsterdam and Partners, an international law firm representing political figures. My most important case was in Thai politics - bringing the Thai junta (military government) to justice at the International Criminal Court (ICC) over the 2010 Bangkok Massacres. I also advised the “Red Shirts” pro-democracy movement in Thailand and various other political parties around the world.

C: I have a more corporate and techie background. My career started as a data analyst at Bloomberg in London, followed by several years at viagogo, an online marketplace for sports and show tickets, where I rose from digital marketing exec to global advertising management. It is common to wear many hats at a tech startup and mine included data analysis, business intelligence, product management and algorithm design for marketing APIs. Then I moved to American Express where my task was to transfer digital marketing know-how from the tech world to “big finance”.

Before Worldacquire I already wanted to connect tech with social causes. Back in 2015 I founded Outreach Digital, an entirely volunteer-run association making digital skills more accessible; it has since become the largest meetup group in London’s tech and digital space.

M: That’s a great fusion of politics and digital marketing. Also, Chris - we first met at one of those meetup groups, which makes a great case for their networking value. So what exactly got you thinking: wow, it would be great to merge analytics and political consulting?

N: While I was advising the Thai pro-democracy movement, I realized how crucial it was to understand a situation in real-time in order to make better decisions. By collecting a lot of data, compiling and analysing it faster, we could operate in a more efficient and scalable way. That led to my interest in data, analytical technologies and AI, and I immediately saw their dangers, too: the Cambridge Analytica scandal, for example, was a misuse of those technologies. But you can change something for the better only if you engage and participate in shaping it.

M: Using AI and data to make better decisions - could you describe how you can do that in politics? How do other organizations do that today?

N: Let’s say you want to understand millions of people including your potential voters. By gathering data from multiple sources, including Facebook, Twitter, forums, emails and more, you can find new paths to make everyone work together towards the same goal. If you want to run a political campaign of any kind but don’t understand what exactly people want, it’s very hard to bring them together.

Moreover, you always need to identify new supporters. There may be people who are unsure about your cause or movement - perhaps they are friends of your ardent supporters - but they hesitate to join because they don’t understand it well enough. You can use AI and data technology to understand what they need and how to best communicate with them.

C: One of the things Cambridge Analytica did was to uncover a blind spot in the political “market share” or electorate and target those swing voters (and people who never wanted to vote) using unethical advertising methods, including fake news.

This raises the question of whether the technology can be used in an ethical and transparent way. If everyone was aware of the practices, if parties or candidates communicated transparently about them, then perhaps we’d have a different situation even in the UK right now. The other consideration is which specific AI and analytics methods to apply: you could implement recommendation systems, pattern recognition techniques or combine existing methods.

N: And use them to spread truth and support good causes, rather than rumours, disinformation and division.

M: So you do the opposite of what Cambridge Analytica did by using the technologies in a more ethical and transparent way?

N: That’s right. If we do not engage in this game, we basically allow malicious or unethical players to misuse it. If we withdraw we cannot help shape new solutions and perspectives.

C: Think practically! Advertising has a very long history - newspaper, TV, radio and billboards. Over the decades every new medium got increasingly regulated, but advertising still exists and it has arguably become more transparent and ethical. If traditional advertising channels can improve over time, so can digital advertising.

M: You also believe that by using these techniques for a good cause, we can make our democratic systems more resilient and less prone to being abused?

C&N: Correct! Moreover, they can also be used to improve public services and government-to-citizen communications, especially during a crisis.

M: Now let’s talk about the election campaign that you guys managed in Thailand. Could you tell me more about it?

N: First of all, it was a long-awaited election because Thailand has been ruled by an authoritarian military junta since 2014. We advised a first-time candidate from a new party standing for MP in Bangkok. Candidates were given only six weeks to campaign - a major challenge considering this was one of the largest constituencies.

Without any data to start with, we went to the local administration office to request the electoral register, but they essentially refused to share anything. We suspected this happened because of the deep influence of the incumbent. We had to think differently!

We could have started canvassing (surveying and campaigning) from door to door, but this would have been an issue for several reasons: firstly, Thai people don’t vote based on where they reside, but where their home was registered. Thus, people who live in a constituency may not have the right to vote there. Secondly, the dangerous political climate and Thailand’s harsh censorship laws made people extremely wary of sharing their political views or past voting behaviour.

Another option was to collect data and communicate online through social media advertising. Unfortunately the party leadership wasn’t willing to invest in it. They preferred to play it safe and spend money on leaflets and billboards instead.

We ultimately went for a sampling method using fieldwork mapping: we divided the constituency into smaller areas based on the polling stations that cover them and interviewed a sample of people in each area (this was still challenging considering how wary people were!) and built our understanding of the overall constituency based on data. We facilitated this by using an app called Mela.

C: We didn’t use any advanced AI magic in this instance, but the project highlighted how important it is to get the right data and to ensure that the initial dataset is clean. What data scientists often do is go on the internet and pick whatever datasets are available online - but these are often outdated, incomplete or even biased.

Especially in a developing country with poor accountability and no balance of powers, it is hard to verify if research and survey data is correct. You have to go hands-on and create the conditions for people to share their genuine views. Once the data is accurate, you can start doing advanced stuff.

N: We would have certainly received more valuable insights had the party leadership approved social media advertising. It takes more effort to measure the conversion rate of leaflets and billboards. Especially if you only have six weeks to campaign.

C: Nonetheless we were able to run a smaller-scale test and get accurate social insights. Ideally, we would have gathered enough data to run prediction and network algorithms to evaluate the profile, behaviour and preferences of each constituency sub-area, and thereby understand which political issues mattered to them the most. Then we could have tailored the messages that would best resonate with each group. Despite the small amount of data, it was enough to draw some important conclusions about the potential voters.

M: It really makes sense when facing such time constraints, so it’s a shame that this is underutilized in election campaigns. Chris, what is the biggest difference between digital advertising in a commercial vs a political setting?

C: In business you have more wiggle space for your message choice. You can use superlatives like “best tickets” or “best concerts”. In politics people do that, too, but it can be dangerous; think about the Brexit bus with the exaggerated claim about the NHS money. There will also be very different budgets due to both internal and external factors. Businesses are often more willing to invest in advertising as it leads to direct and immediate sales. In politics, the “sales cycle” can take much longer.

N: In many countries, including Thailand, there’s also a legal cap on campaigning spend.

C: Another issue is visibility. Most digital platforms decide which content to display based on ranking algorithms; one factor that influences those algorithms is pre-existing activity and performance. For example, if you want to advertise a car on Google, and you build a Google Ads campaigns around the keyword “car”, Google’s algorithms will already know that this is something businesses want to advertise based on historic performance data. The algorithms will also know that people click on those ads after searching “car”. On the contrary, if a keyword was never used before or isn’t typically associated with people clicking on ads, Google will wait a little longer before displaying ads for it. So there can be some delays before an ad for a new politician is actually visible, but it is still faster than trying to get a blog post go viral.

M: So there’s both a legal and a search engine strategy aspect. How about analytics in business vs politics?

C: There are many similarities. You just need to translate a concept from one field into another. In business, KPIs and metrics are formed around impressions, actions (sales), CTR and conversion rates. In politics it’s more about long-term performance, maybe along the lines of CLV (customer lifetime value).

N: Having many clicks on your ad or post doesn’t necessarily translate into votes. It could even be negative - think about Prince Andrew!

M: What were your top challenges at the Thai election?

N: The lack of accurate data and the poor awareness about the importance of data by the leadership - especially current or real-time data. People still rely on old reports and outdated information.

M: What was the outcome of the election campaign?

N: Our candidate lost, but exceeded our expectations. More importantly, the winning candidate was from the Future Forward Party, another new and allied party that did invest significantly in its social media at a national level. We observed that they also used tailored, targeted advertising and A/B-testing to gather data about voter preferences. The political party’s image really helped that candidate. Like our candidate, he was not a resident of the constituency yet still won. This was the very first time that anyone used social media as a key channel for data collection in an election campaign in Thailand.

C: Indeed, this was forward-thinking. Many political campaigners around the world use social media, but don’t make the most of its advanced algorithms and data-gathering capabilities. Considering the difficulties we had in accessing data, one of the biggest learnings is that even in the face of authoritarian red tape and bureaucracy, digital platforms can help overcome hurdles in understanding your audience.

M: Let’s talk about regulation and ethics. Is data is the new oil? A valuable resource that helps society progress?

C: Yes, it can be - depending on which society it will be used in. If a society has strong data protection and privacy laws, then data can be used hand-in-hand with democratic principles. If not, then it can be a very “bad oil”.

Looking back in time, radio was “the new oil” at some point. TV, too. From a regulatory point of view, they all provoked concerns (including about propaganda), but over time different bodies and regulations were formed to address them, such as today’s ASA in the UK; and now for data and digital technologies, we have the ICO in the UK.

M: It’s important to understand where all the data people provide goes and I believe there’s also the question of consent, which is covered by GDPR. What I find really interesting is algorithmic bias, but also the idea that people can better judge what is right and wrong only after they are educated about how the algorithm works.

C: Exactly, and this is what regulators are starting - and should be starting - to think about.

There are some who think that political advertising should be completely banned. Twitter went down this road and many applauded their decision, with the main perception being that data and algorithms can be misused the way Cambridge Analytica did. However, what was ignored was the fact that all these algorithms don’t need to work in a black box - in fact, they can be revealed, changed, overridden.

For example, a recommendation system could, instead of saying “A is better than B”, explain “We recommend A over B because our algorithm observed that you like x, y and z.” Companies may be reluctant to fully reveal their algorithm code, but they could at least give an idea of what parameters are taken into account, what outcomes can be expected, and why. Once again, transparency is key.

Another aspect ignored in the whole debate about political advertising is the fact that if digital platforms like Twitter ban them, newcomer politicians will struggle to gain a following or communicate with their target audience if they have a time constraint, like in the case of Thailand. Banning political ads carries huge disadvantages, especially for politicians who already lack resources.

The right approach would have been to push for more transparency - not only in advertising (“paid”) algorithms, but also in the “organic” and “earned” algorithms used on the very same social media platforms.

N: Moreover, Twitter appears not to care as much as they say about what content is posted on the platform. If they can ban political ads, why do they do so little against hate speech and other online harms? Many, including myself, have experienced harassment and public death threats on Twitter, yet Twitter refused to cooperate swiftly and proactively with the British police - instead they shifted the burden of proof to the victims.

M: It is also very unclear how they will actually implement the ban.

N: I honestly think it’s dangerous that Jack Dorsey (the CEO of Twitter) calls for “earned” popularity and encourages a culture of going viral. If a tweet goes viral does it really mean that it “earned” it? Is it really more accurate and correct than other tweets? More than a few times, a viral message has spread false accusations and fake news.

C: To make things worse, going viral is also determined by algorithms. How does Twitter decide which content should get more visibility? Is it the likes, the retweets, the popularity of the tweet author and their followers? We have seen (and tested) how easily this can be manipulated.

So banning political ads doesn’t solve the problem of powerful obscure algorithms, as organic content is decided by even less transparent ones! Typically, such algorithms seem to favour users who already have a strong following - in the case of an election, this is often the incumbent. Newcomers can be heavily penalized by this dynamic.

Another issue are fake users and bots on Twitter. These can be bought in thousands or more to mass-like or mass-follow and exaggerate the popularity of a particular user or a post. Equally, your competitor could get 10,000 Twitter bots or fake users to report your public posts as spam or abusive. This is an easy way to manipulate the system (also tested in Thailand). The Twitter algorithm will likely disregard the fact that the users are fake and make the falsely reported tweet disappear even if it is genuine and popular.

M: This is something I have seen in Hong Kong, too. Do you think censorship is a solution?

C: Did TVs ban political content? Do billboards ban political ads? Not really, so we really don’t think that banning political ads is a solution.

N: It should be more about regulating content, what is not OK to say.

M: How do you differentiate yourselves from other consultancies or agencies that do the same or similar things as you?

N: Firstly, aside from working on the big picture strategy we actually also implement it. Working hands-on gives us a much more tangible picture of the dynamics, limitations and issues that could be faced.

C: When you hear academics, researchers and thought leaders jump enthusiastically to praise Jack Dorsey for banning political ads on Twitter, a practitioner who personally set up digital campaigns for politics will tell you how disastrous the effects of his decision can be. Our USP is that we work on the ground and thoroughly understand technical implications and their consequences.

M: What is your vision for your business?

C: We truly believe that AI, data and digital technologies can be used for good causes - and we want to show that this is true and applies to anywhere around the world. It is also important for people to understand the uses of these technologies and the actors who control them. We want to help people understand both sides of the coin. Many governments and NGOs have digital and data on their agendas, but often seem to have a very superficial sense of the technologies - we want to help there. And we want to be involved in and lead the societal, political and ethical debates around these technologies, as well as demystify the exaggerated perceptions of danger.

M: Is there any advice you have for data scientists who aspire to work on political projects, political data scientists, or data scientists in general?

C: Get the right data, know the sources and eliminate biases! This includes statistical bias (especially sampling, funding and reporting bias), cognitive bias and other social biases. Online datasets may be easier to obtain, but are outdated or not real time enough. Worse, they could be doctored to reflect the narratives of an authoritarian regime or lobbyist group.

Ask yourselves: what were the context and conditions during the data collection process? What kind of limitations existed? Whether it’s data from a sentiment analysis report or a simple survey, what could be wrong with the data? Could there be any noise? Anything unusual?

Also, the logic behind the metrics in a dataset can be misleading; think about GDP, a measure for economic growth. Does a growing GDP mean the country is improving and everyone is better off? Not really. If you look closer you might see that the GDP growth is distributed only to a small percentage of the population.

How were survey responses recorded? Did the method change during the campaign? What’s the logic behind the metrics? What kind of issues may lead to the data being wrong? Could there be a situation of reluctant journalists or silenced human rights activists?

Finally, ethics is not only for philosophers, but also for engineers. This will be a hot topic over the next years, and AI and data specialists will need to be able to explain to consumers and other stakeholders the different problems and solutions in algorithm-driven products.

N: In politics and economics especially, it is really important to ask who created the dataset, who financed or sponsored it, and who really controls the overall character and narrative of the data.

C: You should never be afraid to go out there on the field and collect the data by yourself - it can be really fun!

M: Thank you again for your time and for this very fascinating interview.

Endnotes

What I thought was an interesting theme is the ubiquity in the application of analytics. But some of the data challenges that Nanthida and Chris raised are very real, and confirms the view that a considerable chunk of time in data analysis is spent on collecting, cleaning and getting the data right for analysis in the first place, not only the analysis itself.

I hope you’ve enjoyed reading the above interview. If you would like to get in touch with Christopher and Nanthida, you may reach them through their website here. I’m also looking to do more interviews, so if you are a data / analytics practitioner and you think you have something interesting to share, please feel free to get in touch!

Image credits: Artem Bryzgalov, Kelvin Yup, Frida Aguilar Estrada, Franz Wender, ThisisEngineering, camilo jimenez, dole777, jbdodane, History in HD, Jonathan Francisca, Carlos Muza

Themes: political analytics, political data science, microtargeting, political advertising, data collection, social analytics, data bias, data integrity, algorithmic bias, statistical bias, sample bias, observer bias, bias, data ethics, data regulation, privacy, AI ethics, gdpr, tech for good, surveys, cambridge analytica, thailand, algorithm design, botnets, disinformation

See UN-related work on digital ethics and transparency, and on anti-disinformation: https://worldacquire.com/2019/12/09/worldacquire-at-the-united-nations-igf-2019/ and https://worldacquire.com/2020/02/27/online-disinformation-and-extremism-how-it-spreads-and-how-to-stop-it/↩︎

Data cleaning with Kamehamehas in R

2020-04-11T00:00:00+00:00

Background

Given present circumstances in in the world, I thought it might be nice to write a post on a lighter subject.

Recently, I came across an interesting Kaggle dataset that features the power levels of Dragon Ball characters at different points in the franchise. Whilst the dataset itself is quite simple with only four columns (Character, Power_Level, Saga_or_Movie, Dragon_Ball_Series), I noticed that you do need to do a fair amount of data and string manipulation before you can perform any meaningful data analysis with it. Therefore, if you’re a fan of Dragon Ball and interested in learning about string manipulation in R, this post is definitely for you!

The Kamehameha - image from Giphy

For those who aren’t as interested in Dragon Ball but still interested in general R tricks, please do read ahead anyway - you won’t need to understand the references to know what’s going on with the code. But you have been warned for spoilers! 😂

Functions or techniques that are covered in this post:

Basic regular expression (regex) matching
stringr::str_detect()
stringr::str_remove_all() or stringr::str_remove()
dplyr::anti_join()
Example of ‘dark mode’ ggplot in themes

Getting started

You can download the dataset from Kaggle, which you’ll need to register an account in order to do so. I would highly recommend doing so if you still haven’t, since they’ve got tons of datasets available on the website which you can practise on.

The next thing I’ll do is to set up my R working directory in this style, and ensure that the dataset is saved in the datasets subfolder. I’ll use the {here} workflow for this example, which is generally good practice as here::here implicitly sets the path root to the path to the top-level of they current project.

Let’s load our packages and explore the data using glimpse():

library(tidyverse)
library(here)

dball_data <- read_csv(here("datasets", "Dragon_Ball_Data_Set.csv"))

dball_data %>% glimpse()

## Observations: 1,244
## Variables: 4
## $ Character          <chr> "Goku", "Bulma", "Bear Thief", "Master Roshi", "...
## $ Power_Level        <chr> "10", "1.5", "7", "30", "5", "8.5", "4", "8", "2...
## $ Saga_or_Movie      <chr> "Emperor Pilaf Saga", "Emperor Pilaf Saga", "Emp...
## $ Dragon_Ball_Series <chr> "Dragon Ball", "Dragon Ball", "Dragon Ball", "Dr...

…and also tail() to view the last five rows of the data, just so we get a more comprehensive picture of what some of the other observations in the data look like:

dball_data %>% tail()

## # A tibble: 6 x 4
##   Character              Power_Level       Saga_or_Movie       Dragon_Ball_Seri~
##   <chr>                  <chr>             <chr>               <chr>            
## 1 Goku (base with SSJG ~ 448,000,000,000   Movie 14: Battle o~ Dragon Ball Z    
## 2 Goku (MSSJ with SSJG'~ 22,400,000,000,0~ Movie 14: Battle o~ Dragon Ball Z    
## 3 Goku (SSJG)            224,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z    
## 4 Goku                   44,800,000,000    Movie 14: Battle o~ Dragon Ball Z    
## 5 Beerus (full power, n~ 896,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z    
## 6 Whis (full power, nev~ 4,480,000,000,00~ Movie 14: Battle o~ Dragon Ball Z

Who does the strongest Kamehameha? 🔥

In the Dragon Ball series, there is an energy attack called Kamehameha, which is a signature (and perhaps the most well recognised) move by the main character Goku. This move is however not unique to him, and has also been used by other characters in the series, including his son Gohan and his master Muten Roshi.

Goku and Muten Roshi - image from Giphy

As you’ll see, this dataset includes observations which detail the power level of the notable occasions when this attack was used. Our task here is get some understanding about this attack move from the data, and see if we can figure out whose kamehameha is actually the strongest out of all the characters.

Data cleaning

Here, we use regex (regular expression) string matching to filter on the Character column. The str_detect() function from the {stringr} package detects whether a pattern or expression exists in a string, and returns a logical value of either TRUE or FALSE (which is what dplyr::filter() takes in the second argument). I also used the stringr::regex() function and set the ignore_case argument to TRUE, which makes the filter case-insensitive, such that cases of ‘Kame’ and ‘kAMe’ are also picked up if they do exist.

dball_data %>%
  filter(str_detect(Character, regex("kameha", ignore_case = TRUE))) -> dball_data_1

dball_data_1 %>% head()

## # A tibble: 6 x 4
##   Character                     Power_Level Saga_or_Movie      Dragon_Ball_Seri~
##   <chr>                         <chr>       <chr>              <chr>            
## 1 Master Roshi's Max Power Kam~ 180         Emperor Pilaf Saga Dragon Ball      
## 2 Goku's Kamehameha             12          Emperor Pilaf Saga Dragon Ball      
## 3 Jackie Chun's Max power Kame~ 330         Tournament Saga    Dragon Ball      
## 4 Goku's Kamehameha             90          Red Ribbon Army S~ Dragon Ball      
## 5 Goku's Kamehameha             90          Red Ribbon Army S~ Dragon Ball      
## 6 Goku's Super Kamehameha       740         Piccolo Jr. Saga   Dragon Ball

If this filter feels convoluted, it’s for a good reason. There is a variation of cases and spellings used in this dataset, which a ‘straightforward’ filter wouldn’t have picked up. So there are two of these:

dball_data %>%
  filter(str_detect(Character, "Kamehameha")) -> dball_data_1b

## Show the rows which do not appears on BOTH datasets
dball_data_1 %>%
  dplyr::anti_join(dball_data_1b, by = "Character")

## # A tibble: 2 x 4
##   Character                        Power_Level Saga_or_Movie   Dragon_Ball_Seri~
##   <chr>                            <chr>       <chr>           <chr>            
## 1 Jackie Chun's Max power Kameham~ 330         Tournament Saga Dragon Ball      
## 2 Android 19 (Goku's kamehameha a~ 230,000,000 Android Saga    Dragon Ball Z

Before we go any further with any analysis, we’ll also need to do something about Power_Level, as it is currently in the form of character / text, which means we can’t do any meaningful analysis until we convert it to numeric. To do this, we can start with removing the comma separators with stringr::str_remove_all(), and then run as.numeric().

In ‘real life’, you often get data saved with k and m suffixes for thousands and millions, which will require a bit more cleaning to do - so here, I’m just thankful that all I have to do is to remove some comma separators.

dball_data_1 %>%
  mutate_at("Power_Level", ~str_remove_all(., ",")) %>%
  mutate_at("Power_Level", ~as.numeric(.)) -> dball_data_2

dball_data_2 %>% tail()

## # A tibble: 6 x 4
##   Character           Power_Level Saga_or_Movie                Dragon_Ball_Seri~
##   <chr>                     <dbl> <chr>                        <chr>            
## 1 Goku's Super Kame~  25300000000 OVA: Plan to Eradicate the ~ Dragon Ball Z    
## 2 Family Kamehameha  300000000000 Movie 10: Broly- The Second~ Dragon Ball Z    
## 3 Krillin's Kameham~      8000000 Movie 11: Bio-Broly          Dragon Ball Z    
## 4 Goten's Kamehameha    950000000 Movie 11: Bio-Broly          Dragon Ball Z    
## 5 Trunk's Kamehameha    980000000 Movie 11: Bio-Broly          Dragon Ball Z    
## 6 Goten's Super Kam~   3000000000 Movie 11: Bio-Broly          Dragon Ball Z

Now that we’ve fixed the Power_Level column, the next step is to isolate the information about the characters from the Character column. The reason why we have to do this is because, inconveniently, the column provides information for both the character and the occasion of when the kamehameha is used, which means we won’t be able to easily filter or group the dataset by the characters only.

One way to overcome this problem is to use the apostrophe (or single quote) as a delimiter to extract the characters from the column. Before I do this, I will take another manual step to remove the rows corresponding to absorbed kamehamehas, e.g. Android 19 (Goku’s kamehameha absorbed), as it refers to the character’s power level after absorbing the attack, rather than the attack itself. (Yes, some characters are able to absorb kamehameha attacks and make themselves stronger..!)

After applying the filter, I use mutate() to create a new column called Character_Single, and then str_remove_all() to remove all the characters that appear after the apostrophe:

dball_data_2 %>%
  filter(!str_detect(Character, "absorbed")) %>% # Remove 2 rows unrelated to kamehameha attacks
  mutate(Character_Single = str_remove_all(Character, "\\'.+")) %>% # Remove everything after apostrophe
  select(Character_Single, everything()) -> dball_data_3

## # A tibble: 10 x 5
##    Character_Single Character       Power_Level Saga_or_Movie   Dragon_Ball_Ser~
##    <chr>            <chr>                 <dbl> <chr>           <chr>           
##  1 Master Roshi     Master Roshi's~         180 Emperor Pilaf ~ Dragon Ball     
##  2 Goku             Goku's Kameham~          12 Emperor Pilaf ~ Dragon Ball     
##  3 Jackie Chun      Jackie Chun's ~         330 Tournament Saga Dragon Ball     
##  4 Goku             Goku's Kameham~          90 Red Ribbon Arm~ Dragon Ball     
##  5 Goku             Goku's Kameham~          90 Red Ribbon Arm~ Dragon Ball     
##  6 Goku             Goku's Super K~         740 Piccolo Jr. Sa~ Dragon Ball     
##  7 Goku             Goku's Kameham~         950 Saiyan Saga     Dragon Ball Z   
##  8 Goku             Goku's Kameham~       36000 Saiyan Saga     Dragon Ball Z   
##  9 Goku             Goku's Kameham~       44000 Saiyan Saga     Dragon Ball Z   
## 10 Goku             Goku's Angry K~   180000000 Frieza Saga     Dragon Ball Z

Note that the apostrophe is a special character, and therefore it needs to be escaped by adding two forward slashes before it. The dot (.) matches all characters, and + tells R to match the preceding dot to match one or more times. Regex is a very useful thing to learn, and I would highly recommend just reading through the linked references below if you’ve never used regular expressions before.¹

Analysis

Now that we’ve got a clean dataset, what can we find out about the Kamehamehas?

The Kamehameha - image from Giphy

My approach is start with calculating the average power levels of Kamehamehas in R, grouped by Character_Single. The resulting table tells us that on average, Goku’s Kamehameha is the most powerful, followed by Gohan:

dball_data_3 %>%
  group_by(Character_Single) %>%
  summarise_at(vars(Power_Level), ~mean(.)) %>%
  arrange(desc(Power_Level)) -> kame_data_grouped # Sort by descending

kame_data_grouped

## # A tibble: 11 x 2
##    Character_Single           Power_Level
##    <chr>                            <dbl>
##  1 Goku                           3.46e14
##  2 Gohan                          1.82e12
##  3 Family Kamehameha              3.00e11
##  4 Super Perfect Cell             8.00e10
##  5 Perfect Cell                   3.02e10
##  6 Goten                          1.98e 9
##  7 Trunk                          9.80e 8
##  8 Krillin                        8.00e 6
##  9 Student-Teacher Kamehameha     1.70e 4
## 10 Jackie Chun                    3.30e 2
## 11 Master Roshi                   1.80e 2

However, it’s not helpful to directly visualise this on a bar chart, as the Power Level of the strongest Kamehameha is 175,433 times greater than the median!

kame_data_grouped %>%
  pull(Power_Level) %>%
  summary()

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 1.800e+02 4.008e+06 1.975e+09 3.170e+13 1.900e+11 3.465e+14

A way around this is to log transform the Power_Level variable prior to visualising it, which I’ve saved the data into a new column called Power_Index. Then, we can pipe the data directly into a ggplot chain, and set a dark mode using theme():

kame_data_grouped %>%
  mutate(Power_Index = log(Power_Level)) %>% # Log transform Power Levels
  ggplot(aes(x = reorder(Character_Single, Power_Level),
             y = Power_Index,
             fill = Character_Single)) +
  geom_col() +
  coord_flip() +
  scale_fill_brewer(palette = "Spectral") +
  theme_minimal() +
  geom_text(aes(y = Power_Index,
                label = round(Power_Index, 1),
                hjust = -.2),
            colour = "#FFFFFF") +
  ggtitle("Power Levels of Kamehamehas", subtitle = "By Dragon Ball characters") +
  theme(plot.background = element_rect(fill = "grey20"),
        text = element_text(colour = "#FFFFFF"),
        panel.grid = element_blank(),
        plot.title = element_text(colour="#FFFFFF", face="bold", size=20),
        axis.line = element_line(colour = "#FFFFFF"),
        legend.position = "none",
        axis.title = element_text(colour = "#FFFFFF", size = 12),
        axis.text = element_text(colour = "#FFFFFF", size = 12)) +
  ylab("Power Levels (log transformed)") +
  xlab(" ")

So as it turns out, the results aren’t too surprising. Goku’s Kamehameha is the strongest of all the characters on average, although it has been referenced several times in the series that his son Gohan’s latent powers are beyond Goku’s.

Also, it is perhaps unsurprising that Master Roshi’s Kamehameha is the least powerful, given a highly powered comparison set of characters. Interestingly, Roshi’s Kamehameha is stronger as ‘Jackie Chun’ than as himself.

We can also see the extent to which Goku’s Kamehameha has grown more powerful across the series. This is available in the column Saga_or_Movie. In the same approach as above, we can do this by grouping the data by Saga_or_Movie, and pipe this into a ggplot bar chart:

dball_data_3 %>%
  filter(Character_Single == "Goku") %>%
  mutate(Power_Index = log(Power_Level)) %>% # Log transform Power Levels
  group_by(Saga_or_Movie) %>%
  summarise(Power_Index = mean(Power_Index)) %>%
  ggplot(aes(x = reorder(Saga_or_Movie, Power_Index),
             y = Power_Index)) +
  geom_col(fill = "#F85B1A") +
  theme_minimal() +
  geom_text(aes(y = Power_Index,
                label = round(Power_Index, 1),
                vjust = -.5),
                colour = "#FFFFFF") +
  ggtitle("Power Levels of Goku's Kamehamehas", subtitle = "By Saga/Movie") +
  scale_y_continuous(limits = c(0, 40)) +
  theme(plot.background = element_rect(fill = "grey20"),
        text = element_text(colour = "#FFFFFF"),
        panel.grid = element_blank(),
        plot.title = element_text(colour="#FFFFFF", face="bold", size=20),
        plot.subtitle = element_text(colour="#FFFFFF", face="bold", size=12),
        axis.line = element_line(colour = "#FFFFFF"),
        legend.position = "none",
        axis.title = element_text(colour = "#FFFFFF", size = 10),
        axis.text.y = element_text(colour = "#FFFFFF", size = 8),
        axis.text.x = element_text(colour = "#FFFFFF", size = 8, angle = 45, hjust = 1)) +
  ylab("Power Levels (log transformed)") +
  xlab(" ")

I don’t have full knowledge of the chronology of the franchise, but I do know that Emperor Pilaf Saga, Red Ribbon Army Saga, and Piccolo Jr. Saga are the earliest story arcs where Goku’s martial arts abilities are still developing. It also appears that if I’d like to witness Goku’s most powerful Kamehameha attack, I should find this in the Baby Saga!

Notes

Hope this was an interesting read for you, and that this tells you something new about R or Dragon Ball.

There is certainly more you can do with this dataset, especially once it is processed into a usable, tidy format.

If you have any related datasets that will help make this analysis more interesting, please let me know!

In the mean time, please stay safe and take care all!

See https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html and https://stringr.tidyverse.org/articles/regular-expressions.html ↩︎

RStudio Projects and Working Directories: A Beginner’s Guide

2020-01-23T00:00:00+00:00

Introduction 📂📂📂

This post provides a basic introduction on how to use RStudio Projects and structure your working directories - which is well worth a read if you are still using setwd() to set your directories!

Although the R working directory is quite a basic and reasonably well-covered subject, I felt that it would still be worth sharing my own approach of structuring working directories, as clearly there can be multiple sensible and valid ways of structuring a working directory. The project directory structure covered in this post is one that I use day-to-day myself, and one that I find the most appropriate for the kind of analysis work that I typically deal with, i.e. data sets loaded into memory, and saved within the working directory itself.

If you are just starting out in R, my personal advice is that using RStudio projects and structuring working directories are ‘must-knows’. Using RStudio projects eliminates so much of the early-stage hassle and confusion around reading in and exporting data. Setting up a working directory properly also helps build up good habits that are conducive to reproducible analysis. It’s one of the non-code related parts of R programming that I think is extremely helpful to know, and arguably for a learner, even a greater priority than learning how to use GitHub! ¹

What is a RStudio project, and why?

When I first started using R several years ago, the textbook and mainstream approach for setting working directories was to use setwd(), which takes an absolute file path as an input then sets it as the current working directory of the R process. You then use getwd() to find out what the current working directory is, and check that your working directory is correctly set.

The problem with this approach is that since setwd() relies on an absolute file path, this makes the links break very easily, and very difficult to share your analysis with others. A simple action of moving the entire directory to a different sub-folder or to a different drive will break the links, and your script will not run. As Jenny Bryan points out, the setwd() approach makes it virtually impossible for anyone else other than the original author of the script, on his or her computer, to make the file paths work:

The chance of the setwd() command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable. To recreate and perhaps extend this plot, the lucky recipient will need to hand edit one or more paths to reflect where the project has landed on their machine. When you do this for the 73rd time in 2 days, while marking an assignment, you start to fantasize about lighting the perpetrator’s computer on fire.

(Check out this link for the original blog post)

At the beginning I was sceptical about the seemingly radical move of abandoning the setwd() orthodox entirely, but since I’ve tried out the project workflow I’ve never really thought about using absolute file paths again. So I’m totally with Jenny Bryan on this one!²

Easy file path referencing with RStudio projects

RStudio projects solve the problem of ‘fragile’ file paths by making file paths relative. The RStudio project file is a file that sits in the root directory, with the extension .Rproj. When your RStudio session is running through the project file (.Rproj), the current working directory points to the root folder where that .Rproj file is saved.

Here’s an example - let’s suppose my working directory is a folder named SurveyAnalysis1. Instead of listing out the full absolute file path, C:/Users/Martin/Documents/Analysis/SurveyAnalysis1/Data/Data1.xlsx, I can simply refer the same Excel file at the directory level when using projects, i.e. just refer to the file by Data/Data1.xlsx. The idea is that if one day I decide to move my entire SurveyAnalysis1 folder/directory to another location, or perhaps open this up on a different computer, all the file paths specified in my R scripts would still work as long as I start the session through opening the .Rproj file.

This .Rproj file can be created by going to File > New Project… in RStudio, which then becomes associated with the specified folder or directory. The mindset should then be that the directory (the whole folder and its sub-folders and contents) is stand-alone and portable, which in other words means that you shouldn’t be reading in data from or writing data to files outside the directory. Everything relating to that analysis or project should only happen within that directory, except for cases where your analysis requires interacting with an Internet source, e.g. web-scraping, calling APIs. When opening an existing project, you should open the .Rproj file first and only subsequently open any R scripts (extensions with .R) from the RStudio session, rather than going straight to the R scripts to open them. You can think of opening the .Rproj file as an ‘initialisation’ step for the RStudio session, which ensures that everything you run from this session could find the proper file paths within that directory. RStudio has a more detailed documentation on RStudio projects which is worth checking out, which has more information on .RData and .Rhistory files. Chapter 8 (Workflow: projects) of R for Data Science also gives a ‘quick start’ guide on how to use RStudio projects.

Structuring your working directory 🔨

Asides from using RStudio projects, it’s also good practice to structure your directory in a way that helps anybody else you are collaborating with - or a future version of you trying to reproduce some analysis - to navigate the analysis easily. I recommend the following as a basic ‘starter’ directory set up:

Basic Structure

In your working directory, you will have the following:

Data - this is the subfolder where I save any files that I need to read into R in order to do my analysis or visualisation. These could be anything from SPSS (*.sav) files, Excel / CSV files, .FST or .RDS files. The key idea is that these are source data files, and at no point should R be saving over or editing these files in order to ensure reproducibility. The reasoning is that reproducible analysis isn’t really possible if the source data file keeps getting changed by the analysis (think analysis in spreadsheets). If you do need to change the source data file, create a new version and ensure that the new file name appropriately reflects that change.
Script - this is where I save my R scripts and RMarkdown files (files with the extension .R and .Rmd).
- Analysis - All my main analysis R scripts are saved here, which I think it is for most intents and purposes fine if you have multiple scripts that perform different tasks saved here. I don’t personally have one project per distinct piece of analysis, as this could get out of hand when I may have 20+ different analysis that I’d like to perform on a single dataset. My (actually quite simple) rule-of-thumb for deciding whether to separate out an analysis is to imagine whether someone completely new to the project would be able to navigate and figure out what is going with this directory. As a side note - thoughtful and sensible file names help a lot!
- Functions - It is optional whether you have your custom functions saved in a separate sub-folder. I find this convenient personally because if I want to re-use a function that I remember I’ve written in a particular project, I can at a quick glance browse all the functions I’ve written for that project. Saving functions separately accompanies a workflow where you use source() to read functions into the ‘main analysis script’, rather than having it together with main analysis.
- RMarkdown files - RMarkdown files are a special case, as they work slightly differently to .R files in terms of file paths, i.e. they behave like mini projects of their own, where the default working directory is where the Rmd file is saved. To save RMarkdown files in this set up, it’s recommended that you use the {here} package and its workflow. Alternatively, you can run knitr::opts_knit$set(root.dir = "../") in your setup chunk so that the working directory is set in the root directory rather than another sub-folder where the RMarkdown file is saved (less ideal than using {here}). In my other post, I briefly discussed a directory structure for combining multiple RMarkdown files into a single long RMarkdown document](https://martinctc.github.io/blog/first-world-problems-very-long-rmarkdown-documents/).
Output - Save all your outputs here, including plots, HTML, and data exports.
- Having this Output folder helps others identify what files are outputs of the code, as opposed to source files that were used to produce the analysis.
- What you have set up as the sub-folders don’t matter too much, as long as they’re sensible. You may decide to set up the sub-folders so that they align with the analysis rather than type of file export.
- The timed_fn() function from my package surveytoolbox (available on GitHub) helps create timestamps for file names, which I use often to ensure that I don’t lose work when I am iterating analysis.

This directory structure ‘template’ should provide a good starting point for organising projects if a project workflow is new to you. However, whilst having consistency is great, different projects will have different needs, and therefore one should always think about what is needed and what will happen when setting up the working directory structure, and adapt appropriately.

Vignette: Downloadable tables in RMarkdown with the DT package

2019-12-25T00:00:00+00:00

Background

In an earlier post April this year, I discussed using flexdashboard (with RMarkdown) as an appealing and practical R alternative to Excel-based reporting dashboards. Since it’s possible to (i) export these ‘flexdashboards’ as static HTML files that can be opened on practically any computer (virtually no dependencies), (ii) shared as attachments over emails, and (iii) run without relying on servers and Internet access, they rival ‘traditional’ Excel dashboards on portability. This is an advantage that you don’t really get with other dashboarding solutions such as Tableau and Shiny, as far as I’m aware.

Traditionally, people also like Excel dashboards for another reason, which is that all the data that is reported in the dashboard is usually self-contained and available in the Excel file in itself, provided that the source data within Excel isn’t hidden and protected. This enables any keen user to extract the source data to produce charts or analysis on their own “off-dashboard”. Moreover, having the data available within the dashboard itself helps with reproducibility, in the sense that one can more easily trace back the relationship between the source data and the reported analysis or visualisation.

In this post, I am going to share a trick on how to implement this feature within RMarkdown (and therefore means you can do this in flexdashboard) such that the users of your dashboards can export/download your source data. This will be implemented using the DT package created by RStudio, which provides an R interface to the JavaScript library DataTables.¹

(Credits to Jonathan Ng for sharing this trick with me in the first place! His original video tutorial that first mentions this is available here)

The DT package

In a nutshell, DT is a R package that enables the creation of interactive, pretty HTML tables with fancy features such as filter, search, scroll, pagination, and sort - to name a few. Since DT generates a html widget (e.g. just like what leaflet, rbokeh, and plotly do), it can be used in RMarkdown HTML outputs and Shiny dashboards. I’ve personally found DT very useful when creating RMarkdown documents (knitted to HTML) because it allows you to create professional-looking, business-ready interactive tables with literally only a couple of lines of code, and you can do this entirely in R without knowing any JavaScript. The other alternative packages that perform a similar job of producing quick and pretty HTML tables are formattable, knitr::kable() and kableExtra, but as far as I’m aware only DT allows you to add these ‘data download’ buttons that we are focussing on in this post.

Downloadable tables

What we are trying to get to is an interactive table with buttons that allow you to perform the following actions:

Copy to clipboard
Export to CSV
Export to Excel
Export to PDF
Print

Though you might only require only one or two of the above buttons, I’m going to go through an example that lets you do all five at the same time. The below is what the final output looks like, using the iris dataset, where the download options are shown at the top of the widget:

To see what the interactive version is like, click here.

The Solution

The main function from DT to create the interactive table is DT::datatable(). The first argument accepts a data frame, so this makes it easy to use it with dplyr / magrittr pipes. This is how we will create the above table, using the inbuilt iris dataset:

library(tidyverse)
library(DT)

iris %>%
  datatable(extensions = 'Buttons',
            options = list(dom = 'Blfrtip',
                           buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                           lengthMenu = list(c(10,25,50,-1),
                                             c(10,25,50,"All"))))

And here is a brief explanation for each of the arguments used in the above code:

extensions: this takes in a character vector of the names of DataTables plug-ins, but only plugins supported by the DT package can be used here. We’ll just put ‘Buttons’ here.
options: this argument is where you feed in all the additional customisation options, which is specified in a list.² I usually think of these as ‘expanded features’ that aren’t / haven’t been built into the DT package yet, but are available in the ‘source’ JavaScript library DataTables.
- dom: This argument defines the table control elements to appear on the page and in what order. Here, we have specified this to be Blfrtip, where:
  - B stands for buttons,
  - l for length changing input control,
  - f for filtering input,
  - r for processing display element,
  - t for the table,
  - i for table information summary,
  - and finally, p for pagination display.
  You may move the letters around to control for where the buttons are placed, where for instance lfrtipB would place the buttons at the very bottom of the widget.
- buttons: you pass a character vector through to specify what buttons to actually display in the widget, where ‘copy’ stands for copy to clipboard, ‘csv’ stands for ‘export to csv’, etc.
- lengthMenu: this allows you to specify display options for how many rows of data to display on each page. Here, I’ve passed a list through with two vectors, where the first specifies the page length values and the second the displayed options.

Try it out! Note that if you run this code in a R script, the table will open up in your Viewer Pane in RStudio, but you will need to run the code within a RMarkdown document in order to produce a share-able HTML output.

Create a function (for cleaner code)

I’ve wrapped the solution in a handy function called create_dt(), which just adds a bit of convenience as I can simply load this script at the beginning of a RMarkdown document and then call the function throughout the document, whenever I want to display the data and make them downloadable. Here it is:

create_dt <- function(x){
  DT::datatable(x,
                extensions = 'Buttons',
                options = list(dom = 'Blfrtip',
                               buttons = c('copy', 'csv', 'excel', 'pdf', 'print'),
                               lengthMenu = list(c(10,25,50,-1),
                                                 c(10,25,50,"All"))))
}

You can customise this function to suit whatever needs you have for your project, but I find creating a function for the task of generating DT tables just makes the overall code cleaner, shorter, and easier to follow.

End notes

Hope you enjoyed this short vignette.

Do comment down below if you find this useful, or if you have any related ideas or suggestions you’d like to share. If you liked this post, please do check out my blog for more R and data science related content.

And have a Merry Christmas everyone!

Not to be confused with the data.table package, which is practically a “super” package for fast data manipulation and wrangling.↩
See https://datatables.net/reference/option/ for a full documentation of the options.↩

Musings on R

Common Statistical Tests in R - Part I

Introduction

Set-up: packages and data

Framing the problem

1. Comparison tests: the t-test

1.1 Testing for normality

1.2 Testing for equality of variance (homoscedasticity)

2. Non-parametric tests

2.1 Wilcoxon Rank-Sum Test

2.2 Kruskal-Wallis test

3. Comparison tests: ANOVA

3.1 ANOVA

3.2 Next steps after ANOVA

4. Summary

4.1 Should I use a t-test or ANOVA for comparing exactly two groups?

4.2 t-tests, ANOVA, and linear regression - are they completely different?

End Notes

References

Top 10 tips to make your R package even more awesome

What this post is about

Background

1. Create a package website with pkgdown

2. Automated R CMD checks with GitHub Actions

3. CodeFactor

4. Use conventional commits

5. Package start-up message

6. Add a GIF in your README

7. Add a Contributor Guide and PR templates

8. Add a hex sticker

9. Create a package cheatsheet

10. Submit to CRAN

Bonus tip…

Comparing Common Operations in dplyr and data.table

Background

1. group_by(), summarise() (a single column)

2. group_by(), summarise_at() (multiple columns)

3. filter(), mutate()

4. mutate_at() (changing multiple columns)

5. Row-wise operations

6. Vectorised multiple if-else (case_when())

7. Function-writing: referencing a column with string

End Note

A Shiny app on Hong Kong District Councillors

👀 TL;DR

💻 Overview

🔍 What is in the app?

🗄️ How was the data collected?

📦 Creating a data package

🔗 Linking our Shiny App to Facebook

🌍 Visualising the shapefiles

💭 What are our next steps?

🔥 Other features in the app

💪 Who is behind this?

✋ Want to get involved?

Vignette: Generate your own ggplot theme gallery

Background

DIY ggplot theme gallery 📊

1. Start with a list of plots and a list of themes

2. Create an iteration table

3. Run your ggplot gallery!

End Notes

Vignette: Simulating a minimal SPSS dataset from R

What this is about 📖

Background

Getting started

Create individual vectors

Combine vectors

Results!

Check the labels 🏷🏷🏷

End notes

Data Chats: An Interview on Data-Driven Campaigns, Bias & Ethics

Background

The Interview

M: It’s great to have you guys here. Tell me a bit about Worldacquire and your journey to bringing together analytics and politics!

M: That’s a great fusion of politics and digital marketing. Also, Chris - we first met at one of those meetup groups, which makes a great case for their networking value. So what exactly got you thinking: wow, it would be great to merge analytics and political consulting?

M: Using AI and data to make better decisions - could you describe how you can do that in politics? How do other organizations do that today?

M: So you do the opposite of what Cambridge Analytica did by using the technologies in a more ethical and transparent way?

M: You also believe that by using these techniques for a good cause, we can make our democratic systems more resilient and less prone to being abused?

M: It is interesting that you mention the blind spot in the market share. The common wisdom is that the results of political campaigning remain unknown until after the election. Political polling is known to be inaccurate. Transparency could be a game-changer.

1. `group_by()`, `summarise()` (a single column)

2. `group_by()`, `summarise_at()` (multiple columns)

3. `filter()`, `mutate()`

4. `mutate_at()` (changing multiple columns)

6. Vectorised multiple if-else (`case_when()`)