Jekyll2022-10-13T14:37:12+00:00https://martinctc.github.io/blog/feed.xmlMusings on RA blog on all things R and Data Science by Martin ChanCommon Statistical Tests in R - Part I2022-10-13T00:00:00+00:002022-10-13T00:00:00+00:00https://martinctc.github.io/blog/common-statistical-tests-in-r---part-i<script src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/header-attrs-2.16/header-attrs.js"></script>
<section class="main-content">
<div id="introduction" class="section level1">
<h1>Introduction</h1>
<p>This post will focus on common statistical tests in R to understand
and validate the relationship between two variables.</p>
<p>There must be tons of similar tutorials around, you may be thinking.
So why?</p>
<p>The primary (and selfish) goal of the post is to create a guide that
is practical enough for myself to refer to from time to time. This post
is edited from my own notes from learning statistics and R, and have
been applied to a data example/scenario that I am familiar with. This
means that the examples should be easily generalisable and mostly
consistent with my usual coding approach (mostly ‘tidy’ and using
pipes). Along the way, this will hopefully benefit others who are
learning statistics and R too.</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/breaking-bad-heisenberg.gif" alt="" />
<p class="caption">image from Giphy</p>
</div>
<p>To illustrate the R code, I will be using a sample dataset
<code>pq_data</code> from the package <a href="https://microsoft.github.io/vivainsights/"><strong>vivainsights</strong></a>,
which is a cross-sectional time-series dataset measuring the
collaboration behaviour of simulated employees in an organization. Each
row represents an employee on a certain week, with columns measuring
behaviours such as total weekly time spent in email, meetings, chats,
and so on. The <strong>vivainsights</strong> package itself provides
visualisation and analysis functions tailored for these datasets which
are available from <a href="https://www.microsoft.com/en-us/microsoft-viva/insights/">Microsoft
Viva Insights</a>.</p>
<p>A note about the structure of this post: in the real world, one
should as a best practice visually check the data distribution and run
tests for assumptions like normality prior to performing any tests. For
the sake of narrative and covering all the scenarios, this practice
isn’t really observed in this post. Hence, please be forgiving as you
see us run ‘head first’ into a test without examining the data - and
avoid this in real life!</p>
</div>
<div id="set-up-packages-and-data" class="section level1">
<h1>Set-up: packages and data</h1>
<p>The package <strong>vivainsights</strong> is available on CRAN, so
you can install this with
<code>install.packages("vivainsights")</code>.</p>
<p>You can load the dataset in R by calling <code>pq_data</code> after
loading the <strong>vivainsights</strong> package. Here is a preview of
the first ten columns of the dataset using
<code>dplyr::glimpse()</code>:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(vivainsights)</span>
<span id="cb1-2"><a href="#cb1-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb1-3"><a href="#cb1-3" aria-hidden="true" tabindex="-1"></a><span class="fu">glimpse</span>(pq_data[, <span class="dv">1</span><span class="sc">:</span><span class="dv">10</span>])</span></code></pre></div>
<pre><code>## Rows: 5,593
## Columns: 10
## $ PersonId <chr> "2b625906-1f36-3273-8d0d-13e714c5f6~
## $ MetricDate <date> 2021-12-26, 2021-12-26, 2021-12-26~
## $ After_hours_call_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_chat_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_collaboration_hours <dbl> 7.6624994, 2.4908612, 0.1625000, 1.~
## $ After_hours_email_hours <dbl> 0.2600000, 0.5883611, 0.1625000, 0.~
## $ After_hours_meeting_hours <dbl> 7.50, 2.00, 0.00, 1.25, 19.00, 0.25~
## $ After_hours_scheduled_call_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ After_hours_unscheduled_call_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Call_hours <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~</code></pre>
<p>This tutorial also uses functions from <strong>tidyverse</strong>, so
ensure that you run <code>library(tidyverse)</code> to reproduce the
example outputs.</p>
</div>
<div id="framing-the-problem" class="section level1">
<h1>Framing the problem</h1>
<p>One of the most fundamental tasks in statistics and data science is
to understand the relation between two variables. Sometimes the
motivation is understand whether the relationship is causal,
i.e. whether one causes another. This is not always the case, as for
instance, one may simply wish to test for
<strong>multicollinearity</strong> when selecting predictors for a
model.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<p>Our dataset <code>pq_data</code> represents the simulated
collaboration data of a company, and each row represents an employee’s
week. There are two metrics of interest:</p>
<ul>
<li><code>Multitasking_hours</code> measures the total number of hours
the person spent sending emails or instant messages during a meeting or
a Teams call.</li>
<li><code>After_hours_collaboration_hours</code> measures the number of
hours a person has spent in collaboration (meetings, emails, IMs, and
calls) outside of working hours.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a></li>
</ul>
<p>Imagine then we have two questions to address:</p>
<ol style="list-style-type: decimal">
<li><p>Do <em>managers</em> multi-task more than <em>senior individual
contributors (IC)</em>?</p></li>
<li><p>The HR leadership suspects that meeting multitasking behaviour
could be correlated with after-hours working, as the former represents
wasted time and productivity during meetings. What can we do to
understand the relationship between the two?</p></li>
</ol>
<p>In this post, we will tackle the first question, and focus primarily
on <strong>comparison tests</strong> and their non-parametric
equivalents in R. In subsequent posts I would also like to cover other
relevant tools/concepts such as correlation tests, regression tests,
effect size, and statistical power.</p>
<p>It is worth noting that the first question postulates a relation
between a <strong>categorical</strong> variable (manager/ senior IC) and
a <strong>continuous</strong> variable (multitasking hours), whereas the
second question a relation between two <strong>continuous</strong>
variables (multitasking hours, afterhours collaboration). The types of
the variables in question help determine which tests are
appropriate.</p>
<p>The categorical variable that provides us information on whether an
employee is a manager or a senior IC in <code>pq_data</code> is stored
in <code>LevelDesignation</code>. We can use
<code>vivainsights::hrvar_count()</code> to explore this variable:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1" aria-hidden="true" tabindex="-1"></a><span class="fu">hrvar_count</span>(pq_data, <span class="at">hrvar =</span> <span class="st">"LevelDesignation"</span>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/figure-html/unnamed-chunk-3-1.png" /><!-- --></p>
</div>
<div id="comparison-tests-the-t-test" class="section level1">
<h1>1. Comparison tests: the t-test</h1>
<p>Two common comparison tests would be the <strong>t-test</strong> and
<strong>Analysis of Variance (ANOVA)</strong>. The oft-cited
<em>practical</em> difference between the two is that you would use the
t-test for comparing means between two groups, and ANOVA for more than
two groups. There is a bit more nuance than that, but we will start with
the t-test.</p>
<p>A t-test can be <em>paired</em> or <em>unpaired</em>, where the
former is used for comparing the means of two groups in the <em>same
population</em>, and the latter for <em>independent samples from two
populations or groups</em>. Since managers and senior ICs are two
different populations, an unpaired (two-sample) t-test is therefore
appropriate for the scenario in question two.</p>
<p>Before we jump into the test, we’ll need to prepare the data. Since
we are interested in the difference between managers and senior ICs, we
will first need to create a factor variable from the data that has only
two levels. In the below code, we will first filter out any values of
<code>LevelDesignation</code> that are not <code>"Manager"</code> and
<code>"Senior IC"</code>, and create a new factor column as
<code>ManagerIndicator</code>:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1" aria-hidden="true" tabindex="-1"></a>pq_data_grouped <span class="ot"><-</span></span>
<span id="cb4-2"><a href="#cb4-2" aria-hidden="true" tabindex="-1"></a> pq_data <span class="sc">%>%</span></span>
<span id="cb4-3"><a href="#cb4-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(LevelDesignation <span class="sc">%in%</span> <span class="fu">c</span>(<span class="st">"Manager"</span>, <span class="st">"Senior IC"</span>)) <span class="sc">%>%</span></span>
<span id="cb4-4"><a href="#cb4-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(</span>
<span id="cb4-5"><a href="#cb4-5" aria-hidden="true" tabindex="-1"></a> <span class="at">ManagerIndicator =</span></span>
<span id="cb4-6"><a href="#cb4-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">factor</span>(LevelDesignation,</span>
<span id="cb4-7"><a href="#cb4-7" aria-hidden="true" tabindex="-1"></a> <span class="at">levels =</span> <span class="fu">c</span>(<span class="st">"Manager"</span>, <span class="st">"Senior IC"</span>))</span>
<span id="cb4-8"><a href="#cb4-8" aria-hidden="true" tabindex="-1"></a> )</span></code></pre></div>
<p>Recall also that our dataset <code>pq_data</code> is a
cross-sectional time-series dataset, which means that for every
individual identified by <code>PersonId</code>, there will be multiple
rows representing a snapshot of a different week. In other words, a
unique identifier would be something like a <code>PersonWeekId</code>.
To simplify the dataset so that we are looking at person averages, we
can group the dataset by <code>PersonId</code> and calculate the mean of
<code>Multitasking_hours</code> for each person. After this
manipulation, <code>Multitasking_hours</code> would represent the mean
multitasking hours <em>per person</em>, as opposed to <em>per person per
week</em>. Let us do this by building on the pipe-chain:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a>pq_data_grouped <span class="ot"><-</span></span>
<span id="cb5-2"><a href="#cb5-2" aria-hidden="true" tabindex="-1"></a> pq_data <span class="sc">%>%</span></span>
<span id="cb5-3"><a href="#cb5-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(LevelDesignation <span class="sc">%in%</span> <span class="fu">c</span>(<span class="st">"Manager"</span>, <span class="st">"Senior IC"</span>)) <span class="sc">%>%</span></span>
<span id="cb5-4"><a href="#cb5-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(</span>
<span id="cb5-5"><a href="#cb5-5" aria-hidden="true" tabindex="-1"></a> <span class="at">ManagerIndicator =</span></span>
<span id="cb5-6"><a href="#cb5-6" aria-hidden="true" tabindex="-1"></a> <span class="fu">factor</span>(LevelDesignation,</span>
<span id="cb5-7"><a href="#cb5-7" aria-hidden="true" tabindex="-1"></a> <span class="at">levels =</span> <span class="fu">c</span>(<span class="st">"Manager"</span>, <span class="st">"Senior IC"</span>))</span>
<span id="cb5-8"><a href="#cb5-8" aria-hidden="true" tabindex="-1"></a> ) <span class="sc">%>%</span></span>
<span id="cb5-9"><a href="#cb5-9" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(PersonId, ManagerIndicator) <span class="sc">%>%</span></span>
<span id="cb5-10"><a href="#cb5-10" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarise</span>(<span class="at">Multitasking_hours =</span> <span class="fu">mean</span>(Multitasking_hours), <span class="at">.groups =</span> <span class="st">"drop"</span>)</span>
<span id="cb5-11"><a href="#cb5-11" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb5-12"><a href="#cb5-12" aria-hidden="true" tabindex="-1"></a><span class="fu">glimpse</span>(pq_data_grouped)</span></code></pre></div>
<pre><code>## Rows: 56
## Columns: 3
## $ PersonId <chr> "00f6d464-ba1f-31ee-b51e-ab6e8ec4fb79", "023ddb61-1~
## $ ManagerIndicator <fct> Senior IC, Manager, Senior IC, Senior IC, Manager, ~
## $ Multitasking_hours <dbl> 0.2813373, 0.5980080, 0.3319752, 0.2938879, 0.70762~</code></pre>
<p>Now our data is in the right format.</p>
<p>Let us presume that the data satisfies all the assumptions of the
t-test, and see what happens when we run it with the base
<code>t.test()</code> function:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1" aria-hidden="true" tabindex="-1"></a><span class="fu">t.test</span>(</span>
<span id="cb7-2"><a href="#cb7-2" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb7-3"><a href="#cb7-3" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped,</span>
<span id="cb7-4"><a href="#cb7-4" aria-hidden="true" tabindex="-1"></a> <span class="at">paired =</span> <span class="cn">FALSE</span></span>
<span id="cb7-5"><a href="#cb7-5" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<pre><code>##
## Welch Two Sample t-test
##
## data: Multitasking_hours by ManagerIndicator
## t = 10.097, df = 28.758, p-value = 5.806e-11
## alternative hypothesis: true difference in means between group Manager and group Senior IC is not equal to 0
## 95 percent confidence interval:
## 0.3444870 0.5195712
## sample estimates:
## mean in group Manager mean in group Senior IC
## 0.8103354 0.3783063</code></pre>
<p>In the function, the predictor and outcome variables are supplied
using a tilde (<code>~</code>) format common in R, and we have specified
<code>paired = FALSE</code> to use an unpaired t-test. As for the
output,</p>
<ul>
<li><code>t</code> represents the t-statistic.</li>
<li><code>df</code> represents the degree of freedom.</li>
<li><code>p-value</code> is - well - the p-value. The value here shows
to be significant, as it is smaller than the significance level at
0.05.</li>
<li>the test allows us to reject the null hypothesis that the means of
multitasking hours between managers and ICs are the same.</li>
</ul>
<p>Note that the t-test used here is the <strong>Welch’s
t-test</strong>, which is an adaptation of the classic <strong>Student’s
t-test</strong>. The Welch’s t-test compares the variances of the two
groups (i.e. handling heteroscedasticity), whereas the classic Student’s
t-test assumes the variances of the two groups to be equal (fancy term =
homoscedastic).</p>
<div id="testing-for-normality" class="section level2">
<h2>1.1 Testing for normality</h2>
<p>But hang on!</p>
<p>There are several assumptions behind the classic t-test we haven’t
examined properly, namely:</p>
<ol style="list-style-type: decimal">
<li>independence - sample is independent</li>
<li>normality - data for each group is normally distributed</li>
<li>homoscedasticity - data across samples have equal variance</li>
</ol>
<p>We can at least be sure of (1), as we know that senior ICs and
Managers are separate populations. However, (2) and (3) are assumptions
that we have to validate and address specifically. To test whether our
data is normally distributed, we can use the <strong>Shapiro-Wilk test
of normality</strong>, with the function
<code>shapiro.test()</code>:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1" aria-hidden="true" tabindex="-1"></a>pq_data_grouped <span class="sc">%>%</span></span>
<span id="cb9-2"><a href="#cb9-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(ManagerIndicator) <span class="sc">%>%</span></span>
<span id="cb9-3"><a href="#cb9-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarise</span>(</span>
<span id="cb9-4"><a href="#cb9-4" aria-hidden="true" tabindex="-1"></a> <span class="at">p =</span> <span class="fu">shapiro.test</span>(Multitasking_hours)<span class="sc">$</span>p.value,</span>
<span id="cb9-5"><a href="#cb9-5" aria-hidden="true" tabindex="-1"></a> <span class="at">statistic =</span> <span class="fu">shapiro.test</span>(Multitasking_hours)<span class="sc">$</span>statistic</span>
<span id="cb9-6"><a href="#cb9-6" aria-hidden="true" tabindex="-1"></a> )</span></code></pre></div>
<pre><code>## # A tibble: 2 x 3
## ManagerIndicator p statistic
## <fct> <dbl> <dbl>
## 1 Manager 0.146 0.936
## 2 Senior IC 0.0722 0.941</code></pre>
<p>As both p-values show up as less than 0.05, the test implies that we
should reject the null hypothesis that the data are normally distributed
(i.e. not normally distributed). To confirm, you can also perform a
visual check for normality using a histogram or a Q-Q plot.</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Multitasking hours - IC</span></span>
<span id="cb11-2"><a href="#cb11-2" aria-hidden="true" tabindex="-1"></a>mth_ic <span class="ot"><-</span></span>
<span id="cb11-3"><a href="#cb11-3" aria-hidden="true" tabindex="-1"></a> pq_data_grouped <span class="sc">%>%</span></span>
<span id="cb11-4"><a href="#cb11-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(ManagerIndicator <span class="sc">==</span> <span class="st">"Senior IC"</span>) <span class="sc">%>%</span></span>
<span id="cb11-5"><a href="#cb11-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">pull</span>(Multitasking_hours) </span>
<span id="cb11-6"><a href="#cb11-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb11-7"><a href="#cb11-7" aria-hidden="true" tabindex="-1"></a><span class="fu">qqnorm</span>(mth_ic, <span class="at">pch =</span> <span class="dv">1</span>, <span class="at">frame =</span> <span class="cn">FALSE</span>)</span>
<span id="cb11-8"><a href="#cb11-8" aria-hidden="true" tabindex="-1"></a><span class="fu">qqline</span>(mth_ic, <span class="at">col =</span> <span class="st">"steelblue"</span>, <span class="at">lwd =</span> <span class="dv">2</span>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/figure-html/unnamed-chunk-8-1.png" /><!-- --></p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1" aria-hidden="true" tabindex="-1"></a><span class="co"># Multitasking hours - Manager</span></span>
<span id="cb12-2"><a href="#cb12-2" aria-hidden="true" tabindex="-1"></a>mth_man <span class="ot"><-</span></span>
<span id="cb12-3"><a href="#cb12-3" aria-hidden="true" tabindex="-1"></a> pq_data_grouped <span class="sc">%>%</span></span>
<span id="cb12-4"><a href="#cb12-4" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(ManagerIndicator <span class="sc">==</span> <span class="st">"Manager"</span>) <span class="sc">%>%</span></span>
<span id="cb12-5"><a href="#cb12-5" aria-hidden="true" tabindex="-1"></a> <span class="fu">pull</span>(Multitasking_hours) </span>
<span id="cb12-6"><a href="#cb12-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb12-7"><a href="#cb12-7" aria-hidden="true" tabindex="-1"></a><span class="fu">qqnorm</span>(mth_man, <span class="at">pch =</span> <span class="dv">1</span>, <span class="at">frame =</span> <span class="cn">FALSE</span>)</span>
<span id="cb12-8"><a href="#cb12-8" aria-hidden="true" tabindex="-1"></a><span class="fu">qqline</span>(mth_man, <span class="at">col =</span> <span class="st">"steelblue"</span>, <span class="at">lwd =</span> <span class="dv">2</span>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/figure-html/unnamed-chunk-8-2.png" /><!-- --></p>
<p>In the Q-Q plots, the points broadly adhere to the reference line.
Therefore, the graphical approach suggests that the Shapiro-Wilk test
may have been slightly over-sensitive. Below is a good thing to bear in
mind:<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a></p>
<blockquote>
<p>Statistical tests have the advantage of making an objective judgment
of normality but have the disadvantage of sometimes not being sensitive
enough at low sample sizes or overly sensitive to large sample sizes.
Graphical interpretation has the advantage of allowing good judgment to
assess normality in situations when numerical tests might be over or
undersensitive.</p>
</blockquote>
<p>In other words, the sample sizes may have well played a role in the
significant result in our Shapiro-Wilk test.<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a> As our data isn’t
conclusively normal - this in turn makes the unpaired t-test less
conclusive. When we cannot safely assume normality, we can consider
other alternatives such as the <strong>non-parametric two-samples
Wilcoxon Rank-Sum test</strong>. This is covered further down below.</p>
</div>
<div id="testing-for-equality-of-variance-homoscedasticity" class="section level2">
<h2>1.2 Testing for equality of variance (homoscedasticity)</h2>
<p>Asides from normality, another assumption of the t-test that we
hadn’t properly test for prior to running <code>t.test()</code> is to
check for equality of variance across the two groups (homoscedasticity).
Thankfully, this was not something we had to worry about as we used the
Welch’s t-test. Recall that the classic Student’s t-test assumes
equality between the two variances, but the Welch’s t-test already takes
the difference in variance into account.</p>
<p>If required, however, here is an example on how you can test for
homoscedasticity in R, using <code>var.test()</code>:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1" aria-hidden="true" tabindex="-1"></a><span class="co"># F test to compare two variances</span></span>
<span id="cb13-2"><a href="#cb13-2" aria-hidden="true" tabindex="-1"></a><span class="fu">var.test</span>(</span>
<span id="cb13-3"><a href="#cb13-3" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb13-4"><a href="#cb13-4" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped</span>
<span id="cb13-5"><a href="#cb13-5" aria-hidden="true" tabindex="-1"></a> )</span></code></pre></div>
<pre><code>##
## F test to compare two variances
##
## data: Multitasking_hours by ManagerIndicator
## F = 4.5726, num df = 22, denom df = 32, p-value = 0.0001085
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 2.146082 10.318237
## sample estimates:
## ratio of variances
## 4.572575</code></pre>
<p>The <code>var.test()</code> function ran above is an F-test
(i.e. uses the F-distribution) used to compare whether the variances of
two samples are the same. Under the null hypothesis of the tests, there
should be homoscedasticity and as the f-statistic is a ratio of
variances, the f-statistic would tend towards 1. The arguments are
provided in a similar format to <code>t.test()</code>.</p>
<p>It appears that homoscedasticity does not hold: since the p-value is
less than 0.05, we should reject the null hypothesis that variances
between the manager and IC dataset are equal. The Student’s t-test would
not have been appropriate here, and we were correct to have used the
Welch’s t-test.</p>
<p>Homoscedasticity can also be examined visually, using a boxplot or a
dotplot (using <code>graphics::dotchart()</code> - suitable for small
datasets). The code to do so would be as follows. For this example,
visual examination is a bit more challenging as the senior IC and
Manager groups have starkly different levels of multi-tasking hours.</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1" aria-hidden="true" tabindex="-1"></a><span class="fu">dotchart</span>(</span>
<span id="cb15-2"><a href="#cb15-2" aria-hidden="true" tabindex="-1"></a> <span class="at">x =</span> pq_data_grouped<span class="sc">$</span>Multitasking_hours,</span>
<span id="cb15-3"><a href="#cb15-3" aria-hidden="true" tabindex="-1"></a> <span class="at">groups =</span> pq_data_grouped<span class="sc">$</span>ManagerIndicator</span>
<span id="cb15-4"><a href="#cb15-4" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/figure-html/unnamed-chunk-10-1.png" /><!-- --></p>
<div class="sourceCode" id="cb16"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb16-1"><a href="#cb16-1" aria-hidden="true" tabindex="-1"></a><span class="fu">boxplot</span>(</span>
<span id="cb16-2"><a href="#cb16-2" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb16-3"><a href="#cb16-3" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped</span>
<span id="cb16-4"><a href="#cb16-4" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/common-statistical-tests_20221007_files/figure-html/unnamed-chunk-10-2.png" /><!-- --></p>
</div>
</div>
<div id="non-parametric-tests" class="section level1">
<h1>2. Non-parametric tests</h1>
<div id="wilcoxon-rank-sum-test" class="section level2">
<h2>2.1 Wilcoxon Rank-Sum Test</h2>
<p>Previously, we could not safely rely on the unpaired two-sample
t-test because we are not fully confident that the data satisfies the
normality condition. As an alternative, we can use the <strong>Wilcoxon
Rank-Sum test</strong> (aka Mann Whitney U Test). The Wilcoxon test is
described as a <strong>non-parametric test</strong>, which in statistics
typically means that there is no specification on a distribution, or the
parameters of a distribution. In this case, the Wilcoxon test does not
assume a normal distribution.</p>
<p>Another difference between the Wilcoxon Rank-Sum test and the
unpaired t-test is that the former tests whether two populations have
the same shape via comparing medians, whereas the latter parametric test
compares means between two independent groups.</p>
<p>This is run using <code>wilcox.test()</code></p>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1" aria-hidden="true" tabindex="-1"></a><span class="fu">wilcox.test</span>(</span>
<span id="cb17-2"><a href="#cb17-2" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb17-3"><a href="#cb17-3" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped,</span>
<span id="cb17-4"><a href="#cb17-4" aria-hidden="true" tabindex="-1"></a> <span class="at">paired =</span> <span class="cn">FALSE</span></span>
<span id="cb17-5"><a href="#cb17-5" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<pre><code>##
## Wilcoxon rank sum exact test
##
## data: Multitasking_hours by ManagerIndicator
## W = 752, p-value = 2.842e-14
## alternative hypothesis: true location shift is not equal to 0</code></pre>
<p>The p-value of the test is less than the significance level (alpha =
0.05), which allows us to conclude that Managers’ median multitasking
hours is significantly different from the ICs’.</p>
<p>Note that the Wilcoxon Rank-Sum test is different from the similarly
named Wilcoxon Signed-Rank test, which is the equivalent alternative for
the <em>paired</em> t-test. To perform the Wilcoxon Signed-Rank test
instead, you can simply specify the argument to be
<code>paired = TRUE</code>. Similar to the decision of whether to use
the paired or the unpaired t-test, you should ensure that the one-sample
condition applies if you use the Wilcoxon Signed-Rank test.</p>
</div>
<div id="kruskal-wallis-test" class="section level2">
<h2>2.2 Kruskal-Wallis test</h2>
<p>So far, we have only been looking at tests which compare exactly two
populations. If we are looking for a test that works with comparisons
across three or more populations, we can consider the
<strong>Kruskal-Wallis test</strong>.</p>
<p>Let us create a new data frame that is grouped at the
<code>PersonId</code> level, but filtering out fewer values in
<code>LevelDesignation</code>:</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1" aria-hidden="true" tabindex="-1"></a>pq_data_grouped_2 <span class="ot"><-</span></span>
<span id="cb19-2"><a href="#cb19-2" aria-hidden="true" tabindex="-1"></a> pq_data <span class="sc">%>%</span></span>
<span id="cb19-3"><a href="#cb19-3" aria-hidden="true" tabindex="-1"></a> <span class="fu">filter</span>(LevelDesignation <span class="sc">%in%</span> <span class="fu">c</span>(</span>
<span id="cb19-4"><a href="#cb19-4" aria-hidden="true" tabindex="-1"></a> <span class="st">"Support"</span>,</span>
<span id="cb19-5"><a href="#cb19-5" aria-hidden="true" tabindex="-1"></a> <span class="st">"Senior IC"</span>,</span>
<span id="cb19-6"><a href="#cb19-6" aria-hidden="true" tabindex="-1"></a> <span class="st">"Junior IC"</span>,</span>
<span id="cb19-7"><a href="#cb19-7" aria-hidden="true" tabindex="-1"></a> <span class="st">"Manager"</span>,</span>
<span id="cb19-8"><a href="#cb19-8" aria-hidden="true" tabindex="-1"></a> <span class="st">"Director"</span></span>
<span id="cb19-9"><a href="#cb19-9" aria-hidden="true" tabindex="-1"></a> )) <span class="sc">%>%</span></span>
<span id="cb19-10"><a href="#cb19-10" aria-hidden="true" tabindex="-1"></a> <span class="fu">mutate</span>(<span class="at">ManagerIndicator =</span> <span class="fu">factor</span>(LevelDesignation)) <span class="sc">%>%</span></span>
<span id="cb19-11"><a href="#cb19-11" aria-hidden="true" tabindex="-1"></a> <span class="fu">group_by</span>(PersonId, ManagerIndicator) <span class="sc">%>%</span></span>
<span id="cb19-12"><a href="#cb19-12" aria-hidden="true" tabindex="-1"></a> <span class="fu">summarise</span>(<span class="at">Multitasking_hours =</span> <span class="fu">mean</span>(Multitasking_hours), <span class="at">.groups =</span> <span class="st">"drop"</span>)</span>
<span id="cb19-13"><a href="#cb19-13" aria-hidden="true" tabindex="-1"></a> </span>
<span id="cb19-14"><a href="#cb19-14" aria-hidden="true" tabindex="-1"></a><span class="fu">glimpse</span>(pq_data_grouped_2)</span></code></pre></div>
<pre><code>## Rows: 198
## Columns: 3
## $ PersonId <chr> "0049ef24-ec83-356d-89f7-46b67364e677", "00f6d464-b~
## $ ManagerIndicator <fct> Support, Senior IC, Manager, Support, Support, Supp~
## $ Multitasking_hours <dbl> 0.3812649, 0.2813373, 0.5980080, 0.2918829, 0.42288~</code></pre>
<p>We can then run the Kruskal-Wallis test:</p>
<div class="sourceCode" id="cb21"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb21-1"><a href="#cb21-1" aria-hidden="true" tabindex="-1"></a><span class="fu">kruskal.test</span>(</span>
<span id="cb21-2"><a href="#cb21-2" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb21-3"><a href="#cb21-3" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped_2</span>
<span id="cb21-4"><a href="#cb21-4" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<pre><code>##
## Kruskal-Wallis rank sum test
##
## data: Multitasking_hours by ManagerIndicator
## Kruskal-Wallis chi-squared = 91.061, df = 4, p-value < 2.2e-16</code></pre>
<p>Based on the Kruskal-Wallis test, we reject the null hypothesis and
we conclude that at least one value in <code>LevelDesignation</code> is
different in terms of their weekly hours spent multitasking. The most
obvious downside to this method is that it does not tell us which groups
are different from which, so this may need to be followed up with
multiple pairwise-comparison tests (also known as <em>post-hoc
tests</em>).</p>
</div>
</div>
<div id="comparison-tests-anova" class="section level1">
<h1>3. Comparison tests: ANOVA</h1>
<div id="anova" class="section level2">
<h2>3.1 ANOVA</h2>
<p>What if we want to run the t-test across more than two groups?</p>
<p><strong>Analysis of Variance (ANOVA)</strong> is an alternative
method that generalises the t-test beyond two groups, so it is used to
compare three or more groups.</p>
<p>There are several versions of ANOVA. The simple version is the
<em>one-way ANOVA</em>, but there is also <em>two-way ANOVA</em> which
is used to estimate how the mean of a quantitative variable changes
according to the levels of two categorical variables (e.g. rain/no-rain
and weekend/weekday with respect to ice cream sales). In this example we
will focus on one-way ANOVA.</p>
<p>There are three assumptions in ANOVA, and this may look familiar:</p>
<ul>
<li>The data are independent.</li>
<li>The responses for each factor level have a normal population
distribution.</li>
<li>These distributions have the same variance.</li>
</ul>
<p>These assumptions are the same as those required for the classic
t-test above, and it is recommended that you check for variance and
normality prior to ANOVA.</p>
<p>ANOVA calculates the ratio of the <strong>between-group
variance</strong> and the <strong>within-group variance</strong>
(quantified using sum of squares), and then compares this with a
threshold from the Fisher distribution (typically based on a
significance level). The key function is <code>aov()</code>:</p>
<div class="sourceCode" id="cb23"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb23-1"><a href="#cb23-1" aria-hidden="true" tabindex="-1"></a>res_aov <span class="ot"><-</span></span>
<span id="cb23-2"><a href="#cb23-2" aria-hidden="true" tabindex="-1"></a> <span class="fu">aov</span>(</span>
<span id="cb23-3"><a href="#cb23-3" aria-hidden="true" tabindex="-1"></a> Multitasking_hours <span class="sc">~</span> ManagerIndicator,</span>
<span id="cb23-4"><a href="#cb23-4" aria-hidden="true" tabindex="-1"></a> <span class="at">data =</span> pq_data_grouped_2</span>
<span id="cb23-5"><a href="#cb23-5" aria-hidden="true" tabindex="-1"></a> )</span>
<span id="cb23-6"><a href="#cb23-6" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb23-7"><a href="#cb23-7" aria-hidden="true" tabindex="-1"></a><span class="fu">summary</span>(res_aov)</span></code></pre></div>
<pre><code>## Df Sum Sq Mean Sq F value Pr(>F)
## ManagerIndicator 4 40.55 10.14 504.6 <2e-16 ***
## Residuals 193 3.88 0.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1</code></pre>
<p>The interpretation is as follows:<a href="#fn5" class="footnote-ref" id="fnref5"><sup>5</sup></a></p>
<ul>
<li><p><code>Df</code>: degrees of freedom for…</p>
<ul>
<li>the outcome variable, i.e. the number of levels in the variable
minus 1</li>
<li>the residuals, i.e. the total number of observations minus one and
minus the number of levels in the outcome variables</li>
</ul></li>
<li><p><code>Sum Sq</code>: sum of squares, i.e. the total variation
between the group means and the overall mean</p></li>
<li><p><code>Mean Sq</code>: mean of the sum of squares, calculated by
dividing the sum of squares by the degrees of freedom for each
parameter</p></li>
<li><p><code>F value</code>: test statistic from the F test. This is the
mean square of each independent variable divided by the mean square of
the residuals. The larger the F value, the more likely it is that the
variation caused by the outcome variable is real and not due to
chance.</p></li>
<li><p><code>Pr(>F)</code>: p-value of the F-statistic. This shows
how likely it is that the F-value calculated from the test would have
occurred if the null hypothesis of no difference among group means were
true.</p></li>
</ul>
<p>Given that the p-value is smaller than 0.05, we reject the null
hypothesis, so we reject the hypothesis that all means are equal.
Therefore, we can conclude that at least one value in
<code>LevelDesignation</code> is different in terms of their weekly
hours spent multitasking.</p>
<p><a href="https://statsandr.com/blog/anova-in-r/">Antoine Soetewey’s
blog</a> recommends the use of the <strong>report</strong> package,
which can help you make sense of the results more easily:</p>
<div class="sourceCode" id="cb25"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb25-1"><a href="#cb25-1" aria-hidden="true" tabindex="-1"></a><span class="fu">library</span>(report)</span>
<span id="cb25-2"><a href="#cb25-2" aria-hidden="true" tabindex="-1"></a></span>
<span id="cb25-3"><a href="#cb25-3" aria-hidden="true" tabindex="-1"></a><span class="fu">report</span>(res_aov)</span></code></pre></div>
<pre><code>## The ANOVA (formula: Multitasking_hours ~ ManagerIndicator) suggests that:
##
## - The main effect of ManagerIndicator is statistically significant and large
## (F(4, 193) = 504.61, p < .001; Eta2 = 0.91, 95% CI [0.90, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.</code></pre>
<p>The same drawback that applies to the Kruskall-Wallis test also
applies to ANOVA, in that doesn’t actually tell you which exact group is
different from which; it only tells you whether any group differs
significantly from the group mean. This ANOVA test is hence sometimes
also referred to as an ‘omnibus’ test.</p>
</div>
<div id="next-steps-after-anova" class="section level2">
<h2>3.2 Next steps after ANOVA</h2>
<p>A <em>pairwise</em> t-test (note: <em>pairwise</em>, not
<em>paired</em>!) is likely required to provide more information, and it
is recommended that you review the <a href="https://rdrr.io/r/stats/p.adjust.html">p-value adjustment
methods</a> when doing so.<a href="#fn6" class="footnote-ref" id="fnref6"><sup>6</sup></a> Type I errors are more likely when running
t-tests pairwise across many variables, and therefore correction is
necessary. Here is an example of how you might run a pairwise
t-test:</p>
<div class="sourceCode" id="cb27"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb27-1"><a href="#cb27-1" aria-hidden="true" tabindex="-1"></a><span class="fu">pairwise.t.test</span>(</span>
<span id="cb27-2"><a href="#cb27-2" aria-hidden="true" tabindex="-1"></a> <span class="at">x =</span> pq_data_grouped_2<span class="sc">$</span>Multitasking_hours,</span>
<span id="cb27-3"><a href="#cb27-3" aria-hidden="true" tabindex="-1"></a> <span class="at">g =</span> pq_data_grouped_2<span class="sc">$</span>ManagerIndicator,</span>
<span id="cb27-4"><a href="#cb27-4" aria-hidden="true" tabindex="-1"></a> <span class="at">paired =</span> <span class="cn">FALSE</span>,</span>
<span id="cb27-5"><a href="#cb27-5" aria-hidden="true" tabindex="-1"></a> <span class="at">p.adjust.method =</span> <span class="st">"bonferroni"</span></span>
<span id="cb27-6"><a href="#cb27-6" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<pre><code>##
## Pairwise comparisons using t tests with pooled SD
##
## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator
##
## Director Junior IC Manager Senior IC
## Junior IC <2e-16 - - -
## Manager <2e-16 <2e-16 - -
## Senior IC <2e-16 1 <2e-16 -
## Support <2e-16 1 <2e-16 1
##
## P value adjustment method: bonferroni</code></pre>
<p>It may not be surprising that a pairwise method also exists as a
follow-up for the Kruskall-Wallis test - which is the pairwise Wilcoxon
test! This can be run using <code>pairwise.wilcox.test()</code>. The API
for the <code>pairwise.wilcox.test()</code> is very similar to
<code>pairwise.t.test()</code> where you can change the p-value
adjustment method using the argument <code>p.adjust.method</code>:</p>
<div class="sourceCode" id="cb29"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb29-1"><a href="#cb29-1" aria-hidden="true" tabindex="-1"></a><span class="fu">pairwise.wilcox.test</span>(</span>
<span id="cb29-2"><a href="#cb29-2" aria-hidden="true" tabindex="-1"></a> <span class="at">x =</span> pq_data_grouped_2<span class="sc">$</span>Multitasking_hours,</span>
<span id="cb29-3"><a href="#cb29-3" aria-hidden="true" tabindex="-1"></a> <span class="at">g =</span> pq_data_grouped_2<span class="sc">$</span>ManagerIndicator,</span>
<span id="cb29-4"><a href="#cb29-4" aria-hidden="true" tabindex="-1"></a> <span class="at">paired =</span> <span class="cn">FALSE</span>,</span>
<span id="cb29-5"><a href="#cb29-5" aria-hidden="true" tabindex="-1"></a> <span class="at">p.adjust.method =</span> <span class="st">"bonferroni"</span></span>
<span id="cb29-6"><a href="#cb29-6" aria-hidden="true" tabindex="-1"></a>)</span></code></pre></div>
<pre><code>##
## Pairwise comparisons using Wilcoxon rank sum exact test
##
## data: pq_data_grouped_2$Multitasking_hours and pq_data_grouped_2$ManagerIndicator
##
## Director Junior IC Manager Senior IC
## Junior IC 5.3e-09 - - -
## Manager 3.3e-09 1.3e-09 - -
## Senior IC 5.9e-11 1 2.8e-13 -
## Support 1.3e-08 1 8.6e-13 1
##
## P value adjustment method: bonferroni</code></pre>
</div>
</div>
<div id="summary" class="section level1">
<h1>4. Summary</h1>
<p>So far, the following tests we performed have yielded similar
results:</p>
<ol style="list-style-type: decimal">
<li><em>For comparing Senior ICs and Managers:</em>
<ul>
<li>unpaired two-sample t-test (assumes normality)</li>
<li>Wilcoxon Rank-Sum test (non-parametric)</li>
</ul></li>
<li><em>For comparing across more than two values:</em>
<ul>
<li>ANOVA (assumes normality)</li>
<li>Kruskal-Wallis test (non-parametric)</li>
</ul></li>
<li><em>For following up on (2) with pairwise comparisons:</em>
<ul>
<li>pairwise t-test with correction (assumes normality)</li>
<li>pairwise Wilcoxon test (non-parametric)</li>
</ul></li>
</ol>
<p>To the first business question, we can conclude that Senior ICs have
significantly lower multitasking hours than Managers. Although the data
for the two groups are not normal or equal in variance, the mitigating
solutions we used have also found the differences to be significant.
Moreover, it appears that significant differences also exist across
other levels when we reviewed the post-hoc tests.</p>
<div id="should-i-use-a-t-test-or-anova-for-comparing-exactly-two-groups" class="section level2">
<h2>4.1 Should I use a t-test or ANOVA for comparing exactly two
groups?</h2>
<p>One question worth discussing is the scenario at (1). Suppose that
normality is observed in both groups, does it make a difference whether
I use the t-test or ANOVA if I am comparing exactly two groups?</p>
<p>The textbook recommendation is that whenever one is comparing exactly
two groups one should use the t-test, and ANOVA whenever there are more
than two groups being compared. What can get confusing here is that
there is the classic Student’s t-test and the Welch’s t-test.</p>
<p>When ANOVA is used to compare two groups, the results will be
equivalent to a classic (Student’s) t-test with equal variances.<a href="#fn7" class="footnote-ref" id="fnref7"><sup>7</sup></a> However,
if we are talking about the Welch’s t-test instead, it may be preferable
over ANOVA because the Welch’s t-test takes into account
heteroscedasticity. When there is heteroscedasticity, ANOVA (as well as
Kruskall-Wallis) would become unstable and produce Type I errors, such
as:</p>
<ul>
<li>conservative estimates for large sample sizes</li>
<li>inflated estimates for small sample size<a href="#fn8" class="footnote-ref" id="fnref8"><sup>8</sup></a></li>
</ul>
<p>To further complicate matters, there is also a method called Welch’s
ANOVA which is like classic ANOVA but handles unequal variances better.
This can be done in R using <code>oneway.test()</code>, but there is
some debate around best practice that is beyond the scope of this post.
<a href="#fn9" class="footnote-ref" id="fnref9"><sup>9</sup></a> It
would be prudent to run the Welch versions of the tests whenever we
suspect the data to be heteroscedastic.</p>
<p>The recurring themes here are: (1) to check for heteroscedasticity
and normality, and (2) to run multiple tests to acquire a more
comprehensive view.</p>
</div>
<div id="t-tests-anova-and-linear-regression---are-they-completely-different" class="section level2">
<h2>4.2 t-tests, ANOVA, and linear regression - are they completely
different?</h2>
<p>The common assumptions shared by the three methods may have gave it
away, but the t-test, ANOVA, and linear regression are actually related
in the sense that one is a special case of another.</p>
<p>The t-test is considered a special case of ANOVA, since the classic
Student’s t-test is the same as ANOVA in comparing two groups when
variances are equal. When the t-test statistic is squared, you get the
corresponding f-statistic in the ANOVA.<a href="#fn10" class="footnote-ref" id="fnref10"><sup>10</sup></a></p>
<p>On the other hand, an ANOVA model is the same as a regression with a
dummy variable. In fact, the <code>aov()</code> function in R is a
wrapper around the linear regression function <code>lm()</code>. Steve
Midway’s <a href="https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html"><em>Analysis
in R</em></a> has a chapter which compares the outputs when running
ANOVA using <code>lm()</code> versus <code>aov()</code>.</p>
<p>All of these procedures are subsumed under the General Linear Model
and share the same assumptions.</p>
</div>
</div>
<div id="end-notes" class="section level1">
<h1>End Notes</h1>
<p>This has been a very long post - hope you have found this useful! Due
to the vastness of the subject, it will not be possible to detail every
consideration and method. However, this should hopefully make flow
charts like the below easier to follow:</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/statistical-tests-decision-tree-grosofsky.png" alt="" />
<p class="caption">Flowchart for inferential statistics from Grosofsky
(2009)</p>
</div>
<p>Please comment in the Disqus box down below if you have any feedback
or suggestions. Do also check out the References list below for further
reading; as I wrote this I have attempted to link to the brilliant
resources referenced as diligently as possible.</p>
</div>
<div id="references" class="section level1">
<h1>References</h1>
<ul>
<li><a href="https://online.stat.psu.edu/stat500/lesson/9/9.4/9.4.2">PennState
STAT500</a></li>
<li><a href="https://www.scribbr.com/statistics/statistical-tests/">Guide on
when to use which statistical tests and when</a></li>
<li><a href="http://www.sthda.com/english/wiki/unpaired-two-samples-t-test-in-r/">Unpaired
t-tests in R</a></li>
<li><a href="https://statsandr.com/blog/anova-in-r/">ANOVA in R</a></li>
<li><a href="https://statsandr.com/blog/kruskal-wallis-test-nonparametric-version-anova/">Kruskall-Wallis
Test in R</a></li>
<li><a href="https://stats.stackexchange.com/questions/1637/if-the-t-test-and-the-anova-for-two-groups-are-equivalent-why-arent-their-assu">t-test
versus ANOVA for two groups</a></li>
<li><a href="https://bookdown.org/steve_midway/DAR/understanding-anova-in-r.html">Understanding
ANOVA in R</a></li>
<li><a href="https://medium.com/git-connected/when-and-why-you-should-use-non-parametric-tests-5ed486a84826">Why
you should use non-parametric tests</a></li>
<li><a href="https://sphweb.bumc.bu.edu/otlt/MPH-Modules/PH717-QuantCore/PH717-Module9-Correlation-Regression/index.html">Correlation
and Regression</a></li>
</ul>
</div>
<div class="footnotes footnotes-end-of-document">
<hr />
<ol>
<li id="fn1"><p>a scenario in modelling where your predictor variables
are correlated, which could lead to a poor inference.<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>See <a href="https://learn.microsoft.com/en-us/viva/insights/use/metric-definitions" class="uri">https://learn.microsoft.com/en-us/viva/insights/use/metric-definitions</a>
for definitions.<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
<li id="fn3"><p>See Mishra P, Pandey CM, Singh U, Gupta A, Sahu C,
Keshri A. Descriptive statistics and normality tests for statistical
data. Ann Card Anaesth. 2019 Jan-Mar;22(1):67-72. doi:
10.4103/aca.ACA_157_18. PMID: 30648682; PMCID: PMC6350423.<a href="#fnref3" class="footnote-back">↩︎</a></p></li>
<li id="fn4"><p>The other well-known alternative test for normality is
the <strong>Kolmogorov-Smirnoff</strong> test, run in R using
<code>ks.test()</code>. The KS test looks at the quantile where your
empirical cumulative distribution function differs maximally from the
normal’s theoretical cumulative distribution function. This is often
somewhere in the middle of the distribution. On the other hand, the
Shapiro-Wilk test focusses on the tails of the distribution, which is
consistent to what we are seeing the Q-Q plots.<a href="#fnref4" class="footnote-back">↩︎</a></p></li>
<li id="fn5"><p>References original article at <a href="https://www.scribbr.com/statistics/anova-in-r/" class="uri">https://www.scribbr.com/statistics/anova-in-r/</a>.<a href="#fnref5" class="footnote-back">↩︎</a></p></li>
<li id="fn6"><p>An alternative is the Tukey Honest Significant
Differences (<code>TukeyHSD()</code>), which won’t be detailed here. The
<code>TukeyHSD()</code> function operates on top of the object returned
by <code>aov()</code>.<a href="#fnref6" class="footnote-back">↩︎</a></p></li>
<li id="fn7"><p>See <a href="https://stats.stackexchange.com/questions/236877/is-it-wrong-to-use-anova-instead-of-a-t-test-for-comparing-two-means">this
discussion</a> and <a href="https://stats.stackexchange.com/questions/409503/anova-vs-t-test-for-two-groups">this</a>.<a href="#fnref7" class="footnote-back">↩︎</a></p></li>
<li id="fn8"><p><a href="https://www.statisticshowto.com/welchs-anova/" class="uri">https://www.statisticshowto.com/welchs-anova/</a><a href="#fnref8" class="footnote-back">↩︎</a></p></li>
<li id="fn9"><p>See <a href="https://statisticsbyjim.com/anova/welchs-anova-compared-to-classic-one-way-anova/" class="uri">https://statisticsbyjim.com/anova/welchs-anova-compared-to-classic-one-way-anova/</a>;
<a href="https://blog.minitab.com/en/adventures-in-statistics-2/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete" class="uri">https://blog.minitab.com/en/adventures-in-statistics-2/did-welchs-anova-make-fishers-classic-one-way-anova-obsolete</a>;
<a href="http://ritsokiguess.site/docs/2017/05/19/welch-analysis-of-variance/" class="uri">http://ritsokiguess.site/docs/2017/05/19/welch-analysis-of-variance/</a>.
See also Liu, H. (2015). Comparing Welch ANOVA, a Kruskal-Wallis test,
and traditional ANOVA in case of heterogeneity of variance. Virginia
Commonwealth University.<a href="#fnref9" class="footnote-back">↩︎</a></p></li>
<li id="fn10"><p>It is worth a quick footnote on the differences between
the t-statistic and the f-statistic. The f-statistic is an output that
is found in both the F-tests for variance (see <code>var.test()</code>)
and ANOVA (see <code>aov()</code>). The f-statistic is a ratio of two
variances, and variance is squared standard deviation. Note that the
f-tests for variance and ANOVA are not the same, as the former compares
variances of two populations whereas the latter compares within- and
between-group variances, even though both tests use the f-distribution.
When there are only two groups for the one-way ANOVA F-test, the
f-statistic is equal to the square of the Student’s t-statistic.<a href="#fnref10" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanTop 10 tips to make your R package even more awesome2020-12-22T00:00:00+00:002020-12-22T00:00:00+00:00https://martinctc.github.io/blog/make-package-even-more-awesome<p><img src="https://github.com/martinctc/blog/raw/master/images/vibing-cat-r.gif" alt="" /></p>
<h2 id="what-this-post-is-about">What this post is about</h2>
<p>This post shares top ten tips on how to make your R package even more awesome than it already is. As an R developer, you’ve already put a lot of work into creating and testing your package - so why waste this opportunity to really showcase your work and make it go even further? The tips mentioned in this post can be divided into three main categories:</p>
<ol>
<li>
<p><strong>Communicating your package</strong>: so others can access your package and try it more easily</p>
</li>
<li>
<p><strong>Wrestling time back from developer chores</strong>: so you can spend more time on the important things</p>
</li>
<li>
<p><strong>DevOps best practices</strong>: so other fellow R users will feel more confident about using your package, and make it easier for other developers to collaborate or contribute.</p>
</li>
</ol>
<p>This post assumes that you’ve already written an R package, and therefore won’t focus on the coding component of R package development.</p>
<hr />
<h2 id="background">Background</h2>
<p>Before we begin, shameless plug alert: I’ve written a few tiny R packages. Here are a few of them:</p>
<ul>
<li>
<p><a href="https://martinctc.github.io/rwa/"><strong>rwa</strong></a>: you can run Relative Weights Analysis (a.k.a. Key Drivers Analysis) to measure variable importance. This is available on CRAN.</p>
</li>
<li>
<p><a href="https://hong-kong-districts-info.github.io/hkdatasets/"><strong>hkdatasets</strong></a>: contains datasets that relate to Hong Kong, and is used for our own projects at <a href="https://hong-kong-districts-info.github.io">Hong Kong Districts Info</a>. This is also available on CRAN.</p>
</li>
<li>
<p><a href="https://martinctc.github.io/parallaxr"><strong>parallaxr</strong></a>: allows you to generate pretty parallax scroll documents with R and Markdown.</p>
</li>
<li>
<p><a href="https://martinctc.github.io/surveytoolbox"><strong>surveytoolbox</strong></a>: this package contains all the ‘convenience’ functions back in the days when I was analysing mostly survey data.</p>
</li>
<li>
<p><a href="https://hong-kong-districts-info.github.io/hkdistrictballs/"><strong>hkdistrictballs</strong></a>: created for fun, that allows you to generate “country ball” graphics but for the 18 districts of Hong Kong. Makes use of the <strong>magick</strong> package.</p>
</li>
</ul>
<p>Admittedly, I did not write all these R packages for entirely altruistic reasons. Writing an R package is an exercise that is valuable in itself, as it allows you to put all your most commonly used custom functions into a neat, self-contained package which you can just load at the start of your analysis sessions, instead of copying and pasting snippets of code from GitHub Gists or randomly placed R scripts.</p>
<p>I used to keep a GitHub Gist which contained 1000+ lines of my most used functions, but trust me, you won’t want to do that. Not only does such a maniacal workflow make the likelihood of your future self being able to reproduce your work <em>completely</em> dependent on your organisational or documentation skills, it also represents a potential loss to your colleagues or the R community, as all the work that you have put into writing your custom functions will help nobody else but yourself, as nobody else can access or understand your functions.</p>
<p>However, one big reason why I write all these R packages is because I enjoy the <strong>creative</strong> process. I believe a significant, but sometimes neglected, part of writing R packages is communicating to your package users on <strong>why</strong> they should use your package, and <strong>how</strong> they can use them. Easy to follow examples, reproducible vignettes, documentation that isn’t 100% technical-lingo - all these things help with making an R package easier to use, yet are unrelated with the quality or the implementation of the R code itself. A lot of this is about communication, which is mostly what this post is about (for the code quality aspect, I would recommend resources like <a href="https://adv-r.hadley.nz/">Advanced R</a> or <a href="https://r-pkgs.org/">R Packages</a> instead).</p>
<p>So here are my top ten recommendations on how to make your R package even more awesome than it currently is. Let’s go!</p>
<h2 id="1-create-a-package-website-with-pkgdown">1. Create a package website with <strong>pkgdown</strong></h2>
<p>Whilst this tip is quite well-known, it’s place in the top ten is unquestionable. The <strong>pkgdown</strong> package makes it incredibly easy to create a package website straight from the files that ‘naturally’ exist in your package, such as <code class="language-plaintext highlighter-rouge">README.md</code> and <code class="language-plaintext highlighter-rouge">DESCRIPTION</code>. This package website will document all the functions in your package, running even all your examples in R scripts (under <code class="language-plaintext highlighter-rouge">@examples</code>), and make it incredibly easy for your users to navigate your package and try out its functionalities.</p>
<p>The alternative is to make your users go through the official PDF R package manual - which although is easy enough to generate with <code class="language-plaintext highlighter-rouge">devtools::build_manual()</code> - is not the easiest to navigate, does not natively support plot examples, and definitely more likely to put off new R users from using your package.</p>
<p>For an example of the website in action, here is <a href="https://microsoft.github.io/wpa">an R package that I’ve recently written for work</a>, which leverages <strong>pkgdown</strong> to showcase <a href="https://microsoft.github.io/wpa/reference/">the large number of functions in the package</a>, and to include an <a href="https://microsoft.github.io/wpa/analyst_guide.html">“Analyst Guide”</a> to make it easier to explore the package’s features.</p>
<p>The set-up I would recommend is to set up a GitHub Actions that generates the pkgdown website to a separate <code class="language-plaintext highlighter-rouge">gh-pages</code> branch every time you push a commit to the <code class="language-plaintext highlighter-rouge">main</code> or <code class="language-plaintext highlighter-rouge">master</code> branch on GitHub, and set your GitHub Pages to point to <code class="language-plaintext highlighter-rouge">gh-pages</code> for hosting.</p>
<p>What this effectively means is that you will a package website that practically “maintains itself”, as the website will update itself as you update your package (like <code class="language-plaintext highlighter-rouge">DESCRIPTION</code> or the function documentation) and push the changes onto GitHub. What’s more, this set up is free as it’s hosted on GitHub!</p>
<p>To set all this up, you just need to run:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">usethis</span><span class="o">::</span><span class="n">use_github_actions</span><span class="p">(</span><span class="s2">"pkgdown"</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>This line of code will configure a GitHub Actions workflow to build and deploy your pkgdown site whenever you push changes to GitHub. This should be created and saved at <code class="language-plaintext highlighter-rouge">.github/workflows/pkgdown.yaml</code>. The only manual step you’ll need to do is to go to <strong>Settings</strong> in your GitHub repo, go to <strong>Options</strong>, and scroll down until you see <strong>GitHub Pages</strong>. For <strong>Source</strong>, the page site should be set to being built from the root folder of the <code class="language-plaintext highlighter-rouge">gh-pages</code>.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/gh-pages.png" alt="" /></p>
<p>Once this is set-up and saved, it should just take a few minutes and you should be able to see your website at <code class="language-plaintext highlighter-rouge">https://<YOUR-GITHUB-USERNAME>.github.io/<YOUR-PACKAGE-NAME></code>. You can also of course use a custom domain if you wish.</p>
<p>If you’d like to customise your website, you may add a <code class="language-plaintext highlighter-rouge">_pkgdown.yml</code> file which you can specify things like what to show your navigation bar, Google Analytics tracking code, site theme, social network icons, etc. There are plenty of package sites that are set up this way, so if you’re looking for inspiration you can just take a peek at the <code class="language-plaintext highlighter-rouge">_pkgdown.yml</code> file for any pkgdown sites that use this set-up (you can start with the actual <a href="https://github.com/r-lib/usethis/blob/master/_pkgdown.yml"><strong>pkgdown</strong> pkgdown site</a>). The five R packages mentioned at the beginning of this post also uses this set-up.</p>
<h2 id="2-automated-r-cmd-checks-with-github-actions">2. Automated R CMD checks with GitHub Actions</h2>
<p>Chances are, if you’ve already written a package, you’ll at least have run a <code class="language-plaintext highlighter-rouge">R CMD check</code>, or ran <code class="language-plaintext highlighter-rouge">devtools::check()</code> to test for errors in your R package.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> The <code class="language-plaintext highlighter-rouge">R CMD check</code> automatically checks your code for common problems, e.g.:</p>
<ul>
<li>whether the package can be successfully installed on different operating systems</li>
<li>whether there are syntax errors in the script</li>
<li>whether there are undocumented arguments in your functions, etc.</li>
</ul>
<p>Now, you can either run this manually on your local machine, OR, you can configure GitHub Actions to run this check automatically whenever push a commit or merge a change to your main/master branch. The bonus with the latter, of course, is that you get a nice fancy badge that you can place in your README.md, like this:</p>
<p><img src="https://github.com/martinctc/blog/raw/master/images/r-cmd-check-passing.svg" alt="" /></p>
<p>The only thing you have to make sure is that your package passes these checks before you add the badge for the first time, otherwise you’ll get an alarming <strong>failing</strong> badge on your repo!</p>
<p>The easiest way to add GitHub Actions, again, is to use the <strong>usethis</strong> package:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">usethis</span><span class="o">::</span><span class="n">use_github_actions</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>Similar to tip #1, this adds a yaml file under <code class="language-plaintext highlighter-rouge">.github/workflows</code> called <code class="language-plaintext highlighter-rouge">R-CMD-check.yaml</code>. To add a badge, you can then run:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">usethis</span><span class="o">::</span><span class="n">use_github_actions_badge</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>You can check the <a href="https://usethis.r-lib.org/reference/github_actions.html"><strong>usethis</strong> documentation</a> on the specific details of this function.</p>
<p>Adding automated checks embodies the principles of <strong>CI/CD</strong> (continuous integration, continuous delivery) coding practice, which prefers regular and frequent code check-ins to version control repositories. Automated checks is a form of <strong>continuous testing</strong>, which is a condition for CI/CD. As an outcome, the argument goes that this leads to better collaboration due to greater transparency, and higher software quality due to continuous testing. Errors can be identified sooner, plus a ‘passing’ badge helps assure potential users of your package that you have done your homework to make sure that your package is passing all the basic checks.</p>
<h2 id="3-codefactor">3. CodeFactor</h2>
<p>Speaking of badges, here’s another that you can add to your GitHub!</p>
<p><img src="https://github.com/martinctc/blog/raw/master/images/codefactor-a-plus.svg" alt="" /></p>
<p><a href="https://www.codefactor.io/">CodeFactor</a> performs an automated review of your R code for code quality, and returns a grade (just like in school!). As you’ll see, it’s possible to get an A+, but you can also get a few of the following grades:</p>
<p><img src="https://www.codefactor.io/Content/badges/B.svg" alt="" /><br />
<img src="https://www.codefactor.io/Content/badges/C.svg" alt="" /><br />
<img src="https://www.codefactor.io/Content/badges/F.svg" alt="" /></p>
<p>Instead of checking whether your functions fail or whether your package can be successfully installed, CodeFactor checks for things like:</p>
<ul>
<li>Whether you use <code class="language-plaintext highlighter-rouge">library()</code> <em>within</em> a function - which is not recommended</li>
<li>Whether you have arguments which have been defined but never used in function</li>
<li>Whether you adopt sub-ideal practices like <code class="language-plaintext highlighter-rouge">1:100</code> (instead of <code class="language-plaintext highlighter-rouge">seq_along()</code>) or <code class="language-plaintext highlighter-rouge">sapply()</code> (due to return type uncertainty. )</li>
<li>Using <code class="language-plaintext highlighter-rouge">options()</code> directly inside a function instead of <code class="language-plaintext highlighter-rouge">withr::with_options()</code></li>
</ul>
<p>This is a great way to review your code automatically, instead of badgering a friend who happens to be an experienced R developer to review your package for you.</p>
<p><img src="https://github.com/martinctc/blog/raw/master/images/badger.gif" alt="" /></p>
<p>And speaking of badgers, I highly recommend checking out the <a href="https://github.com/GuangchuangYu/badger"><strong>badger</strong> package</a>, which allows you to generate badges in your README. There are so many other badges that you can add to your package README (e.g. code coverage, number of downloads), but I won’t detail them here as this would turn into a post about badges.</p>
<h2 id="4-use-conventional-commits">4. Use conventional commits</h2>
<blockquote class="twitter-tweet" data-conversation="none" data-theme="dark"><p lang="en" dir="ltr">Write every commit message as if it's part of a PR to your future employer.<br />- Confucius</p>— 🐢 Florian (@fistful_of_bass) <a href="https://twitter.com/fistful_of_bass/status/1338645998634033152?ref_src=twsrc%5Etfw">December 15, 2020</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>There are many reasons for making sure your commit messages are sensible rather than unhelpful and silly (e.g. “update repo lol”), including the one cited above. Here, the recommendation is to actually take this further and use <strong>conventional commits</strong>. What this refers to is the adherence to a set of conventions when writing commit messages by expressing <em>intent</em>. Each commit message would be prefixed with, for instance, <code class="language-plaintext highlighter-rouge">fix:</code> or <code class="language-plaintext highlighter-rouge">feat:</code> to indicate whether it is a bug fix or a feature change. Some examples are:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">feat: add new barplot function</code> - a new feature introduced</li>
<li><code class="language-plaintext highlighter-rouge">fix: syntax error</code> - a bug fix</li>
<li><code class="language-plaintext highlighter-rouge">format: ggplot theme changes</code> - a change to <em>formatting</em> that doesn’t affect code logic</li>
<li><code class="language-plaintext highlighter-rouge">perf: remove nested loop</code> - a change to <em>performance</em> by removing nested loops</li>
<li><code class="language-plaintext highlighter-rouge">docs: add examples</code> - a change to the documentation only</li>
</ul>
<p>You can find out more about conventional commits <a href="https://www.conventionalcommits.org/en/v1.0.0/">here</a>. I highly recommend at least reading through the FAQ section, which answers some common questions which pop up when you are coming across conventional commits for the first time.</p>
<p>The benefit of using conventional commits is that it increases the transparency of the entire project, and makes it more welcoming and inclusive for collaborators. I’m also sure it will impress potential future employers, with its incredible neatness! It will also make things much easier when you are writing up pull request summaries and any package change logs.</p>
<p>To make this even more inclusive for other collaborators, you can add a Git Style Guide to the Wiki page of your GitHub repository, like <a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/wiki/Style-Guide:-Git">this</a>. Kudos to Avision Ho for sharing this idea and concept with me in the first place.</p>
<h2 id="5-package-start-up-message">5. Package start-up message</h2>
<p>This is probably the most controversial tip in this post, i.e. adding a start-up message to your package. This is a short snippet of message that you can write to your package users which will come up whenever they run <code class="language-plaintext highlighter-rouge">library(YOURPACKAGE)</code>.</p>
<p>Why might you do this? Personally, I think it is a nice way to put certain details such as where to find out more resources about the package, or report bugs. Some developers also use this space to include a few lines to advertise some of their other work. In <strong>tidyquant</strong>, you get a subtle start-up message when you load the package:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">==</span><span class="w"> </span><span class="n">Need</span><span class="w"> </span><span class="n">to</span><span class="w"> </span><span class="n">Learn</span><span class="w"> </span><span class="n">tidyquant</span><span class="o">?</span><span class="w"> </span><span class="o">=====================================================</span><span class="w">
</span><span class="n">Business</span><span class="w"> </span><span class="n">Science</span><span class="w"> </span><span class="n">offers</span><span class="w"> </span><span class="n">a</span><span class="w"> </span><span class="m">1</span><span class="o">-</span><span class="n">hour</span><span class="w"> </span><span class="n">course</span><span class="w"> </span><span class="o">-</span><span class="w"> </span><span class="n">Learning</span><span class="w"> </span><span class="n">Lab</span><span class="w"> </span><span class="c1">#9: Performance Analysis & Portfolio Optimization with tidyquant!</span><span class="w">
</span><span class="o"></></span><span class="w"> </span><span class="n">Learn</span><span class="w"> </span><span class="n">more</span><span class="w"> </span><span class="n">at</span><span class="o">:</span><span class="w"> </span><span class="n">https</span><span class="o">://</span><span class="n">university.business</span><span class="o">-</span><span class="n">science.io</span><span class="o">/</span><span class="n">p</span><span class="o">/</span><span class="n">learning</span><span class="o">-</span><span class="n">labs</span><span class="o">-</span><span class="n">pro</span><span class="w"> </span><span class="o"></></span><span class="w">
</span></code></pre></div></div>
<p>How do you add a start-up message? This can be done adding a function <code class="language-plaintext highlighter-rouge">.onAttach()</code> to one of your R scripts in the package. Here’s <a href="https://github.com/microsoft/wpa/blob/main/R/init.R">one I’ve created earlier</a> for the <strong>wpa</strong> package:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">.onAttach</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="k">function</span><span class="p">(</span><span class="n">libname</span><span class="p">,</span><span class="w"> </span><span class="n">pkgname</span><span class="p">)</span><span class="w"> </span><span class="p">{</span><span class="w">
</span><span class="n">message</span><span class="w"> </span><span class="o"><-</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="s2">"\n Thank you for using the {wpa} R package!"</span><span class="p">,</span><span class="w">
</span><span class="s2">"\n \n Our analysts have taken every care to ensure that this package runs smoothly and bug-free."</span><span class="p">,</span><span class="w">
</span><span class="s2">"\n However, if you do happen to encounter any, please email mac@microsoft.com to report any issues."</span><span class="p">,</span><span class="w">
</span><span class="s2">"\n \n Happy coding!"</span><span class="p">)</span><span class="w">
</span><span class="n">packageStartupMessage</span><span class="p">(</span><span class="n">message</span><span class="p">)</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>The reason why this is controversial is because some argue that package start-up messages clutter up the console and interfere with reproducibility. <sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> However, there is also another line of argument that defends the right of open-source developers to place adverts in the packages that they’ve worked so hard on (see <a href="https://twitter.com/_ColinFay/status/1305423796380336128?s=20">this Twitter thread</a>). Of course, you might just want to add a welcome message rather than an advert to your package, but I’ll leave this to the reader to decide.</p>
<h2 id="6-add-a-gif-in-your-readme">6. Add a GIF in your README</h2>
<p>GIFs are awesome, even in the context of R package READMEs. I’ve recently experimented with screen-recording an example of my package in action, converting the video into a GIF, and adding it to the README - receiving mostly positive feedback. See the below example from the <a href="https://martinctc.github.io/parallaxr/">parallaxr</a> package:</p>
<p><img src="https://raw.githubusercontent.com/martinctc/parallaxr/main/.dev/parallaxr.gif" alt="" /></p>
<p>If your package allows you to generate visual outputs like plots or HTML widgets, this is a great way to let potential users see what they can achieve without leaving it only to their imagination (<em>“what happens when I run <code class="language-plaintext highlighter-rouge">foo_bar()</code>?”</em>).</p>
<h2 id="7-add-a-contributor-guide-and-pr-templates">7. Add a Contributor Guide and PR templates</h2>
<p>This tip is actually what GitHub recommends under its settings in <strong>Insights > Community</strong>. And there are good reasons for doing so. The recommendation is that you should add a contributor guideline (<code class="language-plaintext highlighter-rouge">CONTRIBUTING.md</code>) and pull request template to your repository so that it makes it easier for others to collaborate on your package.</p>
<p>I would highly recommend doing anything that would make it easier for others to contribute, as I think it’s fair to say that the number of contributions (in the form of submitted issues, forks, and pull requests) is a mark of an R package’s success (you can measure using GitHub Stars too if you want, I guess).</p>
<p><a href="https://docs.github.com/en/free-pro-team@latest/github/building-a-strong-community/setting-guidelines-for-repository-contributors">GitHub</a> has a comprehensive guide on how to add a Contributor Guide, and it’s really up to you to decide on how you would like others to contribute changes to your package. Still not sure what to put on your <code class="language-plaintext highlighter-rouge">CONTRIBUTING.md</code>? The best places to look are the big, popular R package GitHub repositories, and look at what they put in theirs (probably one of the most important takeaways of this post).</p>
<p>To add a pull request template, you’ll need to add a file named <code class="language-plaintext highlighter-rouge">pull_request_template.md</code> in the <code class="language-plaintext highlighter-rouge">.github</code> subfolder of your package. Certain things you may consider adding to your pull request template are:</p>
<ul>
<li>Summary of changes from the branch</li>
<li>Checks to perform when reviewing the pull request</li>
<li>What issues are linked to this pull request</li>
</ul>
<p>You can use <a href="https://github.com/Hong-Kong-Districts-Info/hkdatasets/blob/master/.github/pull_request_template.md">this version originally put together by Avision Ho</a> as a starting point for authoring your own templates.</p>
<h2 id="8-add-a-hex-sticker">8. Add a hex sticker</h2>
<p>There’s no way an R package is complete without a hex sticker. It’s tradition, it’s cool, although arguably not <em>essential</em> - but why not? It’s very easy to add one, and it makes people want to download your package first even when they haven’t quite figured out the use case for your code yet.</p>
<p><img src="https://github.com/martinctc/blog/raw/master/images/bilbo-why-not.jpg" alt="" /></p>
<p>What’s more, you can create an R package hex sticker with <strong>an R package</strong>! If you’ve not heard of it yet, you should give GuangchuangYu’s <a href="https://github.com/GuangchuangYu/hexSticker">hexSticker</a> a go.</p>
<p>Alternatively, if you’re some what visual artist yourself, you can also choose to create one on your own with Inkscape, which is an open-source vector graphic editing software. Choose an existing hex sticker as a template, and edit the underlying SVG.</p>
<p>I would recommend editing with SVG because it preserves resolution, which <em>may</em> come into handy one day if your R package makes it big and people want to print it on merch. Dreaming on…</p>
<h2 id="9-create-a-package-cheatsheet">9. Create a package cheatsheet</h2>
<p>Although I’m not aware if there are any R packages out there (tell me if you do) that can generate a package cheatsheet for you, it’s one of the things that are totally worth doing even <em>manually</em>.</p>
<p><img src="https://github.com/martinctc/blog/raw/master/images/ggplot-cheat-sheet.png" alt="" /></p>
<p>A cheatsheet helps users view at a glance all the functions that are available in your package, and categorised in a meaningful way as you yourself (the developer) would have done it. The <a href="https://rstudio.com/resources/cheatsheets/">RStudio cheatsheet collection</a> provides plenty of examples that you can reference, as well as a <a href="https://rstudio.com/resources/cheatsheets/how-to-contribute-a-cheatsheet/">template</a> for which you can create your own cheatsheet using either Keynote or PowerPoint. Here’s <a href="https://github.com/microsoft/wpa/blob/main/man/figures/wpa%20cheatsheet.pdf">one</a> I made earlier.</p>
<h2 id="10-submit-to-cran">10. Submit to CRAN</h2>
<p>Okay, this is kind of a no-brainer, and everyone <em>ideally</em> would want to have their package to be submitted to CRAN. It really is something you should try to do, even if it is a bit of work getting all the bits right, as it gives your package a mark of approval and boosted popularity.</p>
<p>Having automated R CMD checks will help you get there slightly faster and easier, and to be honest I did not find the process as difficult as I previously imagined. All the CRAN reviewers (who are volunteers, by the way!) have all been very helpful and explicit in their feedback on what needs to be changed in order to re-submit a package. Having said that, it’s courtesy to make sure you test and review your package thoroughly before submitting your packages to CRAN so you don’t waste time for both the CRAN team and yourself! Submitting to CRAN is a substantial topic in itself, so I’m going to just put down some links.</p>
<p>Karl Broman has a pretty informative primer on <a href="https://kbroman.org/pkg_primer/pages/cran.html">how to get your R package on CRAN</a>.</p>
<ul>
<li><a href="https://cran.r-project.org/web/packages/submission_checklist.html">Checklist for CRAN submissions</a></li>
<li><a href="https://cran.r-project.org/web/packages/policies.html">CRAN Repository Policy</a></li>
</ul>
<h2 id="bonus-tip">Bonus tip…</h2>
<p>Since the last tip was probably slightly less informative, I’ve decided to throw in a bonus tip, which is a list of channels in which you should try to promote your R package:</p>
<ul>
<li>
<p>Write a blog about your package, and <a href="https://www.r-bloggers.com/add-your-blog/">submit to R-bloggers</a>. There is a huge readership / following with R-bloggers, and this is a great way of getting the R community aware of your package.</p>
</li>
<li>
<p>Submit your package to <a href="https://rweekly.org/submit">RWeekly</a>, either as a blog or as a simple package release message. You can submit to RWeekly by creating a pull request to merge to its <code class="language-plaintext highlighter-rouge">DRAFT.md</code>, or use one of the other submission methods listed on the website.</p>
</li>
<li>
<p>Post your package release message on Twitter with the #rstats hashtag. This makes it much more likely for the package to be picked up by the R community. Note that the convention is to use #Rstats rather than #R as a hashtag - see <a href="https://www.t4rstats.com/hashtags-what-are-they-good-for.html">https://www.t4rstats.com/hashtags-what-are-they-good-for.html</a>.</p>
</li>
</ul>
<blockquote class="twitter-tweet" data-theme="dark"><p lang="en" dir="ltr">I declare <a href="https://twitter.com/hashtag/rstats?src=hash&ref_src=twsrc%5Etfw">#rstats</a> the official R statistical prog lang hashtag, pass it on to friends, family and Stata users</p>— Drew Conway (@drewconway) <a href="https://twitter.com/drewconway/status/1448027809?ref_src=twsrc%5Etfw">April 3, 2009</a></blockquote>
<script async="" src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
<ul>
<li>If you use Reddit, consider posting in the <a href="https://www.reddit.com/r/rstats/">Rstats</a> subreddit.</li>
</ul>
<p>Finally, it’s worth emphasising that the best way to learn how to improve your package is to look at how others do it. In the process of writing this post. I’ve learnt something myself when looking at the <a href="https://github.com/strengejacke/sjmisc">sjmisc</a> package GitHub repository, i.e. a way to make it easy for others to cite your R package, with:</p>
<div class="language-R highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">citation</span><span class="p">(</span><span class="s1">'data.table'</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>I’m sure there are plenty of other great tips out there that I’ve not included, but again I hope this post was useful enough. If you enjoyed this post, please comment in the original blog link. Take care and stay safe, and happy coding!</p>
<div class="footnotes" role="doc-endnotes">
<ol>
<li id="fn:1" role="doc-endnote">
<p>See https://r-pkgs.org/r-cmd-check.html for a detailed explanation of the <code class="language-plaintext highlighter-rouge">R CMD check</code>. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>See https://win-vector.com/2019/08/30/it-is-time-for-cran-to-ban-package-ads/. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">↩</a></p>
</li>
</ol>
</div>Martin ChanComparing Common Operations in dplyr and data.table2020-11-06T00:00:00+00:002020-11-06T00:00:00+00:00https://martinctc.github.io/blog/comparing-common-operations-in-dplyr-and-data.table<script src="https://martinctc.github.io/blog/knitr_files/common-operations-dplyr-datatable-09-11-2020_files/header-attrs-2.3/header-attrs.js"></script>
<script src="https://martinctc.github.io/blog/knitr_files/common-operations-dplyr-datatable-09-11-2020_files/accessible-code-block-0.0.1/empty-anchor.js"></script>
<section class="main-content">
<div id="background" class="section level1">
<h1>Background</h1>
<p>This post compares common data manipulation operations in <strong>dplyr</strong> and <strong>data.table</strong>.</p>
<p><img src="https://martinctc.github.io/blog\images\manipulate.gif" width="80%" /></p>
<p>For new-comers to R who are not aware, <a href="https://martinctc.github.io/blog/using-data.table-with-magrittr-pipes-best-of-both-worlds/">there are <em>many</em> ways to do the same thing in R</a>. Depending on the purpose of the code (readability vs creating functions) and the size of the data, I for one often find myself switching from one flavour (or dialect) of R data manipulation to another. Generally, I prefer the <strong>dplyr</strong> style for its readability and intuitiveness (for myself), <strong>data.table</strong> for its speed in grouping and summarising operations,<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and <strong>base R</strong> when I am writing functions. This is by no means the R community consensus by the way (perfectly aware that I am venturing into a total minefield),<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> but is more of a representation of how I personally navigate the messy (but awesome) R world.</p>
<p>In this post, I am going to list out some of the most common data manipulations in both styles:</p>
<ol style="list-style-type: decimal">
<li><code>group_by()</code>, <code>summarise()</code> (a single column)</li>
<li><code>group_by()</code>, <code>summarise_at()</code> (multiple columns)</li>
<li><code>filter()</code>, <code>mutate()</code></li>
<li><code>mutate_at()</code> (changing multiple columns)</li>
<li>Row-wise operations</li>
<li>Vectorised multiple if-else (<code>case_when()</code>)</li>
<li>Function-writing: referencing a column with string</li>
</ol>
<p>There is a vast amount of resources out there on the internet on the comparison of <strong>dplyr</strong> and <strong>data.table</strong>. For those who love to get into the details, I would really recommend <a href="https://atrebas.github.io/post/2019-03-03-datatable-dplyr/">Atrebas’s seminal blog post</a> that gives a comprehensive tour of <strong>dplyr</strong> and <strong>data.table</strong>, comparing the code side-by-side. I would also recommend <a href="https://wetlandscapes.com/blog/a-comparison-of-r-dialects/">this comparison of the three R dialects</a> by Jason Mercer, which not only includes base R in its comparison, but also goes into a fair bit of detail on elements such as piping/chaining (<code>%>%</code>). There’s also a very excellent cheat sheet from DataCamp, linked <a href="https://s3.amazonaws.com/assets.datacamp.com/img/blog/data+table+cheat+sheet.pdf">here</a>.</p>
<p>Why write a new blog post then, you ask? One key (selfish / self-centred) reason is that I myself often refer to my blog for an <em>aide-memoire</em> on how to do a certain thing in R, and my notes are optimised to only contain my most frequently used code. They also contain certain idiosyncracies in the way that I code (e.g. using pipes with <strong>data.table</strong>), which I’d like to be upfront about - and would at the same time very much welcome any discussion on it. It is perhaps also justifiable that I at least attempted to build on and unify the work of others in this post, which I have argued as what is ultimately important <a href="https://martinctc.github.io/blog/a-short-essay-on-duplicated-r-artefacts/">in relation of duplicated R artefacts</a>.</p>
<p>Rambling on… so here we go!</p>
<p>To make it easy to reproduce results, I am going to just stick to the good ol’ <strong>mtcars</strong> and <strong>iris</strong> datasets which come shipped with R. I will also err on the side of verbosity and load the packages at the beginning of each code chunk, as if each code chunk is its own independent R session.</p>
</div>
<div id="group_by-summarise-a-single-column" class="section level1">
<h1>1. <code>group_by()</code>, <code>summarise()</code> (a single column)</h1>
<ul>
<li><strong>Analysis</strong>: Maximum MPG (<code>mpg</code>) value for each cylinder type in the <strong>mtcars</strong> dataset.<br />
</li>
<li><strong>Operations</strong>: Summarise with the <code>max()</code> function by group.</li>
</ul>
<p>To group by and summarise values, you would run something like this in <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb1-2"><a href="#cb1-2"></a></span>
<span id="cb1-3"><a href="#cb1-3"></a>mtcars <span class="op">%>%</span></span>
<span id="cb1-4"><a href="#cb1-4"></a><span class="st"> </span><span class="kw">group_by</span>(cyl) <span class="op">%>%</span></span>
<span id="cb1-5"><a href="#cb1-5"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="dt">max_mpg =</span> <span class="kw">max</span>(mpg), <span class="dt">.groups =</span> <span class="st">"drop_last"</span>)</span></code></pre></div>
<p>You could do the same in <strong>data.table</strong>, and still use magrittr pipes:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb2-2"><a href="#cb2-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb2-3"><a href="#cb2-3"></a></span>
<span id="cb2-4"><a href="#cb2-4"></a>mtcars <span class="op">%>%</span></span>
<span id="cb2-5"><a href="#cb2-5"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb2-6"><a href="#cb2-6"></a><span class="st"> </span>.[,.(<span class="dt">max_mpg =</span> <span class="kw">max</span>(mpg)), by =<span class="st"> </span>cyl]</span></code></pre></div>
</div>
<div id="group_by-summarise_at-multiple-columns" class="section level1">
<h1>2. <code>group_by()</code>, <code>summarise_at()</code> (multiple columns)</h1>
<ul>
<li><strong>Analysis</strong>: Average mean value for <code>Sepal.Width</code> and <code>Sepal.Length</code> for each iris <code>Species</code> in the <strong>iris</strong> dataset.<br />
</li>
<li><strong>Operations</strong>: Summarise with the <code>mean()</code> function by group.</li>
</ul>
<p>Note: this is slightly different from the scenario above because the “summarisation” is applied to multiple columns.</p>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb3-2"><a href="#cb3-2"></a></span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="co"># Option 1</span></span>
<span id="cb3-4"><a href="#cb3-4"></a>iris <span class="op">%>%</span></span>
<span id="cb3-5"><a href="#cb3-5"></a><span class="st"> </span><span class="kw">group_by</span>(Species) <span class="op">%>%</span></span>
<span id="cb3-6"><a href="#cb3-6"></a><span class="st"> </span><span class="kw">summarise_at</span>(<span class="kw">vars</span>(<span class="kw">contains</span>(<span class="st">"Sepal"</span>)),<span class="op">~</span><span class="kw">mean</span>(.))</span>
<span id="cb3-7"><a href="#cb3-7"></a></span>
<span id="cb3-8"><a href="#cb3-8"></a><span class="co"># Option 2</span></span>
<span id="cb3-9"><a href="#cb3-9"></a>iris <span class="op">%>%</span></span>
<span id="cb3-10"><a href="#cb3-10"></a><span class="st"> </span><span class="kw">group_by</span>(Species) <span class="op">%>%</span></span>
<span id="cb3-11"><a href="#cb3-11"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="kw">across</span>(<span class="kw">contains</span>(<span class="st">"Sepal"</span>), mean), <span class="dt">.groups =</span> <span class="st">"drop_last"</span>)</span></code></pre></div>
<p>In <strong>data.table</strong> with pipes:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb4-2"><a href="#cb4-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb4-3"><a href="#cb4-3"></a></span>
<span id="cb4-4"><a href="#cb4-4"></a><span class="co"># Option 1</span></span>
<span id="cb4-5"><a href="#cb4-5"></a>iris <span class="op">%>%</span></span>
<span id="cb4-6"><a href="#cb4-6"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb4-7"><a href="#cb4-7"></a><span class="st"> </span>.[,<span class="kw">lapply</span>(.SD, mean), by =<span class="st"> </span>Species, .SDcols =<span class="st"> </span><span class="kw">c</span>(<span class="st">"Sepal.Length"</span>, <span class="st">"Sepal.Width"</span>)]</span>
<span id="cb4-8"><a href="#cb4-8"></a> </span>
<span id="cb4-9"><a href="#cb4-9"></a><span class="co"># Option 2</span></span>
<span id="cb4-10"><a href="#cb4-10"></a>iris <span class="op">%>%</span></span>
<span id="cb4-11"><a href="#cb4-11"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb4-12"><a href="#cb4-12"></a><span class="st"> </span>.[,<span class="kw">lapply</span>(.SD, mean), by =<span class="st"> </span>Species, .SDcols =<span class="st"> </span><span class="kw">names</span>(.) <span class="op">%like%</span><span class="st"> "Sepal"</span>]</span></code></pre></div>
</div>
<div id="filter-mutate" class="section level1">
<h1>3. <code>filter()</code>, <code>mutate()</code></h1>
<ul>
<li><strong>Analysis</strong>: Find out what the multiple of <code>Sepal.Width</code> and <code>Sepal.Length</code> would be for the iris species <code>setosa</code>.<br />
</li>
<li><strong>Operations</strong>: Filter by <code>Species=="setosa"</code> and create a new column called <code>Sepal_Index</code>.</li>
</ul>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb5-2"><a href="#cb5-2"></a></span>
<span id="cb5-3"><a href="#cb5-3"></a>iris <span class="op">%>%</span></span>
<span id="cb5-4"><a href="#cb5-4"></a><span class="st"> </span><span class="kw">filter</span>(Species <span class="op">==</span><span class="st"> "setosa"</span>) <span class="op">%>%</span></span>
<span id="cb5-5"><a href="#cb5-5"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">Sepal_Index =</span> Sepal.Width <span class="op">*</span><span class="st"> </span>Sepal.Length)</span></code></pre></div>
<p>In <strong>data.table</strong>:</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb6-2"><a href="#cb6-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb6-3"><a href="#cb6-3"></a></span>
<span id="cb6-4"><a href="#cb6-4"></a>iris <span class="op">%>%</span></span>
<span id="cb6-5"><a href="#cb6-5"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb6-6"><a href="#cb6-6"></a><span class="st"> </span>.[, Species <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">as.character</span>(Species)] <span class="op">%>%</span></span>
<span id="cb6-7"><a href="#cb6-7"></a><span class="st"> </span>.[Species <span class="op">==</span><span class="st"> "setosa"</span>] <span class="op">%>%</span></span>
<span id="cb6-8"><a href="#cb6-8"></a><span class="st"> </span>.[, Sepal_Index <span class="op">:</span><span class="er">=</span><span class="st"> </span>Sepal.Width <span class="op">*</span><span class="st"> </span>Sepal.Length] <span class="op">%>%</span></span>
<span id="cb6-9"><a href="#cb6-9"></a><span class="st"> </span>.[]</span></code></pre></div>
</div>
<div id="mutate_at-changing-multiple-columns" class="section level1">
<h1>4. <code>mutate_at()</code> (changing multiple columns)</h1>
<ul>
<li><strong>Analysis</strong>: Multiply <code>Sepal.Width</code> and <code>Sepal.Length</code> by 100.<br />
</li>
<li><strong>Operations</strong>: As above</li>
</ul>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb7-2"><a href="#cb7-2"></a></span>
<span id="cb7-3"><a href="#cb7-3"></a><span class="co"># Option 1</span></span>
<span id="cb7-4"><a href="#cb7-4"></a>iris <span class="op">%>%</span></span>
<span id="cb7-5"><a href="#cb7-5"></a><span class="st"> </span><span class="kw">mutate_at</span>(<span class="kw">vars</span>(Sepal.Length, Sepal.Width), <span class="op">~</span>.<span class="op">*</span><span class="dv">100</span>)</span>
<span id="cb7-6"><a href="#cb7-6"></a></span>
<span id="cb7-7"><a href="#cb7-7"></a><span class="co"># Option 2</span></span>
<span id="cb7-8"><a href="#cb7-8"></a>iris <span class="op">%>%</span></span>
<span id="cb7-9"><a href="#cb7-9"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="kw">across</span>(<span class="kw">starts_with</span>(<span class="st">"Sepal"</span>), <span class="op">~</span>.<span class="op">*</span><span class="dv">100</span>))</span></code></pre></div>
<p>In <strong>data.table</strong> with pipes:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb8-2"><a href="#cb8-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb8-3"><a href="#cb8-3"></a></span>
<span id="cb8-4"><a href="#cb8-4"></a></span>
<span id="cb8-5"><a href="#cb8-5"></a>sepal_vars <-<span class="st"> </span><span class="kw">c</span>(<span class="st">"Sepal.Length"</span>, <span class="st">"Sepal.Width"</span>)</span>
<span id="cb8-6"><a href="#cb8-6"></a></span>
<span id="cb8-7"><a href="#cb8-7"></a>iris <span class="op">%>%</span></span>
<span id="cb8-8"><a href="#cb8-8"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb8-9"><a href="#cb8-9"></a><span class="st"> </span>.[,<span class="kw">as.vector</span>(sepal_vars) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">lapply</span>(.SD, <span class="cf">function</span>(x) x <span class="op">*</span><span class="st"> </span><span class="dv">100</span>), .SDcols =<span class="st"> </span>sepal_vars] <span class="op">%>%</span></span>
<span id="cb8-10"><a href="#cb8-10"></a><span class="st"> </span>.[]</span></code></pre></div>
</div>
<div id="row-wise-operations" class="section level1">
<h1>5. Row-wise operations</h1>
<p>This is always an awkward one, even for <strong>dplyr</strong>. For this, I will list a couple of options for row-wise calculations.</p>
<ul>
<li><strong>Analysis</strong>: Create a <code>TotalSize</code> column by summing all four columns of <code>Sepal.Length</code>, <code>Sepal.Width</code>, <code>Petal.Length</code>, and <code>Petal.Width</code>.</li>
<li><strong>Operations</strong>: As above</li>
</ul>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb9-2"><a href="#cb9-2"></a></span>
<span id="cb9-3"><a href="#cb9-3"></a><span class="co"># Option 1 - use `rowwise()`</span></span>
<span id="cb9-4"><a href="#cb9-4"></a>iris <span class="op">%>%</span></span>
<span id="cb9-5"><a href="#cb9-5"></a><span class="st"> </span><span class="kw">rowwise</span>() <span class="op">%>%</span></span>
<span id="cb9-6"><a href="#cb9-6"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">TotalSize =</span> <span class="kw">sum</span>(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width))</span>
<span id="cb9-7"><a href="#cb9-7"></a></span>
<span id="cb9-8"><a href="#cb9-8"></a><span class="co"># Option 2 - use `apply()` and `select()`</span></span>
<span id="cb9-9"><a href="#cb9-9"></a><span class="co"># Select all columns BUT `Species`</span></span>
<span id="cb9-10"><a href="#cb9-10"></a>iris <span class="op">%>%</span></span>
<span id="cb9-11"><a href="#cb9-11"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">TotalSize =</span> <span class="kw">select</span>(., <span class="op">-</span>Species) <span class="op">%>%</span><span class="st"> </span><span class="kw">apply</span>(<span class="dt">MARGIN =</span> <span class="dv">1</span>, <span class="dt">FUN =</span> sum))</span>
<span id="cb9-12"><a href="#cb9-12"></a></span>
<span id="cb9-13"><a href="#cb9-13"></a><span class="co"># Option 3 - `rowwise()` and `c_across()`</span></span>
<span id="cb9-14"><a href="#cb9-14"></a><span class="co"># Select all columns BUT `Species`</span></span>
<span id="cb9-15"><a href="#cb9-15"></a>iris <span class="op">%>%</span></span>
<span id="cb9-16"><a href="#cb9-16"></a><span class="st"> </span><span class="kw">rowwise</span>() <span class="op">%>%</span></span>
<span id="cb9-17"><a href="#cb9-17"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">TotalSize =</span> <span class="kw">sum</span>(<span class="kw">c_across</span>(<span class="op">-</span>Species)))</span></code></pre></div>
<p>In <strong>data.table</strong> with pipes:</p>
<div class="sourceCode" id="cb10"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb10-1"><a href="#cb10-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb10-2"><a href="#cb10-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb10-3"><a href="#cb10-3"></a></span>
<span id="cb10-4"><a href="#cb10-4"></a><span class="co"># Get all the column names in Species except for `Species`</span></span>
<span id="cb10-5"><a href="#cb10-5"></a>all_vars <-<span class="st"> </span><span class="kw">names</span>(iris)[<span class="kw">names</span>(iris) <span class="op">!=</span><span class="st"> "Species"</span>]</span>
<span id="cb10-6"><a href="#cb10-6"></a></span>
<span id="cb10-7"><a href="#cb10-7"></a>iris <span class="op">%>%</span></span>
<span id="cb10-8"><a href="#cb10-8"></a><span class="st"> </span><span class="kw">as.data.table</span>() <span class="op">%>%</span></span>
<span id="cb10-9"><a href="#cb10-9"></a><span class="st"> </span>.[, <span class="st">"Sepal_Total"</span> <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">apply</span>(.SD, <span class="dv">1</span>, sum), .SDcols =<span class="st"> </span>all_vars] <span class="op">%>%</span></span>
<span id="cb10-10"><a href="#cb10-10"></a><span class="st"> </span>.[] </span></code></pre></div>
</div>
<div id="vectorised-multiple-if-else-case_when" class="section level1">
<h1>6. Vectorised multiple if-else (<code>case_when()</code>)</h1>
<ul>
<li><strong>Analysis</strong>: Classify an <code>Age</code> into different categories</li>
<li><strong>Operations</strong>: Create a new column called <code>AgeLabel</code> based on the <code>Age</code> variable</li>
</ul>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a><span class="kw">library</span>(dplyr)</span>
<span id="cb11-2"><a href="#cb11-2"></a></span>
<span id="cb11-3"><a href="#cb11-3"></a>age_data <-<span class="st"> </span><span class="kw">tibble</span>(<span class="dt">Age =</span> <span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">100</span>))</span>
<span id="cb11-4"><a href="#cb11-4"></a></span>
<span id="cb11-5"><a href="#cb11-5"></a>age_data <span class="op">%>%</span></span>
<span id="cb11-6"><a href="#cb11-6"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">AgeLabel =</span> <span class="kw">case_when</span>(Age <span class="op"><</span><span class="st"> </span><span class="dv">18</span> <span class="op">~</span><span class="st"> "0 - 17"</span>,</span>
<span id="cb11-7"><a href="#cb11-7"></a> Age <span class="op"><</span><span class="st"> </span><span class="dv">35</span> <span class="op">~</span><span class="st"> "18 - 34"</span>,</span>
<span id="cb11-8"><a href="#cb11-8"></a> Age <span class="op"><</span><span class="st"> </span><span class="dv">65</span> <span class="op">~</span><span class="st"> "35 - 64"</span>,</span>
<span id="cb11-9"><a href="#cb11-9"></a> <span class="ot">TRUE</span> <span class="op">~</span><span class="st"> "65+"</span>))</span></code></pre></div>
<p>In <strong>data.table</strong>:</p>
<div class="sourceCode" id="cb12"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb12-1"><a href="#cb12-1"></a><span class="kw">library</span>(data.table)</span>
<span id="cb12-2"><a href="#cb12-2"></a><span class="kw">library</span>(magrittr) <span class="co"># Or any package that imports the pipe (`%>%`)</span></span>
<span id="cb12-3"><a href="#cb12-3"></a></span>
<span id="cb12-4"><a href="#cb12-4"></a><span class="co"># Option 1 - without pipes</span></span>
<span id="cb12-5"><a href="#cb12-5"></a>age_data <-<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">Age =</span> <span class="dv">0</span><span class="op">:</span><span class="dv">100</span>)</span>
<span id="cb12-6"><a href="#cb12-6"></a>age_data[, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "65+"</span>]</span>
<span id="cb12-7"><a href="#cb12-7"></a>age_data[Age <span class="op"><</span><span class="st"> </span><span class="dv">65</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "35-64"</span>]</span>
<span id="cb12-8"><a href="#cb12-8"></a>age_data[Age <span class="op"><</span><span class="st"> </span><span class="dv">35</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "18-34"</span>]</span>
<span id="cb12-9"><a href="#cb12-9"></a>age_data[Age <span class="op"><</span><span class="st"> </span><span class="dv">18</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "0-17"</span>] </span>
<span id="cb12-10"><a href="#cb12-10"></a></span>
<span id="cb12-11"><a href="#cb12-11"></a><span class="co"># Option 2 - with pipes</span></span>
<span id="cb12-12"><a href="#cb12-12"></a>age_data2 <-<span class="st"> </span><span class="kw">data.table</span>(<span class="dt">Age =</span> <span class="dv">0</span><span class="op">:</span><span class="dv">100</span>)</span>
<span id="cb12-13"><a href="#cb12-13"></a></span>
<span id="cb12-14"><a href="#cb12-14"></a>age_data2 <span class="op">%>%</span></span>
<span id="cb12-15"><a href="#cb12-15"></a><span class="st"> </span>.[, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "65+"</span>] <span class="op">%>%</span></span>
<span id="cb12-16"><a href="#cb12-16"></a><span class="st"> </span>.[Age <span class="op"><</span><span class="st"> </span><span class="dv">65</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "35-64"</span>] <span class="op">%>%</span></span>
<span id="cb12-17"><a href="#cb12-17"></a><span class="st"> </span>.[Age <span class="op"><</span><span class="st"> </span><span class="dv">35</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "18-34"</span>] <span class="op">%>%</span></span>
<span id="cb12-18"><a href="#cb12-18"></a><span class="st"> </span>.[Age <span class="op"><</span><span class="st"> </span><span class="dv">18</span>, AgeLabel <span class="op">:</span><span class="er">=</span><span class="st"> "0-17"</span>] <span class="op">%>%</span></span>
<span id="cb12-19"><a href="#cb12-19"></a><span class="st"> </span>.[]</span></code></pre></div>
<p>One thing to note is that there are two options here - Option 2 <em>with</em> and Option 1 <em>without</em> using magrittr pipes. The reason why Option 1 is possible without any assignment (<code><-</code>) is because of <strong>reference semantics</strong> in <strong>data.table</strong>. When <code>:=</code> is used in <strong>data.table</strong>, a change is made to the data.table object via ‘modify by reference’, without creating a copy of the data.table object; when you assign it to a new object, that is referred to as ‘modify by copy’.</p>
<p>As <a href="https://tysonbarrett.com/jekyll/update/2019/07/12/datatable/">Tyson Barrett</a> nicely summarises, this ‘modifying by reference’ behaviour in <strong>data.table</strong> is partly what makes it efficient, but can be surprising if you do not expect or understand it; however, the good news is that <strong>data.table</strong> gives you the option whether to modify by reference or by making a copy.</p>
</div>
<div id="function-writing-referencing-a-column-with-string" class="section level1">
<h1>7. Function-writing: referencing a column with string</h1>
<ul>
<li><strong>Requirement</strong>: Create a function that will multiply a column by three. A string should be supplied to the argument to specify the column to be multiplied. The function returns the original data frame with the modified column.</li>
</ul>
<p>Here, I intentionally name the packages explicitly within the function and not load them, as it’s best practice for functions to be able to run on their own without loading in an entire library.</p>
<p>In <strong>dplyr</strong>:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a>multiply_three <-<span class="st"> </span><span class="cf">function</span>(data, variable){</span>
<span id="cb13-2"><a href="#cb13-2"></a> </span>
<span id="cb13-3"><a href="#cb13-3"></a> dplyr<span class="op">::</span><span class="kw">mutate</span>(data, <span class="op">!!</span>rlang<span class="op">::</span><span class="kw">sym</span>(variable) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="op">!!</span>rlang<span class="op">::</span><span class="kw">sym</span>(variable) <span class="op">*</span><span class="st"> </span><span class="dv">3</span>)</span>
<span id="cb13-4"><a href="#cb13-4"></a>}</span>
<span id="cb13-5"><a href="#cb13-5"></a></span>
<span id="cb13-6"><a href="#cb13-6"></a><span class="kw">multiply_three</span>(iris, <span class="st">"Sepal.Length"</span>)</span></code></pre></div>
<p>In <strong>data.table</strong>:</p>
<p>(See <a href="https://stackoverflow.com/questions/45982595/r-using-get-and-data-table-within-a-user-defined-function" class="uri">https://stackoverflow.com/questions/45982595/r-using-get-and-data-table-within-a-user-defined-function</a>)</p>
<div class="sourceCode" id="cb14"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb14-1"><a href="#cb14-1"></a>multiply_three <-<span class="st"> </span><span class="cf">function</span>(data, variable){</span>
<span id="cb14-2"><a href="#cb14-2"></a> </span>
<span id="cb14-3"><a href="#cb14-3"></a> dt <-<span class="st"> </span>data.table<span class="op">::</span><span class="kw">as.data.table</span>(data)</span>
<span id="cb14-4"><a href="#cb14-4"></a> dt[, <span class="kw">as.character</span>(<span class="kw">substitute</span>(variable)) <span class="op">:</span><span class="er">=</span><span class="st"> </span><span class="kw">get</span>(variable) <span class="op">*</span><span class="st"> </span><span class="dv">3</span>]</span>
<span id="cb14-5"><a href="#cb14-5"></a> dt[] <span class="co"># Print</span></span>
<span id="cb14-6"><a href="#cb14-6"></a>}</span>
<span id="cb14-7"><a href="#cb14-7"></a></span>
<span id="cb14-8"><a href="#cb14-8"></a><span class="kw">multiply_three</span>(iris, <span class="st">"Sepal.Length"</span>)</span></code></pre></div>
</div>
<div id="end-note" class="section level1">
<h1>End Note</h1>
<p>This is it! For anything with greater detail, please consult the blogs and cheat sheets I recommended at the beginning of this blog post. I’d say this covers 65% (not a strictly empirical statistic) of my needs for data manipulation, so I hope this is of some help to you. (The <code>gather()</code> vs <code>melt()</code> vs <code>pivot_longer()</code> subject is a whole other beast, and ought to be dealt with in another post)</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Elio Campitelli has an [excellent blog post] on <em>Why I love data.table</em>, which is a nice short piece on why <strong>data.table</strong> is pretty awesome.<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>As noted in the <a href="https://ds4ps.org/2019/04/20/datatable-vs-dplyr.html">DS4PS blog</a>, the debate of <strong>dplyr</strong> versus <strong>data.table</strong> has resulted in “Twitter clashes, and even became an inspiration for memes.”<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanA Shiny app on Hong Kong District Councillors2020-09-05T00:00:00+00:002020-09-05T00:00:00+00:00https://martinctc.github.io/blog/a-shiny-app-on-hong-kong-district-councillors<script src="https://martinctc.github.io/blog/knitr_files/a-shiny-app-for-hong-kong-district-councillors_20200905_files/header-attrs-2.3/header-attrs.js"></script>
<script src="https://martinctc.github.io/blog/knitr_files/a-shiny-app-for-hong-kong-district-councillors_20200905_files/accessible-code-block-0.0.1/empty-anchor.js"></script>
<section class="main-content">
<div id="tldr" class="section level2">
<h2>👀 TL;DR</h2>
<p>We built an <a href="https://hkdistricts-info.shinyapps.io/dashboard-hkdistrictcouncillors/">R Shiny app</a> to improve access to information on Hong Kong’s local politicians. This is so that voters can make more informed choices. The app shows basic information on each politician, alongside a live feed of their Facebook page and illustrative maps of their district. We took advantage of this project to test out <strong>a range of R packages and techniques</strong> and to <strong>implement some DevOps best practices</strong>, which we will discuss in this post.</p>
<blockquote>
<p><em>This project is an attempt to help make a difference with R programming. It’s an opportunity for us to learn, to code, to have fun, and to make a difference.</em></p>
</blockquote>
<p>This blog post is originally published on <a href="https://martinctc.github.io/blog/" class="uri">https://martinctc.github.io/blog/</a>, and co-authored by Martin Chan and Avision Ho.</p>
</div>
<div id="overview" class="section level2">
<h2>💻 Overview</h2>
<p>Our project was mainly motivated by an observation <strong>that the engagement of the Hong Kong public with their local politicians was very low.</strong><a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> Historically, the work of Hong Kong’s District Councillors (DCs) are neither widely known nor closely scrutinised by the public media. Until recently, most District Councillors did not use webpages or Facebook pages to share their work, but instead favour distributing physical copies of ‘work reports’ via Direct Mail. This has changed significantly with the 2019 District Council election, which was a significant election where the turnout has jumped to 71% (from 47% in 2015), for different reasons. For context, Hong Kong’s District Councils is the most local level of government, and is the only level in which there is full universal suffrage for all candidates.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/18-district-council.png" alt="A map of Hong Kong's 18 District Councils. Illustration by Ocean Cheung" style="max-width:500px;" /></p>
<p>As of the summer of 2020, we identified that 96% (434) of the 452 District Councillors elected in 2019 actually have a dedicated Facebook page for delivering updates to and engaging with local residents. However, these Facebook pages have to be manually searched for online, and there is not a readily available tool where people can quickly map a District to a District Councillor and to their Facebook feeds.</p>
<p>As a wise person once said, <em>“If you can solve a problem effectively in R, why the hell not?”</em>. We tackled this problem by creating a Shiny app in R, which brings the Facebook feeds and constituency information for Hong Kong’s district councillors in one place. In this way, people will be able to access the currently disparately stored information in a single web app.</p>
<p>You can access:</p>
<ul>
<li>The Shiny app <a href="https://hkdistricts-info.shinyapps.io/dashboard-hkdistrictcouncillors/">here</a>.</li>
<li>Our GitHub repository <a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors">here</a>.</li>
<li>Don’t forget to also provide some feedback to the Shiny app <a href="https://hkdistrictsinfo.typeform.com/to/gFHC02gE">here</a>!</li>
</ul>
<p>Whether you are more of an R enthusiast or simply someone who has an interest in Hong Kong politics (hopefully both!), we hope this post will bring you some inspiration on how you can use R <em>not just</em> as a great tool for data analysis, but also as an enabler for you to do something tangible for your community and contribute to causes you care about.</p>
</div>
<div id="what-is-in-the-app" class="section level2">
<h2>🔍 What is in the app?</h2>
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/hkdc-app-example.png" style="max-width:500px;" />
<p>The Shiny app is built like a dashboard which combines information about each district councillor alongside their Facebook page posts (if it exists) and the district they serve, illustrated on an interactive map. By using the District and Constituency dropdown lists, you can retrieve information about the District Councillor and their Facebook feed.</p>
<p>Specifically, there are several key components that were used on top of the incredible <a href="https://github.com/rstudio/shiny">shiny</a> package:</p>
<ul>
<li><a href="https://github.com/rstudio/shinydashboard">shinydashboard</a>: For mobile-friendly dashboard layout.
<ul>
<li>We understood that our users, primarily HK citizens, frequently use mobiles. Thus, to ensure this app was useful to them, we centred our design on how the app looked on their mobile browsers.</li>
</ul></li>
</ul>
<ul>
<li><a href="https://github.com/tidyverse/googlesheets4">googlesheets4</a>: For seamless access to Google Sheets.
<ul>
<li>We understood that our users are not all technical so we stored the core data in a format and platform familiar and accessible to most people, Google Sheets.</li>
<li>At a later stage of the app development, we migrated to storing the data in an R package we wrote, called <a href="https://github.com/hong-Kong-Districts-Info/hkdatasets"><strong>hkdatasets</strong></a> as we sought to keep the data in one place. However, the Google Sheets implementation worked very well, and the app could be deployed with no impact on performance or user experience.</li>
</ul></li>
<li><a href="https://github.com/r-spatial/sf">sf</a> and <a href="https://github.com/rstudio/leaflet">leaflet</a>: For importing geographic data and creating interactive maps.
<ul>
<li>We understood that our users may want to explore other parts of Hong Kong but may not know the names of each constituency. Thus, we provided a map functionality to improve the ease they can learn more about different parts of Hong Kong.</li>
</ul></li>
<li><a href="https://github.com/carlganz/rintrojs">rintrojs</a>: For interactive tutorials.
<ul>
<li>We understood that our users are not necessarily keen to read pages of instructions on how to use the app, especially if they are on mobile. Thus, we implemented a dynamic feature that walks them through visually each component of the app.</li>
</ul></li>
</ul>
</div>
<div id="how-was-the-data-collected" class="section level2">
<h2>🗄️ How was the data collected?</h2>
<p>Since there was no existing single data source on the DCs, we had to put this together ourselves. All the data on each DC, their constituency, the party they belong to, and their Facebook page was all collected manually through a combination of Wikipedia and Facebook. The data was initially housed on Google Sheets, for multiple reasons:</p>
<ol style="list-style-type: decimal">
<li>Using Google Sheets made it easy for multiple people to collaborate on data entry.</li>
<li>Keeping the data outside of the repo has the advantage of keeping the memory size minimal, in line with best practices.</li>
<li>By storing the data in Google Sheets, non-technical users would also be able to access the data too.</li>
</ol>
<p>Most of all, it was easy to access the Google Sheets data with the {googlesheets4} package! For editing the data for <em>pre-processing</em>, a key function is <code>googlesheets4::gs4_auth()</code>, which directs the developer to a web browser, asked to sign in to their Google account, and to grant googlesheets4 permission to operate on their behalf with Google Sheets. We then set up the main Google Sheet - the nicely formatted version intended for the app to ingest - to provide read-only access to anyone with the link, and used <code>googlesheets4::gs4_deauth()</code> to access the public Google Sheet in a <em>de-authorised</em> state. The Shiny app itself does not have any particular Google credentials stored alongside it (which it shouldn’t, for security reasons), and this workflow allows (i) collaborators/developers to edit the data from R and (ii) for the app to access the Google Sheet data without any need for users to login.</p>
<p>This Google Sheet is available <a href="https://docs.google.com/spreadsheets/d/1007RLMHSukSJ5OfCcDJdnJW5QMZyS2P-81fe7utCZwk/">here</a>.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/googlesheet-example.png" style="max-width:500px;" /></p>
<p>Creating a map with constituency boundaries also required additional data. Boundaries for each constituency were obtained through a Freedom of Information (FOI) request by a member of the public <a href="https://accessinfo.hk/en/request/shapefileshp_for_2019_district_c">here</a> (see discussion of <em>shapefiles</em> below).</p>
<p>This was pretty much Phase #1 of data collection, where we had single Google Sheet with basic information about the District Councillors and their Facebook feeds, which enabled us to create a proof of concept of the Shiny app, i.e. making sure that we can set up a mechanism where the user can select a constituency and the app returns the corresponding Facebook feed of the District Councillor.</p>
<p>Based on user feedback, we started with Phase #2 of data collection, which involved a web-scraping exercise on the official <a href="https://www.districtcouncils.gov.hk/index.html">Hong Kong District Council website</a> and the <a href="https://dce2019.hk01.com/">HK01 News Page on the 2019 District Council elections</a> to get extra data points, such as: - Contact email address - Contact number - Office address - Number of votes, and share of votes won in 2019</p>
<p>A function that was extremely helpful for figuring out the URL of the District Councillors’ individual official pages is the following. What this does is to run a Bing search on the <a href="https://www.districtcouncils.gov.hk" class="uri">https://www.districtcouncils.gov.hk</a> website, and scrape from the search result any links which match what we want (based on what the URL string looks like). Although this doesn’t always work, it helped us a long way with the 452 District Councillors.</p>
<pre><code>scrape_dcs <- function(search_term){
query_string <- paste("site: https://www.districtcouncils.gov.hk", search_term)
squery <- URLencode(query_string)
squeryfull <- paste0("https://www.bing.com/search?q=", squery)
main_page <- xml2::read_html(squeryfull)
temp <- html_nodes(main_page, '.b_title a') %>%
html_attr("href")
temp[grepl("member_id=", temp)]
}</code></pre>
<p>One key thing to note is that all of the above data we compiled is available and accessible in the public domain, where we simply took an extra step to improve the accessibility.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> The Phase #2 data was used in the final app to provide more information to the user when a particular constituency or District Councillor is selected.</p>
</div>
<div id="creating-a-data-package" class="section level2">
<h2>📦 Creating a data package</h2>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/hkdatasets-hex.png" style="max-width:150px;" /></p>
<p>Our data R package, <a href="https://github.com/hong-Kong-Districts-Info/hkdatasets"><strong>hkdatasets</strong></a>, is to some extent a spin-off of this project. We decided to migrate from Google Sheets to an R data package approach, for the following reasons:</p>
<ul>
<li><p>An R data package could allow us to provide more detailed documentation and tracking of how the data would change over time. If we choose to expand the dataset in the future, we can easily add this to the package release notes.</p></li>
<li><p>An R data package would fit well with our broader ambition to work on other Hong Kong themed, open-source projects. From sharing our project with friends, we were approached to help with another project to visualise Hong Kong traffic collisions data, where the repo is <a href="https://github.com/Hong-Kong-Districts-Info/hktrafficcollisions">here</a>. As part of this, we obtained this data via an FOI request on traffic collisions, where the data is also available through <strong>hkdatasets</strong>.</p></li>
<li><p>Make it easier for learners and students in the R community to practise with the datasets we’ve put together, without having to learn about the <strong>googlesheets4</strong> package. Our thinking is that this would benefit others as other data packages like <strong>nycflights13</strong> and <strong>babynames</strong> have benefitted us as we learned R.</p></li>
</ul>
<p><strong>hkdatasets</strong> is currently only available on GitHub, and our aim is to release it on CRAN in the future so that more R users to take advantage of it. Check out our <a href="https://github.com/hong-Kong-Districts-Info/hkdatasets">GitHub repo</a> to find out more about it.</p>
</div>
<div id="linking-our-shiny-app-to-facebook" class="section level2">
<h2>🔗 Linking our Shiny App to Facebook</h2>
<p><img src="https://github.com/Hong-Kong-Districts-Info/hong-kong-districts-info.github.io/raw/master/images/DCAppDemo3.gif" style="max-width:500px;" /></p>
<p>When we first conceptualised this project, our aim has always been to make the Facebook Page content the centre piece of the app. This was contingent on using some form of Facebook API to access content on the District Councillors’ Public Pages, which we initially thought would be easy as Public Page content is ‘out there’, and shouldn’t require any additional permissions or approvals.</p>
<p>It turns out, in order to read public posts from Facebook Pages that we do not have admin access to requires a certain permission called <strong>Page Public Content Access</strong>, which in turn requires us to submit our app to Facebook for review. Reading several threads (such as <a href="https://developers.facebook.com/community/threads/385801828797027/">this</a>) online soon convinced us that this would be a fairly challenging process, as we need to effectively submit a proposal on why we had to request this permission. To add to the difficulty, we understood that the App Review process had been put on pause at the time, due to the re-allocation of resourcing during COVID-19.</p>
<p>This drove us to search for a workaround, and this is where we stumbled across <em>iframes</em> as a solution. An <em>iframe</em> is basically a frame that enables you embed a HTML document within another HTML document (they’ve existed for a long time, as I recall using them in the really early GeoCities and Xanga websites).</p>
<p>The <code>iframe</code> concept roughly works as follows. All the Facebook Page URLs are saved in a vector called <code>FacebookURL</code>, and this is “wrapped” with some HTML markup that enables it to be rendered within the Shiny App as an UI component (using <code>shiny::uiOutput()</code>:</p>
<pre><code>chunk1 <- '<iframe src="https://www.facebook.com/plugins/page.php?href='
chunk3 <- '&tabs=timeline&width=400&height=800&small_header=false&adapt_container_width=true&hide_cover=false&show_facepile=true&appId=3131730406906292" width="400" height="800" style="border:none;overflow:hidden" scrolling="no" frameborder="0" allowTransparency="true" allow="encrypted-media"></iframe>'
iframe <- paste0(chunk1, FacebookURL, chunk3)</code></pre>
<p>Although the <code>iframe</code> solution comes with its own challenges, such as the difficulty in making it truly responsive / mobile-optimised, it was nonetheless an expedient and effective workaround that allowed us to produce a proof-of-concept; the alternative was to splash around in Facebook’s API documentation and discussion boards for at least another month to achieve the App Approval (bearing in mind that we were working on this in our own free time, with limited resources).</p>
</div>
<div id="visualising-the-shapefiles" class="section level2">
<h2>🌍 Visualising the shapefiles</h2>
<blockquote>
<p>The first rule of optimisation is you don’t.</p>
<p>— <em>Michael A. Jackson</em></p>
</blockquote>
<p>We acquired shapefiles in order to be able to visualise the individual Disticts on a map, which we obtained from <a href="https://accessinfo.hk/en/request/shapefileshp_for_2019_district_c">AccessInfo.HK</a>. A shapefile is, according to <a href="https://desktop.arcgis.com/en/arcmap/10.3/manage-data/shapefiles/what-is-a-shapefile.htm">the ArcGIS website</a>:</p>
<blockquote>
<p>… a simple, nontopological format for storing the geometric location and attribute information of geographic features. Geographic features in a shapefile can be represented by points, lines, or polygons (areas).</p>
</blockquote>
<p>These shapefiles could be easily used as part of a <strong>ggplot2</strong> workflow, which we created with <code>geom_sf()</code> to rapidly get a Proof of Concept. This was to quickly visualise the districts and how they look in relation to the Shiny app.</p>
<p>The code we used was as follows:</p>
<pre><code>map_hk_districts <- ggplot() +
geom_sf(data = shape_hk, fill = '#009E73') +
geom_sf(data = shape_district, fill = '#56B4E9', alpha = 0.2, linetype = 'dotted', size = 0.2)</code></pre>
<p><img src="https://user-images.githubusercontent.com/25527485/88489291-6891a280-cf8b-11ea-8c0e-eb48a5af5094.png" style="max-width:500px;" /> </p>
<p class="caption">(Image shows an earlier iteration of the app)</p>
<p>Once we settled on how the map looked in relation to the Shiny app, we then spent some additional time and effort to investigate using <a href="https://github.com/rstudio/leaflet">leaflet</a>. The reason for moving to <strong>leaflet</strong> maps because of their interactivity: we understood our users would want to explore the HK map interactively to find out what consituency they belong to or to find out one that was of interest. This was because we were aware that people may know what region they live in but they may not know the name of the consituency.</p>
</div>
<div id="what-are-our-next-steps" class="section level2">
<h2>💭 What are our next steps?</h2>
<p>There were some cool features that we would have liked to, but have not been able to implement:</p>
<ul>
<li><a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/issues/17">precommit hooks</a>: Those familiar with Python may be aware of pre-commit hooks as ways to automatically detect whether your repo contains anything sensitive like a <code>.secrets</code> file. Setting this up will enable us to have automated checks run each time we make a commit to assure we are follow specified standards.
<ul>
<li>Unfortunately, we named our repo with a hyphen so the pre-commit hooks won’t work.</li>
<li><a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/issues/33">codecov</a>: Allows us to robustly test the functions in our code so that they work under a multitude of scenarios such as when users encounter problems.</li>
</ul></li>
<li><a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/issues/26">modularise shiny code</a>: Ensures our Shiny code is chunked so individual pieces of logic are isolated.
<ul>
<li>This makes the overall code easier to follow as it separates the objects that are connected from those that are not. It also makes testing easier because you can test each isolated chunk.</li>
</ul></li>
<li><a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/issues/36">language selection</a>: Currently the app is a smorgasbord of English and Chinese. Consequently, it looks messy. We want to implement the ability for the user to choose which language they want to see the app in and the app’s language will update accordingly.</li>
<li>Release to alpha testers to get early feedback.</li>
<li>More of our enhancements / spikes are listed here on <a href="https://github.com/Hong-Kong-Districts-Info/dashboard-hkdistrictcouncillors/issues">GitHub</a></li>
</ul>
<p>One of the things that we wanted to try out with this open-source project is to adhere to some DevOps best practices, yet unfortunately some of these were either easier to set up from the beginning, or require more time and knowledge (on our part) to set up. As we develop a V2 of this Shiny App and work on <a href="https://hong-kong-districts-info.github.io/portfolio/">other projects</a>, we hope to find the opportunity to implement more of the above features.</p>
</div>
<div id="other-features-in-the-app" class="section level2">
<h2>🔥 Other features in the app</h2>
<p>There were also a number of features that we have implemented, but were not detailed in this post. For instance:</p>
<ul>
<li>Adding a searchable DataTable with information on the District Councillors, with the <strong>DT</strong> package</li>
<li>Embedding a user survey within the Shiny app</li>
<li>Adding a tutorial to go through features of the Shiny app, using the <strong>rintrojs</strong> package</li>
<li>Adding loader animations with <strong>shinycssloaders</strong></li>
</ul>
<p>We will cover more of that detail in a Part 2 of this blog, so watch this space!</p>
</div>
<div id="who-is-behind-this" class="section level2">
<h2>💪 Who is behind this?</h2>
<p>Multiple people contributed to this work. <strong>Avision Ho</strong> is a data scientist who wrote the majority of the Shiny app, and who was also <a href="https://martinctc.github.io/blog/data-chats-an-interview-with-avision-ho/">previously interviewed on this blog</a>. Avision is a co-author on this post. <strong>Ocean Cheung</strong> came up with the original idea of this app, and made it all possible with his knowledge and network with District Councillors. We would also like to credit <strong>Justin Yim</strong>, <strong>Tiffany Chau</strong>, and <strong>Gabriel Tam</strong> for their feedback and advice on the scope and the direction of this app. We are currently working on a number of other projects, which you can find out more from our website: <a href="https://hong-kong-districts-info.github.io/" class="uri">https://hong-kong-districts-info.github.io/</a>.</p>
<p>(Disclaimer! We are not affiliated to any political individuals nor movements. We are simply some people who’d like to contribute to society through code and open-source projects.)</p>
</div>
<div id="want-to-get-involved" class="section level2">
<h2>✋ Want to get involved?</h2>
<p>We’re looking for collaborators or reviewers, so please send us an email (<a href="mailto:hkdistricts.info@gmail.com" class="email">hkdistricts.info@gmail.com</a>), or comment down below if you are interested! We would also appreciate any feedback or questions, which you could either comment below or respond to our <a href="https://hkdistricts-info.shinyapps.io/dashboard-hkdistrictcouncillors/">in-app survey</a>. You can also get an idea of things we are planning to work on through our Trello board <a href="https://trello.com/b/n5l7DMS5/doing">here</a>.</p>
<p>When we first started out, we were just a couple of people who wanted to learn and practise a new skill (e.g. building a Shiny app, implementing best practices), and wanted a meaningful open-source project that we could work on. Read more about <a href="https://hong-kong-districts-info.github.io/about/">our Vision Statement here</a>.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>There are many reasons for this, and arguably a similar phenomenon can be observed in most local elections in other countries. See Lee, F. L., & Chan, J. M. (2008). Making sense of participation: The political culture of pro-democracy demonstrators in Hong Kong. <em>The China Quarterly</em>, 84-101.<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>This is in compliance with the ICO’s description of the ‘public domain’, i.e. that <em>information is only in the public domain if it is realistically accessible to a member of the general public at the time of the request. It must be available in practice, not just in theory</em>.<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanVignette: Generate your own ggplot theme gallery2020-05-08T00:00:00+00:002020-05-08T00:00:00+00:00https://martinctc.github.io/blog/vignette-generate-your-own-ggplot-theme-gallery<script src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/header-attrs-2.1.1/header-attrs.js"></script>
<section class="main-content">
<div id="background" class="section level2">
<h2>Background</h2>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/diy-ggplot-theme.png" /></p>
<p>I’ve always found it a bit of a pain to explore and choose from all the different themes available out there for {ggplot2}.</p>
<p>Yes I know, I know - there are probably tons of websites out there with a ggplot theme gallery which I can Google,<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> but it’s always more fun if you can create your own. So here’s my attempt to do this, on a lockdown Bank Holiday afternoon.</p>
</div>
<div id="diy-ggplot-theme-gallery" class="section level2">
<h2>DIY ggplot theme gallery 📊</h2>
<div id="start-with-a-list-of-plots-and-a-list-of-themes" class="section level3">
<h3>1. Start with a list of plots and a list of themes</h3>
<p>The outcome I want to achieve from this is to create something that would make it easier to decide which ggplot theme to pick for the visualisation at hand. The solution doesn’t need to be fancy: it would be helpful enough to generate all the combinations of plot types X themes, so I can browse through them and get inspirations more easily.</p>
<p>I took a leaf out of <a href="https://www.shanelynn.ie/themes-and-colours-for-r-ggplots-with-ggthemr/">Shayne Lynn’s book/blog</a> and created a couple of “base plots” using <code>iris</code> (yes, boring, but it works). I did these for four types of plots:</p>
<ol style="list-style-type: decimal">
<li>scatter plot</li>
<li>bar plot</li>
<li>box plot</li>
<li>density plot</li>
</ol>
<p>I then assigned these four plots into a list object called <code>plot_list</code>, and converted them into a tibble (<code>plot_base</code>) that I could use for joining afterwards.</p>
<p>This step is then repeated for themes, where I virtually punched in all the existing themes in {ggplot2} and {ggthemes} into a named list (<code>theme_list</code>), and also create a tibble (<code>theme_base</code>). You can make this list as long and exhaustive as you want, but for this example I didn’t want to go into overkill.</p>
<p>You’ll see that I’ve made the names quite elaborate in terms of specifying the package source. The reason for this is because these names will be used afterwards in the plot output, and it will be helpful for identifying the function for generating the theme in the gallery.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="co">#### Load packages ####</span></span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="kw">library</span>(tidyverse)</span>
<span id="cb1-3"><a href="#cb1-3"></a><span class="kw">library</span>(ggthemes) <span class="co"># Optional - only for testing additional themes</span></span>
<span id="cb1-4"><a href="#cb1-4"></a></span>
<span id="cb1-5"><a href="#cb1-5"></a></span>
<span id="cb1-6"><a href="#cb1-6"></a><span class="co">#### Create base plots ####</span></span>
<span id="cb1-7"><a href="#cb1-7"></a><span class="co">## scatter plot</span></span>
<span id="cb1-8"><a href="#cb1-8"></a>point_plot <-</span>
<span id="cb1-9"><a href="#cb1-9"></a><span class="st"> </span><span class="kw">ggplot</span>(iris, <span class="kw">aes</span>(<span class="dt">x=</span><span class="kw">jitter</span>(Sepal.Width),</span>
<span id="cb1-10"><a href="#cb1-10"></a> <span class="dt">y=</span><span class="kw">jitter</span>(Sepal.Length),</span>
<span id="cb1-11"><a href="#cb1-11"></a> <span class="dt">col=</span>Species)) <span class="op">+</span></span>
<span id="cb1-12"><a href="#cb1-12"></a><span class="st"> </span><span class="kw">geom_point</span>() <span class="op">+</span></span>
<span id="cb1-13"><a href="#cb1-13"></a><span class="st"> </span><span class="kw">labs</span>(<span class="dt">x=</span><span class="st">"Sepal Width (cm)"</span>,</span>
<span id="cb1-14"><a href="#cb1-14"></a> <span class="dt">y=</span><span class="st">"Sepal Length (cm)"</span>,</span>
<span id="cb1-15"><a href="#cb1-15"></a> <span class="dt">col=</span><span class="st">"Species"</span>,</span>
<span id="cb1-16"><a href="#cb1-16"></a> <span class="dt">title=</span><span class="st">"Iris Dataset - Scatter plot"</span>)</span>
<span id="cb1-17"><a href="#cb1-17"></a></span>
<span id="cb1-18"><a href="#cb1-18"></a><span class="co">## bar plot</span></span>
<span id="cb1-19"><a href="#cb1-19"></a>bar_plot <-</span>
<span id="cb1-20"><a href="#cb1-20"></a><span class="st"> </span>iris <span class="op">%>%</span></span>
<span id="cb1-21"><a href="#cb1-21"></a><span class="st"> </span><span class="kw">group_by</span>(Species) <span class="op">%>%</span></span>
<span id="cb1-22"><a href="#cb1-22"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="dt">Sepal.Width =</span> <span class="kw">mean</span>(Sepal.Width)) <span class="op">%>%</span></span>
<span id="cb1-23"><a href="#cb1-23"></a><span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x=</span>Species, <span class="dt">y=</span>Sepal.Width, <span class="dt">fill=</span>Species)) <span class="op">+</span></span>
<span id="cb1-24"><a href="#cb1-24"></a><span class="st"> </span><span class="kw">geom_col</span>() <span class="op">+</span></span>
<span id="cb1-25"><a href="#cb1-25"></a><span class="st"> </span><span class="kw">labs</span>(<span class="dt">x=</span><span class="st">"Species"</span>,</span>
<span id="cb1-26"><a href="#cb1-26"></a> <span class="dt">y=</span><span class="st">"Mean Sepal Width (cm)"</span>,</span>
<span id="cb1-27"><a href="#cb1-27"></a> <span class="dt">fill=</span><span class="st">"Species"</span>,</span>
<span id="cb1-28"><a href="#cb1-28"></a> <span class="dt">title=</span><span class="st">"Iris Dataset - Bar plot"</span>)</span>
<span id="cb1-29"><a href="#cb1-29"></a></span>
<span id="cb1-30"><a href="#cb1-30"></a><span class="co">## box plot</span></span>
<span id="cb1-31"><a href="#cb1-31"></a>box_plot <-<span class="st"> </span><span class="kw">ggplot</span>(iris,</span>
<span id="cb1-32"><a href="#cb1-32"></a> <span class="kw">aes</span>(<span class="dt">x=</span>Species,</span>
<span id="cb1-33"><a href="#cb1-33"></a> <span class="dt">y=</span>Sepal.Width,</span>
<span id="cb1-34"><a href="#cb1-34"></a> <span class="dt">fill=</span>Species)) <span class="op">+</span></span>
<span id="cb1-35"><a href="#cb1-35"></a><span class="st"> </span><span class="kw">geom_boxplot</span>() <span class="op">+</span></span>
<span id="cb1-36"><a href="#cb1-36"></a><span class="st"> </span><span class="kw">labs</span>(<span class="dt">x=</span><span class="st">"Species"</span>,</span>
<span id="cb1-37"><a href="#cb1-37"></a> <span class="dt">y=</span><span class="st">"Sepal Width (cm)"</span>,</span>
<span id="cb1-38"><a href="#cb1-38"></a> <span class="dt">fill=</span><span class="st">"Species"</span>,</span>
<span id="cb1-39"><a href="#cb1-39"></a> <span class="dt">title=</span><span class="st">"Iris Dataset - Box plot"</span>)</span>
<span id="cb1-40"><a href="#cb1-40"></a></span>
<span id="cb1-41"><a href="#cb1-41"></a><span class="co">## density plot</span></span>
<span id="cb1-42"><a href="#cb1-42"></a>density_plot <-</span>
<span id="cb1-43"><a href="#cb1-43"></a><span class="st"> </span>iris <span class="op">%>%</span></span>
<span id="cb1-44"><a href="#cb1-44"></a><span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> Sepal.Length, <span class="dt">fill =</span> Species)) <span class="op">+</span></span>
<span id="cb1-45"><a href="#cb1-45"></a><span class="st"> </span><span class="kw">geom_density</span>() <span class="op">+</span></span>
<span id="cb1-46"><a href="#cb1-46"></a><span class="st"> </span><span class="kw">facet_wrap</span>(.<span class="op">~</span>Species) <span class="op">+</span></span>
<span id="cb1-47"><a href="#cb1-47"></a><span class="st"> </span><span class="kw">labs</span>(<span class="dt">x=</span><span class="st">"Sepal Length (cm)"</span>,</span>
<span id="cb1-48"><a href="#cb1-48"></a> <span class="dt">y=</span><span class="st">"Density"</span>,</span>
<span id="cb1-49"><a href="#cb1-49"></a> <span class="dt">fill=</span><span class="st">"Species"</span>,</span>
<span id="cb1-50"><a href="#cb1-50"></a> <span class="dt">title=</span><span class="st">"Iris Dataset - Density plot"</span>)</span>
<span id="cb1-51"><a href="#cb1-51"></a></span>
<span id="cb1-52"><a href="#cb1-52"></a><span class="co">#### Create iteration table ####</span></span>
<span id="cb1-53"><a href="#cb1-53"></a><span class="co">## Put all base plots in a list</span></span>
<span id="cb1-54"><a href="#cb1-54"></a>plot_list <-</span>
<span id="cb1-55"><a href="#cb1-55"></a><span class="st"> </span><span class="kw">list</span>(<span class="st">"bar plot"</span> =<span class="st"> </span>bar_plot,</span>
<span id="cb1-56"><a href="#cb1-56"></a> <span class="st">"box plot"</span> =<span class="st"> </span>box_plot,</span>
<span id="cb1-57"><a href="#cb1-57"></a> <span class="st">"scatter plot"</span> =<span class="st"> </span>point_plot,</span>
<span id="cb1-58"><a href="#cb1-58"></a> <span class="st">"density plot"</span> =<span class="st"> </span>density_plot)</span>
<span id="cb1-59"><a href="#cb1-59"></a></span>
<span id="cb1-60"><a href="#cb1-60"></a><span class="co">## Convert list into a tibble</span></span>
<span id="cb1-61"><a href="#cb1-61"></a>plot_base <-</span>
<span id="cb1-62"><a href="#cb1-62"></a><span class="st"> </span><span class="kw">tibble</span>(<span class="dt">plot =</span> plot_list,</span>
<span id="cb1-63"><a href="#cb1-63"></a> <span class="dt">plot_names =</span> <span class="kw">names</span>(plot_list))</span>
<span id="cb1-64"><a href="#cb1-64"></a></span>
<span id="cb1-65"><a href="#cb1-65"></a><span class="co">## Put all themes to test in a named list</span></span>
<span id="cb1-66"><a href="#cb1-66"></a><span class="co">## names will be fed into subtitles</span></span>
<span id="cb1-67"><a href="#cb1-67"></a>theme_list <-</span>
<span id="cb1-68"><a href="#cb1-68"></a><span class="st"> </span><span class="kw">list</span>(<span class="st">"ggplot2::theme_minimal()"</span> =<span class="st"> </span><span class="kw">theme_minimal</span>(),</span>
<span id="cb1-69"><a href="#cb1-69"></a> <span class="st">"ggplot2::theme_classic()"</span> =<span class="st"> </span><span class="kw">theme_classic</span>(),</span>
<span id="cb1-70"><a href="#cb1-70"></a> <span class="st">"ggplot2::theme_bw()"</span> =<span class="st"> </span><span class="kw">theme_bw</span>(),</span>
<span id="cb1-71"><a href="#cb1-71"></a> <span class="st">"ggplot2::theme_gray()"</span> =<span class="st"> </span><span class="kw">theme_gray</span>(),</span>
<span id="cb1-72"><a href="#cb1-72"></a> <span class="st">"ggplot2::theme_linedraw()"</span> =<span class="st"> </span><span class="kw">theme_linedraw</span>(),</span>
<span id="cb1-73"><a href="#cb1-73"></a> <span class="st">"ggplot2::theme_light()"</span> =<span class="st"> </span><span class="kw">theme_light</span>(),</span>
<span id="cb1-74"><a href="#cb1-74"></a> <span class="st">"ggplot2::theme_dark()"</span> =<span class="st"> </span><span class="kw">theme_dark</span>(),</span>
<span id="cb1-75"><a href="#cb1-75"></a> <span class="st">"ggthemes::theme_economist()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_economist</span>(),</span>
<span id="cb1-76"><a href="#cb1-76"></a> <span class="st">"ggthemes::theme_economist_white()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_economist_white</span>(),</span>
<span id="cb1-77"><a href="#cb1-77"></a> <span class="st">"ggthemes::theme_calc()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_calc</span>(),</span>
<span id="cb1-78"><a href="#cb1-78"></a> <span class="st">"ggthemes::theme_clean()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_clean</span>(),</span>
<span id="cb1-79"><a href="#cb1-79"></a> <span class="st">"ggthemes::theme_excel()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_excel</span>(),</span>
<span id="cb1-80"><a href="#cb1-80"></a> <span class="st">"ggthemes::theme_excel_new()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_excel_new</span>(),</span>
<span id="cb1-81"><a href="#cb1-81"></a> <span class="st">"ggthemes::theme_few()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_few</span>(),</span>
<span id="cb1-82"><a href="#cb1-82"></a> <span class="st">"ggthemes::theme_fivethirtyeight()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_fivethirtyeight</span>(),</span>
<span id="cb1-83"><a href="#cb1-83"></a> <span class="st">"ggthemes::theme_foundation()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_foundation</span>(),</span>
<span id="cb1-84"><a href="#cb1-84"></a> <span class="st">"ggthemes::theme_gdocs()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_gdocs</span>(),</span>
<span id="cb1-85"><a href="#cb1-85"></a> <span class="st">"ggthemes::theme_hc()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_hc</span>(),</span>
<span id="cb1-86"><a href="#cb1-86"></a> <span class="st">"ggthemes::theme_igray()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_igray</span>(),</span>
<span id="cb1-87"><a href="#cb1-87"></a> <span class="st">"ggthemes::theme_solarized()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_solarized</span>(),</span>
<span id="cb1-88"><a href="#cb1-88"></a> <span class="st">"ggthemes::theme_solarized_2()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_solarized_2</span>(),</span>
<span id="cb1-89"><a href="#cb1-89"></a> <span class="st">"ggthemes::theme_solid()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_solid</span>(),</span>
<span id="cb1-90"><a href="#cb1-90"></a> <span class="st">"ggthemes::theme_stata()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_stata</span>(),</span>
<span id="cb1-91"><a href="#cb1-91"></a> <span class="st">"ggthemes::theme_tufte()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_tufte</span>(),</span>
<span id="cb1-92"><a href="#cb1-92"></a> <span class="st">"ggthemes::theme_wsj()"</span> =<span class="st"> </span>ggthemes<span class="op">::</span><span class="kw">theme_wsj</span>())</span>
<span id="cb1-93"><a href="#cb1-93"></a></span>
<span id="cb1-94"><a href="#cb1-94"></a><span class="co">## Convert list into a tibble</span></span>
<span id="cb1-95"><a href="#cb1-95"></a>theme_base <-</span>
<span id="cb1-96"><a href="#cb1-96"></a><span class="st"> </span><span class="kw">tibble</span>(<span class="dt">theme =</span> theme_list,</span>
<span id="cb1-97"><a href="#cb1-97"></a> <span class="dt">theme_names =</span> <span class="kw">names</span>(theme_list))</span>
<span id="cb1-98"><a href="#cb1-98"></a></span>
<span id="cb1-99"><a href="#cb1-99"></a>plot_base</span></code></pre></div>
<pre><code>## # A tibble: 4 x 2
## plot plot_names
## <named list> <chr>
## 1 <gg> bar plot
## 2 <gg> box plot
## 3 <gg> scatter plot
## 4 <gg> density plot</code></pre>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a>theme_base</span></code></pre></div>
<pre><code>## # A tibble: 25 x 2
## theme theme_names
## <named list> <chr>
## 1 <theme> ggplot2::theme_minimal()
## 2 <theme> ggplot2::theme_classic()
## 3 <theme> ggplot2::theme_bw()
## 4 <theme> ggplot2::theme_gray()
## 5 <theme> ggplot2::theme_linedraw()
## 6 <theme> ggplot2::theme_light()
## 7 <theme> ggplot2::theme_dark()
## 8 <theme> ggthemes::theme_economist()
## 9 <theme> ggthemes::theme_economist_white()
## 10 <theme> ggthemes::theme_calc()
## # ... with 15 more rows</code></pre>
</div>
<div id="create-an-iteration-table" class="section level3">
<h3>2. Create an iteration table</h3>
<p>The next step is to create what I call an iteration table. Here I use <code>tidyr::expand_grid()</code>, which <strong>creates a tibble from all combinations of inputs</strong>. Actually you can use either <code>tidyr::expand_grid()</code> or the base function <code>expand.grid()</code>, but I like the fact that the former returns a tibble rather than a data frame.</p>
<p>The output is <code>all_combos</code>, which is a two column tibble with all combinations of <code>theme_names</code> and <code>plot_names</code>, as character vectors. I then use <code>left_join()</code> twice to bring in the themes and the base plots:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="co">## Create an iteration data frame</span></span>
<span id="cb5-2"><a href="#cb5-2"></a><span class="co">## Use `expand_grid()` to generate all combinations</span></span>
<span id="cb5-3"><a href="#cb5-3"></a><span class="co">## of themes and plots</span></span>
<span id="cb5-4"><a href="#cb5-4"></a></span>
<span id="cb5-5"><a href="#cb5-5"></a>all_combos <-</span>
<span id="cb5-6"><a href="#cb5-6"></a><span class="st"> </span><span class="kw">expand_grid</span>(<span class="dt">plot_names =</span> plot_base<span class="op">$</span>plot_names,</span>
<span id="cb5-7"><a href="#cb5-7"></a> <span class="dt">theme_names =</span> theme_base<span class="op">$</span>theme_names)</span>
<span id="cb5-8"><a href="#cb5-8"></a> </span>
<span id="cb5-9"><a href="#cb5-9"></a>iter_df <-</span>
<span id="cb5-10"><a href="#cb5-10"></a><span class="st"> </span>all_combos <span class="op">%>%</span></span>
<span id="cb5-11"><a href="#cb5-11"></a><span class="st"> </span><span class="kw">left_join</span>(plot_base, <span class="dt">by =</span> <span class="st">"plot_names"</span>) <span class="op">%>%</span></span>
<span id="cb5-12"><a href="#cb5-12"></a><span class="st"> </span><span class="kw">left_join</span>(theme_base, <span class="dt">by =</span> <span class="st">"theme_names"</span>) <span class="op">%>%</span></span>
<span id="cb5-13"><a href="#cb5-13"></a><span class="st"> </span><span class="kw">select</span>(theme_names, theme, plot_names, plot) <span class="co"># Reorder columns</span></span>
<span id="cb5-14"><a href="#cb5-14"></a></span>
<span id="cb5-15"><a href="#cb5-15"></a>iter_df</span></code></pre></div>
<pre><code>## # A tibble: 100 x 4
## theme_names theme plot_names plot
## <chr> <list> <chr> <list>
## 1 ggplot2::theme_minimal() <theme> bar plot <gg>
## 2 ggplot2::theme_classic() <theme> bar plot <gg>
## 3 ggplot2::theme_bw() <theme> bar plot <gg>
## 4 ggplot2::theme_gray() <theme> bar plot <gg>
## 5 ggplot2::theme_linedraw() <theme> bar plot <gg>
## 6 ggplot2::theme_light() <theme> bar plot <gg>
## 7 ggplot2::theme_dark() <theme> bar plot <gg>
## 8 ggthemes::theme_economist() <theme> bar plot <gg>
## 9 ggthemes::theme_economist_white() <theme> bar plot <gg>
## 10 ggthemes::theme_calc() <theme> bar plot <gg>
## # ... with 90 more rows</code></pre>
</div>
<div id="run-your-ggplot-gallery" class="section level3">
<h3>3. Run your ggplot gallery!</h3>
<p>The final step is to create the ggplot “gallery”.</p>
<p>I used <code>purrr::pmap()</code> on <code>iter_df</code>, which applies a function to the data frame, using the values in each column as inputs to the arguments of the function. You will see that:</p>
<ul>
<li><code>iter_label</code> is ultimately used as the names for the list of plots (<code>plot_gallery</code>).</li>
<li><code>label</code> within the function is used for populating the subtitles of the plots</li>
<li><code>output_plot</code> is the plot that is created within the function</li>
</ul>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a><span class="co">#### Run plots ####</span></span>
<span id="cb7-2"><a href="#cb7-2"></a><span class="co">## Use `pmap()` to run all the plots-theme combinations</span></span>
<span id="cb7-3"><a href="#cb7-3"></a></span>
<span id="cb7-4"><a href="#cb7-4"></a><span class="co">## Create labels to be used as names for `plot_gallery`</span></span>
<span id="cb7-5"><a href="#cb7-5"></a>iter_label <-</span>
<span id="cb7-6"><a href="#cb7-6"></a><span class="st"> </span><span class="kw">paste0</span>(<span class="st">"Theme: "</span>,</span>
<span id="cb7-7"><a href="#cb7-7"></a> iter_df<span class="op">$</span>theme_names,</span>
<span id="cb7-8"><a href="#cb7-8"></a> <span class="st">"; Plot type: "</span>,</span>
<span id="cb7-9"><a href="#cb7-9"></a> iter_df<span class="op">$</span>plot_names)</span>
<span id="cb7-10"><a href="#cb7-10"></a></span>
<span id="cb7-11"><a href="#cb7-11"></a><span class="co">## Create a list of plots</span></span>
<span id="cb7-12"><a href="#cb7-12"></a>plot_gallery <-</span>
<span id="cb7-13"><a href="#cb7-13"></a><span class="st"> </span>iter_df <span class="op">%>%</span></span>
<span id="cb7-14"><a href="#cb7-14"></a><span class="st"> </span><span class="kw">pmap</span>(<span class="cf">function</span>(theme_names, theme, plot_names, plot){</span>
<span id="cb7-15"><a href="#cb7-15"></a> </span>
<span id="cb7-16"><a href="#cb7-16"></a> label <-<span class="st"> </span></span>
<span id="cb7-17"><a href="#cb7-17"></a><span class="st"> </span><span class="kw">paste0</span>(<span class="st">"Theme: "</span>,</span>
<span id="cb7-18"><a href="#cb7-18"></a> theme_names,</span>
<span id="cb7-19"><a href="#cb7-19"></a> <span class="st">"</span><span class="ch">\n</span><span class="st">Plot type: "</span>,</span>
<span id="cb7-20"><a href="#cb7-20"></a> plot_names)</span>
<span id="cb7-21"><a href="#cb7-21"></a></span>
<span id="cb7-22"><a href="#cb7-22"></a> output_plot <-</span>
<span id="cb7-23"><a href="#cb7-23"></a><span class="st"> </span>plot <span class="op">+</span></span>
<span id="cb7-24"><a href="#cb7-24"></a><span class="st"> </span>theme <span class="op">+</span></span>
<span id="cb7-25"><a href="#cb7-25"></a><span class="st"> </span><span class="kw">labs</span>(<span class="dt">subtitle =</span> label)</span>
<span id="cb7-26"><a href="#cb7-26"></a> </span>
<span id="cb7-27"><a href="#cb7-27"></a> <span class="kw">return</span>(output_plot)</span>
<span id="cb7-28"><a href="#cb7-28"></a> }) <span class="op">%>%</span></span>
<span id="cb7-29"><a href="#cb7-29"></a><span class="st"> </span><span class="kw">set_names</span>(iter_label)</span>
<span id="cb7-30"><a href="#cb7-30"></a></span>
<span id="cb7-31"><a href="#cb7-31"></a></span>
<span id="cb7-32"><a href="#cb7-32"></a>plot_gallery</span></code></pre></div>
<pre><code>## $`Theme: ggplot2::theme_minimal(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-1.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_classic(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-2.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_bw(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-3.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_gray(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-4.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_linedraw(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-5.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_light(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-6.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_dark(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-7.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-8.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist_white(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-9.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_calc(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-10.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_clean(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-11.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-12.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel_new(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-13.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_few(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-14.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-15.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_foundation(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-16.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_gdocs(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-17.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_hc(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-18.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_igray(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-19.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-20.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized_2(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-21.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solid(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-22.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_stata(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-23.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_tufte(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-24.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_wsj(); Plot type: bar plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-25.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_minimal(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-26.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_classic(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-27.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_bw(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-28.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_gray(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-29.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_linedraw(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-30.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_light(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-31.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_dark(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-32.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-33.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist_white(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-34.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_calc(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-35.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_clean(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-36.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-37.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel_new(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-38.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_few(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-39.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-40.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_foundation(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-41.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_gdocs(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-42.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_hc(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-43.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_igray(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-44.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-45.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized_2(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-46.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solid(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-47.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_stata(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-48.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_tufte(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-49.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_wsj(); Plot type: box plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-50.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_minimal(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-51.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_classic(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-52.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_bw(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-53.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_gray(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-54.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_linedraw(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-55.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_light(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-56.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_dark(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-57.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-58.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist_white(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-59.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_calc(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-60.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_clean(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-61.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-62.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel_new(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-63.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_few(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-64.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-65.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_foundation(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-66.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_gdocs(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-67.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_hc(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-68.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_igray(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-69.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-70.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized_2(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-71.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solid(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-72.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_stata(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-73.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_tufte(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-74.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_wsj(); Plot type: scatter plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-75.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_minimal(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-76.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_classic(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-77.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_bw(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-78.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_gray(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-79.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_linedraw(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-80.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_light(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-81.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggplot2::theme_dark(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-82.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-83.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_economist_white(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-84.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_calc(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-85.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_clean(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-86.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-87.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_excel_new(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-88.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_few(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-89.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_fivethirtyeight(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-90.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_foundation(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-91.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_gdocs(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-92.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_hc(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-93.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_igray(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-94.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-95.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solarized_2(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-96.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_solid(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-97.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_stata(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-98.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_tufte(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-99.png" /><!-- --></p>
<pre><code>##
## $`Theme: ggthemes::theme_wsj(); Plot type: density plot`</code></pre>
<p><img src="https://martinctc.github.io/blog/knitr_files/generate-your-own-ggplot-gallery_20200508_files/figure-html/unnamed-chunk-4-100.png" /><!-- --></p>
</div>
<div id="end-notes" class="section level3">
<h3>End Notes</h3>
<p>And here it is! That didn’t take that many lines of code, but you can already generate a great number of plots with <code>expand_grid()</code> and <code>pmap()</code>.</p>
<p>I should also caveat that this is by no means a “pretty” gallery; it’s very much a minimal implementation, but is good enough for my own consumption.</p>
</div>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>See <a href="https://ggplot2.tidyverse.org/reference/ggtheme.html" class="uri">https://ggplot2.tidyverse.org/reference/ggtheme.html</a> and <a href="https://cmdlinetips.com/2019/10/8-ggplot2-themes/" class="uri">https://cmdlinetips.com/2019/10/8-ggplot2-themes/</a> for instance.<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanVignette: Simulating a minimal SPSS dataset from R2020-04-30T00:00:00+00:002020-04-30T00:00:00+00:00https://martinctc.github.io/blog/vignette-simulating-a-minimal-spss-dataset-from-r<script src="https://martinctc.github.io/blog/knitr_files/minimal-sav_22-04-2020_files/header-attrs-2.1.1/header-attrs.js"></script>
<section class="main-content">
<div id="tldr" class="section level2">
<h2>What this is about 📖</h2>
<p>I will simulate a minimal <strong>labelled survey</strong> dataset that can be exported as a SPSS (.SAV) file (with full variable and value labels) in R. I will also attempt to fabricate ‘meaningful patterns’ to the dataset such that it can be more effectively used for creating demo examples.</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/surveysays.gif" alt="" />
<p class="caption">image from Giphy</p>
</div>
</div>
<div id="background" class="section level2">
<h2>Background</h2>
<p>Simulating data is one of the most useful skills to have in R. For one, it is helpful when you’re debugging code, and you want to create a <strong>reprex</strong> (reproducible example) to ask for help more effectively (<em>help others help you </em>, as the saying goes.)<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> However, regardless of whether you’re a researcher or a business analyst, the data associated with your code is likely to be either <strong>confidential</strong> so you cannot share it on <a href="https://stackoverflow.com/">Stack Overflow</a>, or way too large or complex for you to upload anyway. Creating an example dataset from a few lines of code which you can safely share is an effective way to get around this problem.</p>
<p>Data simulation is slightly more tricky with <strong>survey datasets</strong>, which are characterised by (1) <strong>labels on both variable and values/codes</strong>, and (2) <strong>a large proportion of ordinal / categorical variables</strong>.</p>
<p>For instance, a Net Promoter Score (NPS) variable is usually accompanied with the variable label <em>“On a scale of 0-10, how likely are you to recommend X to a friend or family?”</em> (i.e. the actual question asked in a survey), and is itself an instance of an ordinal variable. If you are trying to produce an example that hinges on an issue where labels are relevant, you would also need to simulate the labels as well.</p>
<p>There are also <em>educational</em> reasons for simulating data: it is useful to simulate data to demo an analysis or a function, because this makes it easy for the audience to reproduce the example. For this purpose, it would be especially beneficial if you can simulate a dataset where there you can introduce some arbitrary relationships between the variables, rather than them being completely random (<code>sample()</code> all the way).</p>
<p>Personally, I have in the past found it a pain to simulate datasets which are suited for demo-ing survey related functions, especially when I was working on examples for the <a href="https://www.github.com/martinctc/surveytoolbox">{surveytoolbox}</a> package 📦. Hence, this is partly an attempt to simulate a labelled dataset that is minimally sufficient for demonstrating some of the <a href="https://www.github.com/martinctc/surveytoolbox">{surveytoolbox}</a> functions.</p>
<p>🏷 For more information specifically on manipulating labels in R, do check out a previous post I’ve written on <a href="https://martinctc.github.io/blog/working-with-spss-labels-in-r/">working with SPSS labels in R</a>.</p>
</div>
<div id="getting-started" class="section level2">
<h2>Getting started</h2>
<p>To run this example, we’ll need to load <a href="https://www.tidyverse.org/">{tidyverse}</a>, <a href="https://www.github.com/martinctc/surveytoolbox">{surveytoolbox}</a>, and <a href="https://haven.tidyverse.org/">{haven}</a>. Specifically, I’m using {tidyverse} for its data manipulation functions, {surveytoolbox} for functions to set up variable/value labels, and finally {haven} to export the data as a .SAV file.</p>
<p>Note that {surveytoolbox} is currently not available on CRAN yet, but you can install this by running <code>devtools::install_github("martinctc/surveytoolbox")</code>. You’ll need {devtools} installed, if you haven’t got it already.</p>
<p>In addition to loading the packages, we will also set the seed<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> with <code>set.seed()</code> to make the simulated numbers reproducible:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(tidyverse)</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="kw">library</span>(surveytoolbox) <span class="co"># Install with devtools::install_github("martinctc/surveytoolbox")</span></span>
<span id="cb1-3"><a href="#cb1-3"></a><span class="kw">library</span>(haven)</span>
<span id="cb1-4"><a href="#cb1-4"></a></span>
<span id="cb1-5"><a href="#cb1-5"></a><span class="kw">set.seed</span>(<span class="dv">100</span>) <span class="co"># Enable reproducibility - 100 is arbitrary</span></span></code></pre></div>
</div>
<div id="create-individual-vectors" class="section level2">
<h2>Create individual vectors</h2>
<p>For the purpose of clarity and ease of debugging, my approach will be to first set up each simulated variable as individual labelled vectors, and then bind them together into a data frame at the end. To adorn variable and value labels to a numeric vector, I will use <code>set_varl()</code> and <code>set_vall()</code> from {surveytoolbox} to do these tasks respectively.</p>
<p>I want to create a dataset with 1000 observations, so I will start with creating <code>v_id</code> as an ID variable running from 1 to 1000, which can simply be generated with the <code>seq()</code> function.<a href="#fn3" class="footnote-ref" id="fnref3"><sup>3</sup></a> I will then use <code>set_varl()</code> from {surveytoolbox} to set a variable label for the <code>v_id</code> vector. The second argument of <code>set_varl()</code> takes in a character vector and assigns it as the variable label of the target variable - super straightforward.</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb2-1"><a href="#cb2-1"></a><span class="co">## Record Identifier</span></span>
<span id="cb2-2"><a href="#cb2-2"></a>v_id <-</span>
<span id="cb2-3"><a href="#cb2-3"></a><span class="st"> </span><span class="kw">seq</span>(<span class="dv">1</span>, <span class="dv">1000</span>) <span class="op">%>%</span></span>
<span id="cb2-4"><a href="#cb2-4"></a><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Record Identifier"</span>)</span></code></pre></div>
<p>The same goes for <code>v_gender</code>, but this time I want to also (1) <em>apply an arbitrary probability to the distribution</em>, and (2) <em>give each value in the vector a value label (“Male”, “Female”, “Other”)</em>.</p>
<p>To do (1), I pass a numeric vector to the <code>prob</code> argument to represent the probabilities that 1, 2, and 3 will fall out for n = 1000.</p>
<p>To do (2), I run <code>set_vall()</code> and pass the desired labels to the <code>value_labels</code> argument. <code>set_vall()</code> acccepts a named character vector to be assigned as value labels.</p>
<p>Finally, I run <code>set_varl()</code> again to make sure that a variable label is present.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a><span class="co">## Gender</span></span>
<span id="cb3-2"><a href="#cb3-2"></a>v_gender <-</span>
<span id="cb3-3"><a href="#cb3-3"></a><span class="st"> </span><span class="kw">sample</span>(<span class="dt">x =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">3</span>,</span>
<span id="cb3-4"><a href="#cb3-4"></a> <span class="dt">size =</span> <span class="dv">1000</span>, <span class="dt">replace =</span> <span class="ot">TRUE</span>,</span>
<span id="cb3-5"><a href="#cb3-5"></a> <span class="dt">prob =</span> <span class="kw">c</span>(.<span class="dv">48</span>, <span class="fl">.48</span>, <span class="fl">.04</span>)) <span class="op">%>%</span><span class="st"> </span><span class="co"># arbitrary probability</span></span>
<span id="cb3-6"><a href="#cb3-6"></a><span class="st"> </span><span class="kw">set_vall</span>(<span class="dt">value_labels =</span> <span class="kw">c</span>(<span class="st">"Male"</span> =<span class="st"> </span><span class="dv">1</span>,</span>
<span id="cb3-7"><a href="#cb3-7"></a> <span class="st">"Female"</span> =<span class="st"> </span><span class="dv">2</span>,</span>
<span id="cb3-8"><a href="#cb3-8"></a> <span class="st">"Other"</span> =<span class="st"> </span><span class="dv">3</span>)) <span class="op">%>%</span></span>
<span id="cb3-9"><a href="#cb3-9"></a><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Q1. Gender"</span>)</span></code></pre></div>
<p>Now that we’ve got our ID variable and a basic grouping variable (gender), let’s also create some mock metric variables.</p>
<p>I want to create a 5-point scale KPI variable (which could represent <em>customer satisfaction</em> or <em>likelihood to recommend</em>). One way to do this is to simply run <code>sample()</code> again, and do the same thing we did for <code>v_gender</code>:</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb4-1"><a href="#cb4-1"></a><span class="co">## KPI - #1 simple sampling</span></span>
<span id="cb4-2"><a href="#cb4-2"></a>v_kpi <-</span>
<span id="cb4-3"><a href="#cb4-3"></a><span class="st"> </span><span class="kw">sample</span>(<span class="dt">x =</span> <span class="dv">1</span><span class="op">:</span><span class="dv">5</span>,</span>
<span id="cb4-4"><a href="#cb4-4"></a> <span class="dt">size =</span> <span class="dv">1000</span>,</span>
<span id="cb4-5"><a href="#cb4-5"></a> <span class="dt">replace =</span> <span class="ot">TRUE</span>) <span class="op">%>%</span></span>
<span id="cb4-6"><a href="#cb4-6"></a><span class="st"> </span><span class="kw">set_vall</span>(<span class="dt">value_labels =</span> <span class="kw">c</span>(<span class="st">"Extremely dissatisfied"</span> =<span class="st"> </span><span class="dv">1</span>,</span>
<span id="cb4-7"><a href="#cb4-7"></a> <span class="st">"Somewhat dissatisfied"</span> =<span class="st"> </span><span class="dv">2</span>,</span>
<span id="cb4-8"><a href="#cb4-8"></a> <span class="st">"Neither"</span> =<span class="st"> </span><span class="dv">3</span>,</span>
<span id="cb4-9"><a href="#cb4-9"></a> <span class="st">"Satisfied"</span> =<span class="st"> </span><span class="dv">4</span>,</span>
<span id="cb4-10"><a href="#cb4-10"></a> <span class="st">"Extremely satisfied"</span> =<span class="st"> </span><span class="dv">5</span>)) <span class="op">%>%</span></span>
<span id="cb4-11"><a href="#cb4-11"></a><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Q2. KPI"</span>)</span></code></pre></div>
<p>Whilst the above approach is straightforward, the downside is that the numbers are likely to look completely random if we try to actually analyse the results - which is what <code>sample()</code> is supposed to do - but clearly isn’t ideal.</p>
<p>I want to simulate numbers that are more realistic, i.e. data which will form a discernible pattern when grouping and summarising by gender. What I’ll therefore do is to iterate through each number in <code>v_gender</code>, and sample numbers based on the gender of the ‘respondent’.</p>
<p>The values that are passed below to the <code>prob</code> argument within <code>sample()</code> are completely arbitrary, but are designed to generate results where a bigger KPI value is more likely if <code>v_gender == 1</code>, followed by <code>v_gender == 3</code>, then <code>v_gender == 2</code>.</p>
<p>Note that I’ve used <code>map2_dbl()</code> here (from the {purrr} package, part of {tidyverse}), which “loops” through <code>v_gender</code> and returns a numeric value for each iteration.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a><span class="co">## KPI - #2 gender-dependent sampling</span></span>
<span id="cb5-2"><a href="#cb5-2"></a>v_kpi <-</span>
<span id="cb5-3"><a href="#cb5-3"></a><span class="st"> </span>v_gender <span class="op">%>%</span></span>
<span id="cb5-4"><a href="#cb5-4"></a><span class="st"> </span><span class="kw">map_dbl</span>(<span class="cf">function</span>(x){</span>
<span id="cb5-5"><a href="#cb5-5"></a> <span class="cf">if</span>(x <span class="op">==</span><span class="st"> </span><span class="dv">1</span>){</span>
<span id="cb5-6"><a href="#cb5-6"></a> <span class="kw">sample</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>,</span>
<span id="cb5-7"><a href="#cb5-7"></a> <span class="dt">size =</span> <span class="dv">1</span>,</span>
<span id="cb5-8"><a href="#cb5-8"></a> <span class="dt">prob =</span> <span class="kw">c</span>(<span class="dv">10</span>, <span class="dv">17</span>, <span class="dv">17</span>, <span class="dv">28</span>, <span class="dv">28</span>)) <span class="co"># Sum to 100</span></span>
<span id="cb5-9"><a href="#cb5-9"></a> } <span class="cf">else</span> <span class="cf">if</span>(x <span class="op">==</span><span class="st"> </span><span class="dv">2</span>){</span>
<span id="cb5-10"><a href="#cb5-10"></a> <span class="kw">sample</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>,</span>
<span id="cb5-11"><a href="#cb5-11"></a> <span class="dt">size =</span> <span class="dv">1</span>,</span>
<span id="cb5-12"><a href="#cb5-12"></a> <span class="dt">prob =</span> <span class="kw">c</span>(<span class="dv">11</span>, <span class="dv">22</span>, <span class="dv">28</span>, <span class="dv">22</span>, <span class="dv">17</span>)) <span class="co"># Sum to 100</span></span>
<span id="cb5-13"><a href="#cb5-13"></a></span>
<span id="cb5-14"><a href="#cb5-14"></a> } <span class="cf">else</span> {</span>
<span id="cb5-15"><a href="#cb5-15"></a> <span class="kw">sample</span>(<span class="dv">1</span><span class="op">:</span><span class="dv">5</span>,</span>
<span id="cb5-16"><a href="#cb5-16"></a> <span class="dt">size =</span> <span class="dv">1</span>,</span>
<span id="cb5-17"><a href="#cb5-17"></a> <span class="dt">prob =</span> <span class="kw">c</span>(<span class="dv">13</span>, <span class="dv">20</span>, <span class="dv">20</span>, <span class="dv">27</span>, <span class="dv">20</span>)) <span class="co"># Sum to 100</span></span>
<span id="cb5-18"><a href="#cb5-18"></a> }</span>
<span id="cb5-19"><a href="#cb5-19"></a> }) <span class="op">%>%</span></span>
<span id="cb5-20"><a href="#cb5-20"></a><span class="st"> </span><span class="kw">set_vall</span>(<span class="dt">value_labels =</span> <span class="kw">c</span>(<span class="st">"Extremely dissatisfied"</span> =<span class="st"> </span><span class="dv">1</span>,</span>
<span id="cb5-21"><a href="#cb5-21"></a> <span class="st">"Somewhat dissatisfied"</span> =<span class="st"> </span><span class="dv">2</span>,</span>
<span id="cb5-22"><a href="#cb5-22"></a> <span class="st">"Neither"</span> =<span class="st"> </span><span class="dv">3</span>,</span>
<span id="cb5-23"><a href="#cb5-23"></a> <span class="st">"Satisfied"</span> =<span class="st"> </span><span class="dv">4</span>,</span>
<span id="cb5-24"><a href="#cb5-24"></a> <span class="st">"Extremely satisfied"</span> =<span class="st"> </span><span class="dv">5</span>)) <span class="op">%>%</span></span>
<span id="cb5-25"><a href="#cb5-25"></a><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Q2. KPI"</span>)</span></code></pre></div>
<p>To add a level of complexity, let me also simulate a mock NPS variable. One way to do this is to punch in random numbers like how it is done above with <code>v_kpi</code>, but this will involve a lot more random punching than is desirable for a 11-point scale NPS variable.</p>
<p>I will therefore instead write a custom function called <code>skew_inputs()</code> that ‘expands’ three arbitrary input numbers into 11 numbers, which will then serve as the probability anchors for my <code>sample()</code> functions later on.</p>
<div class="sourceCode" id="cb6"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb6-1"><a href="#cb6-1"></a><span class="co">## Generate skew inputs for sample probability</span></span>
<span id="cb6-2"><a href="#cb6-2"></a><span class="co">##</span></span>
<span id="cb6-3"><a href="#cb6-3"></a><span class="co">## `value1`, `value2` and `value3`</span></span>
<span id="cb6-4"><a href="#cb6-4"></a><span class="co">## generate the skewed probabilities</span></span>
<span id="cb6-5"><a href="#cb6-5"></a><span class="co">##</span></span>
<span id="cb6-6"><a href="#cb6-6"></a>skew_inputs <-<span class="st"> </span><span class="cf">function</span>(value1, value2, value3){</span>
<span id="cb6-7"><a href="#cb6-7"></a> </span>
<span id="cb6-8"><a href="#cb6-8"></a> all_n <-</span>
<span id="cb6-9"><a href="#cb6-9"></a><span class="st"> </span><span class="kw">c</span>(<span class="kw">rep</span>(value1, <span class="dv">7</span>), <span class="co"># 0 - 6</span></span>
<span id="cb6-10"><a href="#cb6-10"></a> <span class="kw">rep</span>(value2, <span class="dv">2</span>), <span class="co"># 7 - 8</span></span>
<span id="cb6-11"><a href="#cb6-11"></a> <span class="kw">rep</span>(value3, <span class="dv">2</span>)) <span class="co"># 9 - 10</span></span>
<span id="cb6-12"><a href="#cb6-12"></a> </span>
<span id="cb6-13"><a href="#cb6-13"></a> <span class="kw">return</span>(<span class="kw">sort</span>(all_n))</span>
<span id="cb6-14"><a href="#cb6-14"></a>}</span>
<span id="cb6-15"><a href="#cb6-15"></a></span>
<span id="cb6-16"><a href="#cb6-16"></a><span class="co">## Outcome KPI - NPS</span></span>
<span id="cb6-17"><a href="#cb6-17"></a>v_nps <-</span>
<span id="cb6-18"><a href="#cb6-18"></a><span class="st"> </span>v_gender <span class="op">%>%</span></span>
<span id="cb6-19"><a href="#cb6-19"></a><span class="st"> </span><span class="kw">map_dbl</span>(<span class="cf">function</span>(x){</span>
<span id="cb6-20"><a href="#cb6-20"></a> <span class="cf">if</span>(x <span class="op">==</span><span class="st"> </span><span class="dv">1</span>){</span>
<span id="cb6-21"><a href="#cb6-21"></a></span>
<span id="cb6-22"><a href="#cb6-22"></a> <span class="kw">sample</span>(<span class="dv">0</span><span class="op">:</span><span class="dv">10</span>, <span class="dt">size =</span> <span class="dv">1</span>, <span class="dt">prob =</span> <span class="kw">skew_inputs</span>(<span class="dv">1</span>, <span class="dv">1</span>, <span class="dv">8</span>))</span>
<span id="cb6-23"><a href="#cb6-23"></a></span>
<span id="cb6-24"><a href="#cb6-24"></a> } <span class="cf">else</span> <span class="cf">if</span>(x <span class="op">==</span><span class="st"> </span><span class="dv">2</span>){</span>
<span id="cb6-25"><a href="#cb6-25"></a></span>
<span id="cb6-26"><a href="#cb6-26"></a> <span class="kw">sample</span>(<span class="dv">0</span><span class="op">:</span><span class="dv">10</span>, <span class="dt">size =</span> <span class="dv">1</span>, <span class="dt">prob =</span> <span class="kw">skew_inputs</span>(<span class="dv">2</span>, <span class="dv">3</span>, <span class="dv">5</span>))</span>
<span id="cb6-27"><a href="#cb6-27"></a></span>
<span id="cb6-28"><a href="#cb6-28"></a> } <span class="cf">else</span> <span class="cf">if</span>(x <span class="op">==</span><span class="st"> </span><span class="dv">3</span>){</span>
<span id="cb6-29"><a href="#cb6-29"></a></span>
<span id="cb6-30"><a href="#cb6-30"></a> <span class="kw">sample</span>(<span class="dv">0</span><span class="op">:</span><span class="dv">10</span>, <span class="dt">size =</span> <span class="dv">1</span>, <span class="dt">prob =</span> <span class="kw">skew_inputs</span>(<span class="dv">1</span>, <span class="dv">3</span>, <span class="dv">6</span>))</span>
<span id="cb6-31"><a href="#cb6-31"></a></span>
<span id="cb6-32"><a href="#cb6-32"></a> } <span class="cf">else</span> {</span>
<span id="cb6-33"><a href="#cb6-33"></a></span>
<span id="cb6-34"><a href="#cb6-34"></a> <span class="kw">stop</span>(<span class="st">"Error - check x"</span>)</span>
<span id="cb6-35"><a href="#cb6-35"></a></span>
<span id="cb6-36"><a href="#cb6-36"></a> }</span>
<span id="cb6-37"><a href="#cb6-37"></a> }) <span class="op">%>%</span></span>
<span id="cb6-38"><a href="#cb6-38"></a><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Q3. NPS"</span>)</span></code></pre></div>
<p>Admittedly that the above procedure isn’t <em>minimal</em>, but note that this is a trade-off to introduce some arbitrary patterns to the data. A ‘quick and dirty’ alternative simulation would simply be to run <code>sample(x = 0:10, size = 1000, replace = TRUE)</code> for <code>v_nps</code>.</p>
<p>There is one slight technicality: the so-called NPS question is strictly speaking a <em>likelihood to recommend</em> question which ranges from 0 to 10, and the <strong>Net Promoter Score</strong> itself is calculated on a recoded version of that question where <em>Detractors</em> (scoring 0 to 6) have to be coded as -100, <em>Passives</em> (scoring 7 to 8) as 0, and <em>Promoters</em> (scoring 9 to 10) as +100. The <strong>Net Promoter Score</strong> is simply calculated as a mean of those recoded values.</p>
<p>Fortunately, the {surveytoolbox} package comes shipped with a <code>as_nps()</code> function that does this recoding for you, and also automatically applies the value labels. let’s call this new variable <code>v_nps2</code>:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a><span class="co">## Outcome KPI - Recoded NPS (NPS2)</span></span>
<span id="cb7-2"><a href="#cb7-2"></a></span>
<span id="cb7-3"><a href="#cb7-3"></a>v_nps2 <-<span class="st"> </span><span class="kw">as_nps</span>(v_nps) <span class="op">%>%</span><span class="st"> </span><span class="kw">set_varl</span>(<span class="st">"Q3X. Recoded NPS"</span>)</span></code></pre></div>
</div>
<div id="combine-vectors" class="section level2">
<h2>Combine vectors</h2>
<p>Now that all the individual variables are set up, I can simply combine them all into a tibble in one swift movement<a href="#fn4" class="footnote-ref" id="fnref4"><sup>4</sup></a>:</p>
<div class="sourceCode" id="cb8"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb8-1"><a href="#cb8-1"></a><span class="co">#### Combine individual vectors ####</span></span>
<span id="cb8-2"><a href="#cb8-2"></a>combined_df <-</span>
<span id="cb8-3"><a href="#cb8-3"></a><span class="st"> </span><span class="kw">tibble</span>(<span class="dt">id =</span> v_id,</span>
<span id="cb8-4"><a href="#cb8-4"></a> <span class="dt">gender =</span> v_gender,</span>
<span id="cb8-5"><a href="#cb8-5"></a> <span class="dt">kpi =</span> v_kpi,</span>
<span id="cb8-6"><a href="#cb8-6"></a> <span class="dt">nps =</span> v_nps,</span>
<span id="cb8-7"><a href="#cb8-7"></a> <span class="dt">nps2 =</span> v_nps2)</span></code></pre></div>
</div>
<div id="results" class="section level2">
<h2>Results!</h2>
<div class="figure">
<img src="https://media.giphy.com/media/IgLnqEAUh3XP6dagEk/giphy.gif" alt="" />
<p class="caption">image from Giphy</p>
</div>
<p>Let’s run a few checks on our dataset to confirm that everything has worked out okay.</p>
<p>The classic {dplyr} <code>glimpse()</code>:</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a>combined_df <span class="op">%>%</span><span class="st"> </span><span class="kw">glimpse</span>()</span></code></pre></div>
<pre><code>## Observations: 1,000
## Variables: 5
## $ id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 1...
## $ gender <int+lbl> 2, 2, 1, 2, 2, 1, 1, 2, 1, 2, 1, 1, 2, 2, 1, 1, 2, 2, 2,...
## $ kpi <dbl+lbl> 3, 2, 5, 2, 5, 5, 2, 2, 5, 3, 3, 4, 1, 5, 4, 4, 4, 1, 2,...
## $ nps <dbl> 10, 5, 10, 8, 10, 9, 7, 5, 2, 5, 1, 4, 5, 9, 9, 10, 9, 3, 10...
## $ nps2 <dbl+lbl> 100, -100, 100, 0, 100, 100, 0, -100, -100, -100, -100, ...</code></pre>
<p>Then <code>head()</code> to see the first five rows:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a>combined_df <span class="op">%>%</span><span class="st"> </span><span class="kw">head</span>()</span></code></pre></div>
<pre><code>## # A tibble: 6 x 5
## id gender kpi nps nps2
## <int> <int+lbl> <dbl+lbl> <dbl> <dbl+lbl>
## 1 1 2 [Female] 3 [Neither] 10 100 [Promoter]
## 2 2 2 [Female] 2 [Somewhat dissatisfied] 5 -100 [Detractor]
## 3 3 1 [Male] 5 [Extremely satisfied] 10 100 [Promoter]
## 4 4 2 [Female] 2 [Somewhat dissatisfied] 8 0 [Passive]
## 5 5 2 [Female] 5 [Extremely satisfied] 10 100 [Promoter]
## 6 6 1 [Male] 5 [Extremely satisfied] 9 100 [Promoter]</code></pre>
<p>So it appears that the value labels have been properly attached, and the range of values are what we’d expect. Now what about the “fake patterns”?</p>
<p>Looking at the topline result of the data, we seem to have succeeded in fabricating some sensible patterns in the data. It appears that this company X will need to work harder at winning over its female customers, who have rated them lower on two KPI metrics:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a>combined_df <span class="op">%>%</span></span>
<span id="cb13-2"><a href="#cb13-2"></a><span class="st"> </span><span class="kw">group_by</span>(gender) <span class="op">%>%</span></span>
<span id="cb13-3"><a href="#cb13-3"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="dt">n =</span> <span class="kw">n_distinct</span>(id),</span>
<span id="cb13-4"><a href="#cb13-4"></a> <span class="dt">kpi =</span> <span class="kw">mean</span>(kpi),</span>
<span id="cb13-5"><a href="#cb13-5"></a> <span class="dt">nps2 =</span> <span class="kw">mean</span>(nps2))</span></code></pre></div>
<pre><code>## # A tibble: 3 x 4
## gender n kpi nps2
## <int+lbl> <int> <dbl> <dbl>
## 1 1 [Male] 490 3.49 31.0
## 2 2 [Female] 464 3.07 -8.62
## 3 3 [Other] 46 3.15 17.4</code></pre>
</div>
<div id="check-the-labels" class="section level2">
<h2>Check the labels 🏷🏷🏷</h2>
<p>Finally I’d like to share a couple of functions that enable you to explore the labels in a labelled dataset. <code>surveytoolbox::varl_tb()</code> accepts a labelled data frame, and returns a two-column data frame with the variable name and its corresponding variable label:</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1"></a>combined_df <span class="op">%>%</span><span class="st"> </span><span class="kw">varl_tb</span>()</span></code></pre></div>
<pre><code>## # A tibble: 5 x 2
## var var_label
## <chr> <chr>
## 1 id Record Identifier
## 2 gender Q1. Gender
## 3 kpi Q2. KPI
## 4 nps Q3. NPS
## 5 nps2 Q3X. Recoded NPS</code></pre>
<p><code>surveytoolbox::data_dict()</code> takes this further, and shows also the value labels as a third column. This is what effectively what’s typically referred to as a <strong>code frame</strong> in a market research context:</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1"></a>combined_df <span class="op">%>%</span></span>
<span id="cb17-2"><a href="#cb17-2"></a><span class="st"> </span><span class="kw">select</span>(<span class="op">-</span>id) <span class="op">%>%</span></span>
<span id="cb17-3"><a href="#cb17-3"></a><span class="st"> </span><span class="kw">data_dict</span>()</span></code></pre></div>
<pre><code>## var label_var
## 1 gender Q1. Gender
## 2 kpi Q2. KPI
## 3 nps Q3. NPS
## 4 nps2 Q3X. Recoded NPS
## label_val
## 1 Male; Female; Other
## 2 Extremely dissatisfied; Somewhat dissatisfied; Neither; Satisfied; Extremely satisfied
## 3
## 4 Detractor; Passive; Promoter; Missing value
## value
## 1 1; 2; 3
## 2 1; 2; 3; 4; 5
## 3 0; 1; 2; 3; 4; 5; 6; 7; 8; 9; 10
## 4 -100; 0; 100</code></pre>
<p>I would also highly recommend the <code>view_df()</code> function from {sjPlot}, which exports a similar overview of variables and labels in a nicely formatted HTML table. For huge labelled datasets, this offers a fantastic light-weight way to browse through your variables and labels.</p>
<div class="sourceCode" id="cb19"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb19-1"><a href="#cb19-1"></a>combined_df <span class="op">%>%</span><span class="st"> </span>sjPlot<span class="op">::</span><span class="kw">view_df</span>()</span></code></pre></div>
<p>Once we’ve checked all the labels and we’re happy with everything, we can then export our dataset with <code>haven::write_sav()</code>! If everything’s worked properly, all the labels should appear properly if you choose to open your example dataset in SPSS, or Q:</p>
<div class="sourceCode" id="cb20"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb20-1"><a href="#cb20-1"></a>combined_df <span class="op">%>%</span><span class="st"> </span>haven<span class="op">::</span><span class="kw">write_sav</span>(<span class="st">"Simulated Dataset.sav"</span>)</span></code></pre></div>
</div>
<div id="end-notes" class="section level2">
<h2>End notes</h2>
<p>I hope you’ve found this vignette useful!</p>
<p>If you ever get a chance to try out <a href="https://www.github.com/martinctc/surveytoolbox">{surveytoolbox}</a>, I would really appreciate if you can submit any <a href="https://github.com/martinctc/surveytoolbox/issues">issues/feedback on GitHub</a>, or get in touch with me directly. I’m looking for collaborators to make the package more user-friendly and powerful, so if you’re interested, please don’t be shy and give me a shout! 😄</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Check out <a href="https://community.rstudio.com/t/faq-whats-a-reproducible-example-reprex-and-how-do-i-do-one/5219">this RStudio Community thread</a> to learn more about <strong>reprex</strong> (the portmanteau <em>reprex</em> is coined by <a href="https://twitter.com/romain_francois/status/530011023743655936">Romain Francois</a>)<a href="#fnref1" class="footnote-back">↩︎</a></p></li>
<li id="fn2"><p>If you’re not familiar with this concept / approach, I’d recommend checking out <a href="https://stackoverflow.com/questions/13605271/reasons-for-using-the-set-seed-function">this Stack Overflow thread</a>.<a href="#fnref2" class="footnote-back">↩︎</a></p></li>
<li id="fn3"><p>For those who are more ambitious, I would recommend checking out the <a href="https://cran.r-project.org/web/packages/uuid/index.html">{uuid} package</a> for generating proper GUIDs (Globally Unique Identifier). However, this then wouldn’t be <em>minimal</em>, so I would just stick with running a simple <code>seq()</code> sequence.<a href="#fnref3" class="footnote-back">↩︎</a></p></li>
<li id="fn4"><p>I shouldn’t need to footnote this, but here’s a Rocky Flintstone tribute for any Belinkers out there. 🤣<a href="#fnref4" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanData Chats: An Interview on Data-Driven Campaigns, Bias & Ethics2020-04-27T00:00:00+00:002020-04-27T00:00:00+00:00https://martinctc.github.io/blog/data-chats-an-interview-on-data-driven-campaigns-bias-and-ethics<script src="https://martinctc.github.io/blog/knitr_files/data-chats-analytics-and-politics_an-interview_20200427_files/header-attrs-2.1.1/header-attrs.js"></script>
<section class="main-content">
<div id="background" class="section level2">
<h2>Background</h2>
<p>One of the motives for starting the <em>Data Chats</em> interview series was to shed light on the many ways in which data and analytics professionals operate across different fields and cultures. Previously, <a href="https://martinctc.github.io/blog/data-chats-an-interview-with-avision-ho/">Avision Ho</a> (Senior Data Scientist at the British Department for Education at the time) and <a href="https://martinctc.github.io/blog/data-chats-from-physics-student-to-data-science-consultant/">Abhishek Modi</a> (Data Science Consultant at Deloitte at the time) described the data science career <em>journey</em> and answered technology-specific questions (e.g. favourite R packages). So I thought I’d do an interview on how analytics is applied in a very different, yet important, setting: politics.</p>
<p>This time, I have the pleasure to speak with <a href="https://www.linkedin.com/in/treshan">Christopher Treshan Perera</a> and <a href="https://de.linkedin.com/in/da-nanthida-rakwong-27784915a">Nanthida Rakwong</a> from <a href="https://worldacquire.com/">Worldacquire</a>, a digital consultancy with politics as a core practice area. They launched with a mission to use analytics and digital tech in <em>political campaigns</em>, <em>public affairs</em> and <em>human rights</em>. They notably managed campaigns at the <em>2019 Thailand general election</em> and the <em>2019 Hong Kong District Council elections</em>; at the latter, they <strong>helped a pro-democracy candidate defeat a long-standing incumbent</strong>. Their co-founders have spoken at the <em>United Nations</em> and the <em>UK Parliament</em> on ethical issues in technology and society<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a> and managed research for <em>Konrad-Adenauer-Stiftung</em>, Germany’s governing party think tank.</p>
<p>In this in-depth interview, Christopher and Nanthida discuss how they navigate analytics and politics, challenges they encountered (e.g. how to obtain reliable data in Thailand), ethical questions (Cambridge Analytica, GDPR) and other practical considerations.</p>
<p><img src="https://images.unsplash.com/photo-1558685203-2c1f7ee563c3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="the-interview" class="section level2">
<h2>The Interview</h2>
<div id="m-its-great-to-have-you-guys-here.-tell-me-a-bit-about-worldacquire-and-your-journey-to-bringing-together-analytics-and-politics" class="section level4">
<h4>M: It’s great to have you guys here. Tell me a bit about Worldacquire and your journey to bringing together analytics and politics!</h4>
<p>C: We launched two years ago to explore how AI and data-driven technology could enhance democracy - whether by helping aspiring politicians win elections or by supporting research and campaigning efforts of parties and public organisations.</p>
<p>N: I grew up in Thailand, but moved to London in 2010 to work as a political consultant for <em>Amsterdam and Partners</em>, an international law firm representing political figures. My most important case was in Thai politics - <a href="https://www.spiegel.de/international/world/red-shirt-lawyer-to-file-complaint-with-icc-thai-pm-gave-carte-blanche-order-to-massacre-civilians-a-742653.html">bringing the Thai junta (military government) to justice</a> at the <em>International Criminal Court (ICC)</em> over the <a href="https://amsterdamandpartners.com/white-papers/thailand-the-bangkok-massacres-a-call-for-accountability/">2010 Bangkok Massacres</a>. I also advised the “Red Shirts” pro-democracy movement in Thailand and various other political parties around the world.</p>
<p><img src="https://live.staticflickr.com/4584/38794179861_8c07bb0c0f_b.jpg" /></p>
<p>C: I have a more corporate and techie background. My career started as a data analyst at <em>Bloomberg</em> in London, followed by several years at <em>viagogo</em>, an online marketplace for sports and show tickets, where I rose from digital marketing exec to global advertising management. It is common to wear many hats at a tech startup and mine included data analysis, business intelligence, product management and algorithm design for marketing APIs. Then I moved to <em>American Express</em> where my task was to transfer digital marketing know-how from the tech world to “big finance”.</p>
<p>Before Worldacquire I already wanted to connect tech with social causes. Back in 2015 I founded <a href="https://outreachdigital.org/">Outreach Digital</a>, an entirely volunteer-run association making digital skills more accessible; it has since become the <a href="https://www.meetup.com/Digital/">largest meetup group</a> in London’s tech and digital space.</p>
<p><img src="https://pbs.twimg.com/media/D2l0lc8XQAAtxYN?format=png&name=small" /></p>
</div>
<div id="m-thats-a-great-fusion-of-politics-and-digital-marketing.-also-chris---we-first-met-at-one-of-those-meetup-groups-which-makes-a-great-case-for-their-networking-value.-so-what-exactly-got-you-thinking-wow-it-would-be-great-to-merge-analytics-and-political-consulting" class="section level4">
<h4>M: That’s a great fusion of politics and digital marketing. Also, Chris - we first met at one of those meetup groups, which makes a great case for their networking value. So what exactly got you thinking: wow, it would be great to merge analytics and political consulting?</h4>
<p>N: While I was advising the Thai pro-democracy movement, I realized how crucial it was to understand a situation in real-time in order to make better decisions. By collecting a lot of data, compiling and analysing it faster, we could operate in a more efficient and scalable way. That led to my interest in data, analytical technologies and AI, and I immediately saw their dangers, too: the <a href="https://www.wired.com/amp-stories/cambridge-analytica-explainer/">Cambridge Analytica scandal</a>, for example, was a misuse of those technologies. But you can change something for the better only if you engage and participate in shaping it.</p>
</div>
<div id="m-using-ai-and-data-to-make-better-decisions---could-you-describe-how-you-can-do-that-in-politics-how-do-other-organizations-do-that-today" class="section level4">
<h4>M: Using AI and data to make better decisions - could you describe how you can do that in politics? How do other organizations do that today?</h4>
<p>N: Let’s say you want to understand millions of people including your potential voters. By gathering data from multiple sources, including Facebook, Twitter, forums, emails and more, you can find new paths to make everyone work together towards the same goal. If you want to run a political campaign of any kind but don’t understand what exactly people want, it’s very hard to bring them together.</p>
<p>Moreover, you always need to identify new supporters. There may be people who are unsure about your cause or movement - perhaps they are friends of your ardent supporters - but they hesitate to join because they don’t understand it well enough. You can use AI and data technology to understand what they need and how to best communicate with them.</p>
<p><img src="https://images.unsplash.com/photo-1553268169-8232852a2377?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
<p>C: One of the things Cambridge Analytica did was to uncover a blind spot in the political “market share” or electorate and target those swing voters (and people who never wanted to vote) using unethical advertising methods, including fake news.</p>
<p>This raises the question of whether the technology can be used in an ethical and transparent way. If everyone was aware of the practices, if parties or candidates communicated transparently about them, then perhaps we’d have a different situation even in the UK right now. The other consideration is which specific AI and analytics methods to apply: you could implement recommendation systems, pattern recognition techniques or combine existing methods.</p>
<p>N: And use them to spread truth and support good causes, rather than rumours, disinformation and division.</p>
<p><img src="https://images.unsplash.com/photo-1572356722933-adf495627701?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-so-you-do-the-opposite-of-what-cambridge-analytica-did-by-using-the-technologies-in-a-more-ethical-and-transparent-way" class="section level4">
<h4>M: So you do the opposite of what Cambridge Analytica did by using the technologies in a more ethical and transparent way?</h4>
<p>N: That’s right. If we do not engage in this game, we basically allow malicious or unethical players to misuse it. If we withdraw we cannot help shape new solutions and perspectives.</p>
<p>C: Think practically! Advertising has a very long history - newspaper, TV, radio and billboards. Over the decades every new medium got increasingly regulated, but advertising still exists and it has arguably become more transparent and ethical. <strong>If traditional advertising channels can improve over time, so can digital advertising</strong>.</p>
</div>
<div id="m-you-also-believe-that-by-using-these-techniques-for-a-good-cause-we-can-make-our-democratic-systems-more-resilient-and-less-prone-to-being-abused" class="section level4">
<h4>M: You also believe that by using these techniques for a good cause, we can make our democratic systems more resilient and less prone to being abused?</h4>
<p>C&N: Correct! Moreover, they can also be used to <a href="https://worldacquire.com/2020/02/14/thai-mass-shooting-a-case-for-microtargeting-in-emergencies/">improve public services and government-to-citizen communications, especially during a crisis</a>.</p>
</div>
<div id="m-it-is-interesting-that-you-mention-the-blind-spot-in-the-market-share.-the-common-wisdom-is-that-the-results-of-political-campaigning-remain-unknown-until-after-the-election.-political-polling-is-known-to-be-inaccurate.-transparency-could-be-a-game-changer." class="section level4">
<h4>M: It is interesting that you mention the blind spot in the market share. The common wisdom is that the results of political campaigning remain unknown until after the election. Political polling is known to be inaccurate. Transparency could be a game-changer.</h4>
</div>
<div id="m-now-lets-talk-about-the-election-campaign-that-you-guys-managed-in-thailand.-could-you-tell-me-more-about-it" class="section level4">
<h4>M: Now let’s talk about the election campaign that you guys managed in Thailand. Could you tell me more about it?</h4>
<p>N: First of all, it was <a href="https://www.bbc.com/news/world-asia-47664201">a long-awaited election</a> because Thailand has been ruled by an authoritarian military junta since 2014. We advised a first-time candidate from a new party standing for MP in Bangkok. Candidates were given only six weeks to campaign - a major challenge considering this was one of the largest constituencies.</p>
<p>Without any data to start with, we went to the local administration office to request the electoral register, but they essentially refused to share anything. We suspected this happened because of the deep influence of the incumbent. We had to think differently!</p>
<p><img src="https://images.unsplash.com/photo-1571467046329-3ae3dc7430da?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
<p>We could have started canvassing (surveying and campaigning) from door to door, but this would have been an issue for several reasons: firstly, Thai people don’t vote based on where they reside, but where their home was registered. Thus, people who live in a constituency may not have the right to vote there. Secondly, the dangerous political climate and <a href="https://www.hrw.org/report/2019/10/24/speak-out-dangerous/criminalization-peaceful-expression-thailand">Thailand’s harsh censorship laws</a> made people extremely wary of sharing their political views or past voting behaviour.</p>
<p>Another option was to collect data and communicate online through social media advertising. Unfortunately the party leadership wasn’t willing to invest in it. They preferred to play it safe and spend money on leaflets and billboards instead.</p>
<p>We ultimately went for <a href="https://worldacquire.com/2019/04/11/advancing-democracy-with-digital-at-the-2019-thailand-election/">a sampling method using <strong>fieldwork mapping</strong></a>: we divided the constituency into smaller areas based on the polling stations that cover them and interviewed a sample of people in each area (this was still challenging considering how wary people were!) and built our understanding of the overall constituency based on data. We facilitated this by using <a href="https://www.mela.work/">an app called <em>Mela</em></a>.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/Da-Thai-Campaign.jpg" /><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/Worldacquire-client-MELA.png" alt="Mela" /></p>
<p>C: We didn’t use any advanced AI magic in this instance, but the project highlighted how important it is to get the right data and to ensure that the initial dataset is clean. <strong>What data scientists often do is go on the internet and pick whatever datasets are available online - but these are often outdated, incomplete or even biased</strong>.</p>
<p>Especially in a developing country with poor accountability and no balance of powers, it is hard to verify if research and survey data is correct. You have to go hands-on and create the conditions for people to share their genuine views. Once the data is accurate, you can start doing advanced stuff.</p>
<p>N: We would have certainly received more valuable insights had the party leadership approved social media advertising. It takes more effort to measure the conversion rate of leaflets and billboards. Especially if you only have six weeks to campaign.</p>
<p>C: Nonetheless we were able to run a smaller-scale test and get accurate social insights. Ideally, we would have gathered enough data to run prediction and network algorithms to evaluate the profile, behaviour and preferences of each constituency sub-area, and thereby understand which political issues mattered to them the most. Then we could have tailored the messages that would best resonate with each group. Despite the small amount of data, it was enough to draw some important conclusions about the potential voters.</p>
<p><img src="https://images.unsplash.com/photo-1507611268508-bf74edce9029?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-why-was-the-party-leadership-reluctant-to-invest-in-social-media" class="section level4">
<h4>M: Why was the party leadership reluctant to invest in social media?</h4>
<p>C: It was mainly about the social media <em>advertising</em> budget.</p>
<p>N: We did have social media (everyone in Thailand does) but without proper <em>advertising</em> campaigns aimed at targeted data collection and communication. Most politicians in Thailand only engage in one-way communication and are not interested in understanding their audience.</p>
<p>C: When I used to work in tech and finance, I spent a lot of time with Google Ads and Facebook Ads. A major lesson that applies here, too, is that you can post social and blog content to build an online presence and get visitor traffic data over time. But <strong>when you face tight time constraints, the data you can get through online advertising is significantly more valuable</strong>; all activity is tracked, timely, more relevant and more accurate than anything you can get through “organic” or “earned” marketing. The same goes with building an audience in such a short time. That is simply how digital platforms work today.</p>
<p>Many colleagues in digital marketing will confess that it can take weeks (or months) to build a robust presence, let alone become “number one” on Google or Facebook results. Unless you are really lucky and go “viral”.</p>
<p>So when you do an election campaign, when you only have six weeks and you’re a newcomer challenging powerful incumbents, it is unwise to rely on “organic” or “earned” marketing. Paid advertising gives you immediate data and immediate results.</p>
<p><img src="https://images.unsplash.com/photo-1579869847514-7c1a19d2d2ad?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-it-really-makes-sense-when-facing-such-time-constraints-so-its-a-shame-that-this-is-underutilized-in-election-campaigns.-chris-what-is-the-biggest-difference-between-digital-advertising-in-a-commercial-vs-a-political-setting" class="section level4">
<h4>M: It really makes sense when facing such time constraints, so it’s a shame that this is underutilized in election campaigns. Chris, what is the biggest difference between digital advertising in a commercial vs a political setting?</h4>
<p>C: In business you have more wiggle space for your message choice. You can use superlatives like “best tickets” or “best concerts”. In politics people do that, too, but it can be dangerous; think about the Brexit bus with the exaggerated claim about the NHS money. There will also be very different budgets due to both internal and external factors. Businesses are often more willing to invest in advertising as it leads to direct and immediate sales. In politics, the “sales cycle” can take much longer.</p>
<p>N: In many countries, including Thailand, there’s also a legal cap on campaigning spend.</p>
<p>C: Another issue is visibility. <strong>Most digital platforms decide which content to display based on ranking algorithms</strong>; <strong>one factor that influences those algorithms is pre-existing activity and performance</strong>. For example, if you want to advertise a car on Google, and you build a Google Ads campaigns around the keyword “car”, Google’s algorithms will already know that this is something businesses want to advertise based on historic performance data. The algorithms will also know that people click on those ads after searching “car”. On the contrary, if a keyword was never used before or isn’t typically associated with people clicking on ads, <a href="https://support.google.com/google-ads/answer/2616014?hl=en">Google will wait a little longer before displaying ads for it</a>. So there can be some delays before an ad for a new politician is actually visible, but it is still faster than trying to get a blog post go viral.</p>
</div>
<div id="m-so-theres-both-a-legal-and-a-search-engine-strategy-aspect.-how-about-analytics-in-business-vs-politics" class="section level4">
<h4>M: So there’s both a legal and a search engine strategy aspect. How about analytics in business vs politics?</h4>
<p>C: There are many similarities. You just need to translate a concept from one field into another. In business, KPIs and metrics are formed around impressions, actions (sales), CTR and conversion rates. In politics it’s more about long-term performance, maybe along the lines of CLV (customer lifetime value).</p>
<p>N: Having many clicks on your ad or post doesn’t necessarily translate into votes. It could even be negative - think about Prince Andrew!</p>
</div>
<div id="m-what-were-your-top-challenges-at-the-thai-election" class="section level4">
<h4>M: What were your top challenges at the Thai election?</h4>
<p>N: The lack of accurate data and the poor awareness about the importance of data by the leadership - especially current or real-time data. People still rely on old reports and outdated information.</p>
</div>
<div id="m-what-was-the-outcome-of-the-election-campaign" class="section level4">
<h4>M: What was the outcome of the election campaign?</h4>
<p>N: Our candidate lost, but exceeded our expectations. More importantly, the winning candidate was from the <em>Future Forward Party</em>, another new and allied party that <em>did</em> invest significantly in its social media at a national level. We observed that they also used tailored, targeted advertising and A/B-testing to gather data about voter preferences. The political party’s image really helped that candidate. Like our candidate, he was not a resident of the constituency yet still won. This was the very first time that anyone used social media as a key channel for data collection in an election campaign in Thailand.</p>
<p>C: Indeed, this was forward-thinking. <strong>Many political campaigners around the world use social media, but don’t make the most of its advanced algorithms and data-gathering capabilities</strong>. Considering the difficulties we had in accessing data, one of the biggest learnings is that even in the face of authoritarian red tape and bureaucracy, digital platforms can help overcome hurdles in understanding your audience.</p>
<p><img src="https://images.unsplash.com/photo-1506801310323-534be5e7a730?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-lets-talk-about-regulation-and-ethics.-is-data-is-the-new-oil-a-valuable-resource-that-helps-society-progress" class="section level4">
<h4>M: Let’s talk about regulation and ethics. Is data is the new oil? A valuable resource that helps society progress?</h4>
<p>C: Yes, it can be - depending on which society it will be used in. If a society has strong data protection and privacy laws, then data can be used hand-in-hand with democratic principles. If not, then it can be a very “bad oil”.</p>
<p>Looking back in time, radio was “the new oil” at some point. TV, too. From a regulatory point of view, they all provoked concerns (including about propaganda), but over time different bodies and regulations were formed to address them, such as today’s ASA in the UK; and now for data and digital technologies, we have the ICO in the UK.</p>
</div>
<div id="m-what-is-your-position-regarding-gdpr-also-when-people-think-about-applying-analytics-in-political-advertising-many-worry-about-cambridge-analytica-and-brexit.-obviously-you-want-to-promote-good-causes-and-democracy---but-can-you-avoid-repeating-their-mistakes" class="section level4">
<h4>M: What is your position regarding GDPR? Also, when people think about applying analytics in political advertising, many worry about Cambridge Analytica and Brexit. Obviously you want to promote good causes and democracy - but can you avoid repeating their mistakes?</h4>
<p>N: Let’s start with Obama’s presidential campaign. Obama used similar tactics, but what did he do differently? He informed people about how their data would be used. So transparency is really a key factor.</p>
<p><img src="https://images.unsplash.com/photo-1541872703-74c5e44368f9?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
<p>C: I think there can be some misunderstanding that influences the public’s perception, but also that of advertisers. Some advertisers (including businesses and politicians) can see GDPR as a nuisance, when in fact it is an opportunity. <strong>In the same way that media regulation helped build trust in radio and TV, GDPR and future internet or data protection laws will increase trust in digital spaces</strong>. Once people feel safe about the new media, they will feel even more comfortable using the platforms and sharing their data.</p>
<p>When people know that there is a rule of law and accountability around data, they will feel less suspicious than with the dodgy websites today that say very little about data protection. At the same time, <a href="https://worldacquire.com/2018/03/24/cambridge-analytica-facebook-european-data-protection-law/">GDPR will push companies - including advertisers - to design their products, solutions and activities around privacy</a>. This should help encourage ethical uses of this “new oil”.</p>
<p>As a matter of fact, there will be an increasing number of new regulations that will touch upon different issues in the realm of data-driven products, notably in AI. <a href="https://en.wikipedia.org/wiki/Algorithmic_bias">Algorithmic bias</a>, for instance, is one hot topic at the moment. Since AI is based on algorithms that learn from historical data, that data could reinforce existing social biases - whether it’s about predicting who will be the next criminal, or who deserves a visa or insurance. How to solve this? Regulation can and needs to answer this, whether by requiring AI products to make their program code public or by creating mechanisms to prevent or defeat the bias.</p>
<p>N: People need to understand how software works. Once again, transparency is vital. <strong>Many of the issues around algorithms and the misperceptions of AI being dangerous come down to the fact that the technical issues are not properly explained</strong>. When AI experts and advocates come together and make an effort to ensure that biases and manipulation are eliminated from these algorithms, that’s when products and solutions based on AI can be more ethical and transparent.</p>
<p><img src="https://images.unsplash.com/photo-1580894912989-0bc892f4efd0?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-its-important-to-understand-where-all-the-data-people-provide-goes-and-i-believe-theres-also-the-question-of-consent-which-is-covered-by-gdpr.-what-i-find-really-interesting-is-algorithmic-bias-but-also-the-idea-that-people-can-better-judge-what-is-right-and-wrong-only-after-they-are-educated-about-how-the-algorithm-works." class="section level4">
<h4>M: It’s important to understand where all the data people provide goes and I believe there’s also the question of consent, which is covered by GDPR. What I find really interesting is algorithmic bias, but also the idea that people can better judge what is right and wrong only after they are educated about how the algorithm works.</h4>
<p>C: Exactly, and this is what regulators are starting - and <em>should</em> be starting - to think about.</p>
<p>There are some who think that political advertising should be completely banned. <a href="https://techcrunch.com/2019/11/15/twitter-makes-its-political-ad-ban-official/">Twitter went down this road</a> and many applauded their decision, with the main perception being that data and algorithms can be misused the way Cambridge Analytica did. <strong>However, what was ignored was the fact that all these algorithms don’t <em>need</em> to work in a black box - in fact, they can be revealed, changed, overridden</strong>.</p>
<p>For example, a recommendation system could, instead of saying “A is better than B”, explain “We recommend A over B because our algorithm observed that you like x, y and z.” Companies may be reluctant to fully reveal their algorithm code, but <strong>they could at least give an idea of what parameters are taken into account, what outcomes can be expected, and why</strong>. Once again, transparency is key.</p>
<p>Another aspect ignored in the whole debate about political advertising is the fact that if digital platforms like Twitter ban them, newcomer politicians will struggle to gain a following or communicate with their target audience if they have a time constraint, like in the case of Thailand. <a href="https://www.brookings.edu/blog/techtank/2020/01/08/twitters-ban-on-political-advertisements-hurts-our-democracy/">Banning political ads carries huge disadvantages</a>, especially for politicians who already lack resources.</p>
<p>The right approach would have been to push for more transparency - not only in advertising (“paid”) algorithms, but also in the “organic” and “earned” algorithms used on the very same social media platforms.</p>
<p>N: Moreover, Twitter appears not to care as much as they say about what content is posted on the platform. If they can ban political ads, why do they do so little against hate speech and other online harms? <strong>Many, including myself, have experienced harassment and public death threats on Twitter, yet Twitter refused to cooperate swiftly and proactively with the British police - instead they shifted the burden of proof to the victims</strong>.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/ChrisatUNIGF.PNG" /></p>
<hr />
</div>
<div id="m-it-is-also-very-unclear-how-they-will-actually-implement-the-ban." class="section level4">
<h4>M: It is also very unclear how they will actually implement the ban.</h4>
<p>N: I honestly think it’s dangerous that Jack Dorsey (the CEO of Twitter) calls for “earned” popularity and encourages a culture of going viral. <strong>If a tweet goes viral does it really mean that it “earned” it? Is it really more accurate and correct than other tweets? More than a few times, a viral message has spread false accusations and fake news</strong>.</p>
<p>C: To make things worse, <strong>going viral is <em>also</em> determined by algorithms</strong>. How does Twitter decide which content should get more visibility? Is it the likes, the retweets, the popularity of the tweet author and their followers? We have seen (and tested) how easily this can be manipulated.</p>
<p>So banning political ads doesn’t solve the problem of powerful obscure algorithms, as organic content is decided by even less transparent ones! Typically, such algorithms seem to favour users who already have a strong following - in the case of an election, this is often the incumbent. Newcomers can be heavily penalized by this dynamic.</p>
<p>Another issue are <strong>fake users and bots on Twitter</strong>. These can be bought in thousands or more to mass-like or mass-follow and exaggerate the popularity of a particular user or a post. Equally, <strong>your competitor could get 10,000 Twitter bots or fake users to report your public posts as spam or abusive</strong>. <a href="https://worldacquire.com/2020/02/27/online-disinformation-and-extremism-how-it-spreads-and-how-to-stop-it/">This is an easy way to manipulate the system (also tested in Thailand)</a>. The Twitter algorithm will likely disregard the fact that the users are fake and make the falsely reported tweet disappear even if it is genuine and popular.</p>
<p><img src="https://images.unsplash.com/photo-1573152143286-0c422b4d2175?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
</div>
<div id="m-this-is-something-i-have-seen-in-hong-kong-too.-do-you-think-censorship-is-a-solution" class="section level4">
<h4>M: This is something I have seen in Hong Kong, too. Do you think censorship is a solution?</h4>
<p>C: Did TVs ban political content? Do billboards ban political ads? Not really, so we really don’t think that banning political ads is a solution.</p>
<p>N: It should be more about regulating content, what is not OK to say.</p>
</div>
<div id="m-how-do-you-differentiate-yourselves-from-other-consultancies-or-agencies-that-do-the-same-or-similar-things-as-you" class="section level4">
<h4>M: How do you differentiate yourselves from other consultancies or agencies that do the same or similar things as you?</h4>
<p>N: Firstly, aside from working on the big picture strategy we actually also implement it. Working hands-on gives us a much more tangible picture of the dynamics, limitations and issues that could be faced.</p>
<p>C: When you hear academics, researchers and thought leaders jump enthusiastically to praise Jack Dorsey for banning political ads on Twitter, a practitioner who personally set up digital campaigns for politics will tell you how disastrous the effects of his decision can be. Our USP is that we work on the ground and thoroughly understand technical implications and their consequences.</p>
</div>
<div id="m-this-ties-back-with-what-you-mentioned-about-blind-spots.-being-on-the-ground-helps-you-spot-them-too-right" class="section level4">
<h4>M: This ties back with what you mentioned about ‘blind spots’. Being on the ground helps you spot them, too, right?</h4>
<p>N: Correct, and it also helps with finding alternative solutions in case Plan A didn’t work in the first place.</p>
</div>
<div id="m-what-is-your-vision-for-your-business" class="section level4">
<h4>M: What is your vision for your business?</h4>
<p>C: We truly believe that AI, data and digital technologies can be used for good causes - and we want to show that this is true and applies to anywhere around the world. It is also important for people to understand the uses of these technologies and the actors who control them. We want to help people understand both sides of the coin. Many governments and NGOs have digital and data on their agendas, but often seem to have a very superficial sense of the technologies - we want to help there. And we want to be involved in and lead the societal, political and ethical debates around these technologies, as well as demystify the exaggerated perceptions of danger.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/Chris-at-KAS.jpg" /></p>
</div>
<div id="m-is-there-any-advice-you-have-for-data-scientists-who-aspire-to-work-on-political-projects-political-data-scientists-or-data-scientists-in-general" class="section level4">
<h4>M: Is there any advice you have for data scientists who aspire to work on political projects, political data scientists, or data scientists in general?</h4>
<p>C: <strong>Get the right data, know the sources and eliminate biases!</strong> This includes statistical bias (especially sampling, funding and reporting bias), cognitive bias and other social biases. Online datasets may be easier to obtain, but are outdated or not real time enough. Worse, they could be doctored to reflect the narratives of an authoritarian regime or lobbyist group.</p>
<p>Ask yourselves: what were the context and conditions during the data collection process? What kind of limitations existed? Whether it’s data from a sentiment analysis report or a simple survey, what could be wrong with the data? Could there be any noise? Anything unusual?</p>
<p>Also, the logic behind the metrics in a dataset can be misleading; think about GDP, a measure for economic growth. Does a growing GDP mean the country is improving and everyone is better off? Not really. If you look closer you might see that the GDP growth is distributed only to a small percentage of the population.</p>
<p>How were survey responses recorded? Did the method change during the campaign? What’s the logic behind the metrics? What kind of issues may lead to the data being wrong? Could there be a situation of reluctant journalists or silenced human rights activists?</p>
<p>Finally, ethics is not only for philosophers, but also for engineers. This will be a hot topic over the next years, and AI and data specialists will need to be able to explain to consumers and other stakeholders the different problems and solutions in algorithm-driven products.</p>
<p>N: In politics and economics especially, <strong>it is really important to ask who created the dataset, who financed or sponsored it, and who really controls the overall character and narrative of the data</strong>.</p>
<p>C: You should never be afraid to go out there on the field and collect the data by yourself - it can be really fun!</p>
</div>
<div id="m-thank-you-again-for-your-time-and-for-this-very-fascinating-interview." class="section level4">
<h4>M: Thank you again for your time and for this very fascinating interview.</h4>
<p><img src="https://images.unsplash.com/photo-1460925895917-afdab827c52f?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1000&q=80" /></p>
<hr />
</div>
</div>
<div id="endnotes" class="section level2">
<h2>Endnotes</h2>
<p>What I thought was an interesting theme is the ubiquity in the application of analytics. But some of the data challenges that Nanthida and Chris raised are very real, and confirms the view that a considerable chunk of time in data analysis is spent on collecting, cleaning and getting the data right for analysis in the first place, not only the analysis itself.</p>
<p>I hope you’ve enjoyed reading the above interview. If you would like to get in touch with Christopher and Nanthida, you may reach them through their website <a href="https://Worldacquire.com">here</a>. I’m also looking to do more interviews, so if you are a data / analytics practitioner and you think you have something interesting to share, please feel free to get in touch!</p>
<p><small> Image credits: Artem Bryzgalov, Kelvin Yup, Frida Aguilar Estrada, Franz Wender, ThisisEngineering, camilo jimenez, dole777, jbdodane, History in HD, Jonathan Francisca, Carlos Muza </small></p>
<p><small><b>Themes</b>: political analytics, political data science, microtargeting, political advertising, data collection, social analytics, data bias, data integrity, algorithmic bias, statistical bias, sample bias, observer bias, bias, data ethics, data regulation, privacy, AI ethics, gdpr, tech for good, surveys, cambridge analytica, thailand, algorithm design, botnets, disinformation</small></p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>See UN-related work on digital ethics and transparency, and on anti-disinformation: <a href="https://worldacquire.com/2019/12/09/worldacquire-at-the-united-nations-igf-2019/" class="uri">https://worldacquire.com/2019/12/09/worldacquire-at-the-united-nations-igf-2019/</a> and <a href="https://worldacquire.com/2020/02/27/online-disinformation-and-extremism-how-it-spreads-and-how-to-stop-it/" class="uri">https://worldacquire.com/2020/02/27/online-disinformation-and-extremism-how-it-spreads-and-how-to-stop-it/</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanData cleaning with Kamehamehas in R2020-04-11T00:00:00+00:002020-04-11T00:00:00+00:00https://martinctc.github.io/blog/data-cleaning-with-kamehamehas-in-r<script src="https://martinctc.github.io/blog/knitr_files/strongest-kamehameha_20200223_files/header-attrs-2.1.1/header-attrs.js"></script>
<section class="main-content">
<div id="background" class="section level2">
<h2>Background</h2>
<p>Given present circumstances in in the world, I thought it might be nice to write a post on a lighter subject.</p>
<p>Recently, I came across an interesting Kaggle dataset that features <a href="https://www.kaggle.com/shiddharthsaran/dragon-ball-dataset">the power levels of Dragon Ball characters at different points in the franchise</a>. Whilst the dataset itself is quite simple with only four columns (<code>Character</code>, <code>Power_Level</code>, <code>Saga_or_Movie</code>, <code>Dragon_Ball_Series</code>), I noticed that you do need to do a fair amount of data and string manipulation before you can perform any meaningful data analysis with it. Therefore, if you’re a fan of Dragon Ball and interested in learning about string manipulation in R, this post is definitely for you!</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/kamehameha.gif" alt="" />
<p class="caption">The Kamehameha - image from Giphy</p>
</div>
<p>For those who aren’t as interested in Dragon Ball but still interested in general R tricks, please do read ahead anyway - you won’t need to understand the references to know what’s going on with the code. But you have been warned for spoilers! 😂</p>
<p>Functions or techniques that are covered in this post:</p>
<ul>
<li>Basic regular expression (regex) matching</li>
<li><code>stringr::str_detect()</code></li>
<li><code>stringr::str_remove_all()</code> or <code>stringr::str_remove()</code></li>
<li><code>dplyr::anti_join()</code></li>
<li>Example of ‘dark mode’ ggplot in themes</li>
</ul>
</div>
<div id="getting-started" class="section level2">
<h2>Getting started</h2>
<p>You can download the dataset from <a href="https://www.kaggle.com/shiddharthsaran/dragon-ball-dataset">Kaggle</a>, which you’ll need to register an account in order to do so. I would highly recommend doing so if you still haven’t, since they’ve got tons of datasets available on the website which you can practise on.</p>
<p>The next thing I’ll do is to set up my R working directory <a href="https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner%27s-guide/">in this style</a>, and ensure that the dataset is saved in the <em>datasets</em> subfolder. I’ll use the {here} workflow for this example, which is generally good practice as <code>here::here</code> implicitly sets the path root to the path to the top-level of they current project.</p>
<p>Let’s load our packages and explore the data using <code>glimpse()</code>:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb1-1"><a href="#cb1-1"></a><span class="kw">library</span>(tidyverse)</span>
<span id="cb1-2"><a href="#cb1-2"></a><span class="kw">library</span>(here)</span>
<span id="cb1-3"><a href="#cb1-3"></a></span>
<span id="cb1-4"><a href="#cb1-4"></a>dball_data <-<span class="st"> </span><span class="kw">read_csv</span>(<span class="kw">here</span>(<span class="st">"datasets"</span>, <span class="st">"Dragon_Ball_Data_Set.csv"</span>))</span>
<span id="cb1-5"><a href="#cb1-5"></a></span>
<span id="cb1-6"><a href="#cb1-6"></a>dball_data <span class="op">%>%</span><span class="st"> </span><span class="kw">glimpse</span>()</span></code></pre></div>
<pre><code>## Observations: 1,244
## Variables: 4
## $ Character <chr> "Goku", "Bulma", "Bear Thief", "Master Roshi", "...
## $ Power_Level <chr> "10", "1.5", "7", "30", "5", "8.5", "4", "8", "2...
## $ Saga_or_Movie <chr> "Emperor Pilaf Saga", "Emperor Pilaf Saga", "Emp...
## $ Dragon_Ball_Series <chr> "Dragon Ball", "Dragon Ball", "Dragon Ball", "Dr...</code></pre>
<p>…and also <code>tail()</code> to view the last five rows of the data, just so we get a more comprehensive picture of what some of the other observations in the data look like:</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb3-1"><a href="#cb3-1"></a>dball_data <span class="op">%>%</span><span class="st"> </span><span class="kw">tail</span>()</span></code></pre></div>
<pre><code>## # A tibble: 6 x 4
## Character Power_Level Saga_or_Movie Dragon_Ball_Seri~
## <chr> <chr> <chr> <chr>
## 1 Goku (base with SSJG ~ 448,000,000,000 Movie 14: Battle o~ Dragon Ball Z
## 2 Goku (MSSJ with SSJG'~ 22,400,000,000,0~ Movie 14: Battle o~ Dragon Ball Z
## 3 Goku (SSJG) 224,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z
## 4 Goku 44,800,000,000 Movie 14: Battle o~ Dragon Ball Z
## 5 Beerus (full power, n~ 896,000,000,000,~ Movie 14: Battle o~ Dragon Ball Z
## 6 Whis (full power, nev~ 4,480,000,000,00~ Movie 14: Battle o~ Dragon Ball Z</code></pre>
</div>
<div id="who-does-the-strongest-kamehameha" class="section level2">
<h2>Who does the strongest Kamehameha? 🔥</h2>
<p>In the Dragon Ball series, there is an energy attack called <em>Kamehameha</em>, which is a signature (and perhaps the most well recognised) move by the main character <strong>Goku</strong>. This move is however not unique to him, and has also been used by other characters in the series, including his son <strong>Gohan</strong> and his master <strong>Muten Roshi</strong>.</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/goku-and-roshi.gif" alt="" />
<p class="caption">Goku and Muten Roshi - image from Giphy</p>
</div>
<p>As you’ll see, this dataset includes observations which detail the power level of the notable occasions when this attack was used. Our task here is get some understanding about this attack move from the data, and see if we can figure out whose kamehameha is actually the strongest out of all the characters.</p>
<div id="data-cleaning" class="section level3">
<h3>Data cleaning</h3>
<p>Here, we use regex (regular expression) string matching to filter on the <code>Character</code> column. The <code>str_detect()</code> function from the {stringr} package detects whether a pattern or expression exists in a string, and returns a logical value of either <code>TRUE</code> or <code>FALSE</code> (which is what <code>dplyr::filter()</code> takes in the second argument). I also used the <code>stringr::regex()</code> function and set the <code>ignore_case</code> argument to <code>TRUE</code>, which makes the filter case-insensitive, such that cases of ‘Kame’ and ‘kAMe’ are also picked up if they do exist.</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1"></a>dball_data <span class="op">%>%</span></span>
<span id="cb5-2"><a href="#cb5-2"></a><span class="st"> </span><span class="kw">filter</span>(<span class="kw">str_detect</span>(Character, <span class="kw">regex</span>(<span class="st">"kameha"</span>, <span class="dt">ignore_case =</span> <span class="ot">TRUE</span>))) -><span class="st"> </span>dball_data_<span class="dv">1</span></span>
<span id="cb5-3"><a href="#cb5-3"></a></span>
<span id="cb5-4"><a href="#cb5-4"></a>dball_data_<span class="dv">1</span> <span class="op">%>%</span><span class="st"> </span><span class="kw">head</span>()</span></code></pre></div>
<pre><code>## # A tibble: 6 x 4
## Character Power_Level Saga_or_Movie Dragon_Ball_Seri~
## <chr> <chr> <chr> <chr>
## 1 Master Roshi's Max Power Kam~ 180 Emperor Pilaf Saga Dragon Ball
## 2 Goku's Kamehameha 12 Emperor Pilaf Saga Dragon Ball
## 3 Jackie Chun's Max power Kame~ 330 Tournament Saga Dragon Ball
## 4 Goku's Kamehameha 90 Red Ribbon Army S~ Dragon Ball
## 5 Goku's Kamehameha 90 Red Ribbon Army S~ Dragon Ball
## 6 Goku's Super Kamehameha 740 Piccolo Jr. Saga Dragon Ball</code></pre>
<p>If this filter feels convoluted, it’s for a good reason. There is a variation of cases and spellings used in this dataset, which a ‘straightforward’ filter wouldn’t have picked up. So there are two of these:</p>
<div class="sourceCode" id="cb7"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb7-1"><a href="#cb7-1"></a>dball_data <span class="op">%>%</span></span>
<span id="cb7-2"><a href="#cb7-2"></a><span class="st"> </span><span class="kw">filter</span>(<span class="kw">str_detect</span>(Character, <span class="st">"Kamehameha"</span>)) -><span class="st"> </span>dball_data_1b</span>
<span id="cb7-3"><a href="#cb7-3"></a></span>
<span id="cb7-4"><a href="#cb7-4"></a><span class="co">## Show the rows which do not appears on BOTH datasets</span></span>
<span id="cb7-5"><a href="#cb7-5"></a>dball_data_<span class="dv">1</span> <span class="op">%>%</span></span>
<span id="cb7-6"><a href="#cb7-6"></a><span class="st"> </span>dplyr<span class="op">::</span><span class="kw">anti_join</span>(dball_data_1b, <span class="dt">by =</span> <span class="st">"Character"</span>)</span></code></pre></div>
<pre><code>## # A tibble: 2 x 4
## Character Power_Level Saga_or_Movie Dragon_Ball_Seri~
## <chr> <chr> <chr> <chr>
## 1 Jackie Chun's Max power Kameham~ 330 Tournament Saga Dragon Ball
## 2 Android 19 (Goku's kamehameha a~ 230,000,000 Android Saga Dragon Ball Z</code></pre>
<p>Before we go any further with any analysis, we’ll also need to do something about <code>Power_Level</code>, as it is currently in the form of character / text, which means we can’t do any meaningful analysis until we convert it to numeric. To do this, we can start with removing the comma separators with <code>stringr::str_remove_all()</code>, and then run <code>as.numeric()</code>.</p>
<p>In ‘real life’, you often get data saved with <em>k</em> and <em>m</em> suffixes for thousands and millions, which will require a bit more cleaning to do - so here, I’m just thankful that all I have to do is to remove some comma separators.</p>
<div class="sourceCode" id="cb9"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb9-1"><a href="#cb9-1"></a>dball_data_<span class="dv">1</span> <span class="op">%>%</span></span>
<span id="cb9-2"><a href="#cb9-2"></a><span class="st"> </span><span class="kw">mutate_at</span>(<span class="st">"Power_Level"</span>, <span class="op">~</span><span class="kw">str_remove_all</span>(., <span class="st">","</span>)) <span class="op">%>%</span></span>
<span id="cb9-3"><a href="#cb9-3"></a><span class="st"> </span><span class="kw">mutate_at</span>(<span class="st">"Power_Level"</span>, <span class="op">~</span><span class="kw">as.numeric</span>(.)) -><span class="st"> </span>dball_data_<span class="dv">2</span></span>
<span id="cb9-4"><a href="#cb9-4"></a></span>
<span id="cb9-5"><a href="#cb9-5"></a>dball_data_<span class="dv">2</span> <span class="op">%>%</span><span class="st"> </span><span class="kw">tail</span>()</span></code></pre></div>
<pre><code>## # A tibble: 6 x 4
## Character Power_Level Saga_or_Movie Dragon_Ball_Seri~
## <chr> <dbl> <chr> <chr>
## 1 Goku's Super Kame~ 25300000000 OVA: Plan to Eradicate the ~ Dragon Ball Z
## 2 Family Kamehameha 300000000000 Movie 10: Broly- The Second~ Dragon Ball Z
## 3 Krillin's Kameham~ 8000000 Movie 11: Bio-Broly Dragon Ball Z
## 4 Goten's Kamehameha 950000000 Movie 11: Bio-Broly Dragon Ball Z
## 5 Trunk's Kamehameha 980000000 Movie 11: Bio-Broly Dragon Ball Z
## 6 Goten's Super Kam~ 3000000000 Movie 11: Bio-Broly Dragon Ball Z</code></pre>
<p>Now that we’ve fixed the <code>Power_Level</code> column, the next step is to isolate the information about the characters from the <code>Character</code> column. The reason why we have to do this is because, inconveniently, the column provides information for both the <em>character</em> and <em>the occasion</em> of when the kamehameha is used, which means we won’t be able to easily filter or group the dataset by the characters only.</p>
<p>One way to overcome this problem is to use the apostrophe (or single quote) as a delimiter to extract the characters from the column. Before I do this, I will take another manual step to remove the rows corresponding to absorbed kamehamehas, e.g. <em>Android 19 (Goku’s kamehameha absorbed)</em>, as it refers to the character’s power level <em>after</em> absorbing the attack, rather than the attack itself. (Yes, some characters are able to absorb kamehameha attacks and make themselves stronger..!)</p>
<p>After applying the filter, I use <code>mutate()</code> to create a new column called <code>Character_Single</code>, and then <code>str_remove_all()</code> to remove all the characters that appear after the apostrophe:</p>
<div class="sourceCode" id="cb11"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb11-1"><a href="#cb11-1"></a>dball_data_<span class="dv">2</span> <span class="op">%>%</span></span>
<span id="cb11-2"><a href="#cb11-2"></a><span class="st"> </span><span class="kw">filter</span>(<span class="op">!</span><span class="kw">str_detect</span>(Character, <span class="st">"absorbed"</span>)) <span class="op">%>%</span><span class="st"> </span><span class="co"># Remove 2 rows unrelated to kamehameha attacks</span></span>
<span id="cb11-3"><a href="#cb11-3"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">Character_Single =</span> <span class="kw">str_remove_all</span>(Character, <span class="st">"</span><span class="ch">\\</span><span class="st">'.+"</span>)) <span class="op">%>%</span><span class="st"> </span><span class="co"># Remove everything after apostrophe</span></span>
<span id="cb11-4"><a href="#cb11-4"></a><span class="st"> </span><span class="kw">select</span>(Character_Single, <span class="kw">everything</span>()) -><span class="st"> </span>dball_data_<span class="dv">3</span></span></code></pre></div>
<pre><code>## # A tibble: 10 x 5
## Character_Single Character Power_Level Saga_or_Movie Dragon_Ball_Ser~
## <chr> <chr> <dbl> <chr> <chr>
## 1 Master Roshi Master Roshi's~ 180 Emperor Pilaf ~ Dragon Ball
## 2 Goku Goku's Kameham~ 12 Emperor Pilaf ~ Dragon Ball
## 3 Jackie Chun Jackie Chun's ~ 330 Tournament Saga Dragon Ball
## 4 Goku Goku's Kameham~ 90 Red Ribbon Arm~ Dragon Ball
## 5 Goku Goku's Kameham~ 90 Red Ribbon Arm~ Dragon Ball
## 6 Goku Goku's Super K~ 740 Piccolo Jr. Sa~ Dragon Ball
## 7 Goku Goku's Kameham~ 950 Saiyan Saga Dragon Ball Z
## 8 Goku Goku's Kameham~ 36000 Saiyan Saga Dragon Ball Z
## 9 Goku Goku's Kameham~ 44000 Saiyan Saga Dragon Ball Z
## 10 Goku Goku's Angry K~ 180000000 Frieza Saga Dragon Ball Z</code></pre>
<p>Note that the apostrophe is a special character, and therefore it needs to be escaped by adding two forward slashes before it. The dot (<code>.</code>) matches all characters, and <code>+</code> tells R to match the preceding dot to match one or more times. Regex is a very useful thing to learn, and I would highly recommend just reading through the linked references below if you’ve never used regular expressions before.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
</div>
<div id="analysis" class="section level3">
<h3>Analysis</h3>
<p>Now that we’ve got a clean dataset, what can we find out about the Kamehamehas?</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/kamehameha2.gif" alt="" />
<p class="caption">The Kamehameha - image from Giphy</p>
</div>
<p>My approach is start with calculating the average power levels of Kamehamehas in R, grouped by <code>Character_Single</code>. The resulting table tells us that on average, Goku’s Kamehameha is the most powerful, followed by Gohan:</p>
<div class="sourceCode" id="cb13"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb13-1"><a href="#cb13-1"></a>dball_data_<span class="dv">3</span> <span class="op">%>%</span></span>
<span id="cb13-2"><a href="#cb13-2"></a><span class="st"> </span><span class="kw">group_by</span>(Character_Single) <span class="op">%>%</span></span>
<span id="cb13-3"><a href="#cb13-3"></a><span class="st"> </span><span class="kw">summarise_at</span>(<span class="kw">vars</span>(Power_Level), <span class="op">~</span><span class="kw">mean</span>(.)) <span class="op">%>%</span></span>
<span id="cb13-4"><a href="#cb13-4"></a><span class="st"> </span><span class="kw">arrange</span>(<span class="kw">desc</span>(Power_Level)) -><span class="st"> </span>kame_data_grouped <span class="co"># Sort by descending</span></span>
<span id="cb13-5"><a href="#cb13-5"></a></span>
<span id="cb13-6"><a href="#cb13-6"></a>kame_data_grouped</span></code></pre></div>
<pre><code>## # A tibble: 11 x 2
## Character_Single Power_Level
## <chr> <dbl>
## 1 Goku 3.46e14
## 2 Gohan 1.82e12
## 3 Family Kamehameha 3.00e11
## 4 Super Perfect Cell 8.00e10
## 5 Perfect Cell 3.02e10
## 6 Goten 1.98e 9
## 7 Trunk 9.80e 8
## 8 Krillin 8.00e 6
## 9 Student-Teacher Kamehameha 1.70e 4
## 10 Jackie Chun 3.30e 2
## 11 Master Roshi 1.80e 2</code></pre>
<p>However, it’s not helpful to directly visualise this on a bar chart, as the Power Level of the strongest Kamehameha is 175,433 times greater than the median!</p>
<div class="sourceCode" id="cb15"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb15-1"><a href="#cb15-1"></a>kame_data_grouped <span class="op">%>%</span></span>
<span id="cb15-2"><a href="#cb15-2"></a><span class="st"> </span><span class="kw">pull</span>(Power_Level) <span class="op">%>%</span></span>
<span id="cb15-3"><a href="#cb15-3"></a><span class="st"> </span><span class="kw">summary</span>()</span></code></pre></div>
<pre><code>## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.800e+02 4.008e+06 1.975e+09 3.170e+13 1.900e+11 3.465e+14</code></pre>
<p>A way around this is to log transform the <code>Power_Level</code> variable prior to visualising it, which I’ve saved the data into a new column called <code>Power_Index</code>. Then, we can pipe the data directly into a ggplot chain, and set a dark mode using <code>theme()</code>:</p>
<div class="sourceCode" id="cb17"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb17-1"><a href="#cb17-1"></a>kame_data_grouped <span class="op">%>%</span></span>
<span id="cb17-2"><a href="#cb17-2"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">Power_Index =</span> <span class="kw">log</span>(Power_Level)) <span class="op">%>%</span><span class="st"> </span><span class="co"># Log transform Power Levels</span></span>
<span id="cb17-3"><a href="#cb17-3"></a><span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> <span class="kw">reorder</span>(Character_Single, Power_Level),</span>
<span id="cb17-4"><a href="#cb17-4"></a> <span class="dt">y =</span> Power_Index,</span>
<span id="cb17-5"><a href="#cb17-5"></a> <span class="dt">fill =</span> Character_Single)) <span class="op">+</span></span>
<span id="cb17-6"><a href="#cb17-6"></a><span class="st"> </span><span class="kw">geom_col</span>() <span class="op">+</span></span>
<span id="cb17-7"><a href="#cb17-7"></a><span class="st"> </span><span class="kw">coord_flip</span>() <span class="op">+</span></span>
<span id="cb17-8"><a href="#cb17-8"></a><span class="st"> </span><span class="kw">scale_fill_brewer</span>(<span class="dt">palette =</span> <span class="st">"Spectral"</span>) <span class="op">+</span></span>
<span id="cb17-9"><a href="#cb17-9"></a><span class="st"> </span><span class="kw">theme_minimal</span>() <span class="op">+</span></span>
<span id="cb17-10"><a href="#cb17-10"></a><span class="st"> </span><span class="kw">geom_text</span>(<span class="kw">aes</span>(<span class="dt">y =</span> Power_Index,</span>
<span id="cb17-11"><a href="#cb17-11"></a> <span class="dt">label =</span> <span class="kw">round</span>(Power_Index, <span class="dv">1</span>),</span>
<span id="cb17-12"><a href="#cb17-12"></a> <span class="dt">hjust =</span> <span class="fl">-.2</span>),</span>
<span id="cb17-13"><a href="#cb17-13"></a> <span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>) <span class="op">+</span></span>
<span id="cb17-14"><a href="#cb17-14"></a><span class="st"> </span><span class="kw">ggtitle</span>(<span class="st">"Power Levels of Kamehamehas"</span>, <span class="dt">subtitle =</span> <span class="st">"By Dragon Ball characters"</span>) <span class="op">+</span></span>
<span id="cb17-15"><a href="#cb17-15"></a><span class="st"> </span><span class="kw">theme</span>(<span class="dt">plot.background =</span> <span class="kw">element_rect</span>(<span class="dt">fill =</span> <span class="st">"grey20"</span>),</span>
<span id="cb17-16"><a href="#cb17-16"></a> <span class="dt">text =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>),</span>
<span id="cb17-17"><a href="#cb17-17"></a> <span class="dt">panel.grid =</span> <span class="kw">element_blank</span>(),</span>
<span id="cb17-18"><a href="#cb17-18"></a> <span class="dt">plot.title =</span> <span class="kw">element_text</span>(<span class="dt">colour=</span><span class="st">"#FFFFFF"</span>, <span class="dt">face=</span><span class="st">"bold"</span>, <span class="dt">size=</span><span class="dv">20</span>),</span>
<span id="cb17-19"><a href="#cb17-19"></a> <span class="dt">axis.line =</span> <span class="kw">element_line</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>),</span>
<span id="cb17-20"><a href="#cb17-20"></a> <span class="dt">legend.position =</span> <span class="st">"none"</span>,</span>
<span id="cb17-21"><a href="#cb17-21"></a> <span class="dt">axis.title =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>, <span class="dt">size =</span> <span class="dv">12</span>),</span>
<span id="cb17-22"><a href="#cb17-22"></a> <span class="dt">axis.text =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>, <span class="dt">size =</span> <span class="dv">12</span>)) <span class="op">+</span></span>
<span id="cb17-23"><a href="#cb17-23"></a><span class="st"> </span><span class="kw">ylab</span>(<span class="st">"Power Levels (log transformed)"</span>) <span class="op">+</span></span>
<span id="cb17-24"><a href="#cb17-24"></a><span class="st"> </span><span class="kw">xlab</span>(<span class="st">" "</span>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/strongest-kamehameha_20200223_files/figure-html/unnamed-chunk-11-1.png" /><!-- --></p>
<p>So as it turns out, the results aren’t too surprising. Goku’s Kamehameha is the strongest of all the characters on average, although it has been referenced several times in the series that his son Gohan’s latent powers are beyond Goku’s.</p>
<p>Also, it is perhaps unsurprising that Master Roshi’s Kamehameha is the least powerful, given a highly powered comparison set of characters. Interestingly, Roshi’s Kamehameha is stronger as ‘Jackie Chun’ than as himself.</p>
<p>We can also see the extent to which Goku’s Kamehameha has grown more powerful across the series. This is available in the column <code>Saga_or_Movie</code>. In the same approach as above, we can do this by grouping the data by <code>Saga_or_Movie</code>, and pipe this into a ggplot bar chart:</p>
<div class="sourceCode" id="cb18"><pre class="sourceCode r"><code class="sourceCode r"><span id="cb18-1"><a href="#cb18-1"></a>dball_data_<span class="dv">3</span> <span class="op">%>%</span></span>
<span id="cb18-2"><a href="#cb18-2"></a><span class="st"> </span><span class="kw">filter</span>(Character_Single <span class="op">==</span><span class="st"> "Goku"</span>) <span class="op">%>%</span></span>
<span id="cb18-3"><a href="#cb18-3"></a><span class="st"> </span><span class="kw">mutate</span>(<span class="dt">Power_Index =</span> <span class="kw">log</span>(Power_Level)) <span class="op">%>%</span><span class="st"> </span><span class="co"># Log transform Power Levels</span></span>
<span id="cb18-4"><a href="#cb18-4"></a><span class="st"> </span><span class="kw">group_by</span>(Saga_or_Movie) <span class="op">%>%</span></span>
<span id="cb18-5"><a href="#cb18-5"></a><span class="st"> </span><span class="kw">summarise</span>(<span class="dt">Power_Index =</span> <span class="kw">mean</span>(Power_Index)) <span class="op">%>%</span></span>
<span id="cb18-6"><a href="#cb18-6"></a><span class="st"> </span><span class="kw">ggplot</span>(<span class="kw">aes</span>(<span class="dt">x =</span> <span class="kw">reorder</span>(Saga_or_Movie, Power_Index),</span>
<span id="cb18-7"><a href="#cb18-7"></a> <span class="dt">y =</span> Power_Index)) <span class="op">+</span></span>
<span id="cb18-8"><a href="#cb18-8"></a><span class="st"> </span><span class="kw">geom_col</span>(<span class="dt">fill =</span> <span class="st">"#F85B1A"</span>) <span class="op">+</span></span>
<span id="cb18-9"><a href="#cb18-9"></a><span class="st"> </span><span class="kw">theme_minimal</span>() <span class="op">+</span></span>
<span id="cb18-10"><a href="#cb18-10"></a><span class="st"> </span><span class="kw">geom_text</span>(<span class="kw">aes</span>(<span class="dt">y =</span> Power_Index,</span>
<span id="cb18-11"><a href="#cb18-11"></a> <span class="dt">label =</span> <span class="kw">round</span>(Power_Index, <span class="dv">1</span>),</span>
<span id="cb18-12"><a href="#cb18-12"></a> <span class="dt">vjust =</span> <span class="fl">-.5</span>),</span>
<span id="cb18-13"><a href="#cb18-13"></a> <span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>) <span class="op">+</span></span>
<span id="cb18-14"><a href="#cb18-14"></a><span class="st"> </span><span class="kw">ggtitle</span>(<span class="st">"Power Levels of Goku's Kamehamehas"</span>, <span class="dt">subtitle =</span> <span class="st">"By Saga/Movie"</span>) <span class="op">+</span></span>
<span id="cb18-15"><a href="#cb18-15"></a><span class="st"> </span><span class="kw">scale_y_continuous</span>(<span class="dt">limits =</span> <span class="kw">c</span>(<span class="dv">0</span>, <span class="dv">40</span>)) <span class="op">+</span></span>
<span id="cb18-16"><a href="#cb18-16"></a><span class="st"> </span><span class="kw">theme</span>(<span class="dt">plot.background =</span> <span class="kw">element_rect</span>(<span class="dt">fill =</span> <span class="st">"grey20"</span>),</span>
<span id="cb18-17"><a href="#cb18-17"></a> <span class="dt">text =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>),</span>
<span id="cb18-18"><a href="#cb18-18"></a> <span class="dt">panel.grid =</span> <span class="kw">element_blank</span>(),</span>
<span id="cb18-19"><a href="#cb18-19"></a> <span class="dt">plot.title =</span> <span class="kw">element_text</span>(<span class="dt">colour=</span><span class="st">"#FFFFFF"</span>, <span class="dt">face=</span><span class="st">"bold"</span>, <span class="dt">size=</span><span class="dv">20</span>),</span>
<span id="cb18-20"><a href="#cb18-20"></a> <span class="dt">plot.subtitle =</span> <span class="kw">element_text</span>(<span class="dt">colour=</span><span class="st">"#FFFFFF"</span>, <span class="dt">face=</span><span class="st">"bold"</span>, <span class="dt">size=</span><span class="dv">12</span>),</span>
<span id="cb18-21"><a href="#cb18-21"></a> <span class="dt">axis.line =</span> <span class="kw">element_line</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>),</span>
<span id="cb18-22"><a href="#cb18-22"></a> <span class="dt">legend.position =</span> <span class="st">"none"</span>,</span>
<span id="cb18-23"><a href="#cb18-23"></a> <span class="dt">axis.title =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>, <span class="dt">size =</span> <span class="dv">10</span>),</span>
<span id="cb18-24"><a href="#cb18-24"></a> <span class="dt">axis.text.y =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>, <span class="dt">size =</span> <span class="dv">8</span>),</span>
<span id="cb18-25"><a href="#cb18-25"></a> <span class="dt">axis.text.x =</span> <span class="kw">element_text</span>(<span class="dt">colour =</span> <span class="st">"#FFFFFF"</span>, <span class="dt">size =</span> <span class="dv">8</span>, <span class="dt">angle =</span> <span class="dv">45</span>, <span class="dt">hjust =</span> <span class="dv">1</span>)) <span class="op">+</span></span>
<span id="cb18-26"><a href="#cb18-26"></a><span class="st"> </span><span class="kw">ylab</span>(<span class="st">"Power Levels (log transformed)"</span>) <span class="op">+</span></span>
<span id="cb18-27"><a href="#cb18-27"></a><span class="st"> </span><span class="kw">xlab</span>(<span class="st">" "</span>)</span></code></pre></div>
<p><img src="https://martinctc.github.io/blog/knitr_files/strongest-kamehameha_20200223_files/figure-html/unnamed-chunk-12-1.png" /><!-- --></p>
<p>I don’t have full knowledge of the chronology of the franchise, but I do know that <em>Emperor Pilaf Saga</em>, <em>Red Ribbon Army Saga</em>, and <em>Piccolo Jr. Saga</em> are the earliest story arcs where Goku’s martial arts abilities are still developing. It also appears that if I’d like to witness Goku’s most powerful Kamehameha attack, I should find this in the <em>Baby Saga</em>!</p>
</div>
</div>
<div id="notes" class="section level2">
<h2>Notes</h2>
<p>Hope this was an interesting read for you, and that this tells you something new about R or Dragon Ball.</p>
<p>There is certainly more you can do with this dataset, especially once it is processed into a usable, tidy format.</p>
<p>If you have any related datasets that will help make this analysis more interesting, please let me know!</p>
<p>In the mean time, please stay safe and take care all!</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>See <a href="https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html" class="uri">https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html</a> and <a href="https://stringr.tidyverse.org/articles/regular-expressions.html" class="uri">https://stringr.tidyverse.org/articles/regular-expressions.html</a><a href="#fnref1" class="footnote-back">↩︎</a></p></li>
</ol>
</div>
</section>Martin ChanRStudio Projects and Working Directories: A Beginner’s Guide2020-01-23T00:00:00+00:002020-01-23T00:00:00+00:00https://martinctc.github.io/blog/rstudio-projects-and-working-directories-a-beginner's-guide<section class="main-content">
<div id="introduction" class="section level2">
<h2>Introduction 📂📂📂</h2>
<p>This post provides a basic introduction on how to use RStudio Projects and structure your working directories - which is well worth a read if you are still using <code>setwd()</code> to set your directories!</p>
<p>Although the R working directory is quite a basic and reasonably well-covered subject, I felt that it would still be worth sharing my own approach of structuring working directories, as clearly there can be multiple sensible and valid ways of structuring a working directory. The project directory structure covered in this post is one that I use day-to-day myself, and one that I find the most appropriate for the kind of analysis work that I typically deal with, i.e. data sets loaded into memory, and saved within the working directory itself.</p>
<p>If you are just starting out in R, my personal advice is that using RStudio projects and structuring working directories are ‘must-knows’. Using RStudio projects eliminates so much of the early-stage hassle and confusion around reading in and exporting data. Setting up a working directory properly also helps build up good habits that are conducive to reproducible analysis. It’s one of the non-code related parts of R programming that I think is extremely helpful to know, and arguably for a learner, even a greater priority than learning how to use GitHub! <a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
</div>
<div id="what-is-a-rstudio-project-and-why" class="section level2">
<h2>What is a RStudio project, and why?</h2>
<p>When I first started using R several years ago, the textbook and mainstream approach for setting working directories was to use <code>setwd()</code>, which takes an <em>absolute</em> file path as an input then sets it as the current working directory of the R process. You then use <code>getwd()</code> to find out what the current working directory is, and check that your working directory is correctly set.</p>
<p>The problem with this approach is that since <code>setwd()</code> relies on an <em>absolute</em> file path, this makes the links break very easily, and very difficult to share your analysis with others. A simple action of moving the entire directory to a different sub-folder or to a different drive will break the links, and your script will not run. As <a href="https://www.tidyverse.org/blog/2017/12/workflow-vs-script/">Jenny Bryan points out</a>, the <code>setwd()</code> approach makes it virtually impossible for anyone else other than the original author of the script, on his or her computer, to make the file paths work:</p>
<blockquote>
<p>The chance of the <code>setwd()</code> command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable. To recreate and perhaps extend this plot, the lucky recipient will need to hand edit one or more paths to reflect where the project has landed on their machine. When you do this for the 73rd time in 2 days, while marking an assignment, you start to fantasize about lighting the perpetrator’s computer on fire.</p>
</blockquote>
<p>(Check out <a href="https://www.tidyverse.org/blog/2017/12/workflow-vs-script/">this link</a> for the original blog post)</p>
<p>At the beginning I was sceptical about the seemingly radical move of abandoning the <code>setwd()</code> orthodox entirely, but since I’ve tried out the project workflow I’ve never really thought about using absolute file paths again. So I’m totally with Jenny Bryan on this one!<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a></p>
</div>
<div id="easy-file-path-referencing-with-rstudio-projects" class="section level2">
<h2>Easy file path referencing with RStudio projects</h2>
<p>RStudio projects solve the problem of ‘fragile’ file paths by making file paths <em>relative</em>. The RStudio project file is a file that sits in the root directory, with the extension .Rproj. When your RStudio session is running through the project file (.Rproj), the current working directory points to the root folder where that .Rproj file is saved.</p>
<p>Here’s an example - let’s suppose my working directory is a folder named <em>SurveyAnalysis1</em>. Instead of listing out the full absolute file path, <em>C:/Users/Martin/Documents/Analysis/SurveyAnalysis1/Data/Data1.xlsx</em>, I can simply refer the same Excel file at the directory level when using projects, i.e. just refer to the file by <em>Data/Data1.xlsx</em>. The idea is that if one day I decide to move my entire <em>SurveyAnalysis1</em> folder/directory to another location, or perhaps open this up on a different computer, all the file paths specified in my R scripts would still work as long as I start the session through opening the .Rproj file.</p>
<p>This .Rproj file can be created by going to <strong>File > New Project…</strong> in RStudio, which then becomes associated with the specified folder or directory. The mindset should then be that the directory (the whole folder and its sub-folders and contents) is stand-alone and portable, which in other words means that you shouldn’t be reading in data from or writing data to files <em>outside</em> the directory. Everything relating to that analysis or project should only happen within that directory, except for cases where your analysis requires interacting with an Internet source, e.g. web-scraping, calling APIs. When opening an existing project, you should open the .Rproj file first and only subsequently open any R scripts (extensions with .R) from the RStudio session, rather than going straight to the R scripts to open them. You can think of opening the .Rproj file as an ‘initialisation’ step for the RStudio session, which ensures that everything you run from this session could find the proper file paths within that directory. RStudio has a more <a href="https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects">detailed documentation on RStudio projects</a> which is worth checking out, which has more information on .RData and .Rhistory files. <a href="https://r4ds.had.co.nz/workflow-projects.html">Chapter 8</a> (<em>Workflow: projects</em>) of <em>R for Data Science</em> also gives a ‘quick start’ guide on how to use RStudio projects.</p>
</div>
<div id="structuring-your-working-directory" class="section level2">
<h2>Structuring your working directory 🔨</h2>
<p>Asides from using RStudio projects, it’s also good practice to structure your directory in a way that helps anybody else you are collaborating with - or a future version of you trying to reproduce some analysis - to navigate the analysis easily. I recommend the following as a basic ‘starter’ directory set up:</p>
<div class="figure">
<img src="https://raw.githubusercontent.com/martinctc/blog/master/images/RPROJECT_2000dpi.png" alt="Basic Structure" />
<p class="caption">Basic Structure</p>
</div>
<p>In your working directory, you will have the following:</p>
<ul>
<li><strong>Data</strong> - this is the subfolder where I save any files that I need to read into R in order to do my analysis or visualisation. These could be anything from SPSS (*.sav) files, Excel / CSV files, .FST or .RDS files. The key idea is that these are <strong>source data files</strong>, and at no point should R be saving over or editing these files in order to ensure reproducibility. The reasoning is that reproducible analysis isn’t really possible if the source data file keeps getting changed by the analysis (think analysis in spreadsheets). If you do need to change the source data file, create a new version and ensure that the new file name appropriately reflects that change.</li>
<li><strong>Script</strong> - this is where I save my R scripts and RMarkdown files (files with the extension <em>.R and </em>.Rmd).
<ul>
<li><strong>Analysis</strong> - All my main analysis R scripts are saved here, which I think it is for most intents and purposes fine if you have multiple scripts that perform different tasks saved here. I don’t personally have one project per distinct piece of analysis, as this could get out of hand when I may have 20+ different analysis that I’d like to perform on a single dataset. My (actually quite simple) rule-of-thumb for deciding whether to separate out an analysis is to imagine whether someone completely new to the project would be able to navigate and figure out what is going with this directory. As a side note - thoughtful and sensible file names help a lot!</li>
<li><strong>Functions</strong> - It is optional whether you have your custom functions saved in a separate sub-folder. I find this convenient personally because if I want to re-use a function that I remember I’ve written in a particular project, I can at a quick glance browse all the functions I’ve written for that project. Saving functions separately accompanies a workflow where you use <code>source()</code> to read functions into the ‘main analysis script’, rather than having it together with main analysis.</li>
<li><strong>RMarkdown files</strong> - RMarkdown files are a special case, as they work slightly differently to .R files in terms of file paths, i.e. they behave like mini projects of their own, where the default working directory is where the Rmd file is saved. To save RMarkdown files in this set up, it’s recommended that you use the <a href="https://github.com/jennybc/here_here">{here}</a> package and its workflow. Alternatively, you can run <code>knitr::opts_knit$set(root.dir = "../")</code> in your setup chunk so that the working directory is set in the root directory rather than another sub-folder where the RMarkdown file is saved (less ideal than using {here}). In my other post, I briefly discussed a directory structure for combining multiple RMarkdown files into a single long RMarkdown document](<a href="https://martinctc.github.io/blog/first-world-problems-very-long-rmarkdown-documents/" class="uri">https://martinctc.github.io/blog/first-world-problems-very-long-rmarkdown-documents/</a>).</li>
</ul></li>
<li><strong>Output</strong> - Save all your outputs here, including plots, HTML, and data exports.
<ul>
<li>Having this Output folder helps others identify what files are <strong>outputs</strong> of the code, as opposed to source files that were used to produce the analysis.</li>
<li>What you have set up as the sub-folders don’t matter too much, as long as they’re sensible. You may decide to set up the sub-folders so that they align with the analysis rather than type of file export.</li>
<li>The <code>timed_fn()</code> function from my package <a href="https://www.github.com/martinctc">surveytoolbox</a> (available on GitHub) helps create timestamps for file names, which I use often to ensure that I don’t lose work when I am iterating analysis.</li>
</ul></li>
</ul>
<p>This directory structure ‘template’ should provide a good starting point for organising projects if a project workflow is new to you. However, whilst having consistency is great, different projects will have different needs, and therefore one should always think about what is needed and what will happen when setting up the working directory structure, and adapt appropriately.</p>
</div>
<div id="further-reading" class="section level2">
<h2>Further reading 📖</h2>
<p>For further reading, Chris Von Csefalvay has this <a href="https://chrisvoncsefalvay.com/2018/08/09/structuring-r-projects/">great article on structuring R projects</a>, which provides a more detailed guide on what you should consider when structuring your R projects. He recommends also having a README file available for each project (saved at the root directory), which is particularly important for projects with more complexity.</p>
<p>As per usual, feedback, comments, and questions are all very welcome! If you like this post please do check out my other posts on <a href="https://martinctc.github.io/blog/" class="uri">https://martinctc.github.io/blog/</a>.</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>GitHub repositories are structured as working directories, hence it would make sense to learn how to structure a working directory before learning about how to use GitHub.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p><em>Garrett Grolemund</em> and Hadley Wickham’s <em>R for Data Science</em> book makes a similar recommendation in chapter <a href="https://r4ds.had.co.nz/workflow-projects.html">8.3</a>.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
</section>Martin ChanIntroduction 📂📂📂 This post provides a basic introduction on how to use RStudio Projects and structure your working directories - which is well worth a read if you are still using setwd() to set your directories! Although the R working directory is quite a basic and reasonably well-covered subject, I felt that it would still be worth sharing my own approach of structuring working directories, as clearly there can be multiple sensible and valid ways of structuring a working directory. The project directory structure covered in this post is one that I use day-to-day myself, and one that I find the most appropriate for the kind of analysis work that I typically deal with, i.e. data sets loaded into memory, and saved within the working directory itself. If you are just starting out in R, my personal advice is that using RStudio projects and structuring working directories are ‘must-knows’. Using RStudio projects eliminates so much of the early-stage hassle and confusion around reading in and exporting data. Setting up a working directory properly also helps build up good habits that are conducive to reproducible analysis. It’s one of the non-code related parts of R programming that I think is extremely helpful to know, and arguably for a learner, even a greater priority than learning how to use GitHub! 1 What is a RStudio project, and why? When I first started using R several years ago, the textbook and mainstream approach for setting working directories was to use setwd(), which takes an absolute file path as an input then sets it as the current working directory of the R process. You then use getwd() to find out what the current working directory is, and check that your working directory is correctly set. The problem with this approach is that since setwd() relies on an absolute file path, this makes the links break very easily, and very difficult to share your analysis with others. A simple action of moving the entire directory to a different sub-folder or to a different drive will break the links, and your script will not run. As Jenny Bryan points out, the setwd() approach makes it virtually impossible for anyone else other than the original author of the script, on his or her computer, to make the file paths work: The chance of the setwd() command having the desired effect – making the file paths work – for anyone besides its author is 0%. It’s also unlikely to work for the author one or two years or computers from now. The project is not self-contained and portable. To recreate and perhaps extend this plot, the lucky recipient will need to hand edit one or more paths to reflect where the project has landed on their machine. When you do this for the 73rd time in 2 days, while marking an assignment, you start to fantasize about lighting the perpetrator’s computer on fire. (Check out this link for the original blog post) At the beginning I was sceptical about the seemingly radical move of abandoning the setwd() orthodox entirely, but since I’ve tried out the project workflow I’ve never really thought about using absolute file paths again. So I’m totally with Jenny Bryan on this one!2 Easy file path referencing with RStudio projects RStudio projects solve the problem of ‘fragile’ file paths by making file paths relative. The RStudio project file is a file that sits in the root directory, with the extension .Rproj. When your RStudio session is running through the project file (.Rproj), the current working directory points to the root folder where that .Rproj file is saved. Here’s an example - let’s suppose my working directory is a folder named SurveyAnalysis1. Instead of listing out the full absolute file path, C:/Users/Martin/Documents/Analysis/SurveyAnalysis1/Data/Data1.xlsx, I can simply refer the same Excel file at the directory level when using projects, i.e. just refer to the file by Data/Data1.xlsx. The idea is that if one day I decide to move my entire SurveyAnalysis1 folder/directory to another location, or perhaps open this up on a different computer, all the file paths specified in my R scripts would still work as long as I start the session through opening the .Rproj file. This .Rproj file can be created by going to File > New Project… in RStudio, which then becomes associated with the specified folder or directory. The mindset should then be that the directory (the whole folder and its sub-folders and contents) is stand-alone and portable, which in other words means that you shouldn’t be reading in data from or writing data to files outside the directory. Everything relating to that analysis or project should only happen within that directory, except for cases where your analysis requires interacting with an Internet source, e.g. web-scraping, calling APIs. When opening an existing project, you should open the .Rproj file first and only subsequently open any R scripts (extensions with .R) from the RStudio session, rather than going straight to the R scripts to open them. You can think of opening the .Rproj file as an ‘initialisation’ step for the RStudio session, which ensures that everything you run from this session could find the proper file paths within that directory. RStudio has a more detailed documentation on RStudio projects which is worth checking out, which has more information on .RData and .Rhistory files. Chapter 8 (Workflow: projects) of R for Data Science also gives a ‘quick start’ guide on how to use RStudio projects. Structuring your working directory 🔨 Asides from using RStudio projects, it’s also good practice to structure your directory in a way that helps anybody else you are collaborating with - or a future version of you trying to reproduce some analysis - to navigate the analysis easily. I recommend the following as a basic ‘starter’ directory set up: Basic Structure In your working directory, you will have the following: Data - this is the subfolder where I save any files that I need to read into R in order to do my analysis or visualisation. These could be anything from SPSS (*.sav) files, Excel / CSV files, .FST or .RDS files. The key idea is that these are source data files, and at no point should R be saving over or editing these files in order to ensure reproducibility. The reasoning is that reproducible analysis isn’t really possible if the source data file keeps getting changed by the analysis (think analysis in spreadsheets). If you do need to change the source data file, create a new version and ensure that the new file name appropriately reflects that change. Script - this is where I save my R scripts and RMarkdown files (files with the extension .R and .Rmd). Analysis - All my main analysis R scripts are saved here, which I think it is for most intents and purposes fine if you have multiple scripts that perform different tasks saved here. I don’t personally have one project per distinct piece of analysis, as this could get out of hand when I may have 20+ different analysis that I’d like to perform on a single dataset. My (actually quite simple) rule-of-thumb for deciding whether to separate out an analysis is to imagine whether someone completely new to the project would be able to navigate and figure out what is going with this directory. As a side note - thoughtful and sensible file names help a lot! Functions - It is optional whether you have your custom functions saved in a separate sub-folder. I find this convenient personally because if I want to re-use a function that I remember I’ve written in a particular project, I can at a quick glance browse all the functions I’ve written for that project. Saving functions separately accompanies a workflow where you use source() to read functions into the ‘main analysis script’, rather than having it together with main analysis. RMarkdown files - RMarkdown files are a special case, as they work slightly differently to .R files in terms of file paths, i.e. they behave like mini projects of their own, where the default working directory is where the Rmd file is saved. To save RMarkdown files in this set up, it’s recommended that you use the {here} package and its workflow. Alternatively, you can run knitr::opts_knit$set(root.dir = "../") in your setup chunk so that the working directory is set in the root directory rather than another sub-folder where the RMarkdown file is saved (less ideal than using {here}). In my other post, I briefly discussed a directory structure for combining multiple RMarkdown files into a single long RMarkdown document](https://martinctc.github.io/blog/first-world-problems-very-long-rmarkdown-documents/). Output - Save all your outputs here, including plots, HTML, and data exports. Having this Output folder helps others identify what files are outputs of the code, as opposed to source files that were used to produce the analysis. What you have set up as the sub-folders don’t matter too much, as long as they’re sensible. You may decide to set up the sub-folders so that they align with the analysis rather than type of file export. The timed_fn() function from my package surveytoolbox (available on GitHub) helps create timestamps for file names, which I use often to ensure that I don’t lose work when I am iterating analysis. This directory structure ‘template’ should provide a good starting point for organising projects if a project workflow is new to you. However, whilst having consistency is great, different projects will have different needs, and therefore one should always think about what is needed and what will happen when setting up the working directory structure, and adapt appropriately. Further reading 📖 For further reading, Chris Von Csefalvay has this great article on structuring R projects, which provides a more detailed guide on what you should consider when structuring your R projects. He recommends also having a README file available for each project (saved at the root directory), which is particularly important for projects with more complexity. As per usual, feedback, comments, and questions are all very welcome! If you like this post please do check out my other posts on https://martinctc.github.io/blog/. GitHub repositories are structured as working directories, hence it would make sense to learn how to structure a working directory before learning about how to use GitHub.↩ Garrett Grolemund and Hadley Wickham’s R for Data Science book makes a similar recommendation in chapter 8.3.↩Vignette: Downloadable tables in RMarkdown with the DT package2019-12-25T00:00:00+00:002019-12-25T00:00:00+00:00https://martinctc.github.io/blog/vignette-downloadable-tables-in-rmarkdown-with-the-dt-package<section class="main-content">
<div id="background" class="section level2">
<h2>Background</h2>
<p>In an earlier post April this year, I discussed using <a href="https://martinctc.github.io/blog/my-favourite-alternative-to-excel-dashboards/">flexdashboard</a> (with <strong>RMarkdown</strong>) as an appealing and practical R alternative to Excel-based reporting dashboards. Since it’s possible to (i) export these ‘flexdashboards’ as static HTML files that can be opened on practically any computer (virtually no dependencies), (ii) shared as attachments over emails, and (iii) run without relying on servers and Internet access, they rival ‘traditional’ Excel dashboards on <em>portability</em>. This is an advantage that you don’t really get with other dashboarding solutions such as Tableau and <strong>Shiny</strong>, as far as I’m aware.</p>
<p>Traditionally, people also like Excel dashboards for another reason, which is that all the data that is reported in the dashboard is usually <em>self-contained</em> and available in the Excel file in itself, provided that the source data within Excel isn’t hidden and protected. This enables any keen user to extract the source data to produce charts or analysis on their own “off-dashboard”. Moreover, having the data available within the dashboard itself helps with <em>reproducibility</em>, in the sense that one can more easily trace back the relationship between the source data and the reported analysis or visualisation.</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/dashboard-excel-flexdashboard-meme.jpg" /></p>
<p>In this post, I am going to share a trick on how to implement this feature within <strong>RMarkdown</strong> (and therefore means you can do this in <strong>flexdashboard</strong>) such that the users of your dashboards can export/download your source data. This will be implemented using the <a href="https://rstudio.github.io/DT/">DT</a> package created by <a href="https://rstudio.com/">RStudio</a>, which provides an R interface to the JavaScript library <a href="https://datatables.net/">DataTables</a>.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<p>(Credits to <a href="https://datastrategywithjonathan.com/">Jonathan Ng</a> for sharing this trick with me in the first place! His original video tutorial that first mentions this is available <a href="https://www.youtube.com/watch?v=ux2tQqgY8Gg">here</a>)</p>
</div>
<div id="the-dt-package" class="section level2">
<h2>The DT package</h2>
<p>In a nutshell, <a href="https://github.com/rstudio/DT">DT</a> is a R package that enables the creation of interactive, pretty HTML tables with fancy features such as filter, search, scroll, pagination, and sort - to name a few. Since <strong>DT</strong> generates a <a href="https://www.htmlwidgets.org/showcase_leaflet.html">html widget</a> (e.g. just like what <strong>leaflet</strong>, <strong>rbokeh</strong>, and <strong>plotly</strong> do), it can be used in RMarkdown HTML outputs and Shiny dashboards. I’ve personally found <strong>DT</strong> very useful when creating RMarkdown documents (knitted to HTML) because it allows you to create professional-looking, business-ready interactive tables with literally only a couple of lines of code, and you can do this entirely in R without knowing any JavaScript. The other alternative packages that perform a similar job of producing quick and pretty HTML tables are <a href="https://github.com/renkun-ken/formattable">formattable</a>, <code>knitr::kable()</code> and <a href="https://github.com/haozhu233/kableExtra">kableExtra</a>, but as far as I’m aware only <strong>DT</strong> allows you to add these ‘data download’ buttons that we are focussing on in this post.</p>
</div>
<div id="downloadable-tables" class="section level2">
<h2>Downloadable tables</h2>
<p>What we are trying to get to is an interactive table with buttons that allow you to perform the following actions:</p>
<ul>
<li>Copy to clipboard</li>
<li>Export to CSV</li>
<li>Export to Excel</li>
<li>Export to PDF</li>
<li>Print</li>
</ul>
<p>Though you might only require only one or two of the above buttons, I’m going to go through an example that lets you do all five at the same time. The below is what the <a href="https://martinctc.github.io/blog/examples/dt-download-example.html">final output</a> looks like, using the <code>iris</code> dataset, where the download options are shown at the top of the widget:</p>
<p><img src="https://raw.githubusercontent.com/martinctc/blog/master/images/dt-downloadable.PNG" /></p>
<p>To see what the interactive version is like, click <a href="https://martinctc.github.io/blog/examples/dt-download-example.html">here</a>.</p>
</div>
<div id="the-solution" class="section level2">
<h2>The Solution</h2>
<p>The main function from <strong>DT</strong> to create the interactive table is <code>DT::datatable()</code>. The first argument accepts a data frame, so this makes it easy to use it with <strong>dplyr</strong> / <strong>magrittr</strong> pipes. This is how we will create the above table, using the inbuilt <code>iris</code> dataset:</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode r"><code class="sourceCode r"><span class="kw">library</span>(tidyverse)
<span class="kw">library</span>(DT)
iris <span class="op">%>%</span>
<span class="kw">datatable</span>(<span class="dt">extensions =</span> <span class="st">'Buttons'</span>,
<span class="dt">options =</span> <span class="kw">list</span>(<span class="dt">dom =</span> <span class="st">'Blfrtip'</span>,
<span class="dt">buttons =</span> <span class="kw">c</span>(<span class="st">'copy'</span>, <span class="st">'csv'</span>, <span class="st">'excel'</span>, <span class="st">'pdf'</span>, <span class="st">'print'</span>),
<span class="dt">lengthMenu =</span> <span class="kw">list</span>(<span class="kw">c</span>(<span class="dv">10</span>,<span class="dv">25</span>,<span class="dv">50</span>,<span class="op">-</span><span class="dv">1</span>),
<span class="kw">c</span>(<span class="dv">10</span>,<span class="dv">25</span>,<span class="dv">50</span>,<span class="st">"All"</span>))))</code></pre></div>
<p>And here is a brief explanation for each of the arguments used in the above code:</p>
<ul>
<li><p><strong>extensions</strong>: this takes in a character vector of the names of <a href="https://rstudio.github.io/DT/plugins.html">DataTables plug-ins</a>, but only plugins supported by the DT package can be used here. We’ll just put ‘Buttons’ here.</p></li>
<li><p><strong>options</strong>: this argument is where you feed in all the additional customisation options, which is specified in a list.<a href="#fn2" class="footnote-ref" id="fnref2"><sup>2</sup></a> I usually think of these as ‘expanded features’ that aren’t / haven’t been built into the <strong>DT</strong> package yet, but are available in the ‘source’ JavaScript library <strong>DataTables</strong>.</p>
<ul>
<li><p><strong>dom</strong>: This argument defines the table control elements to appear on the page and in what order. Here, we have specified this to be <code>Blfrtip</code>, where:</p>
<ul>
<li><p><code>B</code> stands for <strong>b</strong>uttons,</p></li>
<li><p><code>l</code> for <strong>l</strong>ength changing input control,</p></li>
<li><p><code>f</code> for <strong>f</strong>iltering input,</p></li>
<li><p><code>r</code> for p<strong>r</strong>ocessing display element,</p></li>
<li><p><code>t</code> for the <strong>t</strong>able,</p></li>
<li><p><code>i</code> for table <strong>i</strong>nformation summary,</p></li>
<li><p>and finally, <code>p</code> for <strong>p</strong>agination display.</p></li>
</ul>
<p>You may move the letters around to control for where the buttons are placed, where for instance <code>lfrtipB</code> would place the buttons at the very bottom of the widget.</p></li>
<li><p><strong>buttons</strong>: you pass a character vector through to specify what buttons to actually display in the widget, where ‘copy’ stands for copy to clipboard, ‘csv’ stands for ‘export to csv’, etc.</p></li>
<li><p><strong>lengthMenu</strong>: this allows you to specify display options for how many rows of data to display on each page. Here, I’ve passed a list through with two vectors, where the first specifies the page length values and the second the displayed options.</p></li>
</ul></li>
</ul>
<p>Try it out! Note that if you run this code in a R script, the table will open up in your Viewer Pane in RStudio, but you will need to run the code within a <strong>RMarkdown</strong> document in order to produce a share-able HTML output.</p>
</div>
<div id="create-a-function-for-cleaner-code" class="section level2">
<h2>Create a function (for cleaner code)</h2>
<p>I’ve wrapped the solution in a handy function called <code>create_dt()</code>, which just adds a bit of convenience as I can simply load this script at the beginning of a RMarkdown document and then call the function throughout the document, whenever I want to display the data and make them downloadable. Here it is:</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode r"><code class="sourceCode r">create_dt <-<span class="st"> </span><span class="cf">function</span>(x){
DT<span class="op">::</span><span class="kw">datatable</span>(x,
<span class="dt">extensions =</span> <span class="st">'Buttons'</span>,
<span class="dt">options =</span> <span class="kw">list</span>(<span class="dt">dom =</span> <span class="st">'Blfrtip'</span>,
<span class="dt">buttons =</span> <span class="kw">c</span>(<span class="st">'copy'</span>, <span class="st">'csv'</span>, <span class="st">'excel'</span>, <span class="st">'pdf'</span>, <span class="st">'print'</span>),
<span class="dt">lengthMenu =</span> <span class="kw">list</span>(<span class="kw">c</span>(<span class="dv">10</span>,<span class="dv">25</span>,<span class="dv">50</span>,<span class="op">-</span><span class="dv">1</span>),
<span class="kw">c</span>(<span class="dv">10</span>,<span class="dv">25</span>,<span class="dv">50</span>,<span class="st">"All"</span>))))
}</code></pre></div>
<p>You can customise this function to suit whatever needs you have for your project, but I find creating a function for the task of generating <strong>DT</strong> tables just makes the overall code cleaner, shorter, and easier to follow.</p>
</div>
<div id="end-notes" class="section level2">
<h2>End notes</h2>
<p>Hope you enjoyed this short vignette.</p>
<p>Do comment down below if you find this useful, or if you have any related ideas or suggestions you’d like to share. If you liked this post, please do check out my <a href="https://martinctc.github.io/blog/">blog</a> for more R and data science related content.</p>
<p>And have a Merry Christmas everyone!</p>
</div>
<div class="footnotes">
<hr />
<ol>
<li id="fn1"><p>Not to be confused with the <a href="https://github.com/Rdatatable/data.table">data.table</a> package, which is practically a “super” package for <a href="https://martinctc.github.io/blog/using-data.table-with-magrittr-pipes-best-of-both-worlds/">fast data manipulation and wrangling</a>.<a href="#fnref1" class="footnote-back">↩</a></p></li>
<li id="fn2"><p>See <a href="https://datatables.net/reference/option/" class="uri">https://datatables.net/reference/option/</a> for a full documentation of the options.<a href="#fnref2" class="footnote-back">↩</a></p></li>
</ol>
</div>
</section>Martin ChanBackground In an earlier post April this year, I discussed using flexdashboard (with RMarkdown) as an appealing and practical R alternative to Excel-based reporting dashboards. Since it’s possible to (i) export these ‘flexdashboards’ as static HTML files that can be opened on practically any computer (virtually no dependencies), (ii) shared as attachments over emails, and (iii) run without relying on servers and Internet access, they rival ‘traditional’ Excel dashboards on portability. This is an advantage that you don’t really get with other dashboarding solutions such as Tableau and Shiny, as far as I’m aware. Traditionally, people also like Excel dashboards for another reason, which is that all the data that is reported in the dashboard is usually self-contained and available in the Excel file in itself, provided that the source data within Excel isn’t hidden and protected. This enables any keen user to extract the source data to produce charts or analysis on their own “off-dashboard”. Moreover, having the data available within the dashboard itself helps with reproducibility, in the sense that one can more easily trace back the relationship between the source data and the reported analysis or visualisation. In this post, I am going to share a trick on how to implement this feature within RMarkdown (and therefore means you can do this in flexdashboard) such that the users of your dashboards can export/download your source data. This will be implemented using the DT package created by RStudio, which provides an R interface to the JavaScript library DataTables.1 (Credits to Jonathan Ng for sharing this trick with me in the first place! His original video tutorial that first mentions this is available here) The DT package In a nutshell, DT is a R package that enables the creation of interactive, pretty HTML tables with fancy features such as filter, search, scroll, pagination, and sort - to name a few. Since DT generates a html widget (e.g. just like what leaflet, rbokeh, and plotly do), it can be used in RMarkdown HTML outputs and Shiny dashboards. I’ve personally found DT very useful when creating RMarkdown documents (knitted to HTML) because it allows you to create professional-looking, business-ready interactive tables with literally only a couple of lines of code, and you can do this entirely in R without knowing any JavaScript. The other alternative packages that perform a similar job of producing quick and pretty HTML tables are formattable, knitr::kable() and kableExtra, but as far as I’m aware only DT allows you to add these ‘data download’ buttons that we are focussing on in this post. Downloadable tables What we are trying to get to is an interactive table with buttons that allow you to perform the following actions: Copy to clipboard Export to CSV Export to Excel Export to PDF Print Though you might only require only one or two of the above buttons, I’m going to go through an example that lets you do all five at the same time. The below is what the final output looks like, using the iris dataset, where the download options are shown at the top of the widget: To see what the interactive version is like, click here. The Solution The main function from DT to create the interactive table is DT::datatable(). The first argument accepts a data frame, so this makes it easy to use it with dplyr / magrittr pipes. This is how we will create the above table, using the inbuilt iris dataset: library(tidyverse) library(DT)