Show the code
# install.packages("tidyverse")
# install.packages("medicaldata")
library(tidyverse)
library(medicaldata)
count()
Function in R’s tidyverseCarlos Fernández
June 3, 2024
R’s tidyverse is a collection of packages designed for data science. One of the most useful functions within this suite is count()
, part of the dplyr
package. This function is used to count occurrences of unique values within a dataset, which is crucial for data analysis and exploratory data analysis (EDA). In this blog post, we’ll delve into how count()
works, particularly with the sort = TRUE
option, and we’ll use datasets from the medicaldata
package to illustrate its application.
count()
The count()
function in dplyr
is designed to count the number of occurrences of each unique value of one or more variables. It returns a data frame with the counts of these values. Here’s a basic syntax of the function:
count(data, vars, wt = NULL, sort = FALSE, name = "n")
data
: The data frame.vars
: The variable(s) to count unique values of.wt
: Optional. If provided, counts will be weighted by this variable.sort
: If TRUE
, the resulting data frame will be sorted in descending order by the count.name
: The name of the count column. Default is “n”.count()
with sort = TRUE
Setting sort = TRUE
orders the output by the count in descending order, which is helpful when you want to quickly see the most frequent values.
medicaldata
packageThe medicaldata
package contains various datasets from medical research, which are great for demonstration. We’ll use the covid_testing
dataset for our example.
First, install and load the required packages:
Next, load the covid_testing
dataset and take a look at its structure:
Rows: 15,524
Columns: 17
$ subject_id <dbl> 1412, 533, 9134, 8518, 8967, 11048, 663, 2158, 3794, 4…
$ fake_first_name <chr> "jhezane", "penny", "grunt", "melisandre", "rolley", "…
$ fake_last_name <chr> "westerling", "targaryen", "rivers", "swyft", "karstar…
$ gender <chr> "female", "female", "male", "female", "male", "female"…
$ pan_day <dbl> 4, 7, 7, 8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 1…
$ test_id <chr> "covid", "covid", "covid", "covid", "covid", "covid", …
$ clinic_name <chr> "inpatient ward a", "clinical lab", "clinical lab", "c…
$ result <chr> "negative", "negative", "negative", "negative", "negat…
$ demo_group <chr> "patient", "patient", "patient", "patient", "patient",…
$ age <dbl> 0.0, 0.0, 0.8, 0.8, 0.8, 0.8, 0.8, 0.0, 0.0, 0.9, 0.9,…
$ drive_thru_ind <dbl> 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, …
$ ct_result <dbl> 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45, 45…
$ orderset <dbl> 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, …
$ payor_group <chr> "government", "commercial", NA, NA, "government", "com…
$ patient_class <chr> "inpatient", "not applicable", NA, NA, "emergency", "r…
$ col_rec_tat <dbl> 1.4, 2.3, 7.3, 5.8, 1.2, 1.4, 2.6, 0.7, 1.0, 7.1, 2.5,…
$ rec_ver_tat <dbl> 5.2, 5.8, 4.7, 5.0, 6.4, 7.0, 4.2, 6.3, 5.6, 7.0, 3.8,…
The covid_testing
dataset contains data from deidentified results of COVID-19 testing at the Children’s Hospital of Pennsylvania (CHOP) in 2020. Suppose we want to count the number of participants according to the COVID-19 test result and sort the results by the count in descending order. Here’s how we can do that:
The output is a tibble where the first column is the test result
, and the second column, named n
, shows the counts. Because we used sort = TRUE
, the result with the highest count (negative
) will appear first (n = 14,358 patients), followed by positive results (n = 865), and invalid results (n = 301).
You can also count combinations of multiple variables. For example, if we want to count combinations of result and gender:
# A tibble: 6 × 3
result gender n
<chr> <chr> <int>
1 negative female 7237
2 negative male 7121
3 positive female 449
4 positive male 416
5 invalid male 155
6 invalid female 146
This will give us a data frame with the counts of each result-gender combination, sorted by the count.
The count()
function is particularly useful for:
The count()
function in R’s tidyverse
is a powerful tool for summarizing data. By using the sort = TRUE
option, you can quickly identify the most frequent values in your dataset. Whether you’re counting single variables or combinations of variables, count()
simplifies the task and makes your data analysis workflow more efficient.
Happy counting!
Higgins P (2021). medicaldata: Data Package for Medical Datasets. R package version 0.2.0, https://CRAN.R-project.org/package=medicaldata.
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686 https://doi.org/10.21105/joss.01686.