Example • ezPurrr

The ezPurrr package is built on top of the purrr library. The purpose of the package is to help make functional programming workflows more simple and efficient. Let’s start by loading ezPurrr and a few other packages we’ll use for data manipulation in this vignette.

library(ezPurrr)
library(dplyr)
library(tidyr)
library(ggplot2)

ezPurrr requires a nested dataset. For example

nest_df <- palmerpenguins::penguins %>%
  group_by(island, species) %>%
  nest()

head(nest_df)
#> # A tibble: 5 × 3
#> # Groups:   island, species [5]
#>   species   island    data              
#>   <fct>     <fct>     <list>            
#> 1 Adelie    Torgersen <tibble [52 × 6]> 
#> 2 Adelie    Biscoe    <tibble [44 × 6]> 
#> 3 Adelie    Dream     <tibble [56 × 6]> 
#> 4 Gentoo    Biscoe    <tibble [124 × 6]>
#> 5 Chinstrap Dream     <tibble [68 × 6]>

This data frame has columns for the penguin species and the island on which the data was collected but, critically, an additional data column. This column is a list column that has all the data for the corresponding species/island row. All functions in ezPurrr require the data to first be in this format.

From here, we want to be able to work with the data like normal, but it’s difficult when the data are in this list column format. This is where teh sample_*() functions come in.

`sample_*()`

`sample_row()`

The sample_row() function returns one random or one particular row (when an index argument is supplied) from the nested data as a list or a data frame. If returning a list (default), the first item of the list with be the data from the given row, while the remaining elements of the list will be the grouping variables from that row (i.e., the species and island for the corresponding row, in our previous example). If returning a data frame (with type = df argument), then the data frame will include one row of the nested data, as well as each grouping variable as a column (with identical repeated content).

For example

nest_df %>% 
  sample_row()
#> row 5 is selected randomly
#> $data
#> # A tibble: 68 × 6
#>    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>             <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1           46.5          17.9               192        3500 female  2007
#>  2           50            19.5               196        3900 male    2007
#>  3           51.3          19.2               193        3650 male    2007
#>  4           45.4          18.7               188        3525 female  2007
#>  5           52.7          19.8               197        3725 male    2007
#>  6           45.2          17.8               198        3950 female  2007
#>  7           46.1          18.2               178        3250 female  2007
#>  8           51.3          18.2               197        3750 male    2007
#>  9           46            18.9               195        4150 female  2007
#> 10           51.3          19.9               198        3700 male    2007
#> # … with 58 more rows
#> 
#> $island
#> [1] Dream
#> Levels: Biscoe Dream Torgersen
#> 
#> $species
#> [1] Chinstrap
#> Levels: Adelie Chinstrap Gentoo

nest_df %>% 
  sample_row(index = 3) %>% 
  head()
#> $data
#> # A tibble: 56 × 6
#>    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>             <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1           39.5          16.7               178        3250 female  2007
#>  2           37.2          18.1               178        3900 male    2007
#>  3           39.5          17.8               188        3300 female  2007
#>  4           40.9          18.9               184        3900 male    2007
#>  5           36.4          17                 195        3325 female  2007
#>  6           39.2          21.1               196        4150 male    2007
#>  7           38.8          20                 190        3950 male    2007
#>  8           42.2          18.5               180        3550 female  2007
#>  9           37.6          19.3               181        3300 female  2007
#> 10           39.8          19.1               184        4650 male    2007
#> # … with 46 more rows
#> 
#> $island
#> [1] Dream
#> Levels: Biscoe Dream Torgersen
#> 
#> $species
#> [1] Adelie
#> Levels: Adelie Chinstrap Gentoo

nest_df %>% 
  sample_row(index = 3, type = 'df') %>% 
  head()
#> # A tibble: 6 × 8
#>   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>            <dbl>         <dbl>             <int>       <int> <fct>  <int>
#> 1           39.5          16.7               178        3250 female  2007
#> 2           37.2          18.1               178        3900 male    2007
#> 3           39.5          17.8               188        3300 female  2007
#> 4           40.9          18.9               184        3900 male    2007
#> 5           36.4          17                 195        3325 female  2007
#> 6           39.2          21.1               196        4150 male    2007
#> # … with 2 more variables: group.island <fct>, group.species <fct>

`sample_data()`

sample_data() returns one random or one particular data column (with index argument) from the nested data. The primary difference between sample_data() versus sample_row() is that the grouping variables are not returned - just the data. For example:

nest_df %>% 
  sample_data()
#> row 2 is selected randomly
#> # A tibble: 44 × 6
#>    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>             <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1           37.8          18.3               174        3400 female  2007
#>  2           37.7          18.7               180        3600 male    2007
#>  3           35.9          19.2               189        3800 female  2007
#>  4           38.2          18.1               185        3950 male    2007
#>  5           38.8          17.2               180        3800 male    2007
#>  6           35.3          18.9               187        3800 female  2007
#>  7           40.6          18.6               183        3550 male    2007
#>  8           40.5          17.9               187        3200 female  2007
#>  9           37.9          18.6               172        3150 female  2007
#> 10           40.5          18.9               180        3950 male    2007
#> # … with 34 more rows

nest_df %>% 
  sample_data(index = 3)
#> # A tibble: 56 × 6
#>    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>             <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1           39.5          16.7               178        3250 female  2007
#>  2           37.2          18.1               178        3900 male    2007
#>  3           39.5          17.8               188        3300 female  2007
#>  4           40.9          18.9               184        3900 male    2007
#>  5           36.4          17                 195        3325 female  2007
#>  6           39.2          21.1               196        4150 male    2007
#>  7           38.8          20                 190        3950 male    2007
#>  8           42.2          18.5               180        3550 female  2007
#>  9           37.6          19.3               181        3300 female  2007
#> 10           39.8          19.1               184        4650 male    2007
#> # … with 46 more rows

`sample_group()`

Finally, sample_group() returns one random or one particular grouping columns (with index argument) from the nested dataset with no data. For example:

nest_df %>% 
  sample_group()
#> row 3 is selected randomly
#> # A tibble: 1 × 2
#> # Groups:   island, species [1]
#>   species island
#>   <fct>   <fct> 
#> 1 Adelie  Dream

nest_df %>% 
  sample_group(index = 3)
#> # A tibble: 1 × 2
#> # Groups:   island, species [1]
#>   species island
#>   <fct>   <fct> 
#> 1 Adelie  Dream

`broadcast()`

Generally, we will use the data we have sampled to conduct some operations. Once we’ve conducted those operations we just need to wrap them all in a function, then broadcast() them across the sets of data (i.e., all rows).

In other words, the sampling dataset is used to test the code you want to be applied for each row, such as figures, models, or transformations.

Let’s look at an example for plotting. First, we create our sample data.

samp <- nest_df %>% 
  sample_data(index = 3)

samp
#> # A tibble: 56 × 6
#>    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex     year
#>             <dbl>         <dbl>             <int>       <int> <fct>  <int>
#>  1           39.5          16.7               178        3250 female  2007
#>  2           37.2          18.1               178        3900 male    2007
#>  3           39.5          17.8               188        3300 female  2007
#>  4           40.9          18.9               184        3900 male    2007
#>  5           36.4          17                 195        3325 female  2007
#>  6           39.2          21.1               196        4150 male    2007
#>  7           38.8          20                 190        3950 male    2007
#>  8           42.2          18.5               180        3550 female  2007
#>  9           37.6          19.3               181        3300 female  2007
#> 10           39.8          19.1               184        4650 male    2007
#> # … with 46 more rows

Next, we write the code to create our plot, using samp as our sample data to test our code.

ggplot(samp, aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 color = body_mass_g)) +
  geom_point()

Finally, we can wrap this into a function.

plotting <- function(data){
  ggplot(data, aes(x = bill_length_mm,
                   y = bill_depth_mm,
                   color = body_mass_g)) +
    geom_point()
}

Notice the code above is exactly as we had before, we’ve just changed our sample data set to data as an argument within a function. Normally, writing functions for plotting (and many other functions in the tidyverse) is difficult because of non-standard evaluation, but in this case we don’t have to worry about any of that.

Now, we can broadcast() the function to all of the data!

broadcasted_df = nest_df %>% 
  broadcast(plotting)
broadcasted_df
#> # A tibble: 5 × 4
#> # Groups:   island, species [5]
#>   species   island    data               output
#>   <fct>     <fct>     <list>             <list>
#> 1 Adelie    Torgersen <tibble [52 × 6]>  <gg>  
#> 2 Adelie    Biscoe    <tibble [44 × 6]>  <gg>  
#> 3 Adelie    Dream     <tibble [56 × 6]>  <gg>  
#> 4 Gentoo    Biscoe    <tibble [124 × 6]> <gg>  
#> 5 Chinstrap Dream     <tibble [68 × 6]>  <gg>

Notice we now have a new column, called output, which has all the plots! To look at the first plot, we can print it with

broadcasted_df$output[[1]]
#> Warning: Removed 1 rows containing missing values (geom_point).

Or, similarly:

broadcasted_df %>% 
  pull(output) %>% 
  purrr::pluck(1)
#> Warning: Removed 1 rows containing missing values (geom_point).

You can also use it for modeling or other functions. For example:

tmp_df = nest_df %>% 
  sample_data(index = 3)

lm(body_mass_g ~ bill_length_mm * bill_depth_mm, data = tmp_df)
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm * bill_depth_mm, data = tmp_df)
#> 
#> Coefficients:
#>                  (Intercept)                bill_length_mm  
#>                   -2079.0346                       92.6987  
#>                bill_depth_mm  bill_length_mm:bill_depth_mm  
#>                     145.9699                       -0.6616

modeling <- function(data){
   lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
}

nest_df %>% 
  broadcast(modeling) %>% 
  .$output
#> [[1]]
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm   bill_depth_mm  
#>       -1161.63           47.79          163.14  
#> 
#> 
#> [[2]]
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm   bill_depth_mm  
#>       -3015.76           86.27          183.06  
#> 
#> 
#> [[3]]
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm   bill_depth_mm  
#>       -1622.93           80.87          120.41  
#> 
#> 
#> [[4]]
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm   bill_depth_mm  
#>       -1452.12           57.64          252.96  
#> 
#> 
#> [[5]]
#> 
#> Call:
#> lm(formula = body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
#> 
#> Coefficients:
#>    (Intercept)  bill_length_mm   bill_depth_mm  
#>        -356.11           23.82          158.84

Instead of a list, you can also make the output a number (double).


modeling <- function(data){
  model = lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
  model$coefficients[2] # the slope for bill_length_mm
}

nest_df %>% broadcast(modeling) %>% .$output
#> [[1]]
#> bill_length_mm 
#>       47.78764 
#> 
#> [[2]]
#> bill_length_mm 
#>       86.27406 
#> 
#> [[3]]
#> bill_length_mm 
#>       80.86942 
#> 
#> [[4]]
#> bill_length_mm 
#>       57.64173 
#> 
#> [[5]]
#> bill_length_mm 
#>       23.82257

`broadcast_group()`

broadcast_group() allows you to also use grouping variables in the function. It is still under development so currently it only works in a more limited way. For example, you can include grouping variables as the title for plots.

df = nest_df %>% 
  sample_row(index = 3, type = 'list')

plotting = function(data, species, island){
  ggplot(data, aes(x = bill_length_mm,
                             y = bill_depth_mm,
                             color = body_mass_g)) +
    geom_point() +
    labs(title = paste(species, island))
}

# example with a single plot
plotting(df$data, df$species, df$island)


nest_df = nest_df %>% 
  broadcast_group(plotting)

nest_df$output[[1]]
#> Warning: Removed 1 rows containing missing values (geom_point).

nest_df$output[[4]]
#> Warning: Removed 1 rows containing missing values (geom_point).