class: center, middle, inverse, title-slide # Lec 24 - Tidymodels ##
Statistical Programming ### Sta 323 | Spring 2022 ###
Dr. Colin Rundel --- exclude: true --- background-image: url(imgs/hex_tidymodels.png) background-position: center background-size: contain --- ## Tidymodels ```r library(tidymodels) ``` ``` ## ── Attaching packages ────────────────── tidymodels 0.2.0 ── ``` ``` ## ✓ broom 0.7.12 ✓ rsample 0.1.1 ## ✓ dials 0.1.0 ✓ tune 0.2.0 ## ✓ infer 1.0.0 ✓ workflows 0.2.6 ## ✓ modeldata 0.1.1 ✓ workflowsets 0.2.1 ## ✓ parsnip 0.2.1 ✓ yardstick 0.0.9 ## ✓ recipes 0.2.0 ``` ``` ## ── Conflicts ───────────────────── tidymodels_conflicts() ── ## x scales::discard() masks purrr::discard() ## x dplyr::filter() masks stats::filter() ## x recipes::fixed() masks stringr::fixed() ## x dplyr::lag() masks stats::lag() ## x rsample::populate() masks Rcpp::populate() ## x yardstick::spec() masks readr::spec() ## x recipes::step() masks stats::step() ## • Learn how to get started at https://www.tidymodels.org/start/ ``` --- ## Book data .pull-left[ ```r (books = DAAG::allbacks %>% as_tibble() %>% select(-area) %>% mutate( cover = forcats::fct_recode( cover, "hardback" = "hb", "paperback" = "pb" ) ) ) ``` ``` ## # A tibble: 15 × 3 ## volume weight cover ## <dbl> <dbl> <fct> ## 1 885 800 hardback ## 2 1016 950 hardback ## 3 1125 1050 hardback ## 4 239 350 hardback ## 5 701 750 hardback ## 6 641 600 hardback ## 7 1228 1075 hardback ## 8 412 250 paperback ## 9 953 700 paperback ## 10 929 650 paperback ## 11 1492 975 paperback ## 12 419 350 paperback ## 13 1010 950 paperback ## 14 595 425 paperback ## 15 1034 725 paperback ``` ] -- .pull-right[ ```r ggplot(books, aes(x=volume, y=weight, color = cover)) + geom_point(size=2) ``` <img src="Lec24_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Building a tidymodel ```r linear_reg() ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Building a tidymodel ```r linear_reg() %>% set_engine("lm") ``` ``` ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` --- ## Building a tidymodel .pull-left[ ```r linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume * cover, data = books) ``` ``` ## parsnip model object ## ## ## Call: ## stats::lm(formula = weight ~ volume * cover, data = data) ## ## Coefficients: ## (Intercept) volume ## 161.58654 0.76159 ## coverpaperback volume:coverpaperback ## -120.21407 -0.07573 ``` ] -- .pull-right[ ```r lm(weight ~ volume * cover, data = books) ``` ``` ## ## Call: ## lm(formula = weight ~ volume * cover, data = books) ## ## Coefficients: ## (Intercept) volume ## 161.58654 0.76159 ## coverpaperback volume:coverpaperback ## -120.21407 -0.07573 ``` ] --- background-image: url("imgs/hex_broom.png") background-position: 98% 2% background-size: 15% ## Tidy model objects .pull-left[ .small[ ```r summary(lm(weight ~ volume * cover, data = books)) ``` ``` ## ## Call: ## lm(formula = weight ~ volume * cover, data = books) ## ## Residuals: ## Min 1Q Median 3Q Max ## -89.67 -32.07 -21.82 17.94 215.91 ## ## Coefficients: ## Estimate Std. Error t value ## (Intercept) 161.58654 86.51918 1.868 ## volume 0.76159 0.09718 7.837 ## coverpaperback -120.21407 115.65899 -1.039 ## volume:coverpaperback -0.07573 0.12802 -0.592 ## Pr(>|t|) ## (Intercept) 0.0887 . ## volume 7.94e-06 *** ## coverpaperback 0.3209 ## volume:coverpaperback 0.5661 ## --- ## Signif. codes: ## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 80.41 on 11 degrees of freedom ## Multiple R-squared: 0.9297, Adjusted R-squared: 0.9105 ## F-statistic: 48.5 on 3 and 11 DF, p-value: 1.245e-06 ``` ] ] -- .pull-right[ <br/> <br/> <br/> <br/> .small[ ```r lm_tm = linear_reg() %>% set_engine("lm") %>% fit(weight ~ volume * cover, data = books) ``` ```r summary(lm_tm) ``` ``` ## Length Class Mode ## lvl 0 -none- NULL ## spec 5 linear_reg list ## fit 13 lm list ## preproc 1 -none- list ## elapsed 1 -none- list ``` ```r broom::tidy(lm_tm) ``` ``` ## # A tibble: 4 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.62e+2 86.5 1.87 8.87e-2 ## 2 volume 7.62e-1 0.0972 7.84 7.94e-6 ## 3 coverpaperback -1.20e+2 116. -1.04 3.21e-1 ## 4 volume:coverpaperback -7.57e-2 0.128 -0.592 5.66e-1 ``` ] ] --- ## Tidy model statistics ```r broom::glance(lm(weight ~ volume * cover, data = books)) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.930 0.911 80.4 48.5 0.00000124 3 ## # … with 6 more variables: logLik <dbl>, AIC <dbl>, ## # BIC <dbl>, deviance <dbl>, df.residual <int>, ## # nobs <int> ``` ```r broom::glance(lm_tm) ``` ``` ## # A tibble: 1 × 12 ## r.squared adj.r.squared sigma statistic p.value df ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.930 0.911 80.4 48.5 0.00000124 3 ## # … with 6 more variables: logLik <dbl>, AIC <dbl>, ## # BIC <dbl>, deviance <dbl>, df.residual <int>, ## # nobs <int> ``` --- ## Tidy model prediction ```r broom::augment(lm_tm, new_data = books) ``` ``` ## # A tibble: 15 × 5 ## volume weight cover .pred .resid ## <dbl> <dbl> <fct> <dbl> <dbl> ## 1 885 800 hardback 836. -35.6 ## 2 1016 950 hardback 935. 14.6 ## 3 1125 1050 hardback 1018. 31.6 ## 4 239 350 hardback 344. 6.39 ## 5 701 750 hardback 695. 54.5 ## 6 641 600 hardback 650. -49.8 ## 7 1228 1075 hardback 1097. -21.8 ## 8 412 250 paperback 324. -73.9 ## 9 953 700 paperback 695. 5.00 ## 10 929 650 paperback 679. -28.5 ## 11 1492 975 paperback 1065. -89.7 ## 12 419 350 paperback 329. 21.3 ## 13 1010 950 paperback 734. 216. ## 14 595 425 paperback 449. -24.5 ## 15 1034 725 paperback 751. -25.6 ``` --- ## Putting it together .small[ ```r lm_tm %>% augment( new_data = tidyr::expand_grid( volume = seq(0, 1500, by=5), cover = c("hardback", "paperback") %>% as.factor() ) ) %>% rename(weight = .pred) %>% ggplot(aes(x = volume, y = weight, color = cover, group = cover)) + geom_line() + geom_point(data = books) ``` <img src="Lec24_files/figure-html/unnamed-chunk-16-1.png" width="80%" style="display: block; margin: auto;" /> ] --- background-image: url("imgs/hex_parsnip.png") background-position: 98% 2% background-size: 15% ## Why do we care? .pull-left[ .small[ ```r show_engines("linear_reg") ``` ``` ## # A tibble: 7 × 2 ## engine mode ## <chr> <chr> ## 1 lm regression ## 2 glm regression ## 3 glmnet regression ## 4 stan regression ## 5 spark regression ## 6 keras regression ## 7 brulee regression ``` ] ] .pull-right[ ] -- .small[ ```r (bayes_tm = linear_reg() %>% set_engine( "stan", prior_intercept = rstanarm::student_t(df = 1), prior = rstanarm::student_t(df = 1), seed = 1234 ) ) ``` ``` ## Linear Regression Model Specification (regression) ## ## Engine-Specific Arguments: ## prior_intercept = rstanarm::student_t(df = 1) ## prior = rstanarm::student_t(df = 1) ## seed = 1234 ## ## Computational engine: stan ``` ] --- ## Fitting with `rstanarm` .small[ ```r (bayes_tm = bayes_tm %>% fit(weight ~ volume * cover, data = books) ) ``` ``` ## Warning: There were 32 divergent transitions after warmup. See ## https://mc-stan.org/misc/warnings.html#divergent-transitions-after-warmup ## to find out why this is a problem and how to eliminate them. ``` ``` ## Warning: Examine the pairs() plot to diagnose sampling problems ``` ``` ## parsnip model object ## ## stan_glm ## family: gaussian [identity] ## formula: weight ~ volume * cover ## observations: 15 ## predictors: 4 ## ------ ## Median MAD_SD ## (Intercept) 98.9 63.5 ## volume 0.8 0.1 ## coverpaperback -0.3 3.8 ## volume:coverpaperback -0.2 0.1 ## ## Auxiliary parameter(s): ## Median MAD_SD ## sigma 85.3 18.3 ## ## ------ ## * For help interpreting the printed output see ?print.stanreg ## * For info on the priors used see ?prior_summary.stanreg ``` ] .footnote[ See `?details_linear_reg_stan` for details within `parsnip` ] --- ## What was actually run? .small[ ```r linear_reg() %>% set_engine( "stan", prior_intercept = rstanarm::student_t(df = 1), prior = rstanarm::student_t(df = 1), seed = 1234 ) %>% translate() ``` ``` ## Linear Regression Model Specification (regression) ## ## Engine-Specific Arguments: ## prior_intercept = rstanarm::student_t(df = 1) ## prior = rstanarm::student_t(df = 1) ## seed = 1234 ## ## Computational engine: stan ## ## Model fit template: ## rstanarm::stan_glm(formula = missing_arg(), data = missing_arg(), ## weights = missing_arg(), prior_intercept = rstanarm::student_t(df = 1), ## prior = rstanarm::student_t(df = 1), seed = 1234, family = stats::gaussian, ## refresh = 0) ``` ] --- ## Back to broom ```r broom::tidy(bayes_tm) ``` ``` ## Error in warn_on_stanreg(x): The supplied model object seems to be outputted from the rstanarm package. Tidiers for mixed model output now live in the broom.mixed package. ``` -- ```r broom.mixed::tidy(bayes_tm) ``` ``` ## # A tibble: 4 × 3 ## term estimate std.error ## <chr> <dbl> <dbl> ## 1 (Intercept) 98.9 63.5 ## 2 volume 0.824 0.0762 ## 3 coverpaperback -0.285 3.84 ## 4 volume:coverpaperback -0.195 0.0523 ``` ```r broom.mixed::glance(bayes_tm) ``` ``` ## # A tibble: 1 × 4 ## algorithm pss nobs sigma ## <chr> <dbl> <int> <dbl> ## 1 sampling 4000 15 85.3 ``` --- ## Augment ```r augment(bayes_tm, new_data=books) ``` ``` ## # A tibble: 15 × 5 ## volume weight cover .pred .resid ## <dbl> <dbl> <fct> <dbl> <dbl> ## 1 885 800 hardback 828. -28.4 ## 2 1016 950 hardback 937. 13.3 ## 3 1125 1050 hardback 1027. 23.2 ## 4 239 350 hardback 295. 55.5 ## 5 701 750 hardback 676. 73.7 ## 6 641 600 hardback 627. -26.7 ## 7 1228 1075 hardback 1112. -36.9 ## 8 412 250 paperback 355. -105. ## 9 953 700 paperback 697. 3.39 ## 10 929 650 paperback 681. -31.4 ## 11 1492 975 paperback 1037. -62.4 ## 12 419 350 paperback 359. -8.99 ## 13 1010 950 paperback 733. 217. ## 14 595 425 paperback 470. -45.3 ## 15 1034 725 paperback 748. -22.8 ``` --- ## Predictions .small[ ```r bayes_tm %>% augment( new_data = tidyr::expand_grid( volume = seq(0, 1500, by=5), cover = c("hardback", "paperback") %>% as.factor() ) ) %>% rename(weight = .pred) %>% ggplot(aes(x = volume, y = weight, color = cover, group = cover)) + geom_line() + geom_point(data = books) ``` <img src="Lec24_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> ] --- background-image: url("imgs/hex_yardstick.png") background-position: 98% 2% background-size: 15% ## Performance .pull-left[ ```r lm_tm %>% augment(new_data = books) %>% yardstick::rmse(weight, .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 68.9 ``` ```r bayes_tm %>% augment(new_data = books) %>% yardstick::rmse(weight, .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 72.0 ``` ] .pull-right[ ] .footnote[ More on combining these type of results later, see [`workflows`](https://workflows.tidymodels.org/) ] --- class: middle, center # Cross validation and Feature engineering --- ## The Office & IMDB ```r (office_ratings = read_csv("data/office_ratings.csv")) ``` ``` ## # A tibble: 188 × 6 ## season episode title imdb_rating total_votes air_date ## <dbl> <dbl> <chr> <dbl> <dbl> <date> ## 1 1 1 Pilot 7.6 3706 2005-03-24 ## 2 1 2 Divers… 8.3 3566 2005-03-29 ## 3 1 3 Health… 7.9 2983 2005-04-05 ## 4 1 4 The Al… 8.1 2886 2005-04-12 ## 5 1 5 Basket… 8.4 3179 2005-04-19 ## 6 1 6 Hot Gi… 7.8 2852 2005-04-26 ## 7 2 1 The Du… 8.7 3213 2005-09-20 ## 8 2 2 Sexual… 8.2 2736 2005-09-27 ## 9 2 3 Office… 8.4 2742 2005-10-04 ## 10 2 4 The Fi… 8.4 2713 2005-10-11 ## # … with 178 more rows ``` .footnote[ These data are from [data.world](https://data.world/anujjain7/the-office-imdb-ratings-dataset), by way of [TidyTuesday](https://github.com/rfordatascience/tidytuesday/blob/master/data/2020/2020-03-17/readme.md). ] --- ## Rating vs Air Date <img src="Lec24_files/figure-html/unnamed-chunk-28-1.png" width="80%" style="display: block; margin: auto;" /> --- background-image: url("imgs/hex_rsample.png") background-position: 98% 2% background-size: 15% ## Test-train split .pull-left[ ```r set.seed(123) (office_split = initial_split(office_ratings, prop = 0.8)) ``` ``` ## <Analysis/Assess/Total> ## <150/38/188> ``` ] .pull-right[ ] -- <br/> <div> .pull-left[ ```r (office_train = training(office_split)) ``` ``` ## # A tibble: 150 × 6 ## season episode title imdb_rating total_votes air_date ## <dbl> <dbl> <chr> <dbl> <dbl> <date> ## 1 8 18 Last D… 7.8 1429 2012-03-08 ## 2 9 14 Vandal… 7.6 1402 2013-01-31 ## 3 2 8 Perfor… 8.2 2416 2005-11-15 ## 4 9 5 Here C… 7.1 1515 2012-10-25 ## 5 3 22 Beach … 9.1 2783 2007-05-10 ## 6 7 1 Nepoti… 8.4 1897 2010-09-23 ## 7 3 15 Phylli… 8.3 2283 2007-02-08 ## 8 9 21 Livin'… 8.9 2041 2013-05-02 ## 9 9 18 Promos 8 1445 2013-04-04 ## 10 8 12 Pool P… 8 1612 2012-01-19 ## # … with 140 more rows ``` ] .pull-right[ ```r (office_test = testing(office_split)) ``` ``` ## # A tibble: 38 × 6 ## season episode title imdb_rating total_votes air_date ## <dbl> <dbl> <chr> <dbl> <dbl> <date> ## 1 1 2 Divers… 8.3 3566 2005-03-29 ## 2 2 4 The Fi… 8.4 2713 2005-10-11 ## 3 2 9 E-Mail… 8.4 2527 2005-11-22 ## 4 2 12 The In… 9 3282 2006-01-12 ## 5 2 22 Casino… 9.3 3644 2006-05-11 ## 6 3 5 Initia… 8.2 2254 2006-10-19 ## 7 3 16 Busine… 8.8 2622 2007-02-15 ## 8 3 17 Cockta… 8.5 2264 2007-02-22 ## 9 4 6 Branch… 8.5 2185 2007-11-01 ## 10 4 7 Surviv… 8.3 2110 2007-11-08 ## # … with 28 more rows ``` ] </div> --- ## Feature engineering with dplyr ```r office_train %>% mutate( season = as_factor(season), month = lubridate::month(air_date), wday = lubridate::wday(air_date), top10_votes = as.integer(total_votes > quantile(total_votes, 0.9)) ) ``` ``` ## # A tibble: 150 × 9 ## season episode title imdb_rating total_votes air_date month wday top10_votes ## <fct> <dbl> <chr> <dbl> <dbl> <date> <dbl> <dbl> <int> ## 1 8 18 Last Day in Florida 7.8 1429 2012-03-08 3 5 0 ## 2 9 14 Vandalism 7.6 1402 2013-01-31 1 5 0 ## 3 2 8 Performance Review 8.2 2416 2005-11-15 11 3 0 ## 4 9 5 Here Comes Treble 7.1 1515 2012-10-25 10 5 0 ## 5 3 22 Beach Games 9.1 2783 2007-05-10 5 5 0 ## 6 7 1 Nepotism 8.4 1897 2010-09-23 9 5 0 ## 7 3 15 Phyllis' Wedding 8.3 2283 2007-02-08 2 5 0 ## 8 9 21 Livin' the Dream 8.9 2041 2013-05-02 5 5 0 ## 9 9 18 Promos 8 1445 2013-04-04 4 5 0 ## 10 8 12 Pool Party 8 1612 2012-01-19 1 5 0 ## # … with 140 more rows ``` -- <br/> <br/> .center[ Anyone see a potential problem with the code above? ] --- background-image: url("imgs/hex_recipes.png") background-position: 98% 2% background-size: 15% ## Better living through recipes .pull-left[ ```r (r = recipe(imdb_rating ~ ., data = office_train)) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 5 ``` ] .pull-right[ ] -- <div> .pull-left[ ```r summary(r) ``` ``` ## # A tibble: 6 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 season numeric predictor original ## 2 episode numeric predictor original ## 3 title nominal predictor original ## 4 total_votes numeric predictor original ## 5 air_date date predictor original ## 6 imdb_rating numeric outcome original ``` ] .pull-right[] </div> --- ## Recipe roles .pull-left[ ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ``` ] .pull-right[ ```r summary(r) ``` ``` ## # A tibble: 6 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 season numeric predictor original ## 2 episode numeric predictor original ## 3 title nominal ID original ## 4 total_votes numeric predictor original ## 5 air_date date predictor original ## 6 imdb_rating numeric outcome original ``` ] --- ## Adding features (month & day of week) .pull-left[ ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Operations: ## ## Date features from air_date ``` ] .pull-right[ ```r summary(r) ``` ``` ## # A tibble: 6 × 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 season numeric predictor original ## 2 episode numeric predictor original ## 3 title nominal ID original ## 4 total_votes numeric predictor original ## 5 air_date date predictor original ## 6 imdb_rating numeric outcome original ``` ] --- ## Adding Holidays ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) %>% step_holiday( air_date, holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), keep_original_cols = FALSE ) ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Operations: ## ## Date features from air_date ## Holiday features from air_date ``` --- ## Seasons as factors ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) %>% step_holiday( air_date, holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), keep_original_cols = FALSE ) %>% step_num2factor(season, levels = as.character(1:9)) ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Operations: ## ## Date features from air_date ## Holiday features from air_date ## Factor variables from season ``` --- ## Dummy coding ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) %>% step_holiday( air_date, holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), keep_original_cols = FALSE ) %>% step_num2factor(season, levels = as.character(1:9)) %>% step_dummy(all_nominal_predictors()) ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Operations: ## ## Date features from air_date ## Holiday features from air_date ## Factor variables from season ## Dummy variables from all_nominal_predictors() ``` --- ## `top10_votes` .small[ ```r (r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) %>% step_holiday( air_date, holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), keep_original_cols = FALSE ) %>% step_num2factor(season, levels = as.character(1:9)) %>% step_dummy(all_nominal_predictors()) %>% step_percentile(total_votes) %>% step_mutate(top10 = as.integer(total_votes >= 0.9)) %>% step_rm(total_votes) ) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Operations: ## ## Date features from air_date ## Holiday features from air_date ## Factor variables from season ## Dummy variables from all_nominal_predictors() ## Percentile transformation on <none> ## Variable mutation for as.integer(total_votes >= 0.9) ## Variables removed total_votes ``` ] --- ## Preparing a recipe ```r prep(r) ``` ``` ## Recipe ## ## Inputs: ## ## role #variables ## ID 1 ## outcome 1 ## predictor 4 ## ## Training data contained 150 data points and no missing data. ## ## Operations: ## ## Date features from air_date [trained] ## Holiday features from air_date [trained] ## Factor variables from season [trained] ## Dummy variables from season, air_date_dow, air_date_month [trained] ## Percentile transformation on ~total_votes [trained] ## Variable mutation for ~as.integer(total_votes >= 0.9) [trained] ## Variables removed total_votes [trained] ``` --- ## Baking a recipe ```r prep(r) %>% bake(new_data = office_train) ``` ``` ## # A tibble: 150 × 33 ## episode title imdb_rating air_date_USThan… air_date_USChri… air_date_USNewY… air_date_USInde… ## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 18 Last Day… 7.8 0 0 0 0 ## 2 14 Vandalism 7.6 0 0 0 0 ## 3 8 Performa… 8.2 0 0 0 0 ## 4 5 Here Com… 7.1 0 0 0 0 ## 5 22 Beach Ga… 9.1 0 0 0 0 ## 6 1 Nepotism 8.4 0 0 0 0 ## 7 15 Phyllis'… 8.3 0 0 0 0 ## 8 21 Livin' t… 8.9 0 0 0 0 ## 9 18 Promos 8 0 0 0 0 ## 10 12 Pool Par… 8 0 0 0 0 ## # … with 140 more rows, and 26 more variables: season_X2 <dbl>, season_X3 <dbl>, season_X4 <dbl>, ## # season_X5 <dbl>, season_X6 <dbl>, season_X7 <dbl>, season_X8 <dbl>, season_X9 <dbl>, ## # air_date_dow_Mon <dbl>, air_date_dow_Tue <dbl>, air_date_dow_Wed <dbl>, air_date_dow_Thu <dbl>, ## # air_date_dow_Fri <dbl>, air_date_dow_Sat <dbl>, air_date_month_Feb <dbl>, ## # air_date_month_Mar <dbl>, air_date_month_Apr <dbl>, air_date_month_May <dbl>, ## # air_date_month_Jun <dbl>, air_date_month_Jul <dbl>, air_date_month_Aug <dbl>, ## # air_date_month_Sep <dbl>, air_date_month_Oct <dbl>, air_date_month_Nov <dbl>, … ``` --- ## Informative features? ```r prep(r) %>% bake(new_data = office_train) %>% map_int(~ length(unique(.x))) ``` ``` ## episode title imdb_rating ## 26 150 26 ## air_date_USThanksgivingDay air_date_USChristmasDay air_date_USNewYearsDay ## 1 1 1 ## air_date_USIndependenceDay season_X2 season_X3 ## 1 2 2 ## season_X4 season_X5 season_X6 ## 2 2 2 ## season_X7 season_X8 season_X9 ## 2 2 2 ## air_date_dow_Mon air_date_dow_Tue air_date_dow_Wed ## 1 2 1 ## air_date_dow_Thu air_date_dow_Fri air_date_dow_Sat ## 2 1 1 ## air_date_month_Feb air_date_month_Mar air_date_month_Apr ## 2 2 2 ## air_date_month_May air_date_month_Jun air_date_month_Jul ## 2 1 1 ## air_date_month_Aug air_date_month_Sep air_date_month_Oct ## 1 2 2 ## air_date_month_Nov air_date_month_Dec top10 ## 2 2 2 ``` --- ## Removing zero variance predictors .small[ ```r r = recipe(imdb_rating ~ ., data = office_train) %>% update_role(title, new_role = "ID") %>% step_date(air_date, features = c("dow", "month")) %>% step_holiday( air_date, holidays = c("USThanksgivingDay", "USChristmasDay", "USNewYearsDay", "USIndependenceDay"), keep_original_cols = FALSE ) %>% step_num2factor(season, levels = as.character(1:9)) %>% step_dummy(all_nominal_predictors()) %>% step_percentile(total_votes) %>% step_mutate(top10 = as.integer(total_votes >= 0.9)) %>% step_rm(total_votes) %>% step_zv(all_predictors()) prep(r) %>% bake(new_data = office_train) ``` ``` ## # A tibble: 150 × 22 ## episode title imdb_rating season_X2 season_X3 season_X4 season_X5 season_X6 season_X7 season_X8 ## <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 18 Last D… 7.8 0 0 0 0 0 0 1 ## 2 14 Vandal… 7.6 0 0 0 0 0 0 0 ## 3 8 Perfor… 8.2 1 0 0 0 0 0 0 ## 4 5 Here C… 7.1 0 0 0 0 0 0 0 ## 5 22 Beach … 9.1 0 1 0 0 0 0 0 ## 6 1 Nepoti… 8.4 0 0 0 0 0 1 0 ## 7 15 Phylli… 8.3 0 1 0 0 0 0 0 ## 8 21 Livin'… 8.9 0 0 0 0 0 0 0 ## 9 18 Promos 8 0 0 0 0 0 0 0 ## 10 12 Pool P… 8 0 0 0 0 0 0 1 ## # … with 140 more rows, and 12 more variables: season_X9 <dbl>, air_date_dow_Tue <dbl>, ## # air_date_dow_Thu <dbl>, air_date_month_Feb <dbl>, air_date_month_Mar <dbl>, ## # air_date_month_Apr <dbl>, air_date_month_May <dbl>, air_date_month_Sep <dbl>, ## # air_date_month_Oct <dbl>, air_date_month_Nov <dbl>, air_date_month_Dec <dbl>, top10 <int> ``` ] --- background-image: url("imgs/hex_workflows.png") background-position: 98% 2% background-size: 15% ## Really putting it all together .pull-left[ ```r (office_work = workflow() %>% add_recipe(r) %>% add_model( linear_reg() %>% set_engine("lm") ) ) ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────── ## 8 Recipe Steps ## ## • step_date() ## • step_holiday() ## • step_num2factor() ## • step_dummy() ## • step_percentile() ## • step_mutate() ## • step_rm() ## • step_zv() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────────── ## Linear Regression Model Specification (regression) ## ## Computational engine: lm ``` ] .pull-right[ ] --- ## Workflow fit .small[ ```r (office_fit = office_work %>% fit(data = office_train)) ``` ``` ## ══ Workflow [trained] ══════════════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: linear_reg() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────────────────────────── ## 8 Recipe Steps ## ## • step_date() ## • step_holiday() ## • step_num2factor() ## • step_dummy() ## • step_percentile() ## • step_mutate() ## • step_rm() ## • step_zv() ## ## ── Model ─────────────────────────────────────────────────────────────────────────────────────────── ## ## Call: ## stats::lm(formula = ..y ~ ., data = data) ## ## Coefficients: ## (Intercept) episode season_X2 season_X3 season_X4 ## 7.342550 0.007041 1.216204 1.411946 1.433737 ## season_X5 season_X6 season_X7 season_X8 season_X9 ## 1.353140 1.150998 1.218278 0.526168 0.789066 ## air_date_dow_Tue air_date_dow_Thu air_date_month_Feb air_date_month_Mar air_date_month_Apr ## -0.286676 -0.362900 -0.077280 -0.048725 0.015377 ## air_date_month_May air_date_month_Sep air_date_month_Oct air_date_month_Nov air_date_month_Dec ## 0.144080 0.092809 -0.047704 -0.077198 0.306999 ## top10 ## 0.890058 ``` ] --- ## Performance .pull-left[ ```r office_fit %>% augment(office_train) %>% rmse(imdb_rating, .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.342 ``` ] .pull-right[ ```r office_fit %>% augment(office_test) %>% rmse(imdb_rating, .pred) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 rmse standard 0.399 ``` ] --- ## k-fold cross validation <img src="imgs/kfold-cv.png" width="80%" style="display: block; margin: auto;" /> --- ## Creating folds ```r set.seed(123) (folds = vfold_cv(office_train, v=5)) ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 × 2 ## splits id ## <list> <chr> ## 1 <split [120/30]> Fold1 ## 2 <split [120/30]> Fold2 ## 3 <split [120/30]> Fold3 ## 4 <split [120/30]> Fold4 ## 5 <split [120/30]> Fold5 ``` -- ```r (office_fit_folds = office_work %>% fit_resamples(folds) ) ``` ``` ## ! Fold2: preprocessor 1/1, model 1/1 (predictions): prediction from a rank-deficient fit may be misle... ``` ``` ## # Resampling results ## # 5-fold cross-validation ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [120/30]> Fold1 <tibble [2 × 4]> <tibble [0 × 3]> ## 2 <split [120/30]> Fold2 <tibble [2 × 4]> <tibble [1 × 3]> ## 3 <split [120/30]> Fold3 <tibble [2 × 4]> <tibble [0 × 3]> ## 4 <split [120/30]> Fold4 <tibble [2 × 4]> <tibble [0 × 3]> ## 5 <split [120/30]> Fold5 <tibble [2 × 4]> <tibble [0 × 3]> ## ## There were issues with some computations: ## ## - Warning(s) x1: prediction from a rank-deficient fit may be misleading ## ## Use `collect_notes(object)` for more information. ``` --- ## Fold performance ```r tune::collect_metrics(office_fit_folds) ``` ``` ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 rmse standard 0.420 5 0.0182 Preprocessor1_Model1 ## 2 rsq standard 0.429 5 0.0597 Preprocessor1_Model1 ``` -- ```r tune::collect_metrics(office_fit_folds, summarize = FALSE) %>% filter(.metric == "rmse") ``` ``` ## # A tibble: 5 × 5 ## id .metric .estimator .estimate .config ## <chr> <chr> <chr> <dbl> <chr> ## 1 Fold1 rmse standard 0.467 Preprocessor1_Model1 ## 2 Fold2 rmse standard 0.403 Preprocessor1_Model1 ## 3 Fold3 rmse standard 0.405 Preprocessor1_Model1 ## 4 Fold4 rmse standard 0.454 Preprocessor1_Model1 ## 5 Fold5 rmse standard 0.368 Preprocessor1_Model1 ``` .footnote[ More on the `tune` package next time ]