class: center, middle, inverse, title-slide # Lec 23 - Bigish data ##
Statistical Programming ### Sta 323 | Spring 2022 ###
Dr. Colin Rundel --- exclude: true --- ## Big vs Bigish data * We will be working with data that is large, but it will still fit in memory. * R *loves* to make extra copies of objects, so we need to be careful - even a handful of copies with exhaust the memory on most systems. * Less of an issue on departental server (256 GB of Ram), but this is a shared resource - use too much and your session/process might be killed. --- ## NYC parking ticket data New York City is at the forefront of the open data movement among local, state and federal governments. They have made publicly available a huge amount of data (NYC Open Data) on everything from street trees, to restaurant inspections, to parking violations. We will be looking at all parking tickets issued in NYC for the 2022 Fiscal year - the data were obtained from [NYC Open Data](https://data.cityofnewyork.us/City-Government/Parking-Violations-Issued-Fiscal-Year-2022/pvqr-7yc4). ```r fs::file_size("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv") ``` ``` ## 1.76G ``` --- ## Reading the data If we use the basic approach of `read.csv`, we end up waiting awhile, ```r system.time( read.csv("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv") ) ## user system elapsed ## 90.764 11.426 108.169 ``` Almost 2 minutes to read in a 1.76 gigabyte CSV file. --- ## Improvements If we use `stringsAsFactors=FALSE` and `comment.char=""` arguments we can speed things up a bit. ```r system.time( read.csv( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv", stringsAsFactors=FALSE, comment.char="" ) ) ## user system elapsed ## 75.470 14.124 96.476 ``` We can take this farther by specifying the structure of the data using the `colClasses` argument. --- ## data.table::fread ```r system.time({ nyc_fread = data.table::fread("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv") }) ## |--------------------------------------------------| ## |==================================================| ## user system elapsed ## 17.233 1.041 21.857 ``` -- ```r class(nyc_fread) ## "data.table" "data.frame" nyc = as.data.frame(nyc_fread) class(nyc) ## [1] "data.frame" ``` --- ## readr::read_csv ```r system.time({ nyc = readr::read_csv("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv") }) ## Rows: 9980449 Columns: 43 ## ── Column specification ────────────────────────────────────────────────────────────────────────── ## Delimiter: "," ## chr (25): Plate ID, Registration State, Plate Type, Issue Date, Vehicle Body Type, Vehicle Mak... ## dbl (14): Summons Number, Violation Code, Street Code1, Street Code2, Street Code3, Vehicle Ex... ## lgl (4): Violation Legal Code, No Standing or Stopping Violation, Hydrant Violation, Double P... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ## user system elapsed ## 55.796 10.596 40.465 ``` --- ## readr::read_csv - lazy ```r system.time({ nyc_lazy = readr::read_csv("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv", lazy=TRUE) }) ## Rows: 9980449 Columns: 43 ## ── Column specification ────────────────────────────────────────────────────────────────────────── ## Delimiter: "," ## chr (25): Plate ID, Registration State, Plate Type, Issue Date, Vehicle Body Type, Vehicle Mak... ## dbl (14): Summons Number, Violation Code, Street Code1, Street Code2, Street Code3, Vehicle Ex... ## lgl (4): Violation Legal Code, No Standing or Stopping Violation, Hydrant Violation, Double P... ## ## ℹ Use `spec()` to retrieve the full column specification for this data. ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. ## user system elapsed ## 13.972 4.729 4.497 ``` -- ```r pryr::object_size(nyc) ## 3,461,255,280 B pryr::object_size(nyc_lazy) ## 159,707,840 B ``` --- ## readr This package is part of the tidyverse and is designed to be a fast and friendly way of reading tabular data into R. <br/> Core features: * Faster than base R (~3-4x) * No strings as factors * No column name mangling * Consistent argument/function naming scheme * Plays nice with dplyr (`tbl_df`) --- ## Ticket Data ```r nyc ## # A tibble: 9,980,449 × 43 ## `Summons Number` `Plate ID` `Registration State` `Plate Type` `Issue Date` `Violation Code` ## <dbl> <chr> <chr> <chr> <chr> <dbl> ## 1 1457617912 JEB5683 NY PAS 06/25/2021 40 ## 2 1457617924 JAN2986 NY PAS 06/25/2021 20 ## 3 1457622427 FJH6630 TX PAS 06/17/2021 98 ## 4 1457638629 RD1Y5N MO PAS 06/16/2021 98 ## 5 1457639580 T503814C NY OMT 07/04/2021 40 ## 6 1457643042 JLN5490 NY PAS 06/28/2021 98 ## 7 1457663909 UMB4505 VA PAS 07/02/2021 98 ## 8 1457670471 JPS7544 NY PAS 06/19/2021 40 ## 9 1457670537 UPS7544 NY PAS 06/19/2021 70 ## 10 1457677623 07027R5 TX PAS 07/03/2021 74 ## # … with 9,980,439 more rows, and 37 more variables: `Vehicle Body Type` <chr>, `Vehicle Make` <chr>, ## # `Issuing Agency` <chr>, `Street Code1` <dbl>, `Street Code2` <dbl>, `Street Code3` <dbl>, ## # `Vehicle Expiration Date` <dbl>, `Violation Location` <chr>, `Violation Precinct` <dbl>, ## # `Issuer Precinct` <dbl>, `Issuer Code` <dbl>, `Issuer Command` <chr>, `Issuer Squad` <chr>, ## # `Violation Time` <chr>, `Time First Observed` <chr>, `Violation County` <chr>, ## # `Violation In Front Of Or Opposite` <chr>, `House Number` <chr>, `Street Name` <chr>, ## # `Intersecting Street` <chr>, `Date First Observed` <dbl>, `Law Section` <dbl>, … ``` --- ## Fixing column names ```r (nyc = janitor::clean_names(nyc)) ## # A tibble: 9,980,449 × 43 ## summons_number plate_id registration_sta… plate_type issue_date violation_code vehicle_body_ty… vehicle_make ## <dbl> <chr> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 1457617912 JEB5683 NY PAS 06/25/2021 40 VAN FORD ## 2 1457617924 JAN2986 NY PAS 06/25/2021 20 SUBN DODGE ## 3 1457622427 FJH6630 TX PAS 06/17/2021 98 SDN AUDI ## 4 1457638629 RD1Y5N MO PAS 06/16/2021 98 SDN TOYOT ## 5 1457639580 T503814C NY OMT 07/04/2021 40 TAXI HONDA ## 6 1457643042 JLN5490 NY PAS 06/28/2021 98 SDN HONDA ## 7 1457663909 UMB4505 VA PAS 07/02/2021 98 SDN SUBAR ## 8 1457670471 JPS7544 NY PAS 06/19/2021 40 SDN NISSA ## 9 1457670537 UPS7544 NY PAS 06/19/2021 70 SDN NISSA ## 10 1457677623 07027R5 TX PAS 07/03/2021 74 SUBN LEXUS ## # … with 9,980,439 more rows, and 35 more variables: issuing_agency <chr>, street_code1 <dbl>, ## # street_code2 <dbl>, street_code3 <dbl>, vehicle_expiration_date <dbl>, violation_location <chr>, ## # violation_precinct <dbl>, issuer_precinct <dbl>, issuer_code <dbl>, issuer_command <chr>, ## # issuer_squad <chr>, violation_time <chr>, time_first_observed <chr>, violation_county <chr>, ## # violation_in_front_of_or_opposite <chr>, house_number <chr>, street_name <chr>, intersecting_street <chr>, ## # date_first_observed <dbl>, law_section <dbl>, sub_division <chr>, violation_legal_code <lgl>, ## # days_parking_in_effect <chr>, from_hours_in_effect <chr>, to_hours_in_effect <chr>, vehicle_color <chr>, … ``` --- ## Simplifying There is a lot of variables we won't care about for the time being, so lets make life easier by selecting a subset of columns. ```r (nyc_trim = nyc %>% select( registration_state:issuing_agency, violation_location, violation_precinct, violation_time, house_number:intersecting_street, vehicle_color ) ) ## # A tibble: 9,980,449 × 14 ## registration_state plate_type issue_date violation_code vehicle_body_type vehicle_make issuing_agency ## <chr> <chr> <chr> <dbl> <chr> <chr> <chr> ## 1 NY PAS 06/25/2021 40 VAN FORD P ## 2 NY PAS 06/25/2021 20 SUBN DODGE P ## 3 TX PAS 06/17/2021 98 SDN AUDI P ## 4 MO PAS 06/16/2021 98 SDN TOYOT P ## 5 NY OMT 07/04/2021 40 TAXI HONDA P ## 6 NY PAS 06/28/2021 98 SDN HONDA P ## 7 VA PAS 07/02/2021 98 SDN SUBAR P ## 8 NY PAS 06/19/2021 40 SDN NISSA P ## 9 NY PAS 06/19/2021 70 SDN NISSA P ## 10 TX PAS 07/03/2021 74 SUBN LEXUS P ## # … with 9,980,439 more rows, and 7 more variables: violation_location <chr>, violation_precinct <dbl>, ## # violation_time <chr>, house_number <chr>, street_name <chr>, intersecting_street <chr>, ## # vehicle_color <chr> ``` --- ## Size of objects ```r pryr::object_size(nyc) ## 3,461,255,280 B pryr::object_size(nyc_fread) ## 2,942,262,600 B pryr::object_size(nyc_trim) ## 1,137,418,720 B ``` --- ## Clean data? How many different car colors are in this data set? -- .pull-left[ .small[ ```r nyc %>% count(vehicle_color) %>% arrange(desc(n)) ## # A tibble: 953 × 2 ## vehicle_color n ## <chr> <int> ## 1 GY 1943084 ## 2 WH 1779206 ## 3 BK 1701522 ## 4 NA 683723 ## 5 WHITE 654584 ## 6 BL 653795 ## 7 BLACK 423010 ## 8 RD 377793 ## 9 GREY 302187 ## 10 SILVE 162462 ## # … with 943 more rows ``` ] ] .pull-right[ .small[ ```r nyc %>% count(vehicle_color) ## # A tibble: 953 × 2 ## vehicle_color n ## <chr> <int> ## 1 - 1 ## 2 - / 1 ## 3 -.I. 1 ## 4 ?:{ 15 ## 5 . 2 ## 6 .. 4 ## 7 .N 1 ## 8 .X 1 ## 9 '} 1 ## 10 // 4 ## # … with 943 more rows ``` ] ] --- ## Issue Dates ```r library(lubridate) class(nyc$issue_date) ## [1] "character" head(nyc$issue_date) ## [1] "06/25/2021" "06/25/2021" "06/17/2021" "06/16/2021" "07/04/2021" "06/28/2021" nyc = nyc %>% mutate(issue_date = mdy(issue_date, tz="America/New_York")) class(nyc$issue_date) ## [1] "POSIXct" "POSIXt" range(nyc$issue_date) ## [1] "1973-09-24 EDT" "2067-11-28 EST" ``` --- ```r nyc ## # A tibble: 9,980,449 × 43 ## summons_number plate_id registration_st… plate_type issue_date violation_code vehicle_body_ty… ## <dbl> <chr> <chr> <chr> <dttm> <dbl> <chr> ## 1 1457617912 JEB5683 NY PAS 2021-06-25 00:00:00 40 VAN ## 2 1457617924 JAN2986 NY PAS 2021-06-25 00:00:00 20 SUBN ## 3 1457622427 FJH6630 TX PAS 2021-06-17 00:00:00 98 SDN ## 4 1457638629 RD1Y5N MO PAS 2021-06-16 00:00:00 98 SDN ## 5 1457639580 T503814C NY OMT 2021-07-04 00:00:00 40 TAXI ## 6 1457643042 JLN5490 NY PAS 2021-06-28 00:00:00 98 SDN ## 7 1457663909 UMB4505 VA PAS 2021-07-02 00:00:00 98 SDN ## 8 1457670471 JPS7544 NY PAS 2021-06-19 00:00:00 40 SDN ## 9 1457670537 UPS7544 NY PAS 2021-06-19 00:00:00 70 SDN ## 10 1457677623 07027R5 TX PAS 2021-07-03 00:00:00 74 SUBN ## # … with 9,980,439 more rows, and 36 more variables: vehicle_make <chr>, issuing_agency <chr>, ## # street_code1 <dbl>, street_code2 <dbl>, street_code3 <dbl>, vehicle_expiration_date <dbl>, ## # violation_location <chr>, violation_precinct <dbl>, issuer_precinct <dbl>, issuer_code <dbl>, ## # issuer_command <chr>, issuer_squad <chr>, violation_time <chr>, time_first_observed <chr>, ## # violation_county <chr>, violation_in_front_of_or_opposite <chr>, house_number <chr>, ## # street_name <chr>, intersecting_street <chr>, date_first_observed <dbl>, law_section <dbl>, ## # sub_division <chr>, violation_legal_code <lgl>, days_parking_in_effect <chr>, … ``` --- ## More date issues ```r nyc$issue_date %>% year() %>% table() ## 1973 2000 2001 2002 2005 2006 2007 2009 2010 2011 2012 ## 2 51 5 4 1 2 2 2 9 9 25 ## 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 ## 3 2 4 5 4 5 17 211 8062258 1917658 108 ## 2024 2025 2026 2027 2028 2029 2030 2031 2032 2046 2061 ## 10 9 2 7 1 3 8 17 1 1 2 ## 2067 ## 1 ``` ] --- ```r filter(nyc, issue_date >= mdy("1/1/2021"), issue_date <= mdy("12/31/2022")) ## # A tibble: 9,979,916 × 43 ## summons_number plate_id registration_state plate_type issue_date violation_code ## <dbl> <chr> <chr> <chr> <dttm> <dbl> ## 1 1457617912 JEB5683 NY PAS 2021-06-25 00:00:00 40 ## 2 1457617924 JAN2986 NY PAS 2021-06-25 00:00:00 20 ## 3 1457622427 FJH6630 TX PAS 2021-06-17 00:00:00 98 ## 4 1457638629 RD1Y5N MO PAS 2021-06-16 00:00:00 98 ## 5 1457639580 T503814C NY OMT 2021-07-04 00:00:00 40 ## 6 1457643042 JLN5490 NY PAS 2021-06-28 00:00:00 98 ## 7 1457663909 UMB4505 VA PAS 2021-07-02 00:00:00 98 ## 8 1457670471 JPS7544 NY PAS 2021-06-19 00:00:00 40 ## 9 1457670537 UPS7544 NY PAS 2021-06-19 00:00:00 70 ## 10 1457677623 07027R5 TX PAS 2021-07-03 00:00:00 74 ## # … with 9,979,906 more rows, and 37 more variables: vehicle_body_type <chr>, ## # vehicle_make <chr>, issuing_agency <chr>, street_code1 <dbl>, street_code2 <dbl>, ## # street_code3 <dbl>, vehicle_expiration_date <dbl>, violation_location <chr>, ## # violation_precinct <dbl>, issuer_precinct <dbl>, issuer_code <dbl>, issuer_command ## <chr>, ## # issuer_squad <chr>, violation_time <chr>, time_first_observed <chr>, ## # violation_county <chr>, violation_in_front_of_or_opposite <chr>, house_number <chr>, ## # street_name <chr>, intersecting_street <chr>, date_first_observed <dbl>, … ``` ] --- ## Performance? ```r system.time( filter(nyc, issue_date >= mdy("1/1/2021"), issue_date <= mdy("12/31/2022")) ) ## user system elapsed ## 1.079 0.856 2.396 system.time( filter(nyc, year(issue_date) %in% c(2013, 2014)) ) ## user system elapsed ## 1.007 0.085 1.090 ``` --- ## Putting it all together ```r nyc = readr::read_csv("data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv") %>% janitor::clean_names() %>% select(registration_state:issuing_agency, violation_location, violation_precinct, violation_time, house_number:intersecting_street, vehicle_color) %>% mutate(issue_date = mdy(issue_date)) %>% mutate(issue_day = day(issue_date), issue_month = month(issue_date), issue_year = year(issue_date), issue_wday = wday(issue_date, label=TRUE)) %>% filter(issue_year %in% 2021:2022) nyc ## # A tibble: 9,979,916 × 18 ## registration_state plate_type issue_date violation_code vehicle_body_type vehicle_make ## <chr> <chr> <date> <dbl> <chr> <chr> ## 1 NY PAS 2021-06-25 40 VAN FORD ## 2 NY PAS 2021-06-25 20 SUBN DODGE ## 3 TX PAS 2021-06-17 98 SDN AUDI ## 4 MO PAS 2021-06-16 98 SDN TOYOT ## 5 NY OMT 2021-07-04 40 TAXI HONDA ## 6 NY PAS 2021-06-28 98 SDN HONDA ## 7 VA PAS 2021-07-02 98 SDN SUBAR ## 8 NY PAS 2021-06-19 40 SDN NISSA ## 9 NY PAS 2021-06-19 70 SDN NISSA ## 10 TX PAS 2021-07-03 74 SUBN LEXUS ## # … with 9,979,906 more rows, and 12 more variables: issuing_agency <chr>, ## # violation_location <chr>, violation_precinct <dbl>, violation_time <chr>, ## # house_number <chr>, street_name <chr>, intersecting_street <chr>, vehicle_color <chr>, ## # issue_day <int>, issue_month <dbl>, issue_year <dbl>, issue_wday <ord> ``` --- ## Ticket Frequency ```r nyc %>% count(issue_date) %>% ggplot(aes(x=issue_date, y=n)) + geom_line()+ xlim(mdy("7/1/2021"), mdy("2/28/2022")) ``` <img src="imgs/nyc_date_freq.png" width="80%" style="display: block; margin: auto;" /> --- ## Demos Some dplyr practice, 1. Create a plot of the weekly pattern (tickets issued per day of the week) - When are you most likely to get a ticket and when are you least likely to get a ticket? 2. Which precinct issued the most tickets to Toyotas? --- class: center, middle ## Arrow --- ## Apache Arrow > Apache Arrow is a software development platform for building high performance applications that process and transport large data sets. It is designed to both improve the performance of analytical algorithms and the efficiency of moving data from one system or programming language to another. > > A critical component of Apache Arrow is its in-memory columnar format, a standardized, language-agnostic specification for representing structured, table-like datasets in-memory. This data format has a rich data type system (included nested and user-defined data types) designed to support the needs of analytic database systems, data frame libraries, and more. <img src="imgs/arrow_cols.png" width="40%" style="display: block; margin: auto;" /> --- ## Language support .pull-left-narrow[ Core implementations in: * C * C++ * C# * go * Java * JavaScript * Julia * Rust * MATLAB * Python * R * Ruby ] .pull-right-wide[ <img src="imgs/arrow_memory.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Arrow + CSV ```r system.time({ nyc_arrow_df = arrow::read_csv_arrow( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv" ) }) ## user system elapsed ## 9.331 19.270 11.000 ``` -- ```r system.time({ nyc_arrow = arrow::read_csv_arrow( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv", as_data_frame = FALSE ) }) ## user system elapsed ## 8.860 12.826 7.174 ``` --- ## Arrow tables ```r nyc_arrow ## Table ## 9980449 rows x 43 columns ## $Summons Number <int64> ## $Plate ID <string> ## $Registration State <string> ## $Plate Type <string> ## $Issue Date <string> ## $Violation Code <int64> ## $Vehicle Body Type <string> ## $Vehicle Make <string> ## $Issuing Agency <string> ## $Street Code1 <int64> ## $Street Code2 <int64> ## $Street Code3 <int64> ## $Vehicle Expiration Date <int64> ## $Violation Location <int64> ## $Violation Precinct <int64> ## $Issuer Precinct <int64> ## $Issuer Code <int64> ## $Issuer Command <string> ## $Issuer Squad <string> ## $Violation Time <string> ## $Time First Observed <string> ## $Violation County <string> ## $Violation In Front Of Or Opposite <string> ## $House Number <string> ## $Street Name <string> ## $Intersecting Street <string> ## $Date First Observed <int64> ## $Law Section <int64> ## $Sub Division <string> ## $Violation Legal Code <string> ## $Days Parking In Effect <string> ## $From Hours In Effect <string> ## $To Hours In Effect <string> ## $Vehicle Color <string> ## $Unregistered Vehicle? <int64> ## $Vehicle Year <int64> ## $Meter Number <string> ## $Feet From Curb <int64> ## $Violation Post Code <string> ## $Violation Description <string> ## $No Standing or Stopping Violation <null> ## $Hydrant Violation <null> ## $Double Parking Violation <null> ``` --- class: center, middle ## An aside on tabular file formats --- ## Comma Separated Values This and other text & delimiter based file formats are the most common and generally considered the most portable, however they have a number of significant draw backs * no explicit schema or other metadata * column types must be inferred from the data * numerical values stored as text (efficiency and precision issues) * limited compression options --- ## (Apache) Parquet > ... provides a standardized open-source columnar storage format for use in data analysis systems. It was created originally for use in Apache Hadoop with systems like Apache Drill, Apache Hive, Apache Impala, and Apache Spark adopting it as a shared standard for high performance data IO. Core features: > The values in each column are physically stored in contiguous memory locations and this columnar storage provides the following benefits: > > * Column-wise compression is efficient and saves storage space > * Compression techniques specific to a type can be applied as the column values tend to be of the same type > * Queries that fetch specific column values need not read the entire row data thus improving performance > * Different encoding techniques can be applied to different columns --- ## Feather > ... is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. Feather was created early in the Arrow project as a proof of concept for fast, language-agnostic data frame storage for Python (pandas) and R. Core features: * Direct columnar serialization of Arrow tables * Supports all Arrow data types and compression * Language agnostic * Metadata makes it possible to read only the necessary columns for an operation --- ## File sizes ```r fs::file_size( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv" ) ``` ``` ## 1.76G ``` ```r fs::file_size( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.parquet" ) ``` ``` ## 295M ``` ```r fs::file_size( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.feather" ) ``` ``` ## 1.04G ``` --- ## Read performance ```r system.time( arrow::read_csv_arrow( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv" ) ) ## user system elapsed ## 9.456 18.643 10.660 ``` -- ```r system.time( arrow::read_parquet( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.parquet" ) ) ## user system elapsed ## 4.548 1.545 1.166 ``` -- ```r system.time( arrow::read_feather( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.feather" ) ) ## user system elapsed ## 0.774 1.846 0.975 ``` --- ## Read performance - column subset ```r system.time( arrow::read_csv_arrow( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.csv", col_select = "Violation Precinct" ) ) ## user system elapsed ## 9.457 18.202 10.182 ``` -- ```r system.time( arrow::read_parquet( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.parquet", col_select = "Violation Precinct" ) ) ## user system elapsed ## 0.066 0.011 0.080 ``` -- ```r system.time( arrow::read_feather( "data/Parking_Violations_Issued_-_Fiscal_Year_2022.feather", col_select = "Violation Precinct" ) ) ## user system elapsed ## 0.047 0.044 0.090 ```