Lec 20 - databases & dplyr

class: center, middle, inverse, title-slide

# Lec 20 - databases & dplyr
## Statistical Programming
### Sta 323 | Spring 2022
### Dr. Colin Rundel

---

exclude: true

---
class: middle

# The why of databases

---

## Numbers every programmer should know

| Task                                | Timing (ns)       | Timing (μs)       |
|-------------------------------------|-------------------|-------------------|
| L1 cache reference                  | 0.5               |                   |
| L2 cache reference                  | 7                 |                   |
| Main memory reference               | 100               | 0.1               |
| Random seek SSD                     | 150,000           | 150               |
| Read 1 MB sequentially from memory  | 250,000           | 250               |
| Read 1 MB sequentially from SSD     | 1,000,000         | 1,000             |
| Disk seek                           | 10,000,000        | 10,000            |  
| Read 1 MB sequentially from disk    | 20,000,000        | 20,000            |
| Send packet CA->Netherlands->CA     | 150,000,000       | 150,000           |

.footnote[
From [jboner/latency.txt](https://gist.github.com/jboner/2841832) & [sirupsen/napkin-math](https://github.com/sirupsen/napkin-math) 
Jeff Dean's original [talk](http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf) 
]

---

## Implications for big data

Lets imagine we have a *10 GB* flat data file and that we want to select certain rows based on a particular criteria. This requires a sequential read across the entire data set.

If we can store the file in memory:

* `$10~GB \times (250~\mu s / 1~MB) = 2.5$` seconds

If we have to access the file from SSD:

* `$10~GB \times (1~ms / 1~MB) = 10$` seconds

If we have to access the file from disk:

* `$10~GB \times (20~ms / 1~MB) = 200$` seconds

This is just for *reading* sequential data, if we make any modifications (*writing*) or the data is fragmented things are much worse.

---

## Blocks

.center[
*Cost*: Disk << SSD <<< Memory
]
 
.center[
*Speed*: Disk <<< SSD << Memory
]

So usually possible to grow our disk storage to accommodate our data. However, memory is usually the limiting resource, and if we can't fit everything into memory?

Create *blocks* - group related data (i.e. rows) and read in multiple rows at a time. Optimal size will depend on the task and the properties of the disk.

---

## Linear vs Binary Search

Even with blocks, any kind of querying / subsetting of rows requires a linear search, which requires `$\mathcal{O}(N)$` accesses where `$N$` is the number of blocks.

We can do much better if we are careful about how we structure our data, specifically sorting some or all of the columns.

* Sorting is expensive, `$\mathcal{O}(N \log N)$`, but it only needs to be done once.

* After sorting, we can use a binary search for any subsetting tasks ( `$\mathcal{O}(\log N)$` ).

* These "sorted" columns are known as *indexes*.

* Indexes require additional storage, but usually small enough to be kept in memory while blocks stay on disk.

---

## and then?

This is just barely scratching the surface,

* Efficiency gains are not just for disk, access is access

* In general, trade off between storage and efficiency

* Reality is a lot more complicated for everything mentioned so far, lots of very smart people have spent a lot of time thinking about and implementing tools

* Different tasks with different requirements require different implementations and have different criteria for optimization

---
class: middle

# Databases

---

## R & databases - the DBI package

Low level package for interfacing R with Database management systems (DBMS) that provides a common interface to achieve the following functionality:

* connect/disconnect from DB
* create and execute statements in the DB
* extract results/output from statements
* error/exception handling
* information (meta-data) from database objects
* transaction management (optional)

.footnote[
See [r-dbi.org](https://www.r-dbi.org/) for more details
]

---

## RSQLite

Provides the implementation necessary to use DBI to interface with an SQLite database.

```r
library(RSQLite)
```

this package also loads the necessary DBI functions as well.

Once loaded we can create a connection to our database,

```r
con = dbConnect(RSQLite::SQLite(), ":memory:")
str(con)

## Formal class 'SQLiteConnection' [package "RSQLite"] with 5 slots
## ..@ Id :<externalptr> 
## ..@ dbname : chr ":memory:"
## ..@ loadable.extensions: logi TRUE
## ..@ flags : int 6
## ..@ vfs : chr ""
```

---

## Example Table
 
```r
employees = data.frame(
  name   = c("Alice","Bob","Carol","Dave","Eve","Frank"),
  email  = c("alice@company.com", "bob@company.com",
             "carol@company.com", "dave@company.com",
             "eve@company.com",   "frank@comany.com"),
  salary = c(52000, 40000, 30000, 33000, 44000, 37000),
  dept   = c("Accounting", "Accounting","Sales",
             "Accounting","Sales","Sales"),
)
```

```r
dbWriteTable(con, name = "employees", value = employees)
## [1] TRUE

dbListTables(con)
## [1] "employees"
```

---

## Removing Tables

```r
dbWriteTable(con, "employs", employees)
## [1] TRUE

dbListTables(con)
## [1] "employees" "employs"

dbRemoveTable(con,"employs")
## [1] TRUE

dbListTables(con)
## [1] "employees"
```

---

## Querying Tables

```r
(res = dbSendQuery(con, "SELECT * FROM employees"))
## <SQLiteResult>
## SQL SELECT * FROM employees
## ROWS Fetched: 0 [incomplete]
## Changed: 0

dbFetch(res)
##    name             email salary       dept
## 1 Alice alice@company.com  52000 Accounting
## 2   Bob   bob@company.com  40000 Accounting
## 3 Carol carol@company.com  30000      Sales
## 4  Dave  dave@company.com  33000 Accounting
## 5   Eve   eve@company.com  44000      Sales
## 6 Frank  frank@comany.com  37000      Sales

dbClearResult(res)
## [1] TRUE
```

---

## Creating empty tables

```r
dbCreateTable(con, "iris", iris)

(res = dbSendQuery(con, "select * from iris"))
## <SQLiteResult>
## SQL select * from iris
## ROWS Fetched: 0 [complete]
## Changed: 0

dbFetch(res)
## [1] Sepal.Length Sepal.Width Petal.Length Petal.Width Species 
## <0 rows> (or 0-length row.names)
```

```r
dbFetch(res) %>% as_tibble()
## # A tibble: 0 × 5
## # … with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>, Petal.Length <dbl>,
## # Petal.Width <dbl>, Species <chr>

dbClearResult(res)
```

---

## Adding to tables

```r
dbAppendTable(con, name = "iris", value = iris)
## [1] 150
## Warning message:
## Factors converted to character 
```

```r
(res = dbSendQuery(con, "select * from iris"))
## <SQLiteResult>
## SQL select * from iris
## ROWS Fetched: 0 [incomplete]
## Changed: 0

dbFetch(res) %>% as_tibble()
## # A tibble: 150 x 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## <dbl> <dbl> <dbl> <dbl> <chr> 
## 1 5.1 3.5 1.4 0.2 setosa 
## 2 4.9 3 1.4 0.2 setosa 
## 3 4.7 3.2 1.3 0.2 setosa 
## 4 4.6 3.1 1.5 0.2 setosa 
## 5 5 3.6 1.4 0.2 setosa 
## 6 5.4 3.9 1.7 0.4 setosa 
## 7 4.6 3.4 1.4 0.3 setosa 
## 8 5 3.4 1.5 0.2 setosa 
## 9 4.4 2.9 1.4 0.2 setosa 
## 10 4.9 3.1 1.5 0.1 setosa 
## # … with 140 more rows
```

---

## Ephemeral results

```r
res
## <SQLiteResult>
## SQL select * from iris
## ROWS Fetched: 150 [complete]
## Changed: 0

dbFetch(res) %>% as_tibble()
## # A tibble: 0 x 5
## # … with 5 variables: Sepal.Length <dbl>, Sepal.Width <dbl>, Petal.Length <dbl>, Petal.Width <dbl>,
## # Species <chr>

dbClearResult(res)
```

---

## Closing the connection

```r
con
## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE

dbDisconnect(con)
## [1] TRUE

con
## <SQLiteConnection>
## DISCONNECTED
```

---
class: center, middle

# dplyr & databases

---

## Creating a database

```r
(db = DBI::dbConnect(RSQLite::SQLite(), "flights.sqlite"))
```

```
## <SQLiteConnection>
## Path: flights.sqlite
## Extensions: TRUE
```

```r
flight_tbl = dplyr::copy_to(db, nycflights13::flights, name = "flights", temporary = FALSE)
flight_tbl
```

```
## # Source: table<flights> [?? x 19]
## # Database: sqlite 3.37.2 [flights.sqlite]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # … with more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dbl>
```

---

## What have we created?

All of this data now lives in the database on the *filesystem* not in *memory*,

```r
pryr::object_size(db)
```

```
## 2,456 B
```

```r
pryr::object_size(flight_tbl)
```

```
## 6,192 B
```

```r
pryr::object_size(nycflights13::flights)
```

```
## 40,650,048 B
```
--

```bash
ls -lah *.sqlite
```

```
## -rw-r--r--  1 rundel  staff    21M Mar 23 10:36 flights.sqlite
```

---

## What is `flight_tbl`?

```r
class(nycflights13::flights)
```

```
## [1] "tbl_df"     "tbl"        "data.frame"
```

```r
class(flight_tbl)
```

```
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
```

```r
str(flight_tbl)
```

```
## List of 2
## $ src:List of 2
## ..$ con :Formal class 'SQLiteConnection' [package "RSQLite"] with 8 slots
## .. .. ..@ ptr :<externalptr> 
## .. .. ..@ dbname : chr "flights.sqlite"
## .. .. ..@ loadable.extensions: logi TRUE
## .. .. ..@ flags : int 70
## .. .. ..@ vfs : chr ""
## .. .. ..@ ref :<environment: 0x1254dc5d8> 
## .. .. ..@ bigint : chr "integer64"
## .. .. ..@ extended_types : logi FALSE
## ..$ disco: NULL
## ..- attr(*, "class")= chr [1:4] "src_SQLiteConnection" "src_dbi" "src_sql" "src"
## $ ops:List of 2
## ..$ x : 'ident' chr "flights"
## ..$ vars: chr [1:19] "year" "month" "day" "dep_time" ...
## ..- attr(*, "class")= chr [1:3] "op_base_remote" "op_base" "op"
## - attr(*, "class")= chr [1:5] "tbl_SQLiteConnection" "tbl_dbi" "tbl_sql" "tbl_lazy" ...
```

---

## Accessing existing tables

```r
(dplyr::tbl(db, "flights"))
```

---

## Using dplyr with sqlite

.pull-left[

```r
(oct_21 = flight_tbl %>%
   filter(month == 10, day == 21) %>%
   select(origin, dest, tailnum)
)
```

```
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.37.2 [flights.sqlite]
## origin dest tailnum
## <chr> <chr> <chr> 
## 1 EWR CLT N152UW 
## 2 EWR IAH N535UA 
## 3 JFK MIA N5BSAA 
## 4 JFK SJU N531JB 
## 5 JFK BQN N827JB 
## 6 LGA IAH N15710 
## 7 JFK IAD N825AS 
## 8 EWR TPA N802UA 
## 9 LGA ATL N996DL 
## 10 JFK FLL N627JB 
## # … with more rows
```
]

.pull-right[

```r
dplyr::collect(oct_21)
```

```
## # A tibble: 991 × 3
## origin dest tailnum
## <chr> <chr> <chr> 
## 1 EWR CLT N152UW 
## 2 EWR IAH N535UA 
## 3 JFK MIA N5BSAA 
## 4 JFK SJU N531JB 
## 5 JFK BQN N827JB 
## 6 LGA IAH N15710 
## 7 JFK IAD N825AS 
## 8 EWR TPA N802UA 
## 9 LGA ATL N996DL 
## 10 JFK FLL N627JB 
## # … with 981 more rows
```
]

---

## Laziness

dplyr / dbplyr uses lazy evaluation as much as possible, particularly when working with non-local backends.

* When building a query, we don't want the entire table, often we want just enough to check if our query is working / makes sense.

* Since we would prefer to run one complex query over many simple queries, laziness allows for verbs to be strung together.

* Therefore, by default `dplyr`

* won't connect and query the database until absolutely necessary (e.g. show output),

* and unless explicitly told to, will only query a handful of rows to give a sense of what the result will look like.
    
    * we can force evaluation via `compute()`, `collect()`, or `collapse()`

---

## A crude benchmark

.pull-left[

```r
system.time({
  (oct_21 = flight_tbl %>%
    filter(month == 10, day == 21) %>%
    select(origin, dest, tailnum)
  )
})
```

```
##    user  system elapsed 
##   0.002   0.000   0.002
```
]

.pull-right[

```r
system.time({
  print(oct_21) %>% 
    capture.output() %>% 
    invisible()
})
```

```
##    user  system elapsed 
##   0.015   0.000   0.015
```
]

<div>
.pull-left[

```r
system.time({
  dplyr::collect(oct_21) %>% 
    capture.output() %>% 
    invisible()
})
```

```
## user system elapsed 
## 0.038 0.003 0.041
```
]
</div>

---

## dplyr -> SQL - dplyr::show_query()

```r
class(oct_21)
```

```
## [1] "tbl_SQLiteConnection" "tbl_dbi"              "tbl_sql"             
## [4] "tbl_lazy"             "tbl"
```

```r
show_query(oct_21)
```

```
## <SQL>
## SELECT `origin`, `dest`, `tailnum`
## FROM `flights`
## WHERE ((`month` = 10.0) AND (`day` = 21.0))
```

---

## More complex queries

.pull-left[

```r
oct_21 %>% 
  group_by(origin, dest) %>% 
  summarize(n=n(), .groups = "drop")
```

```
## # Source: lazy query [?? x 3]
## # Database: sqlite 3.37.2 [flights.sqlite]
## origin dest n
## <chr> <chr> <int>
## 1 EWR ATL 15
## 2 EWR AUS 3
## 3 EWR AVL 1
## 4 EWR BNA 7
## 5 EWR BOS 17
## 6 EWR BTV 3
## 7 EWR BUF 2
## 8 EWR BWI 1
## 9 EWR CHS 4
## 10 EWR CLE 4
## # … with more rows
```
]

.pull-right[

```r
oct_21 %>% 
  group_by(origin, dest) %>% 
  summarize(n=n(), .groups = "drop") %>% 
  show_query()
```

```
## <SQL>
## SELECT `origin`, `dest`, COUNT(*) AS `n`
## FROM (SELECT `origin`, `dest`, `tailnum`
## FROM `flights`
## WHERE ((`month` = 10.0) AND (`day` = 21.0)))
## GROUP BY `origin`, `dest`
```
]

---

```r
oct_21 %>% count(origin, dest) %>% show_query()
```

```
## <SQL>
## SELECT `origin`, `dest`, COUNT(*) AS `n`
## FROM (SELECT `origin`, `dest`, `tailnum`
## FROM `flights`
## WHERE ((`month` = 10.0) AND (`day` = 21.0)))
## GROUP BY `origin`, `dest`
```

---

## SQL Translation

In general, dplyr / dbplyr knows how to translate basic math, logical, and summary functions from R to SQL. dbplyr has a function, `translate_sql`, that lets you experiment with how R functions are translated to SQL.

```r
dbplyr::translate_sql(x == 1 & (y < 2 | z > 3))
```

```
## <SQL> `x` = 1.0 AND (`y` < 2.0 OR `z` > 3.0)
```

```r
dbplyr::translate_sql(x ^ 2 < 10)
```

```
## <SQL> POWER(`x`, 2.0) < 10.0
```

```r
dbplyr::translate_sql(x %% 2 == 10)
```

```
## <SQL> `x` % 2.0 = 10.0
```

```r
dbplyr::translate_sql(mean(x))
```

```
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
```

```
## <SQL> AVG(`x`) OVER ()
```

```r
dbplyr::translate_sql(mean(x, na.rm=TRUE))
```

```
## <SQL> AVG(`x`) OVER ()
```

---

```r
dbplyr::translate_sql(sd(x))
```

```
## <SQL> sd(`x`)
```

```r
dbplyr::translate_sql(paste(x,y))
```

```
## <SQL> CONCAT_WS(' ', `x`, `y`)
```

```r
dbplyr::translate_sql(cumsum(x))
```

```
## Warning: Windowed expression 'SUM(`x`)' does not have explicit order.
## Please use arrange() or window_order() to make determinstic.
```

```
## <SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)
```

```r
dbplyr::translate_sql(lag(x))
```

```
## <SQL> LAG(`x`, 1, NULL) OVER ()
```

---

## Dialectic variations?

By default `dbplyr::translate_sql()` will translate R / dplyr code into ANSI SQL, if we want to see results specific to a certain database we can pass in a connection object,

```r
dbplyr::translate_sql(sd(x), con = db)
```

```
## Warning: Missing values are always removed in SQL.
## Use `STDEV(x, na.rm = TRUE)` to silence this warning
## This warning is displayed only once per session.
```

```
## <SQL> STDEV(`x`) OVER ()
```

```r
dbplyr::translate_sql(paste(x,y), con = db)
```

```
## <SQL> `x` || ' ' || `y`
```

```r
dbplyr::translate_sql(cumsum(x), con = db)
```

```
## Warning: Windowed expression 'SUM(`x`)' does not have explicit order.
## Please use arrange() or window_order() to make determinstic.
```

```
## <SQL> SUM(`x`) OVER (ROWS UNBOUNDED PRECEDING)
```

```r
dbplyr::translate_sql(lag(x), con = db)
```

```
## <SQL> LAG(`x`, 1, NULL) OVER ()
```

---

## Complications?

```r
oct_21 %>% mutate(tailnum_n_prefix = grepl("^N", tailnum))
```

```
## Error: no such function: grepl
```

```r
oct_21 %>% mutate(tailnum_n_prefix = grepl("^N", tailnum)) %>% show_query()
```

```
## <SQL>
## SELECT `origin`, `dest`, `tailnum`, grepl('^N', `tailnum`) AS `tailnum_n_prefix`
## FROM `flights`
## WHERE ((`month` = 10.0) AND (`day` = 21.0))
```

---
class: middle, center

# SQL -> R / dplyr

---

## Running SQL queries against R objects

There are two packages that implement this in R which take very different approaches,

* [`tidyquery`](https://github.com/ianmcook/tidyquery) - this package parses your SQL code using the `queryparser` package and then translates the result into R / dplyr code.

* [`sqldf`](https://github.com/ggrothendieck/sqldf) - transparently creates a database with teh data and then runs the query using that database. Defaults to SQLite but other backends are available.

---

## tidyquery

.pull-left[

```r
data(flights, package = "nycflights13")

tidyquery::query(
  "SELECT origin, dest, COUNT(*) AS n
   FROM flights
   WHERE month = 10 AND day = 21
   GROUP BY origin, dest"
)
```

```
## # A tibble: 181 × 3
## origin dest n
## <chr> <chr> <int>
## 1 EWR ATL 15
## 2 EWR AUS 3
## 3 EWR AVL 1
## 4 EWR BNA 7
## 5 EWR BOS 17
## 6 EWR BTV 3
## 7 EWR BUF 2
## 8 EWR BWI 1
## 9 EWR CHS 4
## 10 EWR CLE 4
## # … with 171 more rows
```
]

.pull-right[

```r
flights %>%
  tidyquery::query(
    "SELECT origin, dest, COUNT(*) AS n
     WHERE month = 10 AND day = 21
     GROUP BY origin, dest"
  ) %>%
  arrange(desc(n))
```

```
## # A tibble: 181 × 3
## origin dest n
## <chr> <chr> <int>
## 1 JFK LAX 32
## 2 LGA ORD 31
## 3 LGA ATL 30
## 4 JFK SFO 24
## 5 LGA CLT 22
## 6 EWR ORD 18
## 7 EWR SFO 18
## 8 EWR BOS 17
## 9 LGA MIA 17
## 10 EWR LAX 16
## # … with 171 more rows
```
]

---

## Translating to dplyr

```r
tidyquery::show_dplyr(
  "SELECT origin, dest, COUNT(*) AS n
   FROM flights
   WHERE month = 10 AND day = 21
   GROUP BY origin, dest"
)
```

```
## flights %>%
##   filter(month == 10 & day == 21) %>%
##   group_by(origin, dest) %>%
##   summarise(n = dplyr::n()) %>%
##   ungroup()
```

---

## sqldf

.pull-left[

```r
sqldf::sqldf(
  "SELECT origin, dest, COUNT(*) AS n
   FROM flights
   WHERE month = 10 AND day = 21
   GROUP BY origin, dest"
)
```

```
##     origin dest  n
## 1      EWR  ATL 15
## 2      EWR  AUS  3
## 3      EWR  AVL  1
## 4      EWR  BNA  7
## 5      EWR  BOS 17
## 6      EWR  BTV  3
## 7      EWR  BUF  2
## 8      EWR  BWI  1
## 9      EWR  CHS  4
## 10     EWR  CLE  4
## 11     EWR  CLT 15
## 12     EWR  CMH  3
## 13     EWR  CVG  9
## 14     EWR  DAY  4
## 15     EWR  DCA  3
## 16     EWR  DEN  8
## 17     EWR  DFW  9
## 18     EWR  DSM  2
## 19     EWR  DTW 10
## 20     EWR  FLL 10
## 21     EWR  GRR  2
## 22     EWR  GSO  4
## 23     EWR  GSP  2
## 24     EWR  HNL  1
## 25     EWR  HOU  3
## 26     EWR  IAD  5
## 27     EWR  IAH 11
## 28     EWR  IND  5
## 29     EWR  JAX  4
## 30     EWR  LAS  6
## 31     EWR  LAX 16
## 32     EWR  MCI  4
## 33     EWR  MCO 13
## 34     EWR  MDW  6
## 35     EWR  MEM  3
## 36     EWR  MHT  3
## 37     EWR  MIA  7
## 38     EWR  MKE  3
## 39     EWR  MSN  1
## 40     EWR  MSP 10
## 41     EWR  MSY  4
## 42     EWR  OKC  1
## 43     EWR  OMA  2
## 44     EWR  ORD 18
## 45     EWR  ORF  1
## 46     EWR  PBI  5
## 47     EWR  PDX  2
## 48     EWR  PHX  7
## 49     EWR  PIT  1
## 50     EWR  PVD  1
## 51     EWR  PWM  1
## 52     EWR  RDU  4
## 53     EWR  RIC  5
## 54     EWR  RSW  3
## 55     EWR  SAN  3
## 56     EWR  SAT  1
## 57     EWR  SAV  2
## 58     EWR  SDF  3
## 59     EWR  SEA  5
## 60     EWR  SFO 18
## 61     EWR  SJU  2
## 62     EWR  SLC  1
## 63     EWR  SNA  3
## 64     EWR  STL  8
## 65     EWR  TPA  6
## 66     EWR  TUL  1
## 67     EWR  TYS  1
## 68     EWR  XNA  1
## 69     JFK  ABQ  1
## 70     JFK  ATL  5
## 71     JFK  AUS  4
## 72     JFK  BNA  2
## 73     JFK  BOS 16
## 74     JFK  BQN  1
## 75     JFK  BTV  4
## 76     JFK  BUF 12
## 77     JFK  BUR  1
## 78     JFK  BWI  3
## 79     JFK  CHS  3
## 80     JFK  CLE  1
## 81     JFK  CLT  8
## 82     JFK  CMH  2
## 83     JFK  CVG  3
## 84     JFK  DCA  9
## 85     JFK  DEN  2
## 86     JFK  DFW  2
## 87     JFK  DTW  3
## 88     JFK  FLL  9
## 89     JFK  HNL  1
## 90     JFK  HOU  2
## 91     JFK  IAD  7
## 92     JFK  IAH  1
## 93     JFK  IND  2
## 94     JFK  JAX  3
## 95     JFK  LAS 12
## 96     JFK  LAX 32
## 97     JFK  LGB  2
## 98     JFK  MCI  1
## 99     JFK  MCO 14
## 100    JFK  MIA  9
## 101    JFK  MSP  3
## 102    JFK  MSY  5
## 103    JFK  OAK  1
## 104    JFK  ORD  7
## 105    JFK  ORF  2
## 106    JFK  PBI  4
## 107    JFK  PDX  2
## 108    JFK  PHL  2
## 109    JFK  PHX  6
## 110    JFK  PIT  3
## 111    JFK  PSE  1
## 112    JFK  PWM  4
## 113    JFK  RDU  9
## 114    JFK  ROC  4
## 115    JFK  RSW  2
## 116    JFK  SAN  5
## 117    JFK  SAT  1
## 118    JFK  SEA  7
## 119    JFK  SFO 24
## 120    JFK  SJC  1
## 121    JFK  SJU 11
## 122    JFK  SLC  6
## 123    JFK  SMF  1
## 124    JFK  SRQ  1
## 125    JFK  SYR  4
## 126    JFK  TPA  8
## 127    LGA  ATL 30
## 128    LGA  BGR  2
## 129    LGA  BHM  1
## 130    LGA  BNA 11
## 131    LGA  BOS 16
## 132    LGA  BTV  1
## 133    LGA  BUF  2
## 134    LGA  CAK  2
## 135    LGA  CHS  3
## 136    LGA  CLE  6
## 137    LGA  CLT 22
## 138    LGA  CMH  7
## 139    LGA  CVG  1
## 140    LGA  DAY  1
## 141    LGA  DCA 16
## 142    LGA  DEN 11
## 143    LGA  DFW 14
## 144    LGA  DSM  1
## 145    LGA  DTW 15
## 146    LGA  FLL 10
## 147    LGA  GSO  4
## 148    LGA  GSP  1
## 149    LGA  HOU  2
## 150    LGA  IAD  6
## 151    LGA  IAH  9
## 152    LGA  ILM  1
## 153    LGA  IND  1
## 154    LGA  JAX  2
## 155    LGA  MCI  1
## 156    LGA  MCO 10
## 157    LGA  MDW  6
## 158    LGA  MEM  3
## 159    LGA  MIA 17
## 160    LGA  MKE  5
## 161    LGA  MSN  2
## 162    LGA  MSP 11
## 163    LGA  MSY  3
## 164    LGA  OMA  1
## 165    LGA  ORD 31
## 166    LGA  ORF  1
## 167    LGA  PBI  6
## 168    LGA  PHL  2
## 169    LGA  PIT  5
## 170    LGA  RDU 11
## 171    LGA  RIC  3
## 172    LGA  ROC  2
## 173    LGA  RSW  2
## 174    LGA  SAV  1
## 175    LGA  SDF  1
## 176    LGA  SRQ  2
## 177    LGA  STL  6
## 178    LGA  SYR  1
## 179    LGA  TPA  6
## 180    LGA  TYS  1
## 181    LGA  XNA  3
```
]

.pull-right[

```r
sqldf::sqldf(
  "SELECT origin, dest, COUNT(*) AS n
   FROM flights
   WHERE month = 10 AND day = 21
   GROUP BY origin, dest"
) %>%
  as_tibble() %>%
  arrange(desc(n))
```