Example adapted from: https://github.com/WinVector/rquery/blob/master/extras/SparkR.md.
First let’s take a look at some example data.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Our notional task: find the species with the largest mean petal width. This is easy to write as a pipeline of Codd-style relational operators using the rqdatatable
implementation of the rquery
data wrangling grammar.
library("rqdatatable")
## Loading required package: rquery
iris %.>%
project_nse(., groupby=c('Species'),
mean_petal_width = mean(Petal.Width)) %.>%
pick_top_k(.,
k = 1,
orderby = c('mean_petal_width', 'Species'),
reverse = c('mean_petal_width', 'Species')) %.>%
select_columns(., c('Species', 'mean_petal_width'))
## Species mean_petal_width
## 1: virginica 2.026
The in-memory implementation is supplied by data.table
, so operations tend to be correct and very fast. Directly piping data into operator constructors (instead of first building an operator tree object) is a convenience, and we do suggest rquery
/rqdatatable
consider piping data into already assembled pipelines as a best practice for production work (please see here for notes on the issue).
One can achieve a similar effect with dplyr
. However, notice we have to know the canonical dplyr
tricks to achieve a projection (3 statements) and window functions (2 statements).
library("dplyr")
## Warning: package 'dplyr' was built under R version 3.5.1
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
iris %>%
group_by(., Species) %>% # 1 canonical way to aggregate in dplyr
summarize(., mean_petal_width = mean(Petal.Width)) %>% # 1
ungroup(.) %>% # 1
arrange(., desc(mean_petal_width), desc(Species)) %>% # 2 window function in dplyr
filter(., row_number()==1) %>% # 2
select(., Species, mean_petal_width)
## # A tibble: 1 x 2
## Species mean_petal_width
## <fct> <dbl>
## 1 virginica 2.03
rquery
itself is “database first”, or optimized to work with remote (and possibly large) databases such as PostgreSQL
and Apache Spark
. Let’s set up a database example.
db <- DBI::dbConnect(RPostgreSQL::PostgreSQL(),
host = 'localhost',
port = 5432,
user = 'johnmount',
password = '')
dbopts <- rq_connection_tests(db)
db_info <- rquery_db_info(connection = db,
is_dbi = TRUE,
connection_options = dbopts)
print(db_info)
## [1] "rquery_db_info(PostgreSQLConnection, is_dbi=TRUE, note=\"\")"
To use rquery
on a database table, start the pipeline with a table description instead of an actual data.frame
.
table_description <- rquery::rq_copy_to(db_info, "iris", iris,
overwrite = TRUE,
temporary = TRUE)
print(table_description)
## [1] "table(\"iris\"; Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)"
The table description is just the name of the table and column names assumed to be in the table. It does not contain a database handle or any sort of reference to the database. With the table description we build a optree or operator pipeline as follows.
ops <- table_description %.>%
project_nse(., groupby=c('Species'),
mean_petal_width = mean(Petal.Width)) %.>%
pick_top_k(.,
k = 1,
orderby = c('mean_petal_width', 'Species'),
reverse = c('mean_petal_width', 'Species')) %.>%
select_columns(., c('Species', 'mean_petal_width'))
For database operations rquery
operator specification and execution are completely separate. ops
is the tree of operators, not a result or a result handle. We can examine ops (as we show below) or even add more stages.
ops %.>%
op_diagram(.) %.>%
DiagrammeR::DiagrammeR(diagram = ., type = "grViz")
The preferred method of using rquery
on databases is through the materialize()
command. This takes inputs from the database and writes the result into the database without ever moving data into R
. This is essential when working with big data.
res_table_description <- materialize(db_info, ops)
rstr(db_info, res_table_description$table_name)
## table "rquery_mat_03125186669070964167_0000000000" PostgreSQLConnection
## nrow: 1
## 'data.frame': 1 obs. of 2 variables:
## $ Species : chr "virginica"
## $ mean_petal_width: num 2.03
Results can be brought back directly using execute()
or just by piping our database connection into the operator tree.
execute(db_info, ops)
## Species mean_petal_width
## 1 virginica 2.026
db_info %.>% ops
## Species mean_petal_width
## 1 virginica 2.026
In the database case rquery
transforms are implemented using one or more SQL
statements. The SQL
to implement the above transformations is shown here. Notice how SQL
represents composition by nesting, so everything must be read backwards.
cat(to_sql(ops, db_info))
## SELECT
## "Species",
## "mean_petal_width"
## FROM (
## SELECT * FROM (
## SELECT
## "Species",
## "mean_petal_width",
## row_number ( ) OVER ( ORDER BY "mean_petal_width" DESC, "Species" DESC ) AS "row_number"
## FROM (
## SELECT "Species", avg ( "Petal.Width" ) AS "mean_petal_width" FROM (
## SELECT
## "Petal.Width",
## "Species"
## FROM
## "iris"
## ) tsql_95898858752954886005_0000000000
## GROUP BY
## "Species"
## ) tsql_95898858752954886005_0000000001
## ) tsql_95898858752954886005_0000000002
## WHERE "row_number" <= 1
## ) tsql_95898858752954886005_0000000003
The exact same operator tree can be used on different databases, or even on in-memory data.
iris %.>% ops
## Species mean_petal_width
## 1: virginica 2.026
The above is what we mean by piping data to a pre-built tree. This is the preferred way to use rquery
/rqdatatable
as it avoids expensive re-copying of intermediate results.
dplyr
can also be used on databases, via the dbplyr
package. dplyr
uses a remote table handle instead of a table description (i.e. the table representation holds a database reference).
table_handle <- dplyr::tbl(db, "iris")
table_handle %>%
group_by(., Species) %>%
summarize(., mean_petal_width = mean(Petal.Width, na.rm = TRUE)) %>%
ungroup(.) %>%
arrange(., desc(mean_petal_width), desc(Species)) %>%
filter(., row_number()==1) %>%
select(., Species, mean_petal_width)
## # Source: lazy query [?? x 2]
## # Database: postgres 10.4.0 [johnmount@localhost:5432/johnmount]
## # Ordered by: desc(mean_petal_width), desc(Species)
## Species mean_petal_width
## <chr> <dbl>
## 1 virginica 2.03
dplyr
separates operator specification and execution through lazy evaluation. Until something forces a results (say a print()
or a dplyr::compute()
) no calculation is performed. The implicit printing found in R
triggered the above calculation. Whereas below only the query is printed and no data processing occurs.
table_handle %>%
group_by(., Species) %>%
summarize(., mean_petal_width = mean(Petal.Width, na.rm = TRUE)) %>%
ungroup(.) %>%
arrange(., desc(mean_petal_width), desc(Species)) %>%
filter(., row_number()==1) %>%
select(., Species, mean_petal_width) %>%
dbplyr::remote_query(.)
## <SQL> SELECT "Species", "mean_petal_width"
## FROM (SELECT "Species", "mean_petal_width"
## FROM (SELECT "Species", "mean_petal_width", row_number() OVER (ORDER BY "mean_petal_width" DESC, "Species" DESC) AS "zzz3"
## FROM (SELECT *
## FROM (SELECT "Species", AVG("Petal.Width") AS "mean_petal_width"
## FROM "iris"
## GROUP BY "Species") "rsktwdnkgm"
## ORDER BY "mean_petal_width" DESC, "Species" DESC) "cdclkgofnq") "bypbdhkwst"
## WHERE ("zzz3" = 1.0)) "lfysjlcxae"
In contrast rquery
separates specification and execution. rquery
also moves a number of checks and optimization into the specification phase. This means some errors, such as misspelling a column late in a pipeline, are cheap to find with rquery
and potentially expensive to find with dplyr
.
DBI::dbDisconnect(db)
## [1] TRUE