To introduce working with a database, we’ll use one you can’t break or mess up, since it’s just running locally on your computer. SQLite is a lightweight DB platform that’s widely available and relatively simple. It’s so simple that the “server” part is literally just a file sitting in your working directory, instead of a whole other application. You can use your command line as a client, or you can use R.
You could open your command line/Terminal/PowerShell and type individual commands:
cd ./PATH/TO/YOUR/WORKING/DIRECTORY
sqlite3 experiment.db
Then you can use .help for instructions, and input all the SQL you want (or .exit to quit). But it’s a little bit hard to read at the command line, so lets do it from R instead.
The con object we just created is important. It’s how R knows where and how to send SQL queries, and how to process the data that gets returned back from them.
You can always make new connections to a database, but either your DB or R will crash if you run out of memory. This is very unlikely to happen in normal usage (both can handle millions of connections concurrently), but it’s a good idea to close a connection you opened after your script is done:
dbDisconnect(con)
NOTE: Connection objects contain pointers to memory addresses, so you can’t save them in your R environment for later, or save them to disk. Every time you restart R or open a new R session, you have to make a new connection at the top of your script.
Now we have an empty database, stored in a file in our working directory.
Let’s go back to our yields experiment and put that data into our database. Again, all of this could be done at the command line instead.
treatments <- data.frame(
plot_id = as.integer(c(101, 102, 103, 104, 201, 202, 203, 204)),
crop = c("corn", "corn", "soy", "soy", "corn", "soy", "corn", "soy"),
tillage = c(
"no-till", "strip-till", "strip-till", "no-till",
"strip-till", "no-till", "no-till", "strip-till"
)
)
yields <- data.frame(
plot_id = as.integer(c(101, 102, 103, 104, 201, 202, 203, 204)),
Mg_ha = c(10, 12, 4.5, 4.5, 11, 3, 9.5, 5)
)You can run arbitrary SQL from R using dbExecute.
CREATE TABLE, which is nice and easy). Note that when you make each column, you have to decide what data type goes in there, and different DBs support different data types.INSERT INTO that table sets of VALUES.dbGetQuery to see what was written.[1] 0
# 0 rows affected (new table)
dbExecute(
con, "
INSERT INTO treatments (plot_id, crop, tillage)
VALUES
(101,'corn','no-till'),
(102,'corn','strip-till'),
(103,'soy','strip-till'),
(104,'soy','no-till'),
(201,'corn','strip-till'),
(202,'soy','no-till'),
(203,'corn','no-till'),
(204,'soy','strip-till');
")[1] 8
plot_id crop tillage
1 101 corn no-till
2 102 corn strip-till
3 103 soy strip-till
4 104 soy no-till
5 201 corn strip-till
6 202 soy no-till
7 203 corn no-till
8 204 soy strip-till
That is kind of hard to read, and hard to write by hand without making any mistakes. We can use the R to handle a lot of that for us.
plot_id Mg_ha
1 101 10.0
2 102 12.0
3 103 4.5
4 104 4.5
5 201 11.0
6 202 3.0
7 203 9.5
8 204 5.0
dbWriteTable did all the work of converting the dataframes into those long SQL statements, then sent and executed the query. If you want to reset the database back to the beginning to try running these again, you should run dbRemoveTable(con, "treatments") and dbRemoveTable(con, "yields")
[1] "treatments" "yields"
This is the same output you’d get from .tables at the SQLite command line.
I just showed examples of how to read individual tables with dbGetQuery(con, "SELECT * FROM treatments;") and dbReadTable(con, "yields"). These functions both send out a query, and return the whole table back to R immediately. This is fine for small tables, but if a table has a million rows, this might be too slow and cumbersome.
Fortunately one of the powerful features of databases is that you can send a query and get back a preview without computing the whole request. Besides just getting back whole tables, SQL provides many operations that you can do inside the database, sending back just that preview. These functions include counting rows, adding two columns together, filtering on conditions, and lots more. This way you can construct a query and change it bit-by-bit until you’re sure you’re getting the right data without sacrificing speed and memory at each step.
You can do this with the vanilla SQL interface:
<SQLiteResult>
SQL SELECT * FROM treatments
ROWS Fetched: 0 [incomplete]
Changed: 0
plot_id crop tillage
1 101 corn no-till
2 102 corn strip-till
3 103 soy strip-till
4 104 soy no-till
5 201 corn strip-till
Then you look at the preview and decide that is the query you wanted:
plot_id crop tillage
1 101 corn no-till
2 102 corn strip-till
3 103 soy strip-till
4 104 soy no-till
5 201 corn strip-till
6 202 soy no-till
7 203 corn no-till
8 204 soy strip-till
And you have to clear the pending query when you’re done. This pattern of having intermediate function calls and variables is a pain and a common source of errors, so I don’t recommend it. But if you have a particularly complex SQL query you’ve written, this is one way to go.
If you don’t want to memorize new SQL functions and handle query-state management, there is an alternative in the database backend of the popular {dplyr} package. {dbplyr} takes the common verbs you can learn on locally stored dataframes and translates them into SQL, then constructs the request for you.
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
tbl(con, "treatments")# Source: table<treatments> [?? x 3]
# Database: sqlite 3.30.1 [./experiment.db]
plot_id crop tillage
<int> <chr> <chr>
1 101 corn no-till
2 102 corn strip-till
3 103 soy strip-till
4 104 soy no-till
5 201 corn strip-till
6 202 soy no-till
7 203 corn no-till
8 204 soy strip-till
Notice at the top, it says table<treatments> [?? x 3]? This tells us that we’re viewing table (table<treatments>) in a database (Database: sqlite 3.30.1 [./experiment.db]), instead of a local dataframe. (You might also see lazy query instead of a table name; that means we’ve constructed a temporary unnamed table in the DB’s memory.)
It also tells us that we have 3 columns, but unknown number of rows. Since we just created this table we actually know there are only those 8 rows. However, often in queries, the database could have millions of rows. It’s only returning this short preview. If we want to manipulate this query, we can do that without forcing the database to give us everything (which saves your time in front of a computer waiting for commands to run).
Let’s look at only the rows that match corn plots.
# Source: lazy query [?? x 3]
# Database: sqlite 3.30.1 [./experiment.db]
plot_id crop tillage
<int> <chr> <chr>
1 101 corn no-till
2 102 corn strip-till
3 201 corn strip-till
4 203 corn no-till
Now we’ve got a “lazy query”, because the filtering happened in the database, not in R. We can poke into the backend and find out what’s being executed:
<SQL>
SELECT *
FROM `treatments`
WHERE (`crop` = 'corn')
You can choose rows based on the values in any column, whether they’re text (like crop == "corn" above), numeric, dates, boolean/logical, etc. Let’s look at the yield table, and filter for the rows where the yields were low:
# Source: lazy query [?? x 2]
# Database: sqlite 3.30.1 [./experiment.db]
plot_id Mg_ha
<int> <dbl>
1 103 4.5
2 104 4.5
3 202 3
This is the same as using the vanilla SQL query: SELECT * FROM yields WHERE (Mg_ha < 5.0);.
Another common use case of database operations is group summaries:
# Source: lazy query [?? x 3]
# Database: sqlite 3.30.1 [./experiment.db]
# Groups: crop
crop tillage n
<chr> <chr> <int>
1 corn no-till 2
2 corn strip-till 2
3 soy no-till 2
4 soy strip-till 2
And again, we can see what SQL query is really being sent with show_query():
<SQL>
SELECT `crop`, `tillage`, COUNT() AS `n`
FROM `treatments`
GROUP BY `crop`, `tillage`
But there are some functions that can only be run inside R. For those, you have to force the lazy query to execute fully, with collect().
tbl(con, "treatments") %>%
collect() %>%
mutate(
label = stringr::str_to_title(tillage),
Rep = plot_id %/% 100
)# A tibble: 8 x 5
plot_id crop tillage label Rep
<int> <chr> <chr> <chr> <dbl>
1 101 corn no-till No-Till 1
2 102 corn strip-till Strip-Till 1
3 103 soy strip-till Strip-Till 1
4 104 soy no-till No-Till 1
5 201 corn strip-till Strip-Till 2
6 202 soy no-till No-Till 2
7 203 corn no-till No-Till 2
8 204 soy strip-till Strip-Till 2
You can try running it without the collect() and you’ll see you get a fairly unhelpful error message. However, if you see any kind of error that claims a column isn’t present when you know it’s there, or a function is missing, that’s a good indicator of a failed lazy query.
Here’s a handy reference for the functions that {dbplyr} currently knows how to translate and run inside the DB: https://dbplyr.tidyverse.org/articles/sql-translation.html
One of the most powerful functions in SQL is the join. There are several types of joins, and it usually takes a little thinking to make sure you’re getting the one you want. If there’s a one-to-one relationship with your keys (each row is identified by one plot_id, and each plot_id only refers to one row in a table, for example), then a LEFT JOIN is a good bet.
dbGetQuery(
con, "
SELECT * FROM treatments
LEFT JOIN yields
ON treatments.plot_id = yields.plot_id;
"
) plot_id crop tillage plot_id..4 Mg_ha
1 101 corn no-till 101 10.0
2 102 corn strip-till 102 12.0
3 103 soy strip-till 103 4.5
4 104 soy no-till 104 4.5
5 201 corn strip-till 201 11.0
6 202 soy no-till 202 3.0
7 203 corn no-till 203 9.5
8 204 soy strip-till 204 5.0
Notice that plot_id gets printed twice, this is a quirk of SQL. While it is nice to make sure that the right columns were matched, it’s usually just in the way. There are cleaner ways to get your query using tidy R syntax.
Joining, by = "plot_id"
# Source: lazy query [?? x 4]
# Database: sqlite 3.30.1 [./experiment.db]
plot_id crop tillage Mg_ha
<int> <chr> <chr> <dbl>
1 101 corn no-till 10
2 102 corn strip-till 12
3 103 soy strip-till 4.5
4 104 soy no-till 4.5
5 201 corn strip-till 11
6 202 soy no-till 3
7 203 corn no-till 9.5
8 204 soy strip-till 5
Again, notice that this is a lazy query! The join happened inside the DB. This can be MUCH faster than inside R for large complex joins. We can then collect() the result and use it for further analysis.
Other types of joins you might run into are full/outer/cross, inner, right, semi, and anti. The help page at ?dplyr::full_join is a good resource for explaining them, and there are many online tutorials as well (e.g. https://r4ds.had.co.nz/relational-data.html#understanding-joins). However, not all the joins are implemented in SQLite, so you may have to collect() and do them locally inside R. I’m fairly sure all of them are implemented in {RPostgres} though, so we’ll look at them more when we read the PSA On-Farm database.
data_summary <- left_join(
tbl(con, "treatments"),
tbl(con, "yields")
) %>%
group_by(crop, tillage) %>%
summarise(
mean = mean(Mg_ha, na.rm = TRUE),
sd = sd(Mg_ha, na.rm = TRUE)
)Joining, by = "plot_id"
<SQL>
SELECT `crop`, `tillage`, AVG(`Mg_ha`) AS `mean`, STDEV(`Mg_ha`) AS `sd`
FROM (SELECT `LHS`.`plot_id` AS `plot_id`, `LHS`.`crop` AS `crop`, `LHS`.`tillage` AS `tillage`, `RHS`.`Mg_ha` AS `Mg_ha`
FROM `treatments` AS `LHS`
LEFT JOIN `yields` AS `RHS`
ON (`LHS`.`plot_id` = `RHS`.`plot_id`)
)
GROUP BY `crop`, `tillage`
# Source: lazy query [?? x 4]
# Database: sqlite 3.30.1 [./experiment.db]
# Groups: crop
crop tillage mean sd
<chr> <chr> <dbl> <dbl>
1 corn no-till 9.75 0.354
2 corn strip-till 11.5 0.707
3 soy no-till 3.75 1.06
4 soy strip-till 4.75 0.354
It’s not strictly required to close connections manually, because R will close inactive ones eventually. But it’s good practice to do, so that it will be a habit when you’re connecting to remote databases that may have many concurrent users.