To follow along with this tutorial (and eventually use the database for real analyses) you need to get the login credentials. Rohit is the contact person for that, and you can see how I’ve stored them below. I have a fake file called secret-example.R that’s in this repository, so you can copy it and paste in your own username and password (without the < >).
Source file: secret-example.R
# DB credentials
pg_dbname = "<Postgres db name>"
pg_host = "<Host URL>"
pg_port = 0000
pg_user = "<your username>"
pg_password = "<your password>"
Then save it as secret.R and this code chunk will run it, storing those 5 variables in the global environment.
Now we can use them to connect to the database:
library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)
library(RPostgres)
con <- dbConnect(
Postgres(),
dbname = pg_dbname,
host = pg_host,
port = pg_port,
user = pg_user,
password = pg_password,
sslmode = "require",
bigint = "numeric"
)NOTE: Remember to close connections with
dbDisconnect(con)at the end of your R script, but occasionally your connection will be reset without you doing it explicitly. That could be that the DB thought you were inactive, or a number of other reasons (like your WiFi is spotty). If you find that queries time out or are not responding, just reconnect using this chunk of code.
WARNING: Pay attention to the returned type of numeric values. Computers nowadays use 64-bit processors, which means they’re able to represent very large numbers in memory. However, not all programs implement them in the same way.
When we made the connection above, we specified that the Postgres data type
BIGINTshould be returned as regular Rnumeric(a.k.a. adouble-precision floating point number, or 64 bits). With the scale of data we have, that won’t be a problem, asdoubles can hold data up to1.797693e+308(which you can find with.Machine$double.xmax). If you need exact integers, you can usebigint = "integer"when you make your database connection, but that only supports values up to2147483647(since they’re signed 32-bit integers).I don’t recommend using the default, which is
bigint = "integer64". This is a new class of 64-bit integers which doesn’t have full support yet, so arithmetic and subsetting can have unexpected results.
First thing, let’s make sure this is the right database and everything looks correct. We should see what tables are available:
[1] "cc_families" "decomp_biomass"
[3] "quadrats" "rows"
[5] "weed_research" "yield_monitor"
[7] "in_field_biomass" "site_information"
[9] "yield" "applied_chemicals"
[11] "producer_ids" "treatments"
[13] "gps_locations" "soil"
[15] "bulk_density" "chemical_families"
[17] "subsamples" "cc_species"
[19] "cc_mixture" "chemical_names"
[21] "cc_planting_methods" "cc_termination_methods"
[23] "gps_point_types" "cash_crops"
[25] "codes" "states"
[27] "seasons" "textural_classes"
[29] "depths" "subplots"
[31] "texture" "times"
[33] "weather" "wsensor_allocation"
[35] "farm_history" "wsensor_data"
[37] "pg_stat_statements" "pg_buffercache"
Let’s look at one of the tables, the list of cover crop species:
cc_specie cc_family
1 Annual Ryegrass Grasses
2 Sudan grass Grasses
3 Rape/Canola Broadleaves
4 Cereal Rye Grasses
5 Winter Rye Grasses
6 Triticale Grasses
7 Oats Grasses
8 Crimson Clover Legumes
9 Red Clover Legumes
10 White Clover Legumes
11 Unknown Clover spp Legumes
12 Hairy Vetch Legumes
13 Woolypod Vetch Legumes
14 Crown Vetch Legumes
15 Unknown Vetch spp Legumes
16 Radish/Turnip Broadleaves
17 Mustard Broadleaves
18 Winter Peas Legumes
19 Alfalfa Legumes
20 Kale Broadleaves
21 Wheat Grasses
22 Barley Grasses
23 White Cahaba Vetch Legumes
Notice that it returned the whole table, since this was not a lazy query. For a small table like this, that’s not a problem. But some of the tables are many rows. If you try to pull them all in using dbReadTable, you’ll end up waiting a while for some tables.
# Takes ~11s to return
dbListTables(con) %>%
stringr::str_subset("^pg", negate = TRUE) %>%
purrr::set_names() %>%
purrr::map_dfr(
~tbl(con, .x) %>%
tally() %>%
collect(),
.id = "table"
) %>%
print(n = Inf)# A tibble: 36 x 2
table n
<chr> <dbl>
1 cc_families 3
2 decomp_biomass 3184
3 quadrats 0
4 rows 2
5 weed_research 0
6 yield_monitor 0
7 in_field_biomass 246
8 site_information 134
9 yield 651
10 applied_chemicals 105
11 producer_ids 61
12 treatments 2
13 gps_locations 0
14 soil 982
15 bulk_density 64
16 chemical_families 2
17 subsamples 2
18 cc_species 23
19 cc_mixture 240
20 chemical_names 14
21 cc_planting_methods 4
22 cc_termination_methods 5
23 gps_point_types 0
24 cash_crops 10
25 codes 17576
26 states 50
27 seasons 2
28 textural_classes 12
29 depths 4
30 subplots 4
31 texture 255
32 times 6
33 weather 1496064
34 wsensor_allocation 0
35 farm_history 131
36 wsensor_data 783383
That’s a helpful list, but that was also a mouthful of code. Let’s break it down a bit. The following is identical code, but written with intermediate variables instead of chaining with the %>% (pipe) operator. If you’re familiar with base R instead, you could use lapply instead of purrr::map_*, but {purrr} handles a lot of details like output type for you. You don’t need to know {purrr} for the rest of this tutorial, but if you write a lot of loops, it will improve your code. The output isn’t shown below, since it’s the same as above.
# get the list of tables as a vector
tables <- dbListTables(con)
tables
# discard the tables that begin with "pg"
clean_tables <- stringr::str_subset(tables, "^pg", negate = TRUE)
clean_tables
# name the vector so we can identify each row later
clean_tables <- purrr::set_names(clean_tables)
clean_tables
# helper function to count rows as a lazy query
nrows_remote <- function(name) {
remote_table <- tbl(con, name) # lazily pull the table
remote_count <- tally(remote_table) # count the rows on the DB side,
# not locally
collect(remote_count) # pull the count in to your local
# R environment
}
# loop over each table and apply the helper function,
# row-binding the results to a single tibble/dataframe
purrr::map_dfr(
clean_tables
~nrows_remote(.x),
.id = "table"
)
# `print(..., n=Inf)` just makes sure it prints
# all the rows to the outputOkay, now that we’re set up and we know some pitfalls to watch out for, let’s try a query to look at the sites enrolled. Remember, since it’s a lazy query, we don’t know how many rows there are until we collect() it.
# Source: table<site_information> [?? x 11]
# Database: postgres [...]
cid code year state county longitude latitude notes additional_cont…
<int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
2 1222 TKA 2017 MD Talbot -76.1 39.0 <NA> <NA>
3 1223 TKC 2017 MD Talbot -76.1 38.8 <NA> <NA>
4 1224 TKD 2017 MD Talbot -76.0 38.7 <NA> <NA>
5 1225 TCA 2017 MD Talbot -76.2 38.9 <NA> <NA>
6 1226 TCB 2017 MD Talbot -76.2 38.8 <NA> <NA>
7 1227 TCC 2017 MD Talbot -76.2 38.8 <NA> <NA>
8 1228 BGR 2017 MD Balti… -76.6 39.6 <NA> <NA>
9 1229 SRZ 2017 PA Berks -76.0 40.5 <NA> <NA>
10 1230 CLB 2017 PA Chamb… -77.7 39.8 <NA> <NA>
# … with more rows, and 2 more variables: producer_id <chr>, address <chr>
Remember that this is identical to previewing SELECT * FROM site_information in vanilla SQL. Let’s see how many sites were enrolled in each year:
year_data <- tbl(con, "site_information") %>%
arrange(year) %>%
group_by(year) %>%
tally()
year_data# Source: lazy query [?? x 2]
# Database: postgres [...]
year n
<chr> <dbl>
1 2017 36
2 2018 43
3 2019 55
<SQL>
SELECT "year", COUNT(*) AS "n"
FROM (SELECT *
FROM "site_information"
ORDER BY "year") "dbplyr_153"
GROUP BY "year"
What about if we wanted to store a list of the site codes and what state and year they were enrolled in?
site_list <- tbl(con, "site_information") %>%
select(year, state, code, latitude, longitude) %>%
arrange(year, state, code) %>%
collect()
site_list# A tibble: 134 x 5
year state code latitude longitude
<chr> <chr> <chr> <dbl> <dbl>
1 2017 GA HFS 32.6 -82.1
2 2017 GA IHS 33.4 -83.2
3 2017 MD BGR 39.6 -76.6
4 2017 MD CJL 39.0 -75.8
5 2017 MD CRD 38.9 -75.8
6 2017 MD CSK 38.7 -75.8
7 2017 MD QHS 39.2 -75.8
8 2017 MD SGK 38.2 -75.7
9 2017 MD TCA 38.9 -76.2
10 2017 MD TCB 38.8 -76.2
# … with 124 more rows
Now you could do any local computation or manipulation you wanted. My rule of thumb is that if it’s one of the common operations (selecting columns, arrangeing to sort, filtering rows, *_joining two tables), go ahead and do those with lazy queries. If you have more complex operations you want to do, it might be better to pull the table in locally with collect(). Again, it’s a good idea to check how many rows will be returned before you do that though:
tbl(con, "site_information") %>%
select(year, state, code, latitude, longitude) %>%
arrange(year, state, code) %>%
tally()# Source: lazy query [?? x 1]
# Database: postgres [...]
n
<dbl>
1 134
So let’s do some local computation on the site_list object we stored above. First we’ll make a static map using the {ggplot2} package, which is good for making high quality customizable images. Then we’ll make an interactive map with {leaflet}, a JavaScript package that embeds maps in documents like these or web applications.
We won’t fuss around too much with the options in either, but both are powerful enough to do any kind of mapping visualization you can imagine. You don’t need any of the other packages I’m using here if you aren’t making maps, but they’re handy to have if you are.
state_outlines <-
rnaturalearthdata::states50 %>%
sf::st_as_sf() %>%
filter(sr_adm0_a3 == "USA")
library(ggplot2)
ggplot() +
geom_sf(data = state_outlines) +
geom_point(
data = site_list,
aes(longitude, latitude, color = year)
) +
coord_sf(
xlim = range(site_list$longitude, na.rm = TRUE),
ylim = range(site_list$latitude, na.rm = TRUE)
)
library(leaflet)
leaflet(
site_list,
options = leafletOptions(maxZoom = 6)
) %>%
addTiles() %>%
addMarkers(
~round(longitude, 1),
~round(latitude, 1)
)Data contains 6 rows with either missing or invalid lat/lon values and will be ignored
With the mapping example we wanted all the sites, but some tables just have a lot of rows. You probably want to use a subset of those rows to start your model or visualize before jumping in with the whole thing.
We saw above that the largest table in the DB is the "weather" table. Let’s look at just the first few rows with a lazy query:
# Source: table<weather> [?? x 14]
# Database: postgres [...]
cid code date timez rainfall air_temperature
<dbl> <chr> <date> <dttm> <dbl> <dbl>
1 2.40e6 CLA 2018-08-05 2018-08-05 17:00:00 0 29.8
2 2.05e6 CJL 2017-02-19 2017-02-19 06:00:00 0 9.95
3 2.05e6 CJL 2017-02-19 2017-02-19 07:00:00 0 9.67
4 2.05e6 CJL 2017-02-19 2017-02-19 11:00:00 0 9.49
5 2.05e6 CJL 2017-02-19 2017-02-19 12:00:00 0 9.69
6 2.05e6 CJL 2017-02-19 2017-02-19 15:00:00 0 15.0
7 2.05e6 CJL 2017-02-19 2017-02-19 16:00:00 0 16.2
8 2.05e6 CJL 2017-02-19 2017-02-19 17:00:00 0 17.5
9 2.05e6 CJL 2017-02-19 2017-02-19 21:00:00 0 19.7
10 2.05e6 CJL 2017-02-19 2017-02-19 22:00:00 0 17.6
# … with more rows, and 8 more variables: humidity <dbl>,
# longwave_radiation <dbl>, shortwave_radiation <dbl>, zonal_wind <dbl>,
# meridional_wind <dbl>, eto <dbl>, soil_temperature <dbl>,
# soil_water_content <dbl>
We learned in the last notebook on SQLite how to filter on text and numeric columns, but there’s two new ones here, date and datetime. Really these are still both just numbers, usually counting something like the number of seconds since 1970-Jan-01, or something like that. But it’s hard to think in terms of when 18262 is, or 1577854800. You’re much more familiar with working with January 1, 2020 and 2020-01-01 12:00AM EST.
There are two main ways to work with date(time)s in R. In base R, there’s as.Date and as.POSIXct (but the documentation to use the formats is in ?strptime) and in the {tidyverse}, there’s a whole package called {lubridate}. If that seems too complicated, you’re right. Dates and times are notorious for being a bear in every programming language and on every computer platform. It’s easiest if you use an unambiguous international standard, like ISO-8601, further reading.
dts <- c("1984-08-23", "2020-01-01")
tms <- c("1984-08-23 7:30:00 PM", "2020-01-01 12:00:00 AM")
iso <- c("1984-08-23 19:30:00", "2020-01-01 00:00:00")
data.frame(
dates = as.Date(dts),
times = as.POSIXct(tms, format = "%Y-%m-%d %I:%M:%S %p"),
iso = as.POSIXct(iso)
) dates times iso
1 1984-08-23 1984-08-23 19:30:00 1984-08-23 19:30:00
2 2020-01-01 2020-01-01 00:00:00 2020-01-01 00:00:00
tibble(
dates = lubridate::as_date(dts),
times = lubridate::as_datetime(tms),
iso = lubridate::as_datetime(iso)
)# A tibble: 2 x 3
dates times iso
<date> <dttm> <dttm>
1 1984-08-23 1984-08-23 19:30:00 1984-08-23 19:30:00
2 2020-01-01 2020-01-01 00:00:00 2020-01-01 00:00:00
Notice that using the standard format in your text input makes your code a lot simpler to read. Likewise, {lubridate} is pretty good at guessing, so it can be a handy shortcut as well. I’ll stick with base R for this example though, but I generally prefer the other package.
So let’s say instead of the million or so rows that "weather" has, we want just the observations from July 4, 2018. If we’re using the date column, we would need to select exact equality. But if we’re using the timez column, we want to select a range: everything greater than midnight that morning and less than 11:59:59PM that night.
# Source: lazy query [?? x 14]
# Database: postgres [...]
cid code date timez rainfall air_temperature
<dbl> <chr> <date> <dttm> <dbl> <dbl>
1 2.19e6 ALR 2018-07-04 2018-07-04 00:00:00 0 30.3
2 2.19e6 ALR 2018-07-04 2018-07-04 01:00:00 0 28.9
3 2.19e6 ALR 2018-07-04 2018-07-04 02:00:00 0 27.4
4 2.19e6 ALR 2018-07-04 2018-07-04 03:00:00 0 25.9
5 2.19e6 ALR 2018-07-04 2018-07-04 04:00:00 0 25.2
6 2.19e6 ALR 2018-07-04 2018-07-04 05:00:00 0 24.6
7 2.19e6 ALR 2018-07-04 2018-07-04 06:00:00 0 23.9
8 2.19e6 ALR 2018-07-04 2018-07-04 07:00:00 0 23.6
9 2.19e6 ALR 2018-07-04 2018-07-04 08:00:00 0 23.4
10 2.19e6 ALR 2018-07-04 2018-07-04 09:00:00 0 23.1
# … with more rows, and 8 more variables: humidity <dbl>,
# longwave_radiation <dbl>, shortwave_radiation <dbl>, zonal_wind <dbl>,
# meridional_wind <dbl>, eto <dbl>, soil_temperature <dbl>,
# soil_water_content <dbl>
<SQL>
SELECT *
FROM "weather"
WHERE ("date" = CAST('2018-07-04' AS DATE))
# Source: lazy query [?? x 2]
# Database: postgres [...]
code n
<chr> <dbl>
1 ALR 24
2 AWP 24
3 BGS 24
4 BRL 24
5 BWA 24
6 CCP 24
7 CCS 24
8 CDA 24
9 CDB 24
10 CJA 24
# … with more rows
So we can see that 24 hours of observations are returned in our query for each site. Let’s try it with times instead:
t_start <- as.POSIXct("2018-07-04 00:00:00")
t_end <- as.POSIXct("2018-07-04 23:59:59")
time_query <- tbl(con, "weather") %>%
filter(between(timez, t_start, t_end))
time_query# Source: lazy query [?? x 14]
# Database: postgres [...]
cid code date timez rainfall air_temperature
<dbl> <chr> <date> <dttm> <dbl> <dbl>
1 2.19e6 ALR 2018-07-04 2018-07-04 04:00:00 0 25.2
2 2.19e6 ALR 2018-07-04 2018-07-04 05:00:00 0 24.6
3 2.19e6 ALR 2018-07-04 2018-07-04 06:00:00 0 23.9
4 2.19e6 ALR 2018-07-04 2018-07-04 07:00:00 0 23.6
5 2.19e6 ALR 2018-07-04 2018-07-04 08:00:00 0 23.4
6 2.19e6 ALR 2018-07-04 2018-07-04 09:00:00 0 23.1
7 2.19e6 ALR 2018-07-04 2018-07-04 10:00:00 0 23.9
8 2.19e6 ALR 2018-07-04 2018-07-04 11:00:00 0 24.7
9 2.19e6 ALR 2018-07-04 2018-07-04 12:00:00 0 25.6
10 2.19e6 ALR 2018-07-04 2018-07-04 13:00:00 0 27.0
# … with more rows, and 8 more variables: humidity <dbl>,
# longwave_radiation <dbl>, shortwave_radiation <dbl>, zonal_wind <dbl>,
# meridional_wind <dbl>, eto <dbl>, soil_temperature <dbl>,
# soil_water_content <dbl>
<SQL>
SELECT *
FROM "weather"
WHERE ("timez" BETWEEN '2018-07-04T04:00:00Z' AND '2018-07-05T03:59:59Z')
# Source: lazy query [?? x 2]
# Database: postgres [...]
code n
<chr> <dbl>
1 ALR 24
2 AWP 24
3 BGS 24
4 BRL 24
5 BWA 24
6 CCP 24
7 CCS 24
8 CDA 24
9 CDB 24
10 CJA 24
# … with more rows
We get almost the same results. Note that instead of dplyr::between, we could have said filter(timez >= t_start, timez <= t_end), and it would have been the same. Sometimes I prefer the long-winded way of writing it, if it’s clearer to read the code later.
WARNING: You have to watch your timezones carefully. You’ll notice in the translated SQL, it used
'2018-07-04T04:00:00Z'as our start time. I am writing this code on a machine in theAmerica/New_Yorktime zone, and on July 4 2018, that was equivalent to a-4offset (because of Daylight Saving Time, it’s-5the rest of the year). Thus R assumed that any time I put in without an explicit timezone is my local time, while thetimezcolumn in the DB is UTC (or “Zulu”) time.
NOTE: Whether the first query (the whole of 2018-07-04 at the prime meridian) or the second query (between midnight and midnight on 2018-07-04 on the east coast) is what you want is up to you. Just be aware of what times you’re actually asking for and getting in return.
We already explored the most important table, "site_information". It’s important, because it keys the three letter farm codes to all the other data about each farm. What about "decomp_biomass"? That’s all the data about the litterbag decomposition study, and it’s where dry weight of the cover crop biomass is stored.
# Source: table<decomp_biomass> [?? x 14]
# Database: postgres [...]
cid code subplot subsample time empty_bag_wt fresh_biomass_wt
<int> <chr> <int> <chr> <int> <dbl> <dbl>
1 5057 TRR 1 B 0 62.1 326.
2 5064 TRR 1 B 2 56.6 321.
3 5068 TRR 2 A 0 62.3 315.
4 5075 TRR 2 B 1 53.5 306.
5 3029 LBR 2 A 0 55.4 144.
6 5048 RFV 2 A 3 53.5 298.
7 4051 BWA 1 A 3 53.5 NA
8 4022 JYS 2 B 4 53.5 349
9 4016 JYS 2 A 4 56 346
10 3974 ALR 2 B 4 52.5 135
# … with more rows, and 7 more variables: dry_biomass_wt <dbl>,
# crucible_wt <dbl>, tot_bwt_at_550 <dbl>, percent_c <dbl>,
# percent_n <dbl>, recovery_date <date>, tot_bwt_at_65 <dbl>
But it only has the farm codes, so we need to join it with the site information table. What kind of join is appropriate? In the decomp data, there are multiple rows per farm code (24 if no bags got destroyed), and farm code is the only key that connects the two tables. This is called a one-to-many correspondence.
# Source: lazy query [?? x 2]
# Database: postgres [...]
code n
<chr> <dbl>
1 ABE 20
2 ALR 24
3 AWP 24
4 BBS 23
5 BGR 24
6 BGS 24
7 BJD 24
8 BNF 24
9 BNM 20
10 BRL 24
# … with more rows
# Source: lazy query [?? x 2]
# Database: postgres [...]
code n
<chr> <dbl>
1 AAZ 1
2 ABE 1
3 ALR 1
4 AWP 1
5 BBS 1
6 BET 1
7 BGR 1
8 BGS 1
9 BJD 1
10 BNF 1
# … with more rows
Assuming we join them with "site_information" as the “left” table and "decomp_biomass" as the “right” table:
left_join() would return
right_join would return
NAs for lat/long etc)inner_join() would return
NAs)outer_join() would return
NAs in lat/long & as NAs in dry weight)semi_join would return
NAs)anti_join would return
NAs)I almost always want a full_join, so I can make sure I catch any mismatches and investigate them. A right_join is also good for “one-to-many” correspondences, as well as an inner_join. Beware of the dropped observations with inner joins; however, those rows would be “unidentifiable” because of missing data, so you’d end up dropping them in your analysis anyway. It’s just something to keep in mind.
[1] "cid" "code" "year"
[4] "state" "county" "longitude"
[7] "latitude" "notes" "additional_contact"
[10] "producer_id" "address"
[1] "cid" "code" "subplot"
[4] "subsample" "time" "empty_bag_wt"
[7] "fresh_biomass_wt" "dry_biomass_wt" "crucible_wt"
[10] "tot_bwt_at_550" "percent_c" "percent_n"
[13] "recovery_date" "tot_bwt_at_65"
WARNING: Always look at what columns joins are matching on if you’re not specifying them manually. For example, both of these tables share the
codecolumn (which we want) as well as thecidcolumn (which is an internal DB identifier). If we let it join on thecidcolumn, we’re gonna have trouble, since they won’t ever match.
inner_join(
tbl(con, "site_information"),
tbl(con, "decomp_biomass"),
by = "code" # force using only the `code` col. for matching
)# Source: lazy query [?? x 24]
# Database: postgres [...]
cid.x code year state county longitude latitude notes additional_cont…
<int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
2 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
3 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
4 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
5 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
6 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
7 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
8 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
9 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
10 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
# … with more rows, and 15 more variables: producer_id <chr>, address <chr>,
# cid.y <int>, subplot <int>, subsample <chr>, time <int>,
# empty_bag_wt <dbl>, fresh_biomass_wt <dbl>, dry_biomass_wt <dbl>,
# crucible_wt <dbl>, tot_bwt_at_550 <dbl>, percent_c <dbl>,
# percent_n <dbl>, recovery_date <date>, tot_bwt_at_65 <dbl>
Notice that now there’s cid.x and cid.y, which are from the left and right tables respectively and don’t match each other. This query returns way too many columns, so let’s simplify the data so it’s easier to look at:
bag_data <- inner_join(
tbl(con, "site_information"),
tbl(con, "decomp_biomass"),
by = "code"
) %>%
select(
code, year, state, # vars to identify the site
subplot, subsample, time, # vars to identify the bag
dry_biomass_wt, percent_n # actual data we want
)
show_query(bag_data)<SQL>
SELECT "code", "year", "state", "subplot", "subsample", "time", "dry_biomass_wt", "percent_n"
FROM (SELECT "LHS"."cid" AS "cid.x", "LHS"."code" AS "code", "LHS"."year" AS "year", "LHS"."state" AS "state", "LHS"."county" AS "county", "LHS"."longitude" AS "longitude", "LHS"."latitude" AS "latitude", "LHS"."notes" AS "notes", "LHS"."additional_contact" AS "additional_contact", "LHS"."producer_id" AS "producer_id", "LHS"."address" AS "address", "RHS"."cid" AS "cid.y", "RHS"."subplot" AS "subplot", "RHS"."subsample" AS "subsample", "RHS"."time" AS "time", "RHS"."empty_bag_wt" AS "empty_bag_wt", "RHS"."fresh_biomass_wt" AS "fresh_biomass_wt", "RHS"."dry_biomass_wt" AS "dry_biomass_wt", "RHS"."crucible_wt" AS "crucible_wt", "RHS"."tot_bwt_at_550" AS "tot_bwt_at_550", "RHS"."percent_c" AS "percent_c", "RHS"."percent_n" AS "percent_n", "RHS"."recovery_date" AS "recovery_date", "RHS"."tot_bwt_at_65" AS "tot_bwt_at_65"
FROM "site_information" AS "LHS"
INNER JOIN "decomp_biomass" AS "RHS"
ON ("LHS"."code" = "RHS"."code")
) "dbplyr_157"
# Source: lazy query [?? x 8]
# Database: postgres [...]
code year state subplot subsample time dry_biomass_wt percent_n
<chr> <chr> <chr> <int> <chr> <int> <dbl> <dbl>
1 CSK 2017 MD 1 A 0 92.2 2.71
2 CSK 2017 MD 1 A 1 91 2.75
3 CSK 2017 MD 1 A 2 84 3
4 CSK 2017 MD 1 A 3 81.2 2.67
5 CSK 2017 MD 1 A 4 68.4 2.57
6 CSK 2017 MD 1 A 5 58.4 2.58
7 CSK 2017 MD 1 B 0 98.6 2.42
8 CSK 2017 MD 1 B 1 83.9 3.12
9 CSK 2017 MD 1 B 2 83.6 2.92
10 CSK 2017 MD 1 B 3 76.3 2.66
# … with more rows
That nasty SQL statement could be simplified to:
dbGetQuery(con,
"
SELECT LHS.code, year, state, time, dry_biomass_wt, percent_n
FROM site_information AS LHS
INNER JOIN decomp_biomass AS RHS
ON (LHS.code = RHS.code)
")But again, that would return the whole query at once, instead of lazily, so if you want to iterate over different versions of your query for testing, it can be slow and eat up memory. All the explicit naming and quoting that {dbplyr} is doing when it translates to SQL prevents a lot of errors, but if your hand-written SQL is better than mine, you can use it straight instead.
Let’s look at rows that have already had the bags run on the C:N analyzer to get %N content:
# A tibble: 413 x 8
code year state subplot subsample time dry_biomass_wt percent_n
<chr> <chr> <chr> <int> <chr> <int> <dbl> <dbl>
1 CSK 2017 MD 2 B 4 74.7 2.76
2 SGK 2017 MD 1 B 3 67.6 2.96
3 TKA 2017 MD 1 A 0 61.9 3.21
4 KTC 2018 MD 2 A 0 62 4.64
5 KTE 2018 MD 1 A 0 64.7 2.85
6 CJL 2017 MD 2 B 2 74.1 1.61
7 QAS 2018 MD 1 A 0 59.7 2.92
8 CSA 2018 MD 1 A 0 91.8 1.84
9 QAS 2018 MD 2 A 0 59.9 3.00
10 TKA 2017 MD 1 A 1 59.6 3.69
# … with 403 more rows
Let’s try an example with three tables. I’ll split it up into several separate queries to make it a little more clear what each step is doing, but they could all be chained together into a single statement.
decomp_biomass table to estimate moisture content for bags at time == 0 (first collection).in_field_biomass table and calculate dry equivalents of the fresh biomass there.site_information table to get state and year information.
bag_moisture_query <- tbl(con, "decomp_biomass") %>%
filter(time == 0) %>%
mutate(m_ratio = dry_biomass_wt / fresh_biomass_wt) %>%
group_by(code, subplot) %>%
summarize(m_ratio = mean(m_ratio, na.rm = TRUE))
bag_moisture_query# Source: lazy query [?? x 3]
# Database: postgres [...]
# Groups: code
code subplot m_ratio
<chr> <int> <dbl>
1 ALR 1 0.646
2 ALR 2 0.661
3 AWP 1 0.416
4 AWP 2 0.426
5 BBS 1 0.334
6 BBS 2 0.297
7 BGR 1 0.276
8 BGR 2 0.268
9 BGS 1 0.278
10 BGS 2 0.344
# … with more rows
Here we looked up the decomp table, filtered to only the time-zero bags, and calculated the ratio of dry:fresh biomass weights. Then we separate it into groups for each farm code and subplot (should be two rows per group, subsamples A and B). For each group, we then calculate the mean of that moisture ratio.
in_field_query <- full_join(
bag_moisture_query,
tbl(con, "in_field_biomass")
) %>%
mutate(
dry_wt_est_g_m2 = m_ratio * (fresh_wt_a + fresh_wt_b) / 2
)Joining, by = c("code", "subplot")
# Source: lazy query [?? x 25]
# Database: postgres [...]
# Groups: code
code subplot m_ratio cid percent_fat percent_cp percent_ash percent_adf
<chr> <int> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 SRY 1 0.374 1201 NA NA NA NA
2 FSB 2 0.491 1198 NA NA NA NA
3 SRY 2 0.425 1202 NA NA NA NA
4 ULB 1 0.282 908 2.32 13.2 7.44 32.6
5 ULB 2 0.358 909 1.95 11.6 6.6 34.7
6 MKD 2 0.298 995 3.2 13.7 5.4 22.8
7 BRL 1 0.276 966 2.84 18.0 8.91 25.2
8 STA 1 0.331 1223 2.16 9.63 5.09 31.1
9 DAE 1 0.438 1219 2.3 7.48 2.62 25.8
10 DAE 2 0.381 1220 2.4 6.99 3.67 26.9
# … with more rows, and 17 more variables: percent_ndf <dbl>,
# percent_lignin <dbl>, percent_nfc <dbl>, percent_cellulose_calc <dbl>,
# percent_hemicell_calc <dbl>, percent_carb_nnorm <dbl>,
# percent_cellulose_nnorm <dbl>, percent_lignin_nnorm <dbl>,
# percent_carb_norm <dbl>, percent_cellulose_norm <dbl>,
# percent_lignin_norm <dbl>, fresh_wt_a <dbl>, fresh_wt_b <dbl>,
# bag_wt <dbl>, legumes_40 <lgl>, percent_n_nir <dbl>,
# dry_wt_est_g_m2 <dbl>
The first query (bag_moisture_query) constructed a temporary table inside the database. Now we’re telling the DB we want to do a full (or outer) join on that temporary table and the in_field_biomass table (containing all matches between rows, and every column in both tables). Since we constructed the temporary table to only have three columns (code, subplot, and m_ratio), we don’t have to worry about specifying which columns to join on: we should get perfect 1:1 matching on code and subplot. A helpful message is printed letting us know those are the common columns.
After the join, we then make a new column in this new temporary table. The new column is the average of the two subsamples of fresh weights, times the moisture ratio we calculated before. Again, simple arithmetic operations like this are easy for the database to do internally. Now we have a temporary table with all the biomass properties, including an estimate of dry matter content in \(grams/meter^{2}\).
biomass_with_site_query <- full_join(
tbl(con, "site_information"),
in_field_query,
by = "code"
) %>%
arrange(year, state)
biomass_with_site_query# Source: lazy query [?? x 35]
# Database: postgres [...]
# Ordered by: year, state
cid.x code year state county longitude latitude notes additional_cont…
<int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1085 HFS 2017 GA Burke -82.1 32.6 <NA> <NA>
2 1086 IHS 2017 GA Oconee -83.2 33.4 <NA> <NA>
3 1085 HFS 2017 GA Burke -82.1 32.6 <NA> <NA>
4 1086 IHS 2017 GA Oconee -83.2 33.4 <NA> <NA>
5 1228 BGR 2017 MD Balti… -76.6 39.6 <NA> <NA>
6 1231 CRD 2017 MD Carol… -75.8 38.9 <NA> <NA>
7 1231 CRD 2017 MD Carol… -75.8 38.9 <NA> <NA>
8 1234 CJL 2017 MD Carol… -75.8 39.0 <NA> <NA>
9 1234 CJL 2017 MD Carol… -75.8 39.0 <NA> <NA>
10 1221 CSK 2017 MD Carol… -75.8 38.7 <NA> <NA>
# … with more rows, and 26 more variables: producer_id <chr>, address <chr>,
# subplot <int>, m_ratio <dbl>, cid.y <int>, percent_fat <dbl>,
# percent_cp <dbl>, percent_ash <dbl>, percent_adf <dbl>,
# percent_ndf <dbl>, percent_lignin <dbl>, percent_nfc <dbl>,
# percent_cellulose_calc <dbl>, percent_hemicell_calc <dbl>,
# percent_carb_nnorm <dbl>, percent_cellulose_nnorm <dbl>,
# percent_lignin_nnorm <dbl>, percent_carb_norm <dbl>,
# percent_cellulose_norm <dbl>, percent_lignin_norm <dbl>,
# fresh_wt_a <dbl>, fresh_wt_b <dbl>, bag_wt <dbl>, legumes_40 <lgl>,
# percent_n_nir <dbl>, dry_wt_est_g_m2 <dbl>
Now we take the temporary table and do another full/outer join, this time with the site info table. Since we didn’t strip out any columns in the last query, it’s a good idea to specify by = "code" to make sure it ignores any other common columns (in particular the internal row identifier, cid). Now we have a temporary table with all the biomass properties, as well as information about which state and year they were recorded in.
biomass_with_units_query <- biomass_with_site_query %>%
mutate_at(
vars(matches("percent_")),
list("g_m2" = ~.*dry_wt_est_g_m2/100)
)
# If you want to see a real mouthful, show the underlying SQL call
show_query(biomass_with_units_query)<SQL>
SELECT "cid.x", "code", "year", "state", "county", "longitude", "latitude", "notes", "additional_contact", "producer_id", "address", "subplot", "m_ratio", "cid.y", "percent_fat", "percent_cp", "percent_ash", "percent_adf", "percent_ndf", "percent_lignin", "percent_nfc", "percent_cellulose_calc", "percent_hemicell_calc", "percent_carb_nnorm", "percent_cellulose_nnorm", "percent_lignin_nnorm", "percent_carb_norm", "percent_cellulose_norm", "percent_lignin_norm", "fresh_wt_a", "fresh_wt_b", "bag_wt", "legumes_40", "percent_n_nir", "dry_wt_est_g_m2", "percent_fat" * "dry_wt_est_g_m2" / 100.0 AS "percent_fat_g_m2", "percent_cp" * "dry_wt_est_g_m2" / 100.0 AS "percent_cp_g_m2", "percent_ash" * "dry_wt_est_g_m2" / 100.0 AS "percent_ash_g_m2", "percent_adf" * "dry_wt_est_g_m2" / 100.0 AS "percent_adf_g_m2", "percent_ndf" * "dry_wt_est_g_m2" / 100.0 AS "percent_ndf_g_m2", "percent_lignin" * "dry_wt_est_g_m2" / 100.0 AS "percent_lignin_g_m2", "percent_nfc" * "dry_wt_est_g_m2" / 100.0 AS "percent_nfc_g_m2", "percent_cellulose_calc" * "dry_wt_est_g_m2" / 100.0 AS "percent_cellulose_calc_g_m2", "percent_hemicell_calc" * "dry_wt_est_g_m2" / 100.0 AS "percent_hemicell_calc_g_m2", "percent_carb_nnorm" * "dry_wt_est_g_m2" / 100.0 AS "percent_carb_nnorm_g_m2", "percent_cellulose_nnorm" * "dry_wt_est_g_m2" / 100.0 AS "percent_cellulose_nnorm_g_m2", "percent_lignin_nnorm" * "dry_wt_est_g_m2" / 100.0 AS "percent_lignin_nnorm_g_m2", "percent_carb_norm" * "dry_wt_est_g_m2" / 100.0 AS "percent_carb_norm_g_m2", "percent_cellulose_norm" * "dry_wt_est_g_m2" / 100.0 AS "percent_cellulose_norm_g_m2", "percent_lignin_norm" * "dry_wt_est_g_m2" / 100.0 AS "percent_lignin_norm_g_m2", "percent_n_nir" * "dry_wt_est_g_m2" / 100.0 AS "percent_n_nir_g_m2"
FROM (SELECT *
FROM (SELECT "LHS"."cid" AS "cid.x", COALESCE("LHS"."code", "RHS"."code") AS "code", "LHS"."year" AS "year", "LHS"."state" AS "state", "LHS"."county" AS "county", "LHS"."longitude" AS "longitude", "LHS"."latitude" AS "latitude", "LHS"."notes" AS "notes", "LHS"."additional_contact" AS "additional_contact", "LHS"."producer_id" AS "producer_id", "LHS"."address" AS "address", "RHS"."subplot" AS "subplot", "RHS"."m_ratio" AS "m_ratio", "RHS"."cid" AS "cid.y", "RHS"."percent_fat" AS "percent_fat", "RHS"."percent_cp" AS "percent_cp", "RHS"."percent_ash" AS "percent_ash", "RHS"."percent_adf" AS "percent_adf", "RHS"."percent_ndf" AS "percent_ndf", "RHS"."percent_lignin" AS "percent_lignin", "RHS"."percent_nfc" AS "percent_nfc", "RHS"."percent_cellulose_calc" AS "percent_cellulose_calc", "RHS"."percent_hemicell_calc" AS "percent_hemicell_calc", "RHS"."percent_carb_nnorm" AS "percent_carb_nnorm", "RHS"."percent_cellulose_nnorm" AS "percent_cellulose_nnorm", "RHS"."percent_lignin_nnorm" AS "percent_lignin_nnorm", "RHS"."percent_carb_norm" AS "percent_carb_norm", "RHS"."percent_cellulose_norm" AS "percent_cellulose_norm", "RHS"."percent_lignin_norm" AS "percent_lignin_norm", "RHS"."fresh_wt_a" AS "fresh_wt_a", "RHS"."fresh_wt_b" AS "fresh_wt_b", "RHS"."bag_wt" AS "bag_wt", "RHS"."legumes_40" AS "legumes_40", "RHS"."percent_n_nir" AS "percent_n_nir", "RHS"."dry_wt_est_g_m2" AS "dry_wt_est_g_m2"
FROM "site_information" AS "LHS"
FULL JOIN (SELECT "code", "subplot", "m_ratio", "cid", "percent_fat", "percent_cp", "percent_ash", "percent_adf", "percent_ndf", "percent_lignin", "percent_nfc", "percent_cellulose_calc", "percent_hemicell_calc", "percent_carb_nnorm", "percent_cellulose_nnorm", "percent_lignin_nnorm", "percent_carb_norm", "percent_cellulose_norm", "percent_lignin_norm", "fresh_wt_a", "fresh_wt_b", "bag_wt", "legumes_40", "percent_n_nir", "m_ratio" * ("fresh_wt_a" + "fresh_wt_b") / 2.0 AS "dry_wt_est_g_m2"
FROM (SELECT COALESCE("LHS"."code", "RHS"."code") AS "code", COALESCE("LHS"."subplot", "RHS"."subplot") AS "subplot", "LHS"."m_ratio" AS "m_ratio", "RHS"."cid" AS "cid", "RHS"."percent_fat" AS "percent_fat", "RHS"."percent_cp" AS "percent_cp", "RHS"."percent_ash" AS "percent_ash", "RHS"."percent_adf" AS "percent_adf", "RHS"."percent_ndf" AS "percent_ndf", "RHS"."percent_lignin" AS "percent_lignin", "RHS"."percent_nfc" AS "percent_nfc", "RHS"."percent_cellulose_calc" AS "percent_cellulose_calc", "RHS"."percent_hemicell_calc" AS "percent_hemicell_calc", "RHS"."percent_carb_nnorm" AS "percent_carb_nnorm", "RHS"."percent_cellulose_nnorm" AS "percent_cellulose_nnorm", "RHS"."percent_lignin_nnorm" AS "percent_lignin_nnorm", "RHS"."percent_carb_norm" AS "percent_carb_norm", "RHS"."percent_cellulose_norm" AS "percent_cellulose_norm", "RHS"."percent_lignin_norm" AS "percent_lignin_norm", "RHS"."fresh_wt_a" AS "fresh_wt_a", "RHS"."fresh_wt_b" AS "fresh_wt_b", "RHS"."bag_wt" AS "bag_wt", "RHS"."legumes_40" AS "legumes_40", "RHS"."percent_n_nir" AS "percent_n_nir"
FROM (SELECT "code", "subplot", AVG("m_ratio") AS "m_ratio"
FROM (SELECT "cid", "code", "subplot", "subsample", "time", "empty_bag_wt", "fresh_biomass_wt", "dry_biomass_wt", "crucible_wt", "tot_bwt_at_550", "percent_c", "percent_n", "recovery_date", "tot_bwt_at_65", "dry_biomass_wt" / "fresh_biomass_wt" AS "m_ratio"
FROM (SELECT *
FROM "decomp_biomass"
WHERE ("time" = 0.0)) "dbplyr_170") "dbplyr_171"
GROUP BY "code", "subplot") "LHS"
FULL JOIN "in_field_biomass" AS "RHS"
ON ("LHS"."code" = "RHS"."code" AND "LHS"."subplot" = "RHS"."subplot")
) "dbplyr_172") "RHS"
ON ("LHS"."code" = "RHS"."code")
) "dbplyr_173"
ORDER BY "year", "state") "dbplyr_174"
The last step is to create new columns using an “anonymous function”: ~.*dry_wt_est_g_m2/100. This is the same thing as a function that looks like multiplier <- function(pct) { pct * dry_wt_est_g_m2 / 100}. We then use that function on all the variables/columns (vars()) that match the text "percent_". For every such column, it will now make a new column with the name percent_foo_g_m2, which will be the total amount of “foo” in \(grams/meter^{2}\).
You can also see I printed out the generated SQL query, since this is still all happening inside the database. You could have pulled the query in locally at a number of points during this chain, and the {dplyr} syntax would have been identical (the calls to filter, mutate, summarize, group_by, full_join); just the calls to tbl(con, "table_name") would have been your local variables instead.
Again, this is overly-explicit generated SQL, which could probably be simplified if you wrote it by hand. However, the {dbplyr}-translated version checks for a lot of possible errors and tries to optimize the number of operations for speed and memory-efficiency.
# A tibble: 259 x 51
cid.x code year state county longitude latitude notes additional_cont…
<int> <chr> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
1 1086 IHS 2017 GA Oconee -83.2 33.4 <NA> <NA>
2 1086 IHS 2017 GA Oconee -83.2 33.4 <NA> <NA>
3 1085 HFS 2017 GA Burke -82.1 32.6 <NA> <NA>
4 1085 HFS 2017 GA Burke -82.1 32.6 <NA> <NA>
5 1223 TKC 2017 MD Talbot -76.1 38.8 <NA> <NA>
6 1226 TCB 2017 MD Talbot -76.2 38.8 <NA> <NA>
7 1233 QHS 2017 MD Queen… -75.8 39.2 <NA> <NA>
8 1225 TCA 2017 MD Talbot -76.2 38.9 <NA> <NA>
9 1231 CRD 2017 MD Carol… -75.8 38.9 <NA> <NA>
10 1226 TCB 2017 MD Talbot -76.2 38.8 <NA> <NA>
# … with 249 more rows, and 42 more variables: producer_id <chr>,
# address <chr>, subplot <int>, m_ratio <dbl>, cid.y <int>,
# percent_fat <dbl>, percent_cp <dbl>, percent_ash <dbl>,
# percent_adf <dbl>, percent_ndf <dbl>, percent_lignin <dbl>,
# percent_nfc <dbl>, percent_cellulose_calc <dbl>,
# percent_hemicell_calc <dbl>, percent_carb_nnorm <dbl>,
# percent_cellulose_nnorm <dbl>, percent_lignin_nnorm <dbl>,
# percent_carb_norm <dbl>, percent_cellulose_norm <dbl>,
# percent_lignin_norm <dbl>, fresh_wt_a <dbl>, fresh_wt_b <dbl>,
# bag_wt <dbl>, legumes_40 <lgl>, percent_n_nir <dbl>,
# dry_wt_est_g_m2 <dbl>, percent_fat_g_m2 <dbl>, percent_cp_g_m2 <dbl>,
# percent_ash_g_m2 <dbl>, percent_adf_g_m2 <dbl>, percent_ndf_g_m2 <dbl>,
# percent_lignin_g_m2 <dbl>, percent_nfc_g_m2 <dbl>,
# percent_cellulose_calc_g_m2 <dbl>, percent_hemicell_calc_g_m2 <dbl>,
# percent_carb_nnorm_g_m2 <dbl>, percent_cellulose_nnorm_g_m2 <dbl>,
# percent_lignin_nnorm_g_m2 <dbl>, percent_carb_norm_g_m2 <dbl>,
# percent_cellulose_norm_g_m2 <dbl>, percent_lignin_norm_g_m2 <dbl>,
# percent_n_nir_g_m2 <dbl>
Now all we had left was to pull that long 3-table query into your local R environment and do some visualization and analysis.
Let’s look at hemicellulose content as an example.
local_biomass_df %>%
filter(!is.na(percent_hemicell_calc_g_m2)) %>%
ggplot(aes(state, percent_hemicell_calc_g_m2)) +
geom_boxplot() +
facet_grid(
~year,
space = "free_x",
scales = "free_x"
)What about the relationships between the composition data? Let’s look at crude protein content as a function of total dry matter:
local_biomass_df %>%
filter(!is.na(percent_cp), !is.na(dry_wt_est_g_m2)) %>%
ggplot(aes(dry_wt_est_g_m2, percent_cp, color = factor(year))) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
stat_smooth(method = "lm", se = FALSE)