PSA Training Series github.com/precision-sustainable-ag

Database Querying with R — Brian W. Davis 2020-01-07

  1. What is a DB? (video)
  2. Getting started with R (video)
  3. How to use a local DB (video)
  4. How to use the PSA DB (video)

1 What is a DB?

A database is a rectangular storage system, columns and rows of data, just like a spreadsheet. Unlike a spreadsheet, the way you read and write data in a DB is through code or the command line, not through a point-and-click interface like Google Sheets or Excel.

I’ll mostly be using “database” to mean a relational database management system, like PostgreSQL or SQLite, but there are other paradigms.

The code you use to interact with the DB is called “SQL” (Structured Query Language), but you don’t have to know SQL. In the code examples I’ll show, there will be a “vanilla SQL” version and a “tidy R” version.

1.1 Where is a DB?

A database consists of a server and one or more clients. “Server” and “clients” here just mean computer applications; the server can be a webserver living out in the cloud, or it can be your own laptop. The client can be a command line terminal, an R session, a SAS session, a webpage running JavaScript, etc.

I’ll demonstrate later on how to use a local DB, where the server and the client are both on your computer. Then I’ll discuss the PSA On-Farm database, which is hosted on Azure and you can access from anywhere with an internet connection.

1.2 What’s inside?

Inside a database there are one or more tables, which are analogous to individual tabs in a regular spreadsheet. What makes these tables different is that:

  • Each column only contains one type of data (numbers, text, dates)
  • The type of each column must be decided before you put any data in it
  • There is nothing outside the “rectangle” (no plots, no notes in the margins, no comment boxes floating over top)

There are additional benefits to using a structured relational database:

  • Each table has a “key” that connects an observation between tables
  • Keys can map one row to another row, or many rows to one row, or vice versa
  • Planning a structure (schema) helps you decide how to organize your data

Are there downsides to using a database?

  • You have to learn how to use them, by reading this tutorial

1.3 Example

Let’s look at a toy database with two tables:

  plot_id crop    tillage
1     101 corn    no-till
2     102 corn strip-till
3     103  soy strip-till
4     104  soy    no-till
5     201 corn strip-till
6     202  soy    no-till
7     203 corn    no-till
8     204  soy strip-till
  plot_id Mg_ha
1     101  10.0
2     102  12.0
3     103   4.5
4     104   4.5
5     201  11.0
6     202   3.0
7     203   9.5
8     204   5.0

If you were making a spreadsheet of yields, you would probably want all of this in one table, so you could associate each observation with the treatments. But I’m sure you’ve noticed that copy-pasting columns of treatments inevitably ends up with some errors that you have to go back and clean up. Then when you want to have another observation type, like cover crop biomass or ground cover or weed count, you have to do it all over again.

With a database, instead you record the minimally identifiable information, like plot_id and the observation. Then once all your data is recorded and stored and cleaned, you can join tables together for your analyses. Note that here we’re using plot_id as our key. That’s how we know which row from one table is connected to a row from another table.

This sort of structure is especially helpful where information from one table would need to be repeated many times. Imagine an experiment with 3 or 4 columns of treatment variables, and then a table of sensor readings every 15 minutes for a year (~35000 rows per plot!). You’d have a lot of duplicated information, and any mistakes (misspelling, miscoding treatments, rows out of order) would break the whole thing.

1.4 What do I do with it though?

In general, databases are for storage and organization, and you use other tools like R or SAS for analysis. Once you have the two tables, you join them either in the DB (via SQL) or locally (using merge or dplyr::full_join):

  plot_id crop    tillage Mg_ha
1     101 corn    no-till  10.0
2     102 corn strip-till  12.0
3     103  soy strip-till   4.5
4     104  soy    no-till   4.5
5     201 corn strip-till  11.0
6     202  soy    no-till   3.0
7     203 corn    no-till   9.5
8     204  soy strip-till   5.0

And then analysis is straightforward, using your normal tools.

Analysis of Variance Table

Response: Mg_ha
             Df Sum Sq Mean Sq  F value    Pr(>F)    
crop          1 81.281  81.281 173.4000 0.0001921 ***
tillage       1  3.781   3.781   8.0667 0.0468587 *  
crop:tillage  1  0.281   0.281   0.6000 0.4818174    
Residuals     4  1.875   0.469                       
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
  crop    tillage Mg_ha
1 corn    no-till  9.75
2  soy    no-till  3.75
3 corn strip-till 11.50
4  soy strip-till  4.75

Or if you prefer tidyverse-flavored R:

# A tibble: 4 x 4
# Groups:   crop [2]
  crop  tillage     mean    sd
  <chr> <chr>      <dbl> <dbl>
1 corn  no-till     9.75 0.354
2 corn  strip-till 11.5  0.707
3 soy   no-till     3.75 1.06 
4 soy   strip-till  4.75 0.354
LS0tCnRpdGxlOiAiMDAxIC0gV2hhdCBpcyBhIGRhdGFiYXNlPyIKb3V0cHV0OgogIGh0bWxfbm90ZWJvb2s6IAogICAgaGlnaGxpZ2h0OiB0YW5nbwogICAgbnVtYmVyX3NlY3Rpb25zOiB5ZXMKICAgIHRoZW1lOiByZWFkYWJsZQogICAgdG9jOiB5ZXMKICAgIHRvY19mbG9hdDogeWVzCiAgICBjb2RlX2ZvbGRpbmc6IG5vbmUKICAgIGluY2x1ZGVzOgogICAgICBiZWZvcmVfYm9keTogYXNzZXRzL2hlYWRlci5odG1sCiAgICAgIGFmdGVyX2JvZHk6IGFzc2V0cy9mb290ZXIuaHRtbAotLS0KIAojIFdoYXQgaXMgYSBEQj8KQSBkYXRhYmFzZSBpcyBhIHJlY3Rhbmd1bGFyIHN0b3JhZ2Ugc3lzdGVtLCBjb2x1bW5zIGFuZCByb3dzIG9mIGRhdGEsIGp1c3QgbGlrZSBhIHNwcmVhZHNoZWV0LiBVbmxpa2UgYSBzcHJlYWRzaGVldCwgdGhlIHdheSB5b3UgcmVhZCBhbmQgd3JpdGUgZGF0YSBpbiBhIERCIGlzIHRocm91Z2ggY29kZSBvciB0aGUgY29tbWFuZCBsaW5lLCBub3QgdGhyb3VnaCBhIHBvaW50LWFuZC1jbGljayBpbnRlcmZhY2UgbGlrZSBHb29nbGUgU2hlZXRzIG9yIEV4Y2VsLgoKSSdsbCBtb3N0bHkgYmUgdXNpbmcgImRhdGFiYXNlIiB0byBtZWFuIGEgcmVsYXRpb25hbCBkYXRhYmFzZSBtYW5hZ2VtZW50IHN5c3RlbSwgbGlrZSBQb3N0Z3JlU1FMIG9yIFNRTGl0ZSwgYnV0IHRoZXJlIGFyZSBvdGhlciBwYXJhZGlnbXMuCiAKVGhlIGNvZGUgeW91IHVzZSB0byBpbnRlcmFjdCB3aXRoIHRoZSBEQiBpcyBjYWxsZWQgIlNRTCIgKFN0cnVjdHVyZWQgUXVlcnkgTGFuZ3VhZ2UpLCBidXQgeW91IGRvbid0ICpoYXZlKiB0byBrbm93IFNRTC4gSW4gdGhlIGNvZGUgZXhhbXBsZXMgSSdsbCBzaG93LCB0aGVyZSB3aWxsIGJlIGEgInZhbmlsbGEgU1FMIiB2ZXJzaW9uIGFuZCBhICJ0aWR5IFIiIHZlcnNpb24uCgojIyBXaGVyZSBpcyBhIERCPwoKQSBkYXRhYmFzZSBjb25zaXN0cyBvZiBhIHNlcnZlciBhbmQgb25lIG9yIG1vcmUgY2xpZW50cy4gIlNlcnZlciIgYW5kICJjbGllbnRzIiBoZXJlIGp1c3QgbWVhbiBjb21wdXRlciBhcHBsaWNhdGlvbnM7IHRoZSBzZXJ2ZXIgY2FuIGJlIGEgd2Vic2VydmVyIGxpdmluZyBvdXQgaW4gdGhlIGNsb3VkLCBvciBpdCBjYW4gYmUgeW91ciBvd24gbGFwdG9wLiBUaGUgY2xpZW50IGNhbiBiZSBhIGNvbW1hbmQgbGluZSB0ZXJtaW5hbCwgYW4gUiBzZXNzaW9uLCBhIFNBUyBzZXNzaW9uLCBhIHdlYnBhZ2UgcnVubmluZyBKYXZhU2NyaXB0LCBldGMuIAoKSSdsbCBkZW1vbnN0cmF0ZSBsYXRlciBvbiBob3cgdG8gdXNlIGEgbG9jYWwgREIsIHdoZXJlIHRoZSBzZXJ2ZXIgYW5kIHRoZSBjbGllbnQgYXJlIGJvdGggb24geW91ciBjb21wdXRlci4gVGhlbiBJJ2xsIGRpc2N1c3MgdGhlIFBTQSBPbi1GYXJtIGRhdGFiYXNlLCB3aGljaCBpcyBob3N0ZWQgb24gQXp1cmUgYW5kIHlvdSBjYW4gYWNjZXNzIGZyb20gYW55d2hlcmUgd2l0aCBhbiBpbnRlcm5ldCBjb25uZWN0aW9uLgoKIyMgV2hhdCdzIGluc2lkZT8KCkluc2lkZSBhIGRhdGFiYXNlIHRoZXJlIGFyZSBvbmUgb3IgbW9yZSB0YWJsZXMsIHdoaWNoIGFyZSBhbmFsb2dvdXMgdG8gaW5kaXZpZHVhbCB0YWJzIGluIGEgcmVndWxhciBzcHJlYWRzaGVldC4gV2hhdCBtYWtlcyB0aGVzZSB0YWJsZXMgZGlmZmVyZW50IGlzIHRoYXQ6CgogKiBFYWNoIGNvbHVtbiBvbmx5IGNvbnRhaW5zIG9uZSB0eXBlIG9mIGRhdGEgKG51bWJlcnMsIHRleHQsIGRhdGVzKQogKiBUaGUgdHlwZSBvZiBlYWNoIGNvbHVtbiBtdXN0IGJlIGRlY2lkZWQgYmVmb3JlIHlvdSBwdXQgYW55IGRhdGEgaW4gaXQKICogVGhlcmUgaXMgbm90aGluZyBvdXRzaWRlIHRoZSAicmVjdGFuZ2xlIiAobm8gcGxvdHMsIG5vIG5vdGVzIGluIHRoZSBtYXJnaW5zLCBubyBjb21tZW50IGJveGVzIGZsb2F0aW5nIG92ZXIgdG9wKQogClRoZXJlIGFyZSBhZGRpdGlvbmFsIGJlbmVmaXRzIHRvIHVzaW5nIGEgc3RydWN0dXJlZCByZWxhdGlvbmFsIGRhdGFiYXNlOgoKICogRWFjaCB0YWJsZSBoYXMgYSAia2V5IiB0aGF0IGNvbm5lY3RzIGFuIG9ic2VydmF0aW9uIGJldHdlZW4gdGFibGVzCiAqIEtleXMgY2FuIG1hcCBvbmUgcm93IHRvIGFub3RoZXIgcm93LCBvciBtYW55IHJvd3MgdG8gb25lIHJvdywgb3IgdmljZSB2ZXJzYQogKiBQbGFubmluZyBhIHN0cnVjdHVyZSAoc2NoZW1hKSBoZWxwcyB5b3UgZGVjaWRlIGhvdyB0byBvcmdhbml6ZSB5b3VyIGRhdGEKIApBcmUgdGhlcmUgZG93bnNpZGVzIHRvIHVzaW5nIGEgZGF0YWJhc2U/CgogKiBZb3UgaGF2ZSB0byBsZWFybiBob3cgdG8gdXNlIHRoZW0sIGJ5IHJlYWRpbmcgdGhpcyB0dXRvcmlhbAoKIyMgRXhhbXBsZQpMZXQncyBsb29rIGF0IGEgdG95IGRhdGFiYXNlIHdpdGggdHdvIHRhYmxlczoKCmBgYHtyIGVjaG8gPSBGQUxTRX0KdHJlYXRtZW50cyA8LSBkYXRhLmZyYW1lKAogIHBsb3RfaWQgPSBjKDEwMSwgMTAyLCAxMDMsIDEwNCwgMjAxLCAyMDIsIDIwMywgMjA0KSwKICBjcm9wID0gYygiY29ybiIsICJjb3JuIiwgInNveSIsICJzb3kiLCAiY29ybiIsICJzb3kiLCAiY29ybiIsICJzb3kiKSwKICB0aWxsYWdlID0gYygibm8tdGlsbCIsICJzdHJpcC10aWxsIiwgInN0cmlwLXRpbGwiLCAibm8tdGlsbCIsICJzdHJpcC10aWxsIiwgIm5vLXRpbGwiLCAibm8tdGlsbCIsICJzdHJpcC10aWxsIiksCiAgc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFCikKCnlpZWxkcyA8LSBkYXRhLmZyYW1lKAogIHBsb3RfaWQgPSBjKDEwMSwgMTAyLCAxMDMsIDEwNCwgMjAxLCAyMDIsIDIwMywgMjA0KSwKICBNZ19oYSA9IGMoMTAsIDEyLCA0LjUsIDQuNSwgMTEsIDMsIDkuNSwgNSksCiAgc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFCikKYGBgCgpgYGB7ciBwYWdlZC5wcmludCA9IEZ9CnRyZWF0bWVudHMKeWllbGRzCmBgYApJZiB5b3Ugd2VyZSBtYWtpbmcgYSBzcHJlYWRzaGVldCBvZiB5aWVsZHMsIHlvdSB3b3VsZCBwcm9iYWJseSB3YW50IGFsbCBvZiB0aGlzIGluIG9uZSB0YWJsZSwgc28geW91IGNvdWxkIGFzc29jaWF0ZSBlYWNoIG9ic2VydmF0aW9uIHdpdGggdGhlIHRyZWF0bWVudHMuIEJ1dCBJJ20gc3VyZSB5b3UndmUgbm90aWNlZCB0aGF0IGNvcHktcGFzdGluZyBjb2x1bW5zIG9mIHRyZWF0bWVudHMgaW5ldml0YWJseSBlbmRzIHVwIHdpdGggc29tZSBlcnJvcnMgdGhhdCB5b3UgaGF2ZSB0byBnbyBiYWNrIGFuZCBjbGVhbiB1cC4gVGhlbiB3aGVuIHlvdSB3YW50IHRvIGhhdmUgYW5vdGhlciBvYnNlcnZhdGlvbiB0eXBlLCBsaWtlIGNvdmVyIGNyb3AgYmlvbWFzcyBvciBncm91bmQgY292ZXIgb3Igd2VlZCBjb3VudCwgeW91IGhhdmUgdG8gZG8gaXQgYWxsIG92ZXIgYWdhaW4uCgpXaXRoIGEgZGF0YWJhc2UsIGluc3RlYWQgeW91IHJlY29yZCB0aGUgbWluaW1hbGx5IGlkZW50aWZpYWJsZSBpbmZvcm1hdGlvbiwgbGlrZSBgcGxvdF9pZGAgYW5kIHRoZSBvYnNlcnZhdGlvbi4gVGhlbiBvbmNlIGFsbCB5b3VyIGRhdGEgaXMgcmVjb3JkZWQgYW5kIHN0b3JlZCBhbmQgY2xlYW5lZCwgeW91IGNhbiAqKmpvaW4qKiB0YWJsZXMgdG9nZXRoZXIgZm9yIHlvdXIgYW5hbHlzZXMuIE5vdGUgdGhhdCBoZXJlIHdlJ3JlIHVzaW5nIGBwbG90X2lkYCBhcyBvdXIgKiprZXkqKi4gVGhhdCdzIGhvdyB3ZSBrbm93IHdoaWNoIHJvdyBmcm9tIG9uZSB0YWJsZSBpcyBjb25uZWN0ZWQgdG8gYSByb3cgZnJvbSBhbm90aGVyIHRhYmxlLiAKClRoaXMgc29ydCBvZiBzdHJ1Y3R1cmUgaXMgZXNwZWNpYWxseSBoZWxwZnVsIHdoZXJlIGluZm9ybWF0aW9uIGZyb20gb25lIHRhYmxlIHdvdWxkIG5lZWQgdG8gYmUgcmVwZWF0ZWQgbWFueSB0aW1lcy4gSW1hZ2luZSBhbiBleHBlcmltZW50IHdpdGggMyBvciA0IGNvbHVtbnMgb2YgdHJlYXRtZW50IHZhcmlhYmxlcywgYW5kIHRoZW4gYSB0YWJsZSBvZiBzZW5zb3IgcmVhZGluZ3MgZXZlcnkgMTUgbWludXRlcyBmb3IgYSB5ZWFyICh+MzUwMDAgcm93cyBwZXIgcGxvdCEpLiBZb3UnZCBoYXZlIGEgbG90IG9mIGR1cGxpY2F0ZWQgaW5mb3JtYXRpb24sIGFuZCBhbnkgbWlzdGFrZXMgKG1pc3NwZWxsaW5nLCBtaXNjb2RpbmcgdHJlYXRtZW50cywgcm93cyBvdXQgb2Ygb3JkZXIpIHdvdWxkIGJyZWFrIHRoZSB3aG9sZSB0aGluZy4KCiMjIFdoYXQgZG8gSSBkbyB3aXRoIGl0IHRob3VnaD8KCkluIGdlbmVyYWwsIGRhdGFiYXNlcyBhcmUgZm9yICoqc3RvcmFnZSoqIGFuZCAqKm9yZ2FuaXphdGlvbioqLCBhbmQgeW91IHVzZSBvdGhlciB0b29scyBsaWtlIFIgb3IgU0FTIGZvciAqKmFuYWx5c2lzKiouIE9uY2UgeW91IGhhdmUgdGhlIHR3byB0YWJsZXMsIHlvdSBqb2luIHRoZW0gZWl0aGVyIGluIHRoZSBEQiAodmlhIFNRTCkgb3IgbG9jYWxseSAodXNpbmcgYG1lcmdlYCBvciBgZHBseXI6OmZ1bGxfam9pbmApOgoKYGBge3IgcGFnZWQucHJpbnQgPSBGfQp5aWVsZF9kYXRhIDwtIG1lcmdlKHRyZWF0bWVudHMsIHlpZWxkcykKeWllbGRfZGF0YQpgYGAKCkFuZCB0aGVuIGFuYWx5c2lzIGlzIHN0cmFpZ2h0Zm9yd2FyZCwgdXNpbmcgeW91ciBub3JtYWwgdG9vbHMuCgpgYGB7ciBwYWdlZC5wcmludCA9IEZ9Cm1vZGVsIDwtIGxtKE1nX2hhIH4gY3JvcCAqIHRpbGxhZ2UsIGRhdGEgPSB5aWVsZF9kYXRhKQphbm92YShtb2RlbCkKYWdncmVnYXRlKAogIE1nX2hhIH4gY3JvcCp0aWxsYWdlLCAKICBkYXRhID0geWllbGRfZGF0YSwgCiAgRlVOID0gbWVhbgogICkKYGBgCgpPciBpZiB5b3UgcHJlZmVyIHRpZHl2ZXJzZS1mbGF2b3JlZCBSOgoKYGBge3IgcGFnZWQucHJpbnQgPSBGfQpsaWJyYXJ5KGRwbHlyLCB3YXJuLmNvbmZsaWN0cyA9IEZBTFNFKQoKeWllbGRfZGF0YSAlPiUgCiAgZ3JvdXBfYnkoY3JvcCwgdGlsbGFnZSkgJT4lIAogIHN1bW1hcmlzZV9hdCgKICAgIHZhcnMoTWdfaGEpLCAKICAgIGxpc3Qofm1lYW4oLiksIH5zZCguKSkKICAgICkKYGBgCg==

precisionsustainableag.org


Icons: fontawesome.com, BY 4.0 | This work, BY-SA 4.0