9 Scientific programming

Its possible to connect to the database with various scientific programming languages, such as R, Python, Matlab etc. This section only gives a rough guide as it’s expected that users of those tools find good ressources/documentation on their own anyway.

9.1 R

It is adviced to use the DBI package together with the RPostgreSQL driver and use dbplyr for working with the database. A good place to start is also the RStudio documentation.

A workflow could look similar to this:

9.1.1 Establish a database connection

# load necessary packages ------------------

library(DBI) # for the connection
library(tidyverse) # including dplyr/dbplyr
library(ggplot2) # to do the plots shown below

# connection -------------------------------

con <- dbConnect(RPostgres::Postgres(), dbname = 'iht_database', 
                 host = 'dazzling-calculator-894.db.databaselabs.io', 
                 port = 5432,
                 user = 'YOUR_USERNAME_HERE',
                 password = 'YOUR_PASSWORD_HERE')

# info -------------------------------------
#list all tables in DB
dbListTables(con)
#db info
dbGetInfo(con)

9.1.2 Retrieve data

here we show how to use a table (resp. a View as a ‘virtual table’ - “solid_bending3”) and just manipulate it like a tibble. dbplyr translates the functions to SQL and sends it to the database. To plot we need to call collect() to read the data to local memory.

To use a table/view in the public schema simply type:

dat <- tbl(con, "solid_bending3")

To access any schema use in_schema():

dat <- tbl(con, dbplyr::in_schema('public','solid_bending3'))

We can then use the object as usual, use collect() to pull the data from the database to local memory (only do this for tasks not possible within the database):

ggplot(collect(dat), aes(reorder(species, measurement_value, FUN=median), measurement_value)) + 
  geom_boxplot() +
  ylab('bending strength') +
  xlab('species') + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, face='italic'))

# here we plot the same, but filter for species with replication > 10

give.n <- function(x){
  return(c(y = 1, label = length(x)))
}

dat %>%   
  group_by(species) %>%
  collect %>% 
  filter(n() > 10) %>% 
  ggplot(aes(reorder(species, measurement_value, FUN=median), measurement_value)) + 
  geom_boxplot() +
  ylab('bending strength') +
  xlab('species') + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1, face='italic'))+
  stat_summary(fun.data = give.n, geom = "text")

9.1.3 Basic queries

You can manipulate the data using normal dplyr syntax. dbplyr will translate the manipulations to SQL and calculations will be done within the database.

Some examples:

Show columns from measurement_value to measurement_direction and filter values > 110:

colnames(dat)

dat %>% 
  select(measurement_value:measurement_direction) %>% 
  filter(measurement_value > 110)

Find max value per species and sort descending show highes 5:

(query1 <- dat %>% 
  group_by(species) %>% 
  summarize(max_by_species = max(measurement_value)) %>% 
  arrange(desc(max_by_species)) %>% 
  top_n(5))

If you are interested in the SQL code sent to the database you can use show_query():

query1 %>% show_query()

9.1.4 Data wrangling

data is stored in the database in the “long format”. For example, the sugar content of a speciemen is therefore stored in multiple rows for different sugars, only connected through the specimen id. To get the to the wide format (specimen in rows, sugars in columns) you could do the following:

dat <- tbl(con, dbplyr::in_schema('solid','stm_holocellulose_content'))
(tmp <- dat %>% 
  select(measurement_value, project_specimen_id, carbohydrate, species) %>% 
  collect() %>% 
  spread(carbohydrate, measurement_value))

Now we could sum up the sugar contents of the individual specimens to receive a total sugar content per specimen and then average per species:

tmp %>% 
  mutate(total = rowSums(tmp %>% select(-project_specimen_id, -species))) %>% 
  select(species, project_specimen_id, total) %>% 
  group_by(species) %>% 
  summarise(mean(total))

9.1.5 Disconnect from the database

dbDisconnect(con)

9.2 Python

pyscopg2 is the most popular and adviced driver to connect through python. Anyway, if you are looking for alternatives, you find suggestions here.

9.3 Matlab

Two possible options are: