FRES machine learning

Scientific seminar 05.12.24

Institute of Marine Research

Contents

  • Demystifying AI & Machine learning
    • Definitions & nomenclature
    • The workflow
  • Applying ML to penguin data
  • Discussion on:
    • How do we use or should we use ML?
    • Storing, using data?

On AI and machine learning

Definitions

  • AI: Intelligence exhibited by machines
    • Rule/expert based systems: WolframAlpha, DEREK Nexus
  • ML: Learning patterns from existing data without being explicitly programmed, and applying these patterns to new data.
    • Regression, PCA, clustering, random forest, gradient boosting, …
  • DL: Neural networks
    • Large language models (chat-GPT)
    • Image recognition

Nomenclature of ML

Data

‘TARGET’ ‘FEATURES’
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
Adelie Torgersen 39.1 18.7 181 3750
Gentoo Biscoe 46.1 13.2 211 4500
Chinstrap Dream 46.5 17.9 192 3500
Figure 1: Data: Horst AM, Hill AP, Gorman KB (2020), 10.5281/zenodo.3960218

Models

  • Supervised learning: we have targets/responses
  • Unsupervised learning: we have only features, no labels or targets
  • Hyperparameters: parameters of the models

The ML toolbox

(fig from app.datacamp.com/learn/courses/understanding-machine-learning)

Supervised learning

  • Regression (linear and logistic)
  • Tree based methods (random forest, boosting)
  • Neural networks

Unsupervised learning

  • Dimensionality reduction (PCA)
  • Clustering
    • k-means
    • HAC
    • DBSCAN
  • Neural networks

Reinforcement learning (AlphaGo) is the third class, although outside our scope.

The ML workflow

(fig from app.datacamp.com/learn/courses/understanding-machine-learning)

Supervised learning with a tree based method

Random forest applied to penguin data

Random forest

fig by Jeremybeauchamp @wikipedia

The data

figures by @allison_horst
penguins %>% glimpse()
Rows: 344
Columns: 7
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…

The recipe

(fig from app.datacamp.com/learn/courses/understanding-machine-learning

0 Tidy data

  1. Each variable forms a column
  2. Each observation forms a row
  3. Each type of observational unit forms a table

fig from R for Data Science (2e) (2023), Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

1 Extract features

penguins %>% glimpse()
Rows: 344
Columns: 7
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
penguins_extracted <- penguins %>% select(-sex, -island) %>% drop_na()

2 Split data

  • Train data - to train the model
    • Validation/development data - model tuning and selection
  • Test data - evaluate model without bias
set.seed(123)
penguin_split <- initial_split(penguins_extracted, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

## validation dataset
penguin_boot <- bootstraps(penguin_train, 25)

3 Train model

Specification and hyperparameter tuning

Set up model and tune it: run it using different parameters and assess perfomance.

tune_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune() ) %>% set_mode("classification") %>% set_engine("ranger")
rf_recipe <- recipe(species ~ ., data = penguin_train)
tune_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(tune_spec)
doParallel::registerDoParallel(cores = 14)

tune_res <- tune_grid(tune_wf, resamples = penguin_boot, grid = 200)
tune_res %>% autoplot(metric = "accuracy")

3 Train model

Final model

We select the best performing model and apply it.

best_hyperparameters <- select_best(tune_res, metric = "accuracy")

final_rf <- finalize_model(
  tune_spec,
  best_hyperparameters
)

We can also get the variable importance:

final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(species ~ .,
    data = juice(prep(rf_recipe))
  ) %>% vi()
# A tibble: 4 × 2
  Variable          Importance
  <chr>                  <dbl>
1 bill_length_mm         0.267
2 bill_depth_mm          0.156
3 flipper_length_mm      0.152
4 body_mass_g            0.106

4 Evaluate on unseen test data

final_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(final_rf)
final_res <- final_wf %>% last_fit(penguin_split)
final_res %>% collect_predictions() %>% 
  conf_mat(truth = species, estimate = .pred_class) %>% autoplot(type = "heatmap")+labs(title = "Confusion matrix")

We have 96.61% accuracy. Not bad!

97% accuracy?

Ok. Great!

Again, on more messy data to classify saithe, haddock and cod

Using POPs data from Boitsov et al 2024

Using POPs data from Boitsov et al 2024 2

Accuracy = 82.4324324 %

# A tibble: 11 × 2
   Variable Importance
   <chr>         <dbl>
 1 pp_dde       0.127 
 2 pp_ddd       0.106 
 3 pcb_180      0.105 
 4 sum_ddt      0.103 
 5 pcb_153      0.101 
 6 pcb_138      0.0390
 7 hcb          0.0363
 8 pcb_118      0.0350
 9 sum_6pcb     0.0298
10 sum_7pcb     0.0260
11 sum_hch      0.0207

Make it reproducible

# save environment information (packages and R-version)
renv::init()
renv::snapshot()

push code and data to github: https://github.com/arebruvold/FRES_machine_learning

References

  • Boitsov, S., Frantzen, S., Bruvold, A., & Grøsvik, B.E. (2024) Chemosphere 349, 140939. doi:10.1016/j.chemosphere.2023.140939
  • James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning (2013), doi.org/10.1007/978-1-4614-7138-7.
  • Probst, P., Wright, M., & Boulesteix, A.-L. (2019) Hyperparameters and Tuning Strategies for Random Forest, doi:10.1002/widm.1301
  • https://www.tidymodels.org/start/case-study/
  • https://bradleyboehmke.github.io/uc-bana-4080/lesson-6c-random-forests.html & juliasilge.com/blog/sf-trees-random-tuning/
  • https://
  • Theme: https://github.com/mine-cetinkaya-rundel/tidyperspective/blob/main/talks/dagstat-2022.qmd
  • packages:
library(gt)
library(tidyverse)
library(palmerpenguins)
library(tidymodels)
library(vip)
library(GGally)

How do we work with AI and machine learning?

  • HR-MS analysis of emerging contaminants
    • Non-target quantification, data treatment, filtering
    • In-silico prediction of suspects/ transformation products and properties
  • In-silico toxicity predictions
  • Monitoring legacy POPs
    • Interpolation of missing data by ML
    • Concentration and trend predictions?
    • Anomaly detection/ outlier detection?