FRES machine learning

Scientific seminar 05.12.24

Are Sæle Bruvold

aresb@hi.no

Institute of Marine Research

Demystifying AI & Machine learning
- Definitions & nomenclature
- The workflow
Applying ML to penguin data
Discussion on:
- How do we use or should we use ML?
- Storing, using data?

On AI and machine learning

Definitions

AI: Intelligence exhibited by machines
- Rule/expert based systems: WolframAlpha, DEREK Nexus

ML: Learning patterns from existing data without being explicitly programmed, and applying these patterns to new data.
- Regression, PCA, clustering, random forest, gradient boosting, …

DL: Neural networks
- Large language models (chat-GPT)
- Image recognition

Nomenclature of ML

Data

‘TARGET’	‘FEATURES’
species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
Adelie	Torgersen	39.1	18.7	181	3750
Gentoo	Biscoe	46.1	13.2	211	4500
Chinstrap	Dream	46.5	17.9	192	3500

Figure 1: Data: Horst AM, Hill AP, Gorman KB (2020), 10.5281/zenodo.3960218

Models

Supervised learning: we have targets/responses
Unsupervised learning: we have only features, no labels or targets
Hyperparameters: parameters of the models

The ML toolbox

(fig from app.datacamp.com/learn/courses/understanding-machine-learning)

Supervised learning

Regression (linear and logistic)
Tree based methods (random forest, boosting)
Neural networks

Unsupervised learning

Dimensionality reduction (PCA)
Clustering
- k-means
- HAC
- DBSCAN
Neural networks

Reinforcement learning (AlphaGo) is the third class, although outside our scope.

The ML workflow

(fig from app.datacamp.com/learn/courses/understanding-machine-learning)

Supervised learning with a tree based method

Random forest applied to penguin data

Random forest

fig by Jeremybeauchamp @wikipedia

The data

penguins %>% glimpse()

Rows: 344
Columns: 7
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…

The recipe

(fig from app.datacamp.com/learn/courses/understanding-machine-learning

0 Tidy data

Each variable forms a column
Each observation forms a row
Each type of observational unit forms a table

fig from *R for Data Science* (2e) (2023), Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

1 Extract features

penguins %>% glimpse()

Rows: 344
Columns: 7
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…

penguins_extracted <- penguins %>% select(-sex, -island) %>% drop_na()

2 Split data

Train data - to train the model
- Validation/development data - model tuning and selection
Test data - evaluate model without bias

set.seed(123)
penguin_split <- initial_split(penguins_extracted, strata = species)
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)

## validation dataset
penguin_boot <- bootstraps(penguin_train, 25)

3 Train model

Specification and hyperparameter tuning

Set up model and tune it: run it using different parameters and assess perfomance.

tune_spec <- rand_forest(mtry = tune(), trees = 1000, min_n = tune() ) %>% set_mode("classification") %>% set_engine("ranger")
rf_recipe <- recipe(species ~ ., data = penguin_train)
tune_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(tune_spec)
doParallel::registerDoParallel(cores = 14)

tune_res <- tune_grid(tune_wf, resamples = penguin_boot, grid = 200)
tune_res %>% autoplot(metric = "accuracy")

3 Train model

Final model

We select the best performing model and apply it.

best_hyperparameters <- select_best(tune_res, metric = "accuracy")

final_rf <- finalize_model(
  tune_spec,
  best_hyperparameters
)

We can also get the variable importance:

final_rf %>%
  set_engine("ranger", importance = "permutation") %>%
  fit(species ~ .,
    data = juice(prep(rf_recipe))
  ) %>% vi()

# A tibble: 4 × 2
  Variable          Importance
  <chr>                  <dbl>
1 bill_length_mm         0.267
2 bill_depth_mm          0.156
3 flipper_length_mm      0.152
4 body_mass_g            0.106

4 Evaluate on unseen test data

final_wf <- workflow() %>% add_recipe(rf_recipe) %>% add_model(final_rf)
final_res <- final_wf %>% last_fit(penguin_split)
final_res %>% collect_predictions() %>% 
  conf_mat(truth = species, estimate = .pred_class) %>% autoplot(type = "heatmap")+labs(title = "Confusion matrix")

We have 96.61% accuracy. Not bad!

97% accuracy?

Ok. Great!

Again, on more messy data to classify saithe, haddock and cod

Using POPs data from Boitsov et al 2024

Using POPs data from Boitsov et al 2024 2

Accuracy = 82.4324324 %

# A tibble: 11 × 2
   Variable Importance
   <chr>         <dbl>
 1 pp_dde       0.127 
 2 pp_ddd       0.106 
 3 pcb_180      0.105 
 4 sum_ddt      0.103 
 5 pcb_153      0.101 
 6 pcb_138      0.0390
 7 hcb          0.0363
 8 pcb_118      0.0350
 9 sum_6pcb     0.0298
10 sum_7pcb     0.0260
11 sum_hch      0.0207

Make it reproducible

# save environment information (packages and R-version)
renv::init()
renv::snapshot()

push code and data to github: https://github.com/arebruvold/FRES_machine_learning

References

Boitsov, S., Frantzen, S., Bruvold, A., & Grøsvik, B.E. (2024) Chemosphere 349, 140939. doi:10.1016/j.chemosphere.2023.140939
James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning (2013), doi.org/10.1007/978-1-4614-7138-7.
Probst, P., Wright, M., & Boulesteix, A.-L. (2019) Hyperparameters and Tuning Strategies for Random Forest, doi:10.1002/widm.1301
https://www.tidymodels.org/start/case-study/
https://bradleyboehmke.github.io/uc-bana-4080/lesson-6c-random-forests.html & juliasilge.com/blog/sf-trees-random-tuning/
https://
Theme: https://github.com/mine-cetinkaya-rundel/tidyperspective/blob/main/talks/dagstat-2022.qmd
packages:

library(gt)
library(tidyverse)
library(palmerpenguins)
library(tidymodels)
library(vip)
library(GGally)

How do we work with AI and machine learning?

The tipping point is reached: data is rich enough for black-box models to outcompete white-box […] BUT Data is key and often more important than the choice of model - Prof. Erik Kristiansson (Chalmers), NORMAN ML workshop ’24

HR-MS analysis of emerging contaminants
- Non-target quantification, data treatment, filtering
- In-silico prediction of suspects/ transformation products and properties
In-silico toxicity predictions
Monitoring legacy POPs
- Interpolation of missing data by ML
- Concentration and trend predictions?
- Anomaly detection/ outlier detection?

FRES machine learning

Contents

On AI and machine learning

Definitions

Nomenclature of ML

Data

Models

The ML toolbox

The ML workflow

Supervised learning with a tree based method

Random forest

The data

The recipe

0 Tidy data

1 Extract features

2 Split data

3 Train model

Specification and hyperparameter tuning

3 Train model

Final model

4 Evaluate on unseen test data

97% accuracy?

Again, on more messy data to classify saithe, haddock and cod

Using POPs data from Boitsov et al 2024

Using POPs data from Boitsov et al 2024 2

Make it reproducible

References

How do we work with AI and machine learning?