Introduction to Data Science4-8 August 2025
|
Instructor
Gabor Pozsgai Pos-Doc of the University of Azores |
Course description and aims
While most postgraduate curricula contain basic statistical courses, these most commonly use carefully selected and tidy datasets for demonstrating how methods are used. However, students/researchers face great challenges when they have to cope with real-life datasets which are often poorly structured, scattered with errors, or contain special characters. This summer school aims to provide a hands-on, applied introduction to data science with a particular focus on working with messy, complex datasets in R.
Participants will learn:
About the instructor
Dr Gábor Pozsgai is an insect ecologist and data scientist with nearly two decades of research experience. He holds a PhD in Ecology from the University of Aberdeen (UK) and is currently a Postdoctoral Research Fellow at the University of the Azores. His research explores ecological patterns, ecological networks, and, more recently, spatial interaction models in regional science. He is proficient in a range of modelling techniques, multivariate statistics, spatial analysis, and machine learning, with a particular interest in AI-based biodiversity monitoring. Dr Pozsgai is an expert in R and Python programming and regularly publishes in scientific journals, for which he also serves as a reviewer.
Program schedule
Day 1: Introduction to data and data science
Morning:
Morning:
Morning:
Morning:
Morning:
Evaluation
Participants will be assessed based on:
Suggested readings & resources
While most postgraduate curricula contain basic statistical courses, these most commonly use carefully selected and tidy datasets for demonstrating how methods are used. However, students/researchers face great challenges when they have to cope with real-life datasets which are often poorly structured, scattered with errors, or contain special characters. This summer school aims to provide a hands-on, applied introduction to data science with a particular focus on working with messy, complex datasets in R.
Participants will learn:
- How to identify and handle issues in datasets.
- How to structure and store data effectively.
- Best practices for creating robust and reusable datasets.
- Key analytical workflows using R.
About the instructor
Dr Gábor Pozsgai is an insect ecologist and data scientist with nearly two decades of research experience. He holds a PhD in Ecology from the University of Aberdeen (UK) and is currently a Postdoctoral Research Fellow at the University of the Azores. His research explores ecological patterns, ecological networks, and, more recently, spatial interaction models in regional science. He is proficient in a range of modelling techniques, multivariate statistics, spatial analysis, and machine learning, with a particular interest in AI-based biodiversity monitoring. Dr Pozsgai is an expert in R and Python programming and regularly publishes in scientific journals, for which he also serves as a reviewer.
Program schedule
Day 1: Introduction to data and data science
Morning:
- What is data?
- Overview of data science and its applications in ecology and beyond
- Data types, formats, and structures
- Introduction to metadata
- RStudio setup and essentials
- Introduction to R: syntax, variables, data structures
- Loading and exploring basic datasets
Morning:
- Designing a data collection plan
- Common pitfalls in data entry and formatting
- Reading and importing data from various sources (CSV, Excel, web, MySQL)
- Dealing with character encodings and locale issues
- Hands-on importing and inspecting real-life datasets
- Spotting and correcting format issues
- Intro to cleaning data with tidyverse
Morning:
- Cleaning and transforming data with dplyr and tidyr
- Handling missing values and outliers
- Understanding dataset properties: factors, ranges, summaries
- Step-by-step wrangling tasks
- Creating new variables, filtering and summarizing
- Visualizing data with ggplot2 (histograms, scatterplots, boxplots)
Morning:
- Introduction to spatial data
- Working with image and video data
- Basics of network data and ecological interaction networks
- Brief look at Python as a tool for data analysis
- Mapping with sf and ggmap
- Network visualisation using igraph
- Optional: data analysis experiment with Python
Morning:
- Finding and using open data repositories
- Data ethics and reproducibility
- Wrap-up discussion: integrating data science into your research
- Final project: wrangling and visualising a messy dataset
- Peer feedback and group discussion
- Summary of techniques and future learning paths
Evaluation
Participants will be assessed based on:
- Active engagement in practical sessions
- A final-day data wrangling and visualisation mini-project
Suggested readings & resources
- Wickham, H. & Grolemund, G. (2017). "R for Data Science." https://r4ds.had.co.nz
- Ellison, A.M. (2010). "Repeatability and transparency in ecological research." Ecology 91(9): 2536–2539.
- Marwick, B., Boettiger, C. & Mullen, L. (2018). "Packaging Data Analytical Work Reproducibly Using R (and Friends)." The American Statistician 72(1): 80–88.
- https://datacarpentry.org – Free hands-on lessons for data science
- https://ropensci.org – R tools for open science
- https://www.tidyverse.org – Core tools for modern R data science