From interactive data exploration to model fitting: A data science journey on NeSI
A data science project involves many steps, from data collection to reporting, each with a different set of computing requirements. To help data scientists with their projects, New Zealand eScience Infrastructure (NeSI) provides a variety of tools and devices which can be applied at the various stages of their data science projects.
In this talk, we will journey through the phases of a typical NeSI data science project and explain which tools are best suited for each step. In the early phases of a project, when a data scientist might want to interactively explore a dataset, we’ll show how this can be done using the Jupyter and RStudio environments. For the next phase of the project, when the data scientist might need to scale up their processing, we will show how to request scalable computing resources (e.g. GPUs, high memory nodes) using Slurm. Mid-project, ensuring the reproducibility of the results becomes crucial to the data scientist – we will show how virtual environments and containers can support this.
As the data scientist adds more processing steps to their research workflows (data cleaning, model fitting, validation and reporting, model testing), we’ll show how a workflow management system can orchestrate this complexity. Finally, to automate the data scientist’s data flows, tools like Globus can be used to schedule transfers of large amounts of data.
ABOUT THE AUTHORS
Maxime Rio is a data science engineer and data scientist at NeSI and NIWA. He enjoys helping researchers analyse their data, from visualisation to probabilistic modelling.
Alex Pletzer is a research software engineer for NeSI at NIWA, helping researchers in Aotearoa run their code better and faster.
Chris Scott is a research software engineer for NeSI based at the University of Auckland. He leads the Computational Science Team in charge of the consultancy service at NeSI.