boilerplate for reproducible and transparent science
View the Project on GitHub miguelarbesu/cookiecutter-reproducible-science
This is a template for Python-based data analysis workflows and tools. It’s a fork from Mario Krapp’s boilerplate for Python data science projects, Reproducible Science. The original derives from Cookiecutter Data Science: A logical, reasonably standardized, but flexible project structure for doing and sharing data science work.
Here I reintroduce some elements according to my own needs and preferences.Other important sources of inspiration are:
Install cookiecutter
from Pip or Conda.
To start a new science project from this version of the template:
cookiecutter gh:miguelarbesu/cookiecutter-reproducible-science --checkout main
If you have a local install, you can run:
cookiecutter ./cookiecutter-reproducible-science --checkout main
├─ data <--- Experimental data
│ ├─ external
│ ├─ interim
│ ├─ processed
│ └─ raw
├─ devtools <--- Development tools
├─ doc <--- Project documentation
├─ notebook <--- Exploratory Jupyter notebooks
├─ output <--- Final analysis report
│ └─ figures
├─ src
│ ├─ packagename
│ │ ├─ __init__.py
│ │ ├─ __main__.py
│ │ └─ modulename.py
│ ├─ data
│ └─ tests
│ ├─ __init__.py
| └─ test_modulename.py
└─ setup.py
Typically, I will start analyzing a given data set(s) with experimental results
and/or a reference database from others. My own data will be saved as an
immutable dump under raw
, third-party under external
. Then usually follows an exploratory stage to evaluate & clean the data. Intermediate data belong to interim
. Finally, some kind of elaborated data is derived – e.g. parameters from a fitting – the processed
data.
I use to start exploring data in Jupyter notebooks under /notebooks
, writing
basic functions to delineate a piece of the analysis pipeline, then refactor it
under /src
once it is functional. This exploratory phase should not eclipse
proper coding: write directly in the module and start writing tests soon.
At this point, the nature of the project should define the form of the repository. For a one-off analysis of a small set of measurement, a simple module usually does the trick and one will not bother distributing a proper package. This does not mean lousy code: document and test properly, as this small pieces can be needed in the future, or incorporated in larger projects. An example of this is the code associated to figures in a research article.
For recurrent analysis on new data sets of the same kind, a proper tool is needed. Usually, a Command Line Interface (CLI) is the way to go. Turning a module into a CLI is natural in Python, it just involves a parsing layer module. While Argparse is the basic tool, Click is easy and powerful. Further down the road one may want to create a GUI.
For large scale pipeline analysis with a DAG structure – e.g. bioinformatics studies, processing thousands of files – a Makefile is desirable. Software Carpentry has a great tutorial on the topic.
The distilled information derived from the analysis is usually presented in the form of plots, integrated in a report. A jupyter notebook is usually a good format to put all this together in an interactive and exportable format.
All this material is finally found under output
and output/figures
.
This project is licensed under the terms of the BSD License