Eric J Ma's Website

Use pyprojroot and Python’s pathlib to manage your data paths!

written by Eric J. Ma on 2020-04-21 | tags: data science pathlib python packages tools


If you adopt a proper organizational structure for your data projects, then each project gets its own directory (i.e. a clean and isolated "workspace") and its own isolated analysis environment (e.g. a conda environment).

In that workspace, your directory structure might look like this:

project/
- data/
- notebooks/
- src/
- setup.py
- README

As such, your notebook are all going to be in a different directory from your data. This is one way that keeps the mind sane: you might have subdirectories in the notebooks/ directory that you use to organize the notebooks further, yet you have multiple notebooks that use the same file, leading to brittle path linking. After all in one notebook, you might do:

import pandas as pd

df = pd.read_csv("data.csv")

But in another notebook that lives in a different directory, to link to the dataset, you might have to do:

import pandas as pd

df = pd.read_csv("../other_dir/data.csv")

The potential for confusion is just immense here.

A better way is to provide one authoritative path to a particular dataset that you can use. For example:

import pandas as pd

df = pd.read_csv("../data/data.csv")

But even that is a bit tricky: if you move the notebook for whatever good reason, the path to the data might break. It’s still brittle. We need a better way to resolve paths.

Enter pyprojroot. Written by my fellow PyData conference doppleganger Daniel Chen, it provides a here function that will resolve to your project root directory (hence the package name). The original was written in R (rprojroot), and it’s a wonderful tool for data scientists. Let’s see it in action:

import pandas as pd
from pyprojroot import here

df = pd.read_csv(here() / "data/data.csv")

And voila! No fragile relative paths, and no perpetually long chains of ../../..! Just nice and clean resolution to your project root.

How does it work? What pyprojroot does underneath the hood is recursively climb the file tree until it finds one of a set of pre-specified files that are commonly found in a project’s root directory. For example, .git is a common one. For Python packages, setup.py is another.

If your project doesn’t "fit" any of the conventions assumed, or if you have a fancier structure, you can always add a .here() to your project root, and configure the project_files keyword argument so that here only looks for that one authoritative file:

import pandas as pd
from pyprojroot import here

root = here(project_files=[".here"])

df = pd.read_csv(root / "data/data.csv")

And what exactly is the here function returning? Well, it’s returning a pathlib.Path object, which has some seriously clever patching to allow it to work with the / operator to represent paths in native Python code!

Now, let us all toast to cleaner path resolution in our data projects!


Cite this blog post:
@article{
    ericmjl-2020-use-paths,
    author = {Eric J. Ma},
    title = {Use pyprojroot and Python’s pathlib to manage your data paths!},
    year = {2020},
    month = {04},
    day = {21},
    howpublished = {\url{https://ericmjl.github.io}},
    journal = {Eric J. Ma's Blog},
    url = {https://ericmjl.github.io/blog/2020/4/21/use-pyprojroot-and-pythons-pathlib-to-manage-your-data-paths},
}
  

I send out a newsletter with tips and tools for data scientists. Come check it out at Substack.

If you would like to sponsor the coffee that goes into making my posts, please consider GitHub Sponsors!

Finally, I do free 30-minute GenAI strategy calls for teams that are looking to leverage GenAI for maximum impact. Consider booking a call on Calendly if you're interested!