Skip to content

Datamate Documentation

Datamate is a lightweight data and configuration management framework in Python, tailored to support research in machine-learning science. It provides a programming interface to store and retrieve files on a hierarchical filesystem through Directory objects, enabling data creation and access with standard Python code. Built on HDF5 and numpy, it handles file I/O while treating the filesystem as memory.

Example

The following example demonstrates how to set up an experiment directory, store experiment configuration in _meta.yaml, and store arrays as HDF5 files without boilerplate code. More examples can be found in the examples section.

import datamate
import numpy as np

# Set up experiment directory
datamate.set_root_dir("./experiments")

# Set up experiment configuration
config = {
    "model": "01",
    "date": "2024-03-20",
    "optimizer": "Adam",
    "learning_rate": 0.001,
    "n_epochs": 100,
    "description": "Setting the learning rate to 0.001"
}

# Set up experiment directory at ./experiments/vision_study/model_01
# and store configuration in _meta.yaml
exp = datamate.Directory("vision_study/model_01", config)

# Store arrays as HDF5 files
exp.images = np.random.rand(100, 64, 64)  # stored as images.h5
exp.responses = np.zeros((100, 1000))     # stored as responses.h5

# Verify that the experiment data is set up
print(exp)

def train(exp: datamate.Directory):
    """Train a model using the experiment's config and data."""
    # Set up optimizer using config
    optimizer = get_optimizer(exp.config.optimizer, lr=exp.config.learning_rate)

    losses = []
    # Training loop using config parameters
    for epoch in range(exp.config.n_epochs):
        # ... training code ...

        # Cache results in memory to avoid high IO overhead
        losses.append(loss)

    # Store results back in experiment directory outside of the training loop
    exp.losses = np.array(losses)  # creates losses.h5

# Run training
train(exp)

# Access results
mean_loss = exp.losses.mean()  # compute mean loss

Main Features

The main features of datamate are:

  • Filesystem as memory through Directory objects
  • Hierarchical data organization
  • Automatic path handling and resolution with pathlib
  • Array storage in HDF5 format
  • Parallel read/write operations
  • Configuration-based compilation and access of data
  • Configuration management in YAML files
  • Configuration comparison and diffing
  • Pandas DataFrame integration
  • Directory structure visualization (tree view)
  • Basic experiment status tracking

Installation

pip install datamate

Tutorials

API Reference

For detailed information about Datamate’s components and functions, please refer to our API Reference section.

datamate was co-developed with the flyvis project, which is a complex usage example of datamate.

artisan is the original framework that inspired datamate.

Getting Help

If you have any questions or encounter any issues, please check our FAQ or Contributing pages for more information.