mama/README.md
warnason 23566b0885 ExifTool metadata extraction + size in observations + workflow doc
- metadata.py: persistent ExifTool session (avoids Perl startup per file),
  filters out File:/ExifTool: noise
- scanner now populates observations.meta as JSONB
- size duplicated into observations for self-contained queries and to
  strengthen the rescan idempotency check (path + mtime + size)
- README rewritten with state diagram, schema tables, scan/apply workflow
2026-05-26 09:00:35 +02:00

8.2 KiB

mama

Media Archive Meets Automation — a self-hosted system for ingesting, deduplicating, and organizing personal media (photos, videos, music, documents) on top of ZFS.

⚠️ Project Status

Pre-alpha. Not ready for use by anyone but the author.

  • Database schema, CLI surface, on-disk layout, and HTTP API are unstable and will change without migration paths.
  • Most features described below are partly implemented, partly planned.
  • Documentation lags behind code.

Do not point mama at irreplaceable data. Keep independent backups.

Concept

mama treats every file as two separate things:

  • A blob — pure content, identified by its BLAKE3 hash. Stored once in a content-addressed store, regardless of how many places it appears.
  • An observation — a sighting of that content at a specific filesystem path on a specific host at a specific time, with its own filesystem metadata and embedded metadata (EXIF, ID3, sidecar, ...).

This split is what enables real deduplication without losing context. Identical content from a phone, a backup DVD, and an old laptop become three observations referencing one blob.

Workflow

mama operates in two phases per source folder:

  1. mama-scan — walk filesystem, hash files, record observations in DB. No copies, no moves. Safe to re-run.
  2. mama-apply — materialize observations into the archive (CAS blobs + hardlinked views). Idempotent.
                ┌─────────────────┐
   filesystem ─▶│   mama-scan     │─▶ observations + blobs in DB
                └─────────────────┘
                                            │
                                            ▼
                                  ┌─────────────────┐
                                  │   mama-apply    │─▶ CAS + views
                                  └─────────────────┘

Storage Layout

<archive_root>/
├── blobs/        Content-addressed storage: blobs/ab/cd/<full-hash>
│                 - mode 444, identical hashes share one inode
├── views/        Hardlink trees scoped by source_kind:
│                 views/<source_kind>/<basedir>/<relpath>/<filename>
│                 - same inode as the corresponding blob (zero extra storage
│                   within the same ZFS dataset)
└── previews/     (planned: derived thumbnails / low-res for browsing)

Database

blobs — one row per unique content (BLAKE3 hash):

Column Type Purpose
hash str(64) PK BLAKE3 hex digest
size bigint content size in bytes
storage_path text location in CAS (set by mama-apply)
first_seen timestamptz when first scanned
mime str(128)? detected via libmagic
block_reason str(32)? NULL = active; planned: deleted/blocked

observations — one row per file sighting:

Column Type Purpose
id int PK
blob_hash str(64) FK links to blobs.hash
hostname str(255) machine where the file was seen
basedir text scan root path
relpath text directory below scan root
filename text
size bigint size as seen (also in blobs, denormalized)
mtime timestamptz file's modification time
ctime timestamptz file's change time
scan_time timestamptz last time this path was confirmed
source_kind str(32) syncthing / incoming / existing / import
status str(32) pending / assigned / ignored
meta jsonb? ExifTool / ID3 / sidecar metadata

Indexes:

  • ix_observations_blob_hash — for joins
  • ix_observations_path_mtime — for rescan idempotency (hostname, basedir, relpath, filename, mtime, size)

Observation Lifecycle

stateDiagram-v2
    [*] --> pending: mama-scan (new file)
    pending --> assigned: mama-apply
    pending --> ignored: curation (planned)
    assigned --> ignored: curation (planned)
    ignored --> assigned: curation (planned)

status represents the target state (Soll-Zustand):

  • pending — newly scanned, target not yet decided
    • current: mama-apply auto-promotes to assigned
    • planned: stays pending until reviewed via web UI or rules
  • assigned — should be in the archive; mama-apply ensures the view exists
  • ignored — should not be in the archive; mama-apply ensures no view (planned)

mama-apply's job is to reconcile the filesystem with the target state.

mama-scan in detail

For each file under the scan root:

1. Cheap path check (no content I/O)

Reads:

  • stat()size, mtime, ctime
  • DB query for an observation matching (hostname, basedir, relpath, filename, mtime, size)

If a match is found:

  • update scan_time on that observation
  • increment unchanged counter
  • skip everything else (no hashing, no metadata extraction)

2. Full processing (new or modified file)

Reads:

  • BLAKE3 over content → hash
  • libmagic → mime
  • ExifTool → meta JSON

Writes:

  • new Blob row if hash not seen before (sets: hash, size, mime, first_seen; leaves storage_path empty for mama-apply to fill)
  • new Observation row (sets all fields, status='pending')

Counters reported: files | new obs | unchanged | new blobs | duplicates | with metadata | errors.

mama-apply in detail

Processes observations in cursor-paginated batches, ordered by id.

For each observation:

  1. If blob.block_reason IS NOT NULL → skip, count as blocked
  2. Compute CAS target path: <archive_root>/blobs/<2>/<2>/<full-hash>
  3. If CAS target doesn't exist:
    • resolve source path: basedir/relpath/filename
    • if source is missing → skip, count as missing
    • try os.link() (instant, same dataset)
    • fall back to shutil.copy2() (cross-dataset; POSIX limit, costs space)
    • chmod 444 on the blob
    • set blob.storage_path to the CAS-relative path
  4. Compute view path: <archive_root>/views/<source_kind>/<basedir>/<relpath>/<filename>
  5. If view doesn't exist → os.link() from CAS blob to view path
  6. Set observation.status = 'assigned'

The whole loop is idempotent — re-running mama-apply with no pending observations does nothing.

Rescan safety

mama-scan can be re-run on the same path any number of times:

  • unchanged files (matching (path, size, mtime)) → only scan_time updated, no new observation, no hashing
  • modified files → re-hashed, new observation row added (old one stays for history)
  • new files → full processing
  • removed files → observation stays in DB (planned: mark as gone)

This makes mama-scan cheap to schedule on a timer for the Syncthing folders.

Components

  • mama-scan — index files into DB (above)
  • mama-apply — materialize archive (above)
  • mama-dev — developer utilities (reset, stats)
  • mama-web — planned: browse, merge duplicates, filter, export, set status

Tech Stack

  • Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic
  • PostgreSQL 17 (JSONB for embedded metadata)
  • Vue 3, Vite
  • ZFS (single archive dataset, snapshots, NFS export), Caddy
  • ExifTool, BLAKE3, libmagic, ffmpeg, Pillow
  • Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx)

Disclaimer

mama is provided as-is for personal use. The author assumes no responsibility for data loss, corruption, mis-deduplication, accidental deletion, or any other adverse outcome arising from its use. Use at your own risk and only on data you can afford to lose.

License

MIT