- metadata.py: persistent ExifTool session (avoids Perl startup per file), filters out File:/ExifTool: noise - scanner now populates observations.meta as JSONB - size duplicated into observations for self-contained queries and to strengthen the rescan idempotency check (path + mtime + size) - README rewritten with state diagram, schema tables, scan/apply workflow |
||
|---|---|---|
| alembic | ||
| src/mama | ||
| tests | ||
| .gitignore | ||
| .python-version | ||
| alembic.ini | ||
| LICENSE | ||
| mama.toml.example | ||
| pyproject.toml | ||
| README.md | ||
| uv.lock | ||
mama
Media Archive Meets Automation — a self-hosted system for ingesting, deduplicating, and organizing personal media (photos, videos, music, documents) on top of ZFS.
⚠️ Project Status
Pre-alpha. Not ready for use by anyone but the author.
- Database schema, CLI surface, on-disk layout, and HTTP API are unstable and will change without migration paths.
- Most features described below are partly implemented, partly planned.
- Documentation lags behind code.
Do not point mama at irreplaceable data. Keep independent backups.
Concept
mama treats every file as two separate things:
- A blob — pure content, identified by its BLAKE3 hash. Stored once in a content-addressed store, regardless of how many places it appears.
- An observation — a sighting of that content at a specific filesystem path on a specific host at a specific time, with its own filesystem metadata and embedded metadata (EXIF, ID3, sidecar, ...).
This split is what enables real deduplication without losing context. Identical content from a phone, a backup DVD, and an old laptop become three observations referencing one blob.
Workflow
mama operates in two phases per source folder:
mama-scan— walk filesystem, hash files, record observations in DB. No copies, no moves. Safe to re-run.mama-apply— materialize observations into the archive (CAS blobs + hardlinked views). Idempotent.
┌─────────────────┐
filesystem ─▶│ mama-scan │─▶ observations + blobs in DB
└─────────────────┘
│
▼
┌─────────────────┐
│ mama-apply │─▶ CAS + views
└─────────────────┘
Storage Layout
<archive_root>/
├── blobs/ Content-addressed storage: blobs/ab/cd/<full-hash>
│ - mode 444, identical hashes share one inode
├── views/ Hardlink trees scoped by source_kind:
│ views/<source_kind>/<basedir>/<relpath>/<filename>
│ - same inode as the corresponding blob (zero extra storage
│ within the same ZFS dataset)
└── previews/ (planned: derived thumbnails / low-res for browsing)
Database
blobs — one row per unique content (BLAKE3 hash):
| Column | Type | Purpose |
|---|---|---|
| hash | str(64) PK | BLAKE3 hex digest |
| size | bigint | content size in bytes |
| storage_path | text | location in CAS (set by mama-apply) |
| first_seen | timestamptz | when first scanned |
| mime | str(128)? | detected via libmagic |
| block_reason | str(32)? | NULL = active; planned: deleted/blocked |
observations — one row per file sighting:
| Column | Type | Purpose |
|---|---|---|
| id | int PK | |
| blob_hash | str(64) FK | links to blobs.hash |
| hostname | str(255) | machine where the file was seen |
| basedir | text | scan root path |
| relpath | text | directory below scan root |
| filename | text | |
| size | bigint | size as seen (also in blobs, denormalized) |
| mtime | timestamptz | file's modification time |
| ctime | timestamptz | file's change time |
| scan_time | timestamptz | last time this path was confirmed |
| source_kind | str(32) | syncthing / incoming / existing / import |
| status | str(32) | pending / assigned / ignored |
| meta | jsonb? | ExifTool / ID3 / sidecar metadata |
Indexes:
ix_observations_blob_hash— for joinsix_observations_path_mtime— for rescan idempotency (hostname, basedir, relpath, filename, mtime, size)
Observation Lifecycle
stateDiagram-v2
[*] --> pending: mama-scan (new file)
pending --> assigned: mama-apply
pending --> ignored: curation (planned)
assigned --> ignored: curation (planned)
ignored --> assigned: curation (planned)
status represents the target state (Soll-Zustand):
pending— newly scanned, target not yet decided- current:
mama-applyauto-promotes toassigned - planned: stays
pendinguntil reviewed via web UI or rules
- current:
assigned— should be in the archive;mama-applyensures the view existsignored— should not be in the archive;mama-applyensures no view (planned)
mama-apply's job is to reconcile the filesystem with the target state.
mama-scan in detail
For each file under the scan root:
1. Cheap path check (no content I/O)
Reads:
stat()→size,mtime,ctime- DB query for an observation matching
(hostname, basedir, relpath, filename, mtime, size)
If a match is found:
- update
scan_timeon that observation - increment
unchangedcounter - skip everything else (no hashing, no metadata extraction)
2. Full processing (new or modified file)
Reads:
- BLAKE3 over content →
hash - libmagic →
mime - ExifTool →
metaJSON
Writes:
- new
Blobrow ifhashnot seen before (sets:hash,size,mime,first_seen; leavesstorage_pathempty formama-applyto fill) - new
Observationrow (sets all fields,status='pending')
Counters reported: files | new obs | unchanged | new blobs | duplicates | with metadata | errors.
mama-apply in detail
Processes observations in cursor-paginated batches, ordered by id.
For each observation:
- If
blob.block_reason IS NOT NULL→ skip, count asblocked - Compute CAS target path:
<archive_root>/blobs/<2>/<2>/<full-hash> - If CAS target doesn't exist:
- resolve source path:
basedir/relpath/filename - if source is missing → skip, count as
missing - try
os.link()(instant, same dataset) - fall back to
shutil.copy2()(cross-dataset; POSIX limit, costs space) chmod 444on the blob- set
blob.storage_pathto the CAS-relative path
- resolve source path:
- Compute view path:
<archive_root>/views/<source_kind>/<basedir>/<relpath>/<filename> - If view doesn't exist →
os.link()from CAS blob to view path - Set
observation.status = 'assigned'
The whole loop is idempotent — re-running mama-apply with no pending
observations does nothing.
Rescan safety
mama-scan can be re-run on the same path any number of times:
- unchanged files (matching
(path, size, mtime)) → onlyscan_timeupdated, no new observation, no hashing - modified files → re-hashed, new observation row added (old one stays for history)
- new files → full processing
- removed files → observation stays in DB (planned: mark as gone)
This makes mama-scan cheap to schedule on a timer for the Syncthing folders.
Components
mama-scan— index files into DB (above)mama-apply— materialize archive (above)mama-dev— developer utilities (reset,stats)mama-web— planned: browse, merge duplicates, filter, export, set status
Tech Stack
- Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic
- PostgreSQL 17 (JSONB for embedded metadata)
- Vue 3, Vite
- ZFS (single archive dataset, snapshots, NFS export), Caddy
- ExifTool, BLAKE3, libmagic, ffmpeg, Pillow
- Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx)
Disclaimer
mama is provided as-is for personal use. The author assumes no responsibility for data loss, corruption, mis-deduplication, accidental deletion, or any other adverse outcome arising from its use. Use at your own risk and only on data you can afford to lose.