# mama **M**edia **A**rchive **M**eets **A**utomation — a self-hosted system for ingesting, deduplicating, and organizing personal media (photos, videos, music, documents) on top of ZFS. ## ⚠️ Project Status **Pre-alpha. Not ready for use by anyone but the author.** - Database schema, CLI surface, on-disk layout, and HTTP API are unstable and will change without migration paths. - Most features described below are partly implemented, partly planned. - Documentation lags behind code. Do not point mama at irreplaceable data. Keep independent backups. ## Concept mama treats every file as two separate things: - **A blob** — pure content, identified by its BLAKE3 hash. Stored once in a content-addressed store, regardless of how many places it appears. - **An observation** — a sighting of that content at a specific filesystem path on a specific host at a specific time, with its own filesystem metadata and embedded metadata (EXIF, ID3, sidecar, ...). This split is what enables real deduplication without losing context. Identical content from a phone, a backup DVD, and an old laptop become three observations referencing one blob. ## Workflow mama operates in two phases per source folder: 1. **`mama-scan`** — walk filesystem, hash files, record observations in DB. No copies, no moves. Safe to re-run. 2. **`mama-apply`** — materialize observations into the archive (CAS blobs + hardlinked views). Idempotent. ``` ┌─────────────────┐ filesystem ─▶│ mama-scan │─▶ observations + blobs in DB └─────────────────┘ │ ▼ ┌─────────────────┐ │ mama-apply │─▶ CAS + views └─────────────────┘ ``` ### Storage Layout ``` / ├── blobs/ Content-addressed storage: blobs/ab/cd/ │ - mode 444, identical hashes share one inode ├── views/ Hardlink trees scoped by source_kind: │ views//// │ - same inode as the corresponding blob (zero extra storage │ within the same ZFS dataset) └── previews/ (planned: derived thumbnails / low-res for browsing) ``` ### Database **`blobs`** — one row per unique content (BLAKE3 hash): | Column | Type | Purpose | |---------------|-------------|------------------------------------------| | hash | str(64) PK | BLAKE3 hex digest | | size | bigint | content size in bytes | | storage_path | text | location in CAS (set by `mama-apply`) | | first_seen | timestamptz | when first scanned | | mime | str(128)? | detected via libmagic | | block_reason | str(32)? | NULL = active; planned: deleted/blocked | **`observations`** — one row per file sighting: | Column | Type | Purpose | |-------------|-------------|------------------------------------------| | id | int PK | | | blob_hash | str(64) FK | links to `blobs.hash` | | hostname | str(255) | machine where the file was seen | | basedir | text | scan root path | | relpath | text | directory below scan root | | filename | text | | | size | bigint | size as seen (also in blobs, denormalized) | | mtime | timestamptz | file's modification time | | ctime | timestamptz | file's change time | | scan_time | timestamptz | last time this path was confirmed | | source_kind | str(32) | syncthing / incoming / existing / import | | status | str(32) | pending / assigned / ignored | | meta | jsonb? | ExifTool / ID3 / sidecar metadata | Indexes: - `ix_observations_blob_hash` — for joins - `ix_observations_path_mtime` — for rescan idempotency (hostname, basedir, relpath, filename, mtime, size) ### Observation Lifecycle ```mermaid stateDiagram-v2 [*] --> pending: mama-scan (new file) pending --> assigned: mama-apply pending --> ignored: curation (planned) assigned --> ignored: curation (planned) ignored --> assigned: curation (planned) ``` `status` represents the **target state** (Soll-Zustand): - `pending` — newly scanned, target not yet decided - current: `mama-apply` auto-promotes to `assigned` - planned: stays `pending` until reviewed via web UI or rules - `assigned` — should be in the archive; `mama-apply` ensures the view exists - `ignored` — should not be in the archive; `mama-apply` ensures no view (planned) `mama-apply`'s job is to reconcile the filesystem with the target state. ### mama-scan in detail For each file under the scan root: **1. Cheap path check (no content I/O)** Reads: - `stat()` → `size`, `mtime`, `ctime` - DB query for an observation matching `(hostname, basedir, relpath, filename, mtime, size)` If a match is found: - update `scan_time` on that observation - increment `unchanged` counter - **skip everything else** (no hashing, no metadata extraction) **2. Full processing (new or modified file)** Reads: - BLAKE3 over content → `hash` - libmagic → `mime` - ExifTool → `meta` JSON Writes: - new `Blob` row if `hash` not seen before (sets: `hash`, `size`, `mime`, `first_seen`; leaves `storage_path` empty for `mama-apply` to fill) - new `Observation` row (sets all fields, `status='pending'`) Counters reported: `files | new obs | unchanged | new blobs | duplicates | with metadata | errors`. ### mama-apply in detail Processes observations in cursor-paginated batches, ordered by `id`. For each observation: 1. If `blob.block_reason IS NOT NULL` → skip, count as `blocked` 2. Compute CAS target path: `/blobs/<2>/<2>/` 3. If CAS target doesn't exist: - resolve source path: `basedir/relpath/filename` - if source is missing → skip, count as `missing` - try `os.link()` (instant, same dataset) - fall back to `shutil.copy2()` (cross-dataset; POSIX limit, costs space) - `chmod 444` on the blob - set `blob.storage_path` to the CAS-relative path 4. Compute view path: `/views////` 5. If view doesn't exist → `os.link()` from CAS blob to view path 6. Set `observation.status = 'assigned'` The whole loop is idempotent — re-running `mama-apply` with no pending observations does nothing. ### Rescan safety `mama-scan` can be re-run on the same path any number of times: - unchanged files (matching `(path, size, mtime)`) → only `scan_time` updated, no new observation, no hashing - modified files → re-hashed, new observation row added (old one stays for history) - new files → full processing - removed files → observation stays in DB (planned: mark as gone) This makes `mama-scan` cheap to schedule on a timer for the Syncthing folders. ## Components - **`mama-scan`** — index files into DB (above) - **`mama-apply`** — materialize archive (above) - **`mama-dev`** — developer utilities (`reset`, `stats`) - **`mama-web`** — planned: browse, merge duplicates, filter, export, set status ## Tech Stack - Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic - PostgreSQL 17 (JSONB for embedded metadata) - Vue 3, Vite - ZFS (single archive dataset, snapshots, NFS export), Caddy - ExifTool, BLAKE3, libmagic, ffmpeg, Pillow - Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx) ## Disclaimer mama is provided as-is for personal use. The author assumes no responsibility for data loss, corruption, mis-deduplication, accidental deletion, or any other adverse outcome arising from its use. Use at your own risk and only on data you can afford to lose. ## License [MIT](LICENSE)