warnason 23566b0885 ExifTool metadata extraction + size in observations + workflow doc

- metadata.py: persistent ExifTool session (avoids Perl startup per file),
  filters out File:/ExifTool: noise
- scanner now populates observations.meta as JSONB
- size duplicated into observations for self-contained queries and to
  strengthen the rescan idempotency check (path + mtime + size)
- README rewritten with state diagram, schema tables, scan/apply workflow

2026-05-26 09:00:35 +02:00

8.2 KiB

Raw Blame History

mama

Media Archive Meets Automation — a self-hosted system for ingesting, deduplicating, and organizing personal media (photos, videos, music, documents) on top of ZFS.

⚠️ Project Status

Pre-alpha. Not ready for use by anyone but the author.

Database schema, CLI surface, on-disk layout, and HTTP API are unstable and will change without migration paths.
Most features described below are partly implemented, partly planned.
Documentation lags behind code.

Do not point mama at irreplaceable data. Keep independent backups.

Concept

mama treats every file as two separate things:

A blob — pure content, identified by its BLAKE3 hash. Stored once in a content-addressed store, regardless of how many places it appears.
An observation — a sighting of that content at a specific filesystem path on a specific host at a specific time, with its own filesystem metadata and embedded metadata (EXIF, ID3, sidecar, ...).

This split is what enables real deduplication without losing context. Identical content from a phone, a backup DVD, and an old laptop become three observations referencing one blob.

Workflow

mama operates in two phases per source folder:

mama-scan — walk filesystem, hash files, record observations in DB. No copies, no moves. Safe to re-run.
mama-apply — materialize observations into the archive (CAS blobs + hardlinked views). Idempotent.

                ┌─────────────────┐
   filesystem ─▶│   mama-scan     │─▶ observations + blobs in DB
                └─────────────────┘
                                            │
                                            ▼
                                  ┌─────────────────┐
                                  │   mama-apply    │─▶ CAS + views
                                  └─────────────────┘

Storage Layout

<archive_root>/
├── blobs/        Content-addressed storage: blobs/ab/cd/<full-hash>
│                 - mode 444, identical hashes share one inode
├── views/        Hardlink trees scoped by source_kind:
│                 views/<source_kind>/<basedir>/<relpath>/<filename>
│                 - same inode as the corresponding blob (zero extra storage
│                   within the same ZFS dataset)
└── previews/     (planned: derived thumbnails / low-res for browsing)

Database

blobs — one row per unique content (BLAKE3 hash):

Column	Type	Purpose
hash	str(64) PK	BLAKE3 hex digest
size	bigint	content size in bytes
storage_path	text	location in CAS (set by `mama-apply`)
first_seen	timestamptz	when first scanned
mime	str(128)?	detected via libmagic
block_reason	str(32)?	NULL = active; planned: deleted/blocked

observations — one row per file sighting:

Column	Type	Purpose
id	int PK
blob_hash	str(64) FK	links to `blobs.hash`
hostname	str(255)	machine where the file was seen
basedir	text	scan root path
relpath	text	directory below scan root
filename	text
size	bigint	size as seen (also in blobs, denormalized)
mtime	timestamptz	file's modification time
ctime	timestamptz	file's change time
scan_time	timestamptz	last time this path was confirmed
source_kind	str(32)	syncthing / incoming / existing / import
status	str(32)	pending / assigned / ignored
meta	jsonb?	ExifTool / ID3 / sidecar metadata

Indexes:

ix_observations_blob_hash — for joins
ix_observations_path_mtime — for rescan idempotency (hostname, basedir, relpath, filename, mtime, size)

Observation Lifecycle

stateDiagram-v2
    [*] --> pending: mama-scan (new file)
    pending --> assigned: mama-apply
    pending --> ignored: curation (planned)
    assigned --> ignored: curation (planned)
    ignored --> assigned: curation (planned)

status represents the target state (Soll-Zustand):

pending — newly scanned, target not yet decided
- current: mama-apply auto-promotes to assigned
- planned: stays pending until reviewed via web UI or rules
assigned — should be in the archive; mama-apply ensures the view exists
ignored — should not be in the archive; mama-apply ensures no view (planned)

mama-apply's job is to reconcile the filesystem with the target state.

mama-scan in detail

For each file under the scan root:

1. Cheap path check (no content I/O)

Reads:

stat() → size, mtime, ctime
DB query for an observation matching (hostname, basedir, relpath, filename, mtime, size)

If a match is found:

update scan_time on that observation
increment unchanged counter
skip everything else (no hashing, no metadata extraction)

2. Full processing (new or modified file)

Reads:

BLAKE3 over content → hash
libmagic → mime
ExifTool → meta JSON

Writes:

new Blob row if hash not seen before (sets: hash, size, mime, first_seen; leaves storage_path empty for mama-apply to fill)
new Observation row (sets all fields, status='pending')

mama-apply in detail

Processes observations in cursor-paginated batches, ordered by id.

For each observation:

If blob.block_reason IS NOT NULL → skip, count as blocked
Compute CAS target path: <archive_root>/blobs/<2>/<2>/<full-hash>
If CAS target doesn't exist:
- resolve source path: basedir/relpath/filename
- if source is missing → skip, count as missing
- try os.link() (instant, same dataset)
- fall back to shutil.copy2() (cross-dataset; POSIX limit, costs space)
- chmod 444 on the blob
- set blob.storage_path to the CAS-relative path
Compute view path: <archive_root>/views/<source_kind>/<basedir>/<relpath>/<filename>
If view doesn't exist → os.link() from CAS blob to view path
Set observation.status = 'assigned'

The whole loop is idempotent — re-running mama-apply with no pending observations does nothing.

Rescan safety

mama-scan can be re-run on the same path any number of times:

unchanged files (matching (path, size, mtime)) → only scan_time updated, no new observation, no hashing
modified files → re-hashed, new observation row added (old one stays for history)
new files → full processing
removed files → observation stays in DB (planned: mark as gone)

This makes mama-scan cheap to schedule on a timer for the Syncthing folders.

Components

mama-scan — index files into DB (above)
mama-apply — materialize archive (above)
mama-dev — developer utilities (reset, stats)
mama-web — planned: browse, merge duplicates, filter, export, set status

Tech Stack

Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic
PostgreSQL 17 (JSONB for embedded metadata)
Vue 3, Vite
ZFS (single archive dataset, snapshots, NFS export), Caddy
ExifTool, BLAKE3, libmagic, ffmpeg, Pillow
Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx)

Disclaimer

mama is provided as-is for personal use. The author assumes no responsibility for data loss, corruption, mis-deduplication, accidental deletion, or any other adverse outcome arising from its use. Use at your own risk and only on data you can afford to lose.

License

MIT

8.2 KiB Raw Blame History