- metadata.py: persistent ExifTool session (avoids Perl startup per file), filters out File:/ExifTool: noise - scanner now populates observations.meta as JSONB - size duplicated into observations for self-contained queries and to strengthen the rescan idempotency check (path + mtime + size) - README rewritten with state diagram, schema tables, scan/apply workflow
211 lines
8.2 KiB
Markdown
211 lines
8.2 KiB
Markdown
# mama
|
|
|
|
**M**edia **A**rchive **M**eets **A**utomation — a self-hosted system for
|
|
ingesting, deduplicating, and organizing personal media (photos, videos, music,
|
|
documents) on top of ZFS.
|
|
|
|
## ⚠️ Project Status
|
|
|
|
**Pre-alpha. Not ready for use by anyone but the author.**
|
|
|
|
- Database schema, CLI surface, on-disk layout, and HTTP API are unstable and
|
|
will change without migration paths.
|
|
- Most features described below are partly implemented, partly planned.
|
|
- Documentation lags behind code.
|
|
|
|
Do not point mama at irreplaceable data. Keep independent backups.
|
|
|
|
## Concept
|
|
|
|
mama treats every file as two separate things:
|
|
|
|
- **A blob** — pure content, identified by its BLAKE3 hash. Stored once in
|
|
a content-addressed store, regardless of how many places it appears.
|
|
- **An observation** — a sighting of that content at a specific filesystem
|
|
path on a specific host at a specific time, with its own filesystem
|
|
metadata and embedded metadata (EXIF, ID3, sidecar, ...).
|
|
|
|
This split is what enables real deduplication without losing context.
|
|
Identical content from a phone, a backup DVD, and an old laptop become three
|
|
observations referencing one blob.
|
|
|
|
## Workflow
|
|
|
|
mama operates in two phases per source folder:
|
|
|
|
1. **`mama-scan`** — walk filesystem, hash files, record observations in DB.
|
|
No copies, no moves. Safe to re-run.
|
|
2. **`mama-apply`** — materialize observations into the archive (CAS blobs +
|
|
hardlinked views). Idempotent.
|
|
|
|
```
|
|
┌─────────────────┐
|
|
filesystem ─▶│ mama-scan │─▶ observations + blobs in DB
|
|
└─────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────┐
|
|
│ mama-apply │─▶ CAS + views
|
|
└─────────────────┘
|
|
```
|
|
|
|
### Storage Layout
|
|
|
|
```
|
|
<archive_root>/
|
|
├── blobs/ Content-addressed storage: blobs/ab/cd/<full-hash>
|
|
│ - mode 444, identical hashes share one inode
|
|
├── views/ Hardlink trees scoped by source_kind:
|
|
│ views/<source_kind>/<basedir>/<relpath>/<filename>
|
|
│ - same inode as the corresponding blob (zero extra storage
|
|
│ within the same ZFS dataset)
|
|
└── previews/ (planned: derived thumbnails / low-res for browsing)
|
|
```
|
|
|
|
### Database
|
|
|
|
**`blobs`** — one row per unique content (BLAKE3 hash):
|
|
|
|
| Column | Type | Purpose |
|
|
|---------------|-------------|------------------------------------------|
|
|
| hash | str(64) PK | BLAKE3 hex digest |
|
|
| size | bigint | content size in bytes |
|
|
| storage_path | text | location in CAS (set by `mama-apply`) |
|
|
| first_seen | timestamptz | when first scanned |
|
|
| mime | str(128)? | detected via libmagic |
|
|
| block_reason | str(32)? | NULL = active; planned: deleted/blocked |
|
|
|
|
**`observations`** — one row per file sighting:
|
|
|
|
| Column | Type | Purpose |
|
|
|-------------|-------------|------------------------------------------|
|
|
| id | int PK | |
|
|
| blob_hash | str(64) FK | links to `blobs.hash` |
|
|
| hostname | str(255) | machine where the file was seen |
|
|
| basedir | text | scan root path |
|
|
| relpath | text | directory below scan root |
|
|
| filename | text | |
|
|
| size | bigint | size as seen (also in blobs, denormalized) |
|
|
| mtime | timestamptz | file's modification time |
|
|
| ctime | timestamptz | file's change time |
|
|
| scan_time | timestamptz | last time this path was confirmed |
|
|
| source_kind | str(32) | syncthing / incoming / existing / import |
|
|
| status | str(32) | pending / assigned / ignored |
|
|
| meta | jsonb? | ExifTool / ID3 / sidecar metadata |
|
|
|
|
Indexes:
|
|
- `ix_observations_blob_hash` — for joins
|
|
- `ix_observations_path_mtime` — for rescan idempotency (hostname, basedir, relpath, filename, mtime, size)
|
|
|
|
### Observation Lifecycle
|
|
|
|
```mermaid
|
|
stateDiagram-v2
|
|
[*] --> pending: mama-scan (new file)
|
|
pending --> assigned: mama-apply
|
|
pending --> ignored: curation (planned)
|
|
assigned --> ignored: curation (planned)
|
|
ignored --> assigned: curation (planned)
|
|
```
|
|
|
|
`status` represents the **target state** (Soll-Zustand):
|
|
|
|
- `pending` — newly scanned, target not yet decided
|
|
- current: `mama-apply` auto-promotes to `assigned`
|
|
- planned: stays `pending` until reviewed via web UI or rules
|
|
- `assigned` — should be in the archive; `mama-apply` ensures the view exists
|
|
- `ignored` — should not be in the archive; `mama-apply` ensures no view (planned)
|
|
|
|
`mama-apply`'s job is to reconcile the filesystem with the target state.
|
|
|
|
### mama-scan in detail
|
|
|
|
For each file under the scan root:
|
|
|
|
**1. Cheap path check (no content I/O)**
|
|
|
|
Reads:
|
|
- `stat()` → `size`, `mtime`, `ctime`
|
|
- DB query for an observation matching
|
|
`(hostname, basedir, relpath, filename, mtime, size)`
|
|
|
|
If a match is found:
|
|
- update `scan_time` on that observation
|
|
- increment `unchanged` counter
|
|
- **skip everything else** (no hashing, no metadata extraction)
|
|
|
|
**2. Full processing (new or modified file)**
|
|
|
|
Reads:
|
|
- BLAKE3 over content → `hash`
|
|
- libmagic → `mime`
|
|
- ExifTool → `meta` JSON
|
|
|
|
Writes:
|
|
- new `Blob` row if `hash` not seen before (sets: `hash`, `size`, `mime`,
|
|
`first_seen`; leaves `storage_path` empty for `mama-apply` to fill)
|
|
- new `Observation` row (sets all fields, `status='pending'`)
|
|
|
|
Counters reported: `files | new obs | unchanged | new blobs | duplicates |
|
|
with metadata | errors`.
|
|
|
|
### mama-apply in detail
|
|
|
|
Processes observations in cursor-paginated batches, ordered by `id`.
|
|
|
|
For each observation:
|
|
|
|
1. If `blob.block_reason IS NOT NULL` → skip, count as `blocked`
|
|
2. Compute CAS target path: `<archive_root>/blobs/<2>/<2>/<full-hash>`
|
|
3. If CAS target doesn't exist:
|
|
- resolve source path: `basedir/relpath/filename`
|
|
- if source is missing → skip, count as `missing`
|
|
- try `os.link()` (instant, same dataset)
|
|
- fall back to `shutil.copy2()` (cross-dataset; POSIX limit, costs space)
|
|
- `chmod 444` on the blob
|
|
- set `blob.storage_path` to the CAS-relative path
|
|
4. Compute view path: `<archive_root>/views/<source_kind>/<basedir>/<relpath>/<filename>`
|
|
5. If view doesn't exist → `os.link()` from CAS blob to view path
|
|
6. Set `observation.status = 'assigned'`
|
|
|
|
The whole loop is idempotent — re-running `mama-apply` with no pending
|
|
observations does nothing.
|
|
|
|
### Rescan safety
|
|
|
|
`mama-scan` can be re-run on the same path any number of times:
|
|
|
|
- unchanged files (matching `(path, size, mtime)`) → only `scan_time` updated,
|
|
no new observation, no hashing
|
|
- modified files → re-hashed, new observation row added (old one stays for history)
|
|
- new files → full processing
|
|
- removed files → observation stays in DB (planned: mark as gone)
|
|
|
|
This makes `mama-scan` cheap to schedule on a timer for the Syncthing folders.
|
|
|
|
## Components
|
|
|
|
- **`mama-scan`** — index files into DB (above)
|
|
- **`mama-apply`** — materialize archive (above)
|
|
- **`mama-dev`** — developer utilities (`reset`, `stats`)
|
|
- **`mama-web`** — planned: browse, merge duplicates, filter, export, set status
|
|
|
|
## Tech Stack
|
|
|
|
- Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic
|
|
- PostgreSQL 17 (JSONB for embedded metadata)
|
|
- Vue 3, Vite
|
|
- ZFS (single archive dataset, snapshots, NFS export), Caddy
|
|
- ExifTool, BLAKE3, libmagic, ffmpeg, Pillow
|
|
- Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx)
|
|
|
|
## Disclaimer
|
|
|
|
mama is provided as-is for personal use. The author assumes no responsibility
|
|
for data loss, corruption, mis-deduplication, accidental deletion, or any other
|
|
adverse outcome arising from its use. Use at your own risk and only on data you
|
|
can afford to lose.
|
|
|
|
## License
|
|
|
|
[MIT](LICENSE)
|