mama/README.md

# mama

**M**edia **A**rchive **M**eets **A**utomation — a self-hosted system for
ingesting, deduplicating, and organizing personal media (photos, videos, music,
documents) on top of ZFS.

## ⚠️ Project Status

**Pre-alpha. Not ready for use by anyone but the author.**

- Database schema, CLI surface, on-disk layout, and HTTP API are unstable and
  will change without migration paths.
- Most features described below are partly implemented, partly planned.
- Documentation lags behind code.

Do not point mama at irreplaceable data. Keep independent backups.

## Concept

mama treats every file as two separate things:

- **A blob** — pure content, identified by its BLAKE3 hash. Stored once in
  a content-addressed store, regardless of how many places it appears.
- **An observation** — a sighting of that content at a specific filesystem
  path on a specific host at a specific time, with its own filesystem
  metadata and embedded metadata (EXIF, ID3, sidecar, ...).

This split is what enables real deduplication without losing context.
Identical content from a phone, a backup DVD, and an old laptop become three
observations referencing one blob.

## Workflow

mama operates in two phases per source folder:

1. **`mama-scan`** — walk filesystem, hash files, record observations in DB.
   No copies, no moves. Safe to re-run.
2. **`mama-apply`** — materialize observations into the archive (CAS blobs +
   hardlinked views). Idempotent.

```
                ┌─────────────────┐
   filesystem ─▶│   mama-scan     │─▶ observations + blobs in DB
                └─────────────────┘
                                            │
                                            ▼
                                  ┌─────────────────┐
                                  │   mama-apply    │─▶ CAS + views
                                  └─────────────────┘
```

### Storage Layout

```
<archive_root>/
├── blobs/        Content-addressed storage: blobs/ab/cd/<full-hash>
│                 - mode 444, identical hashes share one inode
├── views/        Hardlink trees scoped by source_kind:
│                 views/<source_kind>/<basedir>/<relpath>/<filename>
│                 - same inode as the corresponding blob (zero extra storage
│                   within the same ZFS dataset)
└── previews/     (planned: derived thumbnails / low-res for browsing)
```

### Database

**`blobs`** — one row per unique content (BLAKE3 hash):

| Column        | Type        | Purpose                                  |
|---------------|-------------|------------------------------------------|
| hash          | str(64) PK  | BLAKE3 hex digest                        |
| size          | bigint      | content size in bytes                    |
| storage_path  | text        | location in CAS (set by `mama-apply`)    |
| first_seen    | timestamptz | when first scanned                       |
| mime          | str(128)?   | detected via libmagic                    |
| block_reason  | str(32)?    | NULL = active; planned: deleted/blocked  |

**`observations`** — one row per file sighting:

| Column      | Type        | Purpose                                  |
|-------------|-------------|------------------------------------------|
| id          | int PK      |                                          |
| blob_hash   | str(64) FK  | links to `blobs.hash`                    |
| hostname    | str(255)    | machine where the file was seen          |
| basedir     | text        | scan root path                           |
| relpath     | text        | directory below scan root                |
| filename    | text        |                                          |
| size        | bigint      | size as seen (also in blobs, denormalized) |
| mtime       | timestamptz | file's modification time                 |
| ctime       | timestamptz | file's change time                       |
| scan_time   | timestamptz | last time this path was confirmed        |
| source_kind | str(32)     | syncthing / incoming / existing / import |
| status      | str(32)     | pending / assigned / ignored             |
| meta        | jsonb?      | ExifTool / ID3 / sidecar metadata        |

Indexes:
- `ix_observations_blob_hash` — for joins
- `ix_observations_path_mtime` — for rescan idempotency (hostname, basedir, relpath, filename, mtime, size)

### Observation Lifecycle

```mermaid
stateDiagram-v2
    [*] --> pending: mama-scan (new file)
    pending --> assigned: mama-apply
    pending --> ignored: curation (planned)
    assigned --> ignored: curation (planned)
    ignored --> assigned: curation (planned)
```

`status` represents the **target state** (Soll-Zustand):

- `pending` — newly scanned, target not yet decided
  - current: `mama-apply` auto-promotes to `assigned`
  - planned: stays `pending` until reviewed via web UI or rules
- `assigned` — should be in the archive; `mama-apply` ensures the view exists
- `ignored` — should not be in the archive; `mama-apply` ensures no view (planned)

`mama-apply`'s job is to reconcile the filesystem with the target state.

### mama-scan in detail

For each file under the scan root:

**1. Cheap path check (no content I/O)**

Reads:
- `stat()` → `size`, `mtime`, `ctime`
- DB query for an observation matching
  `(hostname, basedir, relpath, filename, mtime, size)`

If a match is found:
- update `scan_time` on that observation
- increment `unchanged` counter
- **skip everything else** (no hashing, no metadata extraction)

**2. Full processing (new or modified file)**

Reads:
- BLAKE3 over content → `hash`
- libmagic → `mime`
- ExifTool → `meta` JSON

Writes:
- new `Blob` row if `hash` not seen before (sets: `hash`, `size`, `mime`,
  `first_seen`; leaves `storage_path` empty for `mama-apply` to fill)
- new `Observation` row (sets all fields, `status='pending'`)

Counters reported: `files | new obs | unchanged | new blobs | duplicates |
with metadata | errors`.

### mama-apply in detail

Processes observations in cursor-paginated batches, ordered by `id`.

For each observation:

1. If `blob.block_reason IS NOT NULL` → skip, count as `blocked`
2. Compute CAS target path: `<archive_root>/blobs/<2>/<2>/<full-hash>`
3. If CAS target doesn't exist:
   - resolve source path: `basedir/relpath/filename`
   - if source is missing → skip, count as `missing`
   - try `os.link()` (instant, same dataset)
   - fall back to `shutil.copy2()` (cross-dataset; POSIX limit, costs space)
   - `chmod 444` on the blob
   - set `blob.storage_path` to the CAS-relative path
4. Compute view path: `<archive_root>/views/<source_kind>/<basedir>/<relpath>/<filename>`
5. If view doesn't exist → `os.link()` from CAS blob to view path
6. Set `observation.status = 'assigned'`

The whole loop is idempotent — re-running `mama-apply` with no pending
observations does nothing.

### Rescan safety

`mama-scan` can be re-run on the same path any number of times:

- unchanged files (matching `(path, size, mtime)`) → only `scan_time` updated,
  no new observation, no hashing
- modified files → re-hashed, new observation row added (old one stays for history)
- new files → full processing
- removed files → observation stays in DB (planned: mark as gone)

This makes `mama-scan` cheap to schedule on a timer for the Syncthing folders.

## Components

- **`mama-scan`** — index files into DB (above)
- **`mama-apply`** — materialize archive (above)
- **`mama-dev`** — developer utilities (`reset`, `stats`)
- **`mama-web`** — planned: browse, merge duplicates, filter, export, set status

## Tech Stack

- Python 3.13, FastAPI, SQLAlchemy 2.x (async), Alembic
- PostgreSQL 17 (JSONB for embedded metadata)
- Vue 3, Vite
- ZFS (single archive dataset, snapshots, NFS export), Caddy
- ExifTool, BLAKE3, libmagic, ffmpeg, Pillow
- Docker Compose for companion viewers (Immich, Navidrome, Paperless-ngx)

## Disclaimer

mama is provided as-is for personal use. The author assumes no responsibility
for data loss, corruption, mis-deduplication, accidental deletion, or any other
adverse outcome arising from its use. Use at your own risk and only on data you
can afford to lose.

## License

[MIT](LICENSE)