plainfile-family-history

TOOLING.md — Plainfile Family History: Tooling Design

Version 1.2 — 2026-06-12 · companion to SPEC.md v1.2 (versions track the SPEC)

This is the implementation design for every tool that supports the archive. SPEC.md Part IV states what each tool must do; this document specifies how — schemas, algorithms, command shapes, libraries, and edge cases — in enough detail to rebuild any tool from scratch in any language. Tools are conveniences: the archive works by hand without them, and every artifact they produce is a disposable cache.

Tooling design changes are governed by the SPEC decision log.

Operating principles for all tools: no daemons, watchers, or schedulers — freshness is refresh-on-use (cheap incremental refreshes run at the start of any tool that needs them, notably fha report). Deterministic work belongs in fha tools; AI judgment belongs in workbench skills; human review is the only gate to accepted. AI passes are always invoked and recorded — nothing mines, extracts, or classifies silently.


1. Shared foundations

All tools are subcommands of one CLI, fha (family-history archive), implemented as one Python file per tool under tools/, each also runnable standalone (python tools/lint.py). A tool must never import another tool; shared code lives in tools/_lib.py.

Environment. Python ≥ 3.10. Permitted dependencies: PyYAML, Jinja2 (site generator only). External binary: exiftool (embedded-metadata read/write). Everything else stdlib. No network access except the geocoder’s optional gazetteer download.

Archive root discovery. An archive root is identified by the presence of fha.yaml; discovery walks upward from CWD to find it, and --root PATH overrides. (sources/ and the other record directories are expected but not required for root detection — a brand-new archive may have fha.yaml before its record directories exist.) The spec docs (SPEC.md/TOOLING.md) normally live at the archive root but may instead be installed with the tools; --spec-root PATH points at them when separated. In the public spec repo, the fixture is run explicitly: fha lint --root example-archive --spec-root . — the repo root is not itself an archive. All stored paths are alias-form with forward slashes (normalize on Windows).

--root per subcommand. Because Python’s argparse does not propagate parent-parser flags into subparsers, every subcommand defines its own --root (and --spec-root) flag. Both fha --root X lint and fha lint --root X work; the subcommand’s own flag wins. Implementations in other languages must respect the same dual-position convention.

Path resolution (resolve_path). Record paths begin with a root alias (photos/…, documents/…). _lib.py loads fha.yaml once per run; resolve_path(p) maps the first segment through roots: (absolute values used as-is; relative values join the archive root; a missing fha.yaml or alias defaults to the internal directory of the same name). Every tool that touches asset files — lint E011/E012, photoindex, packet, process, site — resolves through this function and never assumes assets are under the archive root.

Caches. All generated non-.md artifacts live in archive/.cache/ (index, photo catalog, logs) — gitignored, deletable at any time. Generated .md views are written into the tree where humans browse, and always begin:

<!-- GENERATED by fha <tool> on <date> — do not edit; regenerate instead -->

The linter treats hand-edits below that header as a warning.

Parsing layer (_lib.py). The four parsing primitives every tool uses:

ID_RE      = re.compile(r'\b([PSCLH])-([0-9a-hjkmnp-tv-z]{10})\b', re.I)   # Crockford Base32, case-insensitive match
TOKEN_RE   = re.compile(r'\[([PSCLH]-[0-9a-hjkmnp-tv-z]{10})\]', re.I)     # citations/cross-links
FRONT_RE   = re.compile(r'\A---\r?\n(.*?)\r?\n---\r?\n', re.S)  # YAML frontmatter
CLAIMS_RE  = re.compile(r'^## Claims.*?```yaml\r?\n(.*?)```', re.S | re.M)

read_record(path){meta: dict, claims: list, stories: str|None, body: str}. The parser normalizes YAML scalars: booleans (living: false) and dates (created: 2026-06-12, read as date objects) are coerced to the canonical strings the index expects (living TEXT holds 'true'|'false'|'unknown'; dates serialize back to ISO). Person companion files (_research/_timeline/_sources-index, same P-id) are linked via person_files, not treated as duplicate person rows — the profile populates persons.path. Claims blocks parse with yaml.safe_load; a parse failure is a lint error, never a crash — tools collect errors and continue. All file IO is UTF-8.

EDTF handling. Tools never need full EDTF; they need validation and sortable bounds. edtf_bounds(s) -> (min_iso, max_iso):

EDTF min max
1850 1850-01-01 1850-12-31
1850~ / 1850? 1849-01-01 1851-12-31 (widened ±1y)
185X 1850-01-01 1859-12-31
1850-05 1850-05-01 1850-05-31
1850-~05 or 1850-05~ 1850-04-01 1850-06-30 (widened ±1m)
[..1920] 0001-01-01 1920-12-31
A/B (interval) min(A) max(B)

Month-approximate tilde position: EDTF Level 1 permits the uncertainty qualifier either before the month component (1850-~05, tilde precedes the month) or after the full date component (1850-05~, tilde trails the month). Both are valid and treated identically by this system. Real-world data uses both; the validation regex and bounds computation must accept either position.

Validation regex (accepting the subset above) is the linter’s date check; anything else is E014.

Exit codes (all tools). 0 clean · 1 warnings only · 2 errors found · 3 tool failure. --json emits machine-readable findings; default output is human lines: SEVERITY CODE path: message.


2. fha index — the index builder

Purpose. Rebuild \.cache/index.sqlite from scratch by walking the tree. The query surface for every other tool and for ad-hoc research (sqlite3 or any viewer). Never appended; never authoritative.

Incremental mode. fha index --source S-xxxx upserts one source: delete its rows, re-parse the one file, re-insert — sub-second, run automatically at the end of review sessions. Deletion order matters: collect claim IDs first, then delete claim_persons and claim_links by those IDs, then delete claims, then sources/source_files/source_people. Reversing this order leaves orphan rows in the child tables because the parent subquery finds nothing. Full rebuild remains the periodic truth-check; any discrepancy between incremental and full states is a bug in incremental, by definition.

Build algorithm. (1) glob sources/**/*.md, people/**/*.md, places/places.yaml, notes/**/*.md; (2) parse each with the parsing layer; (3) insert in one transaction; (4) scan all prose bodies for TOKEN_RE → citations table; (5) glob asset trees for filenames carrying S-ids → files table reconciliation; (6) build FTS tables. Target: full rebuild of a mature archive (~5k sources, ~30k claims) in under a minute; correctness over speed.

Schema (DDL).

CREATE TABLE persons(
  id TEXT PRIMARY KEY, name TEXT NOT NULL, surname TEXT, sex TEXT,
  living TEXT NOT NULL,            -- 'true' | 'false' | 'unknown' (unknown = living for exports)
  tier TEXT NOT NULL, status TEXT DEFAULT 'active', merged_into TEXT,
  no_known_marriages INTEGER DEFAULT 0, no_known_children INTEGER DEFAULT 0, path TEXT NOT NULL);
CREATE TABLE person_variants(person_id TEXT, variant TEXT);
CREATE TABLE person_face_tags(person_id TEXT, tag TEXT);
CREATE TABLE person_files(person_id TEXT, kind TEXT, path TEXT, generated INTEGER DEFAULT 0,
  PRIMARY KEY(person_id, kind));   -- profile | research | timeline | sources-index; profile populates persons.path
CREATE TABLE person_external(person_id TEXT, system TEXT, ext_id TEXT);

CREATE TABLE sources(
  id TEXT PRIMARY KEY, title TEXT NOT NULL, source_type TEXT,
  date_edtf TEXT, date_min TEXT, date_max TEXT,
  repository TEXT, restricted INTEGER DEFAULT 0,
  source_class TEXT, publication_ok INTEGER,   -- flattened from rights.publication_ok
  status TEXT DEFAULT 'active',
  superseded_by TEXT, path TEXT NOT NULL);
CREATE TABLE source_files(
  source_id TEXT, path TEXT, role TEXT, copy TEXT,
  derived INTEGER DEFAULT 0, original_filename TEXT,
  exists_on_disk INTEGER, in_inventory INTEGER);   -- reconciliation flags

CREATE TABLE claims(
  id TEXT PRIMARY KEY, source_id TEXT NOT NULL, type TEXT NOT NULL,
  subtype TEXT, date_edtf TEXT, date_min TEXT, date_max TEXT,
  place_id TEXT, place_text TEXT, value TEXT NOT NULL, status TEXT NOT NULL,
  reviewed TEXT, confidence TEXT, information TEXT, evidence TEXT,
  asset TEXT, anchor TEXT, hypothesis TEXT,
  significance_override TEXT, significance_reason TEXT,
  negated INTEGER DEFAULT 0, notes TEXT);
CREATE TABLE claim_persons(claim_id TEXT, person_id TEXT, position INTEGER, role TEXT);
CREATE TABLE claim_links(claim_id TEXT, rel TEXT, target_id TEXT); -- corroborates|contradicts
CREATE TABLE source_people(source_id TEXT, person_id TEXT);  -- people: any source (speakers, photo subjects, household)

-- Derived, rebuildable edge list for graph traversal (NOT new truth — flattened from
-- accepted relationship/marriage/divorce/death claims; every edge keeps its source claim_id).
CREATE TABLE relationships(
  person_id TEXT, rel TEXT,            -- parent | child | spouse | friend | associate | neighbor
  other_id TEXT, claim_id TEXT,        -- provenance: the claim this edge came from
  date_start TEXT, date_end TEXT);     -- spouse edges: marriage date / divorce|death date

CREATE TABLE places(id TEXT PRIMARY KEY, name TEXT, hierarchy TEXT,
  within TEXT, lat REAL, lon REAL);   -- within: recursed for containment queries
CREATE TABLE place_names(place_id TEXT, alt_name TEXT);
CREATE TABLE place_history(place_id TEXT, period_edtf TEXT, date_min TEXT, date_max TEXT, hierarchy TEXT);

CREATE TABLE search_log(date TEXT, person_id TEXT, question TEXT,
  repository TEXT, collection TEXT, terms TEXT, result TEXT, source_id TEXT, path TEXT);

CREATE TABLE hypotheses(id TEXT PRIMARY KEY, person_id TEXT, hypothesis TEXT,
  basis TEXT, verify TEXT, origin TEXT, status TEXT, verified_claim TEXT, path TEXT);

CREATE TABLE citations(token TEXT, kind TEXT, path TEXT, line INTEGER);

CREATE VIRTUAL TABLE notes_fts USING fts5(path, content);
CREATE VIRTUAL TABLE transcripts_fts USING fts5(source_id, path, content);

Relationship derivation (during index build, after claims load): for each accepted claim — relationship subtype: child-of → a child edge (subject→parents) and reciprocal parent edges; subtype: spouse-of and type: marriage → reciprocal spouse edges dated from the claim, date_end backfilled from any divorce claim between the pair or a death claim on either; social subtypes → friend/associate/neighbor edges. Edges are pure cache: dropping and rebuilding relationships from claims is a no-op on truth. Bilinear by nature — traversal follows edges wherever they lead, including in-law branches.

Derived significance is computed at query time from the SPEC §8.2 table (shipped in _lib.py as SIGNIFICANCE: dict), honoring overrides.

Canonical queries (shipped as views): person timeline; vitals completeness per person; suggested backlog per source; contradiction pairs lacking questions.


3. fha lint — the linter

Purpose. Verify the archive against SPEC. Report-only by default. Runs file-by-file plus cross-file passes over a fresh in-memory index (it builds its own; it must not require fha index to have run).

Checks. Errors:

Code Check Detection
E001 Duplicate ID A non-person record ID in two records’ frontmatter is an error. For P-ids, one primary profile plus its companion files (_research, _timeline, _sources-index) may share the ID; two primary profile files with one P-id is the error.
E002 Malformed ID / filename filename fails the §13 grammars; ID fails ID_RE
E003 Filename ID ≠ record ID compare filename suffix to frontmatter id
E004 Orphan reference any [token], persons:, place:, corroborates/contradicts target not found
E005 Referenced person lacks a stub P-id appears anywhere but no person record parses with it
E006 accepted claim missing reviewed field check
E007 Claim type outside vocabulary membership in §8.2 set
E008 Significance override without reason field pair check
E009 contradicts: without an open question search notes/questions.md + research files for both C-ids; absent → error
E010 Frontmatter schema failure required fields per record kind
E011 Inventory ↔ disk mismatch files: entry missing on disk (path resolved via fha.yaml) — except status: missing-fixture under example-archive/ or tests/ (any directory named tests), which is suppressed entirely (no finding emitted); a missing-fixture in a real archive (not under those roots) is an error; or an on-disk file carrying this S-id — by filename (documents root) or embedded SOURCE: keyword scan (photos root, --with-exif) — absent from inventory
E012 Embedded source-keyword disagreement documents root: SOURCE: keyword must agree with the filename S-id and inventory. Photos root: SOURCE: must agree with the record inventory — photos carry no filename S-ids by design. (--with-exif, slower)
E013 Summary block drift parse **Born/Died/Married/Parents/Children:** lines; compare cited [S-]/[P-] ids and dates against accepted claims for that person; mismatch → error, summary line lacking any accepted claim → warning W104
E014 Non-EDTF date validation regex
E015 type: relationship claim missing roles: field check
E016 New claim references a merged person directly resolve via merged_into; flag for cleanup
E017 DNA source not restricted: true, or DNA file outside documents/dna/ field + path check
E018 Agent-instruction drift AGENTS.md / skills reference deprecated commands (fha promote), stale skill names, or contradict locked rules (photo renames)

Warnings / reports: W101 vitals gaps per person (the completeness report) · W102 suggested-claim backlog per source · W103 stale folder bracket lists vs relationship claims · W104 summary line without supporting accepted claim · W105 hand-edits under a GENERATED header · W106 accepted claims missing Mills analysis fields (informational; cleanup-session fodder) · W107 direct references to merged persons (gradual cleanup list) · W108 README.md older than the last SPEC.md change (the README rule) · W109 non-vital or low-confidence claim missing notes context (the context nudge); also used as the catch-all code for unrecognized source_type vocabulary and for file-format issues surfaced by --format-check (missing final newline, CRLF line endings) — a future cleanup may give these their own codes · W110 Ahnentafel placement issue (requires root_person in fha.yaml): a direct-line couple folder’s numeric prefix does not match the derived Ahnentafel number, or a direct-line person’s files live in the wrong couple folder — fha views brackets --fix resolves both.

Formatter (fha lint --format-check / --format-write): conservative normalization only — frontmatter and claim key order, lowercase IDs, blank-line and final-newline hygiene, YAML list indentation. Never rewrites prose beyond trailing whitespace.

Fix modes (each gated behind an explicit flag; use --dry-run to preview what would change before writing): --mint-stubs (E005 → create stubs in people/stubs/), --spawn-questions (E009 → append templated question to notes/questions.md), --fix-inventory (E011 → regenerate files: from the ID-glob, preserving hand-written role/original_filename where the path matches).

Required claim fields enforced by E010: id, type, persons, value, status. Note: confidence is required per SPEC §8.5 but is not in the E010 required-field set for milestone 1 — it is derived by tooling from source_type and the linter does not yet enforce its presence. Future work: add confidence to E010 enforcement, or emit a dedicated warning for accepted claims lacking it.

Summary-block parsing detail (E013). The block is the run of **Label:** … entries after the H1, before the first ## Section header. Labels are: Born, Died, Married, Parents, Children. The parser must handle both layout styles:

Implementation: scan the summary text (up to the first ## Section) with finditer on the **Label:** pattern; split into segments between consecutive label positions; extract [S-id] and [P-id] tokens from each segment. Do not split by line — line splitting fails on the inline form.

Comparison rule: each [S-id] in a segment must correspond to an accepted claim of the matching type (birth/death/marriage) for this person from that source; [P-id]s in Parents/Children must match accepted relationship claims. Display text is not parsed for equality — the citation is the contract.


3a. fha doctor — archive health

One command answering “what is wrong with this archive?”: archive root found; fha.yaml parses; every mapped root reachable; exiftool and Python deps present; index and photoindex freshness (age vs. newest record mtime); one-line lint summary (E/W counts); inbox items older than 14 days; counts of restricted sources, living and unknown-living persons; E018 agent-drift findings; reminder line that archive + mapped roots must both be in the backup policy (the spec takes no position on backup tooling). Exit codes as §1. Run it first after any migration or machine move. Stated tradeoff: this spec does not attempt to detect bit rot or silent file alteration; that preservation concern is deliberately deferred to the backup strategy, outside the archive format. A future fha doctor --fixity could add optional checksum verification without making checksums part of the research model.

4. fha id — minting

fha id mint P [-n 5] → prints fresh IDs. Algorithm: draw 10 chars from the Crockford Base32 alphabet 0123456789abcdefghjkmnpqrstvwxyz via secrets.choice (lowercase, omitting i l o u); check non-existence by (a) ripgrep-style scan of the tree for the candidate, or (b) the index if fresh (--fast); retry on the ~impossible collision. fha id check <ID> → where it appears. Dice-roll fallback documented in output of fha id --help (hand-minting: any 10 Base32 chars from the alphabet above + a search to confirm absence). On input, IDs are normalized to lowercase before matching, so a hand-typed uppercase ID still resolves.


4a. fha find — the universal locator

fha find <ID> answers “where does this thing live?” for any ID — necessary because photos aren’t renamed, so disk search alone can’t locate a photo by S-id. Output by type: S-id → record path, every asset file (paths resolved through fha.yaml, located via filename for documents / inventory+keyword for photos), citation sites, claim count by status. P-id → person file, couple folder, claims naming them, photo count (via §9 resolution), citation sites. C-id → its source record + line, status, links in/out. L-id/H-id → record + every reference. fha find <text> falls through to FTS across notes, transcripts, captions, and comments. (fha id check is an alias.)

fha find --related <ID> — “show me everything adjacent to this,” for any ID type, ranked, pure query over the index (no schema change). What counts as adjacent depends on what you point at — each type has a natural neighborhood the index already holds:

fha find --related --date <EDTF> — the time neighborhood, orthogonal to ID type: all claims (and the people, sources, and photos behind them) whose dates overlap the given EDTF range, via the index’s date_min/date_max bounds (§1) and photo EDTFs. “Who and what was active 1869–1874.” Combinable: --related <L-id> --date 187X = the place, narrowed to a decade.

fha find --text "…" — full-text search across everything textual: record bodies and notes, transcripts, and the photo/document index caption/comment/keyword fields (§9). Returns hits with their record or asset and context. Cheap because the corpus is plain text plus the two SQLite FTS tables; one query spans prose and media metadata.

Together these make find the connection-discovery primitive: any ID or date or phrase → its neighborhood, ranked, every edge carrying provenance — the raw material the report, the FAN view, and cooccur all build on. Uses the index when fresh; degrades to a tree scan with a warning when not. (fha id check is an alias for the bare locator.)

5. fha stubs — stub minter

Scan claims for P-ids without person records (E005 set) → create people/stubs/{surname}__{given}_{P-id}.md with minimal frontmatter (tier: stub). Name/surname resolution: from claim value text where parseable, else unknown__unknown_{P-id}.md flagged for hand-rename. Also supports --from-names "Ethel Hartley; Frances Hartley" interactive minting (new IDs + stubs in one step). Never overwrites; never moves a stub out of stubs/ (placement into couple folders is a human act).


6. fha process — processing assistant

fha process <file|folder> [--type photo --title "…" --slug …]

Folder mode: runs the triage scorer (§15b) over the folder, prints ranked candidates with signals, and processes only the ones the human selects.

Input is a bare file, a source-stub sidecar (one asset + *.notes.md), a source-stub bundle folder (multiple files + a notes.md — e.g. recording + transcript; each file becomes a role-tagged version sharing the minted S-id, the folder dissolves, SPEC §12.1), a variation group (see below), or a folder to triage. File mode, transactional (any failure → roll back):

Variation detection — implicit grouping from a mixed folder. A real photo library has variation-siblings scattered among unrelated photos: three scans of the same portrait at different resolutions, a front-and-back pair, a color original and a restored version. They don’t need a bundle folder — fha process (and photoindex) detects them automatically using two tiers:

Tier 1 — deterministic (free, always on): files sharing the same basename root with only a suffix variation (portrait_1880.jpg, portrait_1880_back.jpg, portrait_1880b.jpg, portrait_1880b_back.jpg) are flagged as a candidate group. Known role suffixes: -b or -c etc (multiple scans of the same original photo, each may have separate context though), b or c etc (photo variation letters don’t always start with a dash), -negative (the film negative of the photo), -front (occasionally fronts are tagged with backs), -page-N (if a multi-page photo book it could have multiple pages), -back (reverse side), -bw (greyscale version), -crop (cropped). Any unrecognized suffix is kept as a freeform role. The primary is the file with the shortest/root name; others list as variants.

NOTE ON page-N variations. This is usually because for photo books both the whole photo book page is scanned along with each individual photo on the page for archiving. So variations based on page-N share less context than other variations, but it still may be relevant for original source documentation.

Tier 2 — model-assisted (optional, gated behind --with-vision): a vision pass over candidate pairs to confirm perceptual similarity — catches variations with unrelated names (e.g. scan001.jpg and scan002.jpg that are actually front/back of one photo). Expensive; used only on ambiguous cases Tier 1 can’t resolve. Backlog until the core tools exist.

Filename parsing algorithm. Suffixes are stripped in this fixed priority order (implemented in parse_media_filename):

  1. Strip -crop suffix first → is_crop = True. Crop stacks on any other suffix combination.
  2. Strip part-kind suffix: -negative checked before -back / -front / -pageN.
  3. Detect trailing single-letter variant ID — hyphen-letter (-b) or letter immediately after a digit (034b).
  4. Remaining stem → base_id.

Result: ParsedName(base_id, variant_id, part_kind, page_num, is_crop) where part_kind"front" | "back" | "page" | "negative" | "none". Semantics: -negative is mutually exclusive with -front, -back, and -pageN — it marks a scan of the physical film or glass-plate negative; in the grouper it is always stored at the stem level regardless of any variant letter (a negative is source material for the root image, not a print variant). -crop marks a derivative detail image (zoomed text, face close-up); crops are stored in dedicated _crop slots alongside their parent scan and supplied as supplementary AI context, never treated as independent primary images. A missing -page1 is never inferred by the parser — if a group contains explicit -pageN siblings, the grouper may promote an untagged file to page 1, but the parser itself never does this.

Multi-image batch types. When a folder contains multiple images for one source, classify the set before processing:

Type Description Treatment
A — Variant scans Same physical side/object, different crop/exposure/rotation/quality Merge into one source
B — Front/back One or more front scans and one or more back scans of the same physical photo, postcard, or document Merge into one source
C — Multi-page document Different pages of a booklet, album, scrapbook, or document set (filenames -page1, -page2, …) One ordered document set; preserve page order exactly; transcribe text across all pages labeling each section [Page 1], [Page 2], … in captions
D — Helper crops Small crops from a larger page to aid legibility of text or a detail Supplementary views of their parent page only — not independent pages or sources

Processing a variation group: when a group is detected, fha process surfaces it with a confirmation prompt:

Found 3 files that appear to be variations of the same photo:
  portrait_1880.jpg        [primary]
  portrait_1880_back.jpg   [role: back]
  portrait_1880_bw.jpg     [role: bw]
Process as ONE source (shared S-id) or separately?  [one / separate / skip]

On one: a single S-id is minted. The photos stay exactly where they are — names and locations unchanged. The root photo is flagged as is_primary=true. The source record updates to include an inventory lists each at its existing path with a role annotation:

files:

The SOURCE: S-xxxxxxxxxx keyword is written into each file’s embedded metadata via exiftool so identity travels with the file if it moves. That is the only write to the files themselves — content and filename are untouched.

On separate: each is processed as its own source. If one is clearly a derivative of another, a provenance note is suggested. On skip: deferred to the next session.

The role annotations in files: are the only thing binding the variants — role lives in the record and the embedded SOURCE: keyword, not the filename. This is the opposite of documents-root files, which do carry the S-id in the filename precisely because they can be renamed; photos cannot be, so the record carries the meaning instead. The photo_groups index table caches the grouping for fast “show me all variants of this photo” queries.

(1) mint S-id; (2) mark identity — documents root: rename to {slug}_{S-id}.{ext} in place (record original_filename); photos root: NEVER rename (Lightroom catalog integrity) — keyword only; (3) exiftool -keywords+="SOURCE: {S-id}" -overwrite_original_in_place where the format supports keywords; (4) scaffold sources/{type}/{slug}_{S-id}.md from the §14 template, inventory pre-filled, empty ## Claims; (5) print the path. --more FILE role[:copy] attaches versions. Refuses files already carrying an S-id (filename or keyword).

The deterministic command is Stage A of the process pipeline; Stages B (AI draft: read the file incl. vision, resolve names/places against the index, draft suggested claims with anchors, pull ## Stories) and C (review) are the process-source and review-claims skills (§16). Lint E012 for photos checks keyword↔inventory agreement (no filename carrier).


7. View generators — fha views

All write GENERATED-headed .md into the tree; all derive purely from the index.


8. fha packet — person data dump

fha packet P-de957bcda1 [-o out/] [--include-research] [--include-restricted]

Gathers, as copies, into packet_{surname}_{P-id}_{date}/ then zips:

README.txt        ← generated manifest + "derived export, not research data" disclaimer + date
profile/          ← person .md (+ research file only with --include-research)
timeline.md       ← freshly generated
sources/          ← every source record citing the person
files/            ← those sources' files (copied, original names kept)
photos/           ← ALL photos of the person (see below)

Photo gathering (the “all photos of grandma” requirement): union of (a) photos carrying the bare {P-id} keyword; (b) photos whose face-region/people tags match the person’s face_tags: exactly; (c) name/name_variants matches — listed in the README as name-matched, unverified; (d) files of sources citing the person. Requires the photo index (§9); refuses with a clear message if it is missing/stale unless --no-photos.

Audience: fha packet is a family/private research export, not a public publication format — included materials may mention other living people. Public sharing goes through fha site --standalone (or another exporter), which redacts living and unknown-living persons by default. The packet README cautions name both living: true and living: unknown persons.

Privacy enforcement: living: unknown is treated as living. Sources with restricted: true are excluded (listed by ID only in README) unless --include-restricted; DNA sources are excluded even then — only --include-dna includes them; any other person in the packet’s materials with living: true is named in a README caution. DNA never included by default.


9. fha photoindex — photo metadata catalog

Purpose. The photo library currently lives in photo organizing software (Lightroom); finding anything must not require opening the other software. This tool scrapes embedded metadata for the entire photos/ tree into \.cache/photos.sqlite — a disposable catalog making the library searchable in milliseconds. (It reads files, never the external catalog: embedded metadata is the durable layer. If a good open-source media browser is later adopted, it slots in at the interface layer; this catalog stays the scriptable surface. Candidates evaluated in the owner’s private tool log.)

Scrape — the field set that matters (batched exiftool -j … -r photos/, one process, JSON out, ~50–100 files/sec; incremental by (path, mtime, size); --full rebuilds):

Field Meaning in this library
Title Usually null; set only when the photo’s identity is unmistakable
Caption (Description) What is written in or on the photo — direct transcription of contents
UserComment Contextual info that may not be on the photo, incl. the pipeline’s AI summary — the richest text field
DateTimeOriginal Actual/estimated original-creation date (paired with the DATE: confidence pattern → EDTF)
Sub-location (“location”) Neighborhood / specific area within a city
City / State / Country Place hierarchy as tagged
GPS lat/lon Authoritative when present — logger-recorded or manually verified; never second-guessed, and a backfill source for places.yaml coords
Keywords/Subject Incl. SOURCE: S-…, bare P-… person ids, DATE: patterns
Face regions XMP-mwg-rs RegionInfo (the trickiest: structured regions object, parsed for names + areas) → face-tag strings

Schema.

CREATE TABLE photos(path TEXT PRIMARY KEY, mtime REAL, size INTEGER,
  title TEXT, caption TEXT, user_comment TEXT,        -- §field table above
  exif_date TEXT, date_pattern TEXT, edtf TEXT,       -- pattern→EDTF via SPEC §20 table
  sublocation TEXT, city TEXT, state TEXT, country TEXT,
  gps_lat REAL, gps_lon REAL,                         -- authoritative when present
  source_id TEXT,                                     -- from SOURCE: keyword
  group_id TEXT, is_primary INTEGER DEFAULT 0,        -- variation grouping (below)
  variant_copy TEXT, variant_role TEXT);              -- parsed from filename suffixes
CREATE TABLE photo_groups(group_id TEXT PRIMARY KEY, primary_path TEXT,
  edtf_resolved TEXT, date_conflict INTEGER DEFAULT 0, file_count INTEGER);
CREATE TABLE photo_keywords(path TEXT, keyword TEXT);
CREATE TABLE photo_people(path TEXT, person_ref TEXT, via TEXT); -- via: pid-keyword | face-tag | name-match
CREATE VIRTUAL TABLE photo_fts USING fts5(path, title, caption, user_comment, keywords);

Variation grouping. Versions of one physical photo must index as one logical photo. Group key, in priority order: (1) shared S-id (filename or SOURCE: keyword) — processed photos group by source; (2) same directory + same base stem after stripping the recognized suffix grammar [-{copy}][-{role}] (b, c,-b, -c, -negative, -back, -front, -page-N) — the pipeline’s own naming convention; “text from alternate version” keywords corroborate a grouping but never create one (too fuzzy). (NOTE letter variations are not always started with a dash, they always appear as the last thing before any additional tags however.) Grouping is conservative: never across directories, never on caption similarity. The primary variant is the front of copy a (fallback: lexicographically first); search results and the packet’s photo gathering return groups, copying all variants but counting the photo once (--files exposes raw rows).

Group data structure. After grouping, each stem’s entry holds the following slots (parallel to the group_folder_images output used by the photo pipeline, see §6 for the parsing algorithm):

Negative grouping rule: negatives are stored at the stem level regardless of any variant letter in their filename — a negative is source material for the root image, not an A/B variant of the print. For page sets, all_fronts is the sorted page list and all_backs is all-None; photo_groups in the schema caches this structure for “show me all variants” queries. The photos.variant_role column holds the compound role value (front, back, front-crop, back-crop, negative, negative-crop, page-1, …).

Group date resolution. Each variant may carry its own DATE: pattern and EXIF date (different backs say different things — that is evidence, not noise). The group’s edtf_resolved is the best-confidence variant’s EDTF: score by number of ! components (D > M > Y), then ~ over ?; deterministic tie-break by path. If any two variants’ EDTF bounds (per §1 edtf_bounds) fail to overlap, set date_conflict = 1 — and fha photoindex report lists all conflicted groups, because a date disagreement between the front and the back of the same photo is a research finding worth a question, not a value to silently average. Person/keyword attributes aggregate as the union across variants.

photo_people resolution, in confidence order: (1) bare P-… ID keywords (regex ^P-[0-9a-hjkmnp-tv-z]{10}$, case-insensitive) → via=pid-keyword, authoritative; (2) face-region/people-tag strings matched exactly against person records’ face_tags:via=face-tag; (3) name/name_variants matches → via=name-match; (4) caption/comment hits → weakest, flagged. A tag string matching multiple persons is ambiguous, never guessed — surfaced for tag-person resolution. fha photoindex tag-person <P-id> [--from-face-tag "X" | paths…] writes the bare P-id keyword (via exiftool, previewed list first) across a face-tag match or onto specific photos — making identifications in-file durable and settling same-name collisions.

Reconciliation — fha photoindex reconcile (and the general fha reconcile). Because assets are organized by moving files (and other systems rename them), the on-disk reality drifts from the index’s stored paths. Reconcile compares disk to index across all asset trees: files whose path still matches → untouched; a stored path now missing on disk → re-match by embedded ID (SOURCE: keyword for photos, filename S-id for documents) and, on a hit, update the stored path silently; files on disk that the index doesn’t know → log as new (and, if they carry a SOURCE:/S-id, attach to the source’s inventory). Anything unmatchable is reported for human attention. This is why folder location is never truth: identity rides in the file (keyword/ID), and paths are a refreshable cache. Runs incrementally; folded into fha doctor and the report’s freshness step. fha reconcile applies the same disk↔index path-healing to every file type, not just photos.

Query. fha photoindex find --person P-… | --keyword … | --edtf 192X | --text "…" → paths.


10. fha places — registry + geocoder

fha places lint (orphan L-ids, duplicates by normalized name, dangling/cyclic within: links, within: pointing at a non-settlement) · fha places candidates — the recurrence detector: normalize unlinked claim place_text (case, punctuation, abbreviation expansion: St/Street, Co/County), cluster near-variants (token-set match), emit groups ≥3 occurrences with their claims and date spread; plus photo-GPS clusters (≥3 photos within ~150m) near no known place’s coords. Elevation is a guided flow: human confirms → mint L-id (+within:) → per-claim place: backfill, each shown (same text across decades may be different buildings); place_text never altered. This deterministic-cluster → human-confirmed-elevation pattern is the template for future recurring-people (FAN) detection. · fha places geocode — backfill coords/alt_names. Gazetteer: GeoNames offline dump (cities15000 + allCountries as needed), downloaded once into \.cache/geonames/; no live API dependency. Match name + hierarchy tokens against GeoNames name/admin1/admin2/country; on a unique high-confidence hit, propose coords and alternate names; every write requires interactive confirmation (place identity is a research judgment, not a string match). Writes preserve YAML comments (ruamel.yaml permitted here, or regenerate the file wholesale with a GENERATED note for hand-restoration of comments).


11. fha convert-mining — interview converter

Migrates a legacy transcript-mining pipeline output (facts.txt table rows, stories.txt blocks, questions.txt blocks, sources.txt, alias files) into conformant records.

  1. Sources first: each legacy S###/transcript → copy transcript into documents/interviews/, process (mint S-id, rename, scaffold record with people: resolved via alias files → P-ids, source_type: interview), record the legacy extraction pass in ## Notes (model: gpt-4-class, dates from run headers).
  2. Facts → claims: parse the markdown table rows + Update(T###) continuation lines (merge updates into the claim’s notes). Field map: Claim→value; Earliest/Latest→EDTF (same date → single value; range → interval; their ~/?? → EDTF ~/X); Confidence High/Medium/Low→confidence; Section → dropped; status: suggested for all (AI-extracted); type assigned by keyword heuristics (birth/marriage/served/lived-at → vocabulary) defaulting to event + subtype from the legacy Section.
  3. Anchors: best-effort — take the 3 rarest content words of value, search the transcript; a unique window → anchor: line N; else omit.
  4. Stories → ## Stories, people refs resolved to P-ids; questions → notes/questions.md (open, with [S-id] references mapped).
  5. Audit trail: write \.cache/convert_mapping.csv (legacy_id, new_id, notes) and a dry-run report; --apply to write. Unresolvable people → minted stubs via §5.

12. fha site — static HTML explorer

Output: \.cache/site/ (or --out), fully relative links, no CDN, no JS frameworks — works from file:// and a USB stick; the packet tool can embed a single-person slice.

Scope: the whole-family site — the archive as a browsable website, not a single profile (the packet embeds a one-person slice of the same generator).

Pages: Home — an interactive descendant explorer (v1 hero): expand/collapse nodes forward from a root ancestor, each node linking to its person page; plus an ancestor-pedigree view and the surname A–Z index and recent-discoveries teaser. All rendered from the relationships edges; no server, works from file://. Person (curated) — summary block; biography with [S-] rendered as numbered footnotes, [P-] as links; timeline; photo strip (via §9 person resolution, i.e. face_tags); Stories; Friends & Family; sources list. Person (stub) — one-line entries on their couple’s section. Source — citation, metadata, claims table with status badges, thumbnails + file links. Place — name, coords (map URLs, no embedded map dependency), dated history:, claims naming it, contained micro-places (within: children). Discoveries — rendered from notes/discoveries.md.

Assets — self-contained snapshot by default. The generated site is a standalone snapshot, not a live view of the archive: the generator produces its own web-optimized image derivatives (resized, EXIF stripped so living-person/location metadata never leaks) and copies them into the site folder, so the site depends on nothing outside itself — deploy it to a USB stick, a static host, or hand it to a relative, with the archive absent. Full-resolution originals never leave the archive; the site carries derivatives only. The snapshot contains only publication-eligible materialliving/unknown persons redacted, restricted/DNA excluded, rights.publication_ok: false sources withheld — so “generate the site” is also “produce the safe-to-share version.” --linked is the opt-in alternative for fast local preview (relative links to real files, no copying, no redaction guarantees — developer mode only).

Modularity: because the site is a snapshot, it is decoupled from the archive’s churn — regenerating is idempotent, and an old site folder remains a valid frozen view even as the archive moves on. Thumbnails/derivatives via PIL.

Implementation: Jinja2 templates in tools/templates/; markdown→HTML via a minimal stdlib converter (headings, bold, lists, links — the profile format is deliberately simple) to avoid a markdown dependency; image derivatives via PIL. Token swap: TOKEN_RE → relative hrefs; unresolved tokens render highlighted (already lint errors).

Tree rendering — borrow the engine. The interactive trees (descendant explorer, ancestor pedigree, FAN graph) are rendered by a vendored client-side library, not hand-rolled D3 — current best candidate family-chart (donatso, MIT, D3-based, framework-agnostic, has its own JSON input format). It is a replaceable rendering adapter in the borrow-the-engines spirit: fha views tree emits the neutral tree JSON (§14b), the site bundles the library and feeds it that JSON (mapped to the library’s format), and swapping renderers later touches only the adapter. The library is vendored into the site bundle so the snapshot stays self-contained and offline. Verdict + alternatives (Yakubovich/descendant_tree, others) evaluated in the owner’s private tool log.


13. fha wikitree — profile exporter

Renders a curated profile to the user’s extended WikiTree dialect (established in existing profiles, e.g. Hartley-6084), not vanilla markup:


13a. fha gedcom — relationship exchange export

Derives a standard GEDCOM 5.5.1 file at export time — never stored, never re-imported as truth (SPEC §22 holds the never-the-corpus rule). fha gedcom <P-id> [--mode descendants|ancestors|connected] [--generations N] or --all. From the relationships edges and vital claims: INDI records (name, sex, birth/death/marriage from accepted vital claims with dates), FAM records (from spouse + child edges). living/unknown persons are redacted by default (living individuals → name withheld, à la standard privacy); --include-living overrides. Sources: each fact’s [S-id] becomes a SOUR note referencing the source citation. Output is a .ged file; round-tripping back in is explicitly unsupported — GEDCOM is a one-way bridge to other apps.

13b. fha capture — web record capture (the intake on-ramp)

The primary way starter sources enter the archive. Most existing research lives behind logins (Ancestry especially), so capture is designed to run on the page the human already has open — it never logs in, never scrapes credentialed endpoints on its own, and sees only what the browser is already showing. Two delivery forms, one backend:

Delivery (interface layer, replaceable):

The companion’s output is a source stub in the inbox (SPEC §12.1), never a finished record. Backend (fha capture, deterministic + skill):

  1. Extract citation fields from the page HTML. Site recipes for the heavy sources — Ancestry, FamilySearch, Newspapers.com, FindAGrave — know where title, date, collection, repository, image URL, and the persons/relationships listed on the page live. An unknown site falls to a generic recipe: capture page title, URL, accessed-date, and visible text as the citation basis.
  2. Asset handling, three cases: (a) a downloadable image/document is present → the human saves it (or drops it into the companion) and it becomes the source’s asset; (b) the record is the page → store a cleaned HTML snapshot as the asset (documents/web/…, an acceptable second-tier format per SPEC §2), so the evidence is preserved even if the page rots; (c) the page only points elsewhere (“record available at…”) → no asset, citation + external_links only, flagged for later retrieval.
  3. Pre-fill the source record: map recipe output into §14 frontmatter — source_type (recipe-inferred: census/vital-record/newspaper/…), citation, repository, external_links, source_date, and people: from the names the page lists (resolved against the index, unresolved → stub candidates).
  4. Write a research-log entry automatically (date, repository, collection, terms if visible, result: found [S-id]) — capture is itself a logged search, closing the loop §16 opened.
  5. Hand off to fha process + the draft pass: the page’s structured data (a census page’s household table, a marriage index’s fields) seeds suggested claims with the page as anchor, ready for review.

Boundaries: capture reads the open DOM/HTML only — it does not paginate, query APIs, or fetch behind auth; bulk or automated retrieval against a site’s terms is out of scope by design. The recipe set is data (tools/capture_recipes/), extensible without touching code. Everything it produces enters at suggested/needs-review like any intake.

13b.1 The companion workflow (the ideal experience)

Capture happens mid-research — the human is in flow on a record page and must not be forced into a full review session to save what they’re looking at. The governing principle: fast, forgiving capture now; structured review later. The companion grabs everything while the page is open and defers every judgment call to the workbench. It never blocks on a decision that can’t be answered in two seconds.

Phase 1 — Invoke (one gesture). On any record page the human clicks the companion (bookmarklet, extension button, or Claude-in-Chrome action). A small panel opens over the page; nothing has been written yet. If the site matches a recipe, the panel says so (“Ancestry census recognized”); otherwise it announces generic capture.

Phase 2 — Confirm (glance, don’t fill). The panel shows what the recipe already pulled — title, date, collection/repository, the persons the page lists, the image it found — as pre-filled, editable fields, plus a free-text notes box for anything the human wants to say in the moment (“this is Bob’s neighbor who babysat him”). The human’s job is a glance and a nudge, not data entry: fix a mangled title, untick an irrelevant person, type a sentence of context. Everything is optional; a human in a hurry clicks straight through. The panel’s only insistence is on asset capture (Phase 3), because that’s the thing that can’t be redone later once the page is closed.

Phase 3 — Capture the evidence (the part that can’t wait). The companion resolves the asset by the three §13b cases, but interactively, because this is where pages fight back:

Phase 4 — Stage, don’t process (the hand-off). Clicking Capture writes a source stub into inbox/ (SPEC §12.1) — a lone *.notes.md sidecar beside a single asset, or a bundle folder when there are several files or none. The notes file carries optional frontmatter (the recipe’s citation fields, parsed person list, source-type guess) and the human’s free-text notes as its body; the asset(s) sit alongside. Capture also auto-writes the research-log entry (date, repository, collection, terms if visible). That’s the end of the browser’s job. No source record is minted, no claims are drafted, no S-id is assigned — the stub is pre-source. The human goes back to researching, capturing five more the same way.

Later, in the workbench: a process-source session works the inbox bundles. Now the deferred judgment happens, with full context and the index at hand: the sidecar pre-fills the §14 frontmatter, the person list is reconciled against existing records (“the page named Margaret Cole — match P-cd795c61e0, or new?”), suggested claims are drafted from the page’s structured data with the page as anchor, and review proceeds as normal. A provisional-image asset surfaces a reminder that a better scan may be obtainable; an asset-elsewhere flag lands in the research-to-do.

Why staged, not immediate: the split mirrors the operating loop’s capture→file→process spine (§4). Capture is the file step — get it safely into the inbox while you’re looking at it. Processing is a deliberate, index-aware act that benefits from not being rushed mid-browse. It also makes batch capture natural (a research sitting yields a dozen bundles; one later session processes them all) and keeps the companion dumb and replaceable — it stages, the durable tooling decides. The paste-fallback path produces the same staged bundle from copied page content, so the two delivery forms converge the moment they hit the inbox.

Recipe coverage, in priority order: Ancestry (record + image viewer, census household tables), Newspapers.com (clipping + citation), FamilySearch (record + tree person), FindAGrave (memorial + cemetery place). Each recipe knows that site’s DOM well enough to fill the panel; the generic recipe (title, URL, visible text, accessed-date) ensures any page is capturable, just with more to fix in review.

13c. fha install / fha update-tools — scaffolding & updating a private archive

Vendoring the operating layer into an archive — and keeping it current as the public repo evolves — is a fixed ritual that should be a command, not a remembered checklist. This pair handles both, and both are generic glue (they move operating-layer files between a public-repo clone and a private archive, and touch no family data), so they live in the public tool suite.

The manifest (the package’s own packing list). The public repo ships a manifest.json (or equivalent) that is the definition of the operating layer: every path that belongs in an archive, plus a content checksum and the spec/tool version for each. This is the single source of truth for “what gets copied” — when the package grows (a second tools folder, a fifth doc), it’s added to the manifest, and update-tools automatically knows to copy it. The list lives in versioned data, never hardcoded in prose or memory.

The manifest covers the operating layertools/ (and any future tool folders), SPEC.md, TOOLING.md, AGENTS.md, CLAUDE.md — and the skeleton (fha.yaml, the empty record dirs, a seeded places.yaml). It explicitly excludes spec-repo furniture (example-archive/, archive-template/, tests/, .github/, the public README.md, PRIVACY.md, RELEASE_CHECKLIST.md), which never enters an archive.

fha install <archive-path> — the first-time bootstrap, run from a clone of the public repo (the only place the code exists before anything is copied). Creates <archive-path> if absent (skeleton + full operating layer), or populates an existing folder that has no tools yet. Stamps .plainfile-version recording the manifest version and per-file checksums received. After this the archive is self-contained — tools and rulebook both vendored — and update-tools can run from within it. (Bootstrap note: install necessarily runs from the clone, e.g. python tools/fha.py install ~/my-family-archive, because the archive has no fha yet. This is the one command that can’t be run from the archive — by definition.)

fha update-tools — run from the archive (or the clone, pointed at it) any time the public repo improves. Reads the public manifest, compares against the archive’s .plainfile-version stamp, and reconciles without ever destroying anything:

Situation What update-tools does
Manifest has a new file/folder Copy it in.
File is unchanged from stock (checksum matches the recorded version) Overwrite silently — no work to lose.
File was customized (checksum differs — you edited it) Move your version to .plainfile-backup/{date}/, install the new stock version, and report it so you can reconcile your changes.
File removed from the manifest upstream (a tool retired/renamed) Move it to .plainfile-backup/{date}/never deleted — and report it.

Then it refreshes the stamp and prints a summary (added / updated / backed-up / quarantined, with the backup path). --dry-run reports the plan without writing.

The governing principle: the updater never deletes and never silently overwrites your work. It only adds, replaces-pristine-with-stock, or moves-aside-and-reports. Customizations and retired files accumulate in .plainfile-backup/ for the human to prune when confident — the same project-wide bias as everywhere else: never lose the human’s work; make the human the one who throws things away. You may edit any operating-layer file freely (a tool, AGENTS.md, the spec); the checksum compare detects your change and protects it on the next update. (.plainfile-backup/ and .plainfile-version are the updater’s only footprints; both are safe to inspect or delete by hand.)

14. Backlog (ideas, not yet designed)

Idea Sketch
photo-context skill Update a photo’s embedded AI summary (UserComment) with archive knowledge: identified people’s relationships, the event/claim context, place history — the pipeline’s captions get smarter as the archive grows. Writes are marked as AI per §20.
WikiTree importer Reverse of §13 for legacy profiles: named refs → draft source records (Ancestry/Newspapers.com citation patterns recognized), spacetime spans → claim date/place hints, wikilinks → external_ids, sections → profile scaffold; everything enters suggested. The existing WikiTree corpus is migration source material.
Citation assistant Match uncited factual sentences in profiles against accepted claims (person + type + date overlap); suggest [S-] insertions as a diff.
Auto-anchor refinement Re-run anchor matching with better text alignment for converter output.

14a. fha xref — cross-reference pass

Triggered at the end of every review session (by the review-claims skill) and inside fha report — never scheduled. Deterministic candidates from the index: (a) corroboration — pairs of accepted/needs-review claims, same person + same type + overlapping EDTF bounds + different sources, not already linked; (b) contradiction — same person + same type + non-overlapping bounds, or same type with incompatible values (vital types only for value comparison; substantive types compare dates only). Output: candidate pairs with both claims’ context. Judgment + gate: the skill layer assesses each candidate; on human confirmation, writes corroborates:/contradicts: links (incremental reindex follows) and spawns a structured question (origin: tool) for each confirmed contradiction.

14a2. fha cooccur — connection candidate detection

The recurrence detector pointed at people and named entities — sibling to fha places candidates, same deterministic-cluster → human-confirm discipline, consumed through the report.

People co-occurrence → candidate social edges. From the index: person-pairs named together in ≥2 distinct sources with no existing relationship edge between them (kin or social). Rank by co-occurrence count and source variety (different source_types weigh more — a census + a newspaper + a photo back beats three pages of one census). Output: candidate pairs with the shared sources and T.E.’s-side context. Confirmation mints a relationship claim (subtype: friend|associate|neighbor, the confirming source cited); dismissal records a tombstone so the pair isn’t re-proposed. Noise control is the threshold + the human gate: every witness co-occurs with everyone on a document, so low-variety pairs rank low and most are dismissed — threshold tuning is pilot-data work.

Organization/entity recurrence → connection hubs. Named entities living as claim values (employers in occupation, units in military, clubs in membership event/note claims) that recur across ≥N people or sources are surfaced as candidate hubs: “Plains Junction Railroad — 4 people, 6 sources.” Per SPEC §22 these stay claim values (no O- object); the detector simply makes the shared affiliation visible and queryable, and flags it should the recurrence ever justify promoting to a real record later. No schema change.

Both feed the report; both render in the FAN view as provisional (dashed) edges distinct from confirmed claims.

14b. fha views draft-queue — material awaiting narrative

Per curated person: accepted claims whose source is not cited in the person’s profile ([S-id] token absent from the profile body) — the writing backlog, computed as a set diff. Generated view next to the timeline; consumed by the write-biography skill; surfaced in the report.

15a. fha report — the session report (research feed)

Purpose. The “login screen”: one command that refreshes state and tells the researcher where to focus. Surfaced in the workbench as the /today skill (the skill runs the report, then narrates and offers to start the top item, e.g. a review-claims session).

Steps. (1) fha photoindex incremental refresh; (2) fha index rebuild; (3) lint in report mode; (4) diff current IDs/statuses against the snapshot in .cache/last_report.json; (5) assemble sections; (6) write the new snapshot.

Sections. (Discoveries lead — the report is a research narrative first, a chore list second.)

  1. Discoveries since last session — questions answered, contradictions resolved, first corroborations (a claim gaining its first independent second source), profiles newly vital-complete, confirmed connections. With confirmation, each is appended to notes/discoveries.md (dated, with [S-]/[P-] refs) — the durable log of research wins.
  2. Review queue — suggested claims grouped by source, oldest first (lint W102).
  3. New since last session — sources, claims, people added/changed (the snapshot diff).
  4. Vitals gaps — curated people first, then people touched since last report (lint W101).
  5. Contradictionscontradicts: pairs lacking an open question (lint E009 set).
  6. Search-log awareness — leads in any section are annotated “already searched (date)” from the search_log; nils older than a configurable horizon (default 18 months) are flagged as worth re-running (collections grow). 5b. Answerable questions — open questions whose referenced gap (person+type) now has an accepted claim, or whose referenced C-ids changed status: each proposed for answered [S-…] or review. Closing requires human confirmation — the report proposes, never executes.
  7. Photo processing triage — top N candidates (§15b). 6b. Place candidates — recurring unlinked place_text clusters and GPS clusters (fha places candidates), with elevate-or-dismiss prompts.
  8. Hypotheses & draft queues — open hypotheses awaiting verification; per-person draft-queue counts (§14b).
  9. Possible connections (fha cooccur) — deterministic leads, never facts, with confirm/dismiss prompts: (a) person-pairs co-occurring in ≥2 sources with no relationship edge; (b) accepted claims of different unlinked people sharing a place with overlapping EDTF bounds; (c) recurring organizations/clubs/employers as shared-affiliation hubs. The /today skill may narrate judgment on top.

Output: markdown to stdout and .cache/report_{date}.md. Flags: --full (ignore snapshot), --section <name>.

15b. fha photoindex triage — processing candidates

Answers “which of these thousands of photos deserve source records?” — consumed almost entirely through the report’s triage section, rarely called directly. Ranks unprocessed photo groups (no source_id) by evidence signals: +3 caption contains transcribed text; +2 any bare P-id person keyword; +1 date pattern at Y! or better; +1 group has a -back variant (writing likely); −2 AI caption only (no human/verbatim text). Emit top N with path, signals, and suggested next step (fha process …). Feeds report §15a.6. (Flagged as a future-intelligence focus: a model-assisted scoring pass over caption/UserComment text, learning from past process/skip decisions, is the designed upgrade path — backlog.)


15. Build order & testing

Order: _lib foundations → install/update-tools (scaffolding, §13c) → idindexlintstubsviewsphotoindex (+triage, +reconcile) → xrefcooccurreportpacketprocessconvert-miningsitewikitreegedcomcaptureplaces geocode. Rationale: lint validates everything later tools write; the photo index gates the packet; the converter needs stubs + lint to validate its output.

Testing: the pilot tree is the clean golden fixture (tests/fixtures/) and must lint clean (exit 0; TODO-marked asset gaps are W-level, not E-level, in fixture mode); intentionally broken fixtures live separately under tests/fixtures/broken-*/, one per lint code. The example-archive/ is a separate demonstration fixture — it is permitted to exit 1 with documented known warnings (e.g. W101 for historical figures whose death records have not been located); those warnings must not regress in count or code without a deliberate change. Each tool ships --dry-run; tests are golden-file comparisons against a committed copy of the pilot (tests/fixtures/), plus deliberate-corruption fixtures for every lint code (a file per E/W code, asserting it fires). fha lint must run clean on the pilot before any release of any tool; a make check target runs the suite.


16. The research workbench (harness configuration)

Pattern (SPEC §6): an agentic CLI harness opened on the archive root, beside a plain text editor — human and AI edit the same files. Claude Code is the operating choice, not a required one. The configuration below is what makes any conforming harness genealogy-aware, and what keeps the choice reversible.

Vendor-lock prevention rules.

  1. AGENTS.md is canonical. All agent operating instructions live there, in plain markdown, harness-agnostic. CLAUDE.md is a one-line deferral (Read and follow AGENTS.md.) plus, at most, Claude-Code-specific notes. Any other harness’s convention file (e.g. for Codex or Gemini CLI) gets the same one-line deferral.
  2. Skills are portable. Workflow skills live in .claude/skills/{name}/SKILL.md using the open SKILL.md standard (adopted beyond Claude Code); they contain instructions and fha invocations only — no harness APIs.
  3. No harness-only state is load-bearing. Session memory, harness caches, and MCP configurations are disposable; anything worth keeping is written into archive records. Switching harnesses must cost one afternoon, not a migration.
  4. The harness’s “knowledge” of the archive is the index and the fha tools, never bulk file ingestion — ten thousand photos cost zero context because photo questions are fha photoindex calls.

Initial skills (build alongside linter v1).

External roots in the workbench. When fha.yaml maps a root outside the archive, the harness needs access granted: for Claude Code, launch with --add-dir <photos-root> (the settings-file additionalDirectories route has had reliability reports; prefer the flag, e.g. in a small launch script committed next to fha.yaml). The agent still must not bulk-read asset trees — access exists for exiftool/process/packet operations, not ingestion.

Model selection (workbench economics). Deterministic fha tools cost no model credits — the deterministic/judgment split is also the cost model. For model work, tier by judgment density, not habit: the workhorse tier (currently Claude Sonnet) is the default for tool-building, processing, review, and drafting; the frontier tier (currently Claude Opus / Fable) is escalated per task for proof arguments, merge/separate judgment, brick-wall research, spec-refinement, and stuck debugging — the tell is cheap to attempt, expensive to get wrong; the fast tier (currently Claude Haiku) serves batch API pipelines only after a sample-quality bake-off (handwriting transcription degrades quietly on small models). Switch per session (/model); the tiers are roles, not vendors — any harness’s equivalents slot in.

Workbench session hygiene (enforced by AGENTS.md): run fha lint after any batch of edits; never bypass fha process for renames; new claims always status: suggested when AI-drafted; never edit below a GENERATED header.


17. Command reference (the full surface)

Invocation surfaces: T terminal (works with no AI) · C conversational (agent shells out) · A auto-triggering skill · / slash wrapper. Organized by how often you touch it — the skills are the real working surface; most CLI commands are what skills shell into.

The working surface (daily)

Surface What it is / when
today skill — /today Session start. Runs and narrates the report, discoveries first; offers the top item.
process-source skill (A) “Process the inbox / this file.” fha process + AI draft + hand-off to review.
review-claims skill (A) “Review the census claims.” The gate; ends with reindex + xref + lint.
mine-transcript skill (A) “Mine grandpa’s interview.” Invoked, never automatic.
write-biography skill (A) “Draft Margaret’s bio.” Consumes the draft queue; AI-DRAFT markers.
research-next skill (A) “Where should I look for X?” Log-aware leads; writes hypotheses.
merge-identities skill (A) “Same person” / “two people.” Frontier-tier candidate.
place-research skill (A) “Fill in Suwałki’s history.” Loose citations OK.
fha lint/lint (T C) After any batch of edits; the done-gate. Flags: --with-exif, fix modes (diff-previewed).
fha doctor/doctor (T C) “What’s wrong with this archive?” After moves, migrations, weirdness.
fha process <file\|folder> (T C) Direct intake without the skill conversation; folder mode triages first.
fha packet <P-id> (T C) “Make grandma’s packet.” --include-research/--include-restricted/--include-dna/--no-photos.
fha report (T C) The raw report, un-narrated. --full, --section.
fha find <ID\|text> (T C) “Where does this live?” Records + assets + citations for any ID; FTS for text.
fha find --related <ID> (T C) “What’s adjacent to this?” Neighborhood of any ID — person/place/source/claim/hypothesis — ranked, with provenance. The connection-discovery primitive.
fha find --related --date <EDTF> (T C) The time neighborhood: everyone/everything active in a date range. Combinable with an ID.
fha find --text "…" (T C) Full-text search across record bodies, notes, transcripts, and photo/document caption+keyword fields.

Occasional (setup, migration, publication moments)

Command When
fha site [--standalone] (T C) Regenerating the family site; before any share or USB hand-off.
fha wikitree <P-id> (T C) Publishing a curated profile in the WikiTree dialect. Never uploads.
fha install <path> (clone) / fha update-tools (T C) Bootstrap a private archive with the operating layer, or refresh it later — backs up your edits, never deletes, never touches data.
fha capture (T C, + browser companion) Capturing a record from an open web page (Ancestry etc.): citation + asset/HTML-snapshot + research-log entry → fha process. The main intake on-ramp.
fha gedcom <P-id\|--all> (T C) Exporting relationships+vitals to GEDCOM for another genealogy app. One-way; redacts living/unknown.
fha views tree <P-id> --mode … (T C) Generating an ancestor/descendant/FAN tree (json/html/dot).
fha views timeline\|sources-index\|draft-queue\|brackets (T C) Manual view refresh (review sessions auto-trigger); brackets --fix after family-structure changes — also verifies and corrects Ahnentafel folder numbers and person file placement (requires root_person in fha.yaml).
fha places geocode / places lint (T C) Coordinate backfill (human-confirmed); registry hygiene.
fha convert-mining [--apply] (T C) One-time: legacy ChatGPT mining migration.
fha photoindex find / report (T C) Ad-hoc photo search; cross-variant date-conflict review.
fha reconcile / photoindex reconcile (T C) Heal index paths after moving/renaming files; run after a photo-organizing session.
fha photoindex tag-person (T C) Writing bare P-id keywords across a face-tag match or onto specific photos (previewed).

Plumbing (invoked by the surface; direct calls are the by-hand fallback)

Command Who calls it
fha id mint P\|S\|C\|L\|H process, skills, converters. Direct use = hand-authoring with zero tooling — its reason to exist.
fha stubs lint --mint-stubs, the process pipeline, converters.
fha index [--source S-x] review sessions (incremental), the report, anything needing queries.
fha xref review-session end; the report.
fha cooccur the report’s connections section (people pairs + org/place recurrence).
fha photoindex (scan) The report’s first step; incremental, silent.
fha photoindex triage The report’s triage section.
fha lint --format-check/--format-write Cleanup sessions; always diff-previewed.

Recommended slash wrappers (thin files in .claude/commands/): /today, /lint, /doctor. Everything else is conversational; skills auto-trigger on matching requests.