How it happened

The journey behind the submission. Pivots, surprises, the validation that kept the work honest, the rules that emerged, and where each one came from. Use the persona overview at the top to orient; read top-to-bottom, or jump to a section.

🧪 Reliability & validation

The first rule on the project — and the one I'd promote to the top of every future one — is to treat any LLM as unreliable on resources and confident-sounding by default. Cost claims, time estimates, and "I can't access X" refusals all need to be checked against a real run before they shape a decision. Pierre is on Claude Max 20×, so the cost framing was retired entirely; what stayed is the discipline.

The headline insight: hand-labelling 100 rows changed every downstream argument. The hand-tagged set is the only ground truth in the project; the model labels and the cross-check models are candidates, not arbiters. View the full table at 📋 hand-labelled rows — 50 touched rows where the original raw classifier said joy/love/surprise on 1- and 2-star reviews, against the 8b correction, the j-hartmann cross-check, and Pierre's hand-label.

50hand-labelled rows

100%directional accuracy of the fix

42%exact-emotion accuracy of the fix

18%indie cross-check on same gold

The reliability rule

Verify before claiming. Before saying "this is live", or "X is reachable", or "we don't have access to Y", run the check from the user's vantage. Curl from the wrong shell isn't what the user sees in a browser. "I can't access X" is a refusal, not a fact — try the documented path; if it fails, the failure is the answer.

Eyeball the first 20 rows

Every aggregate metric was sense-checked against twenty actual rows on the way through the pipeline. Aggregate accuracy of 0.78 is meaningless without seeing 20 examples; the aggregate hides the failure mode. The 50-row hand-labelled table is the formalisation of that habit.

🎯 Headline moves

Five decisions that shaped the work — what changed, why it mattered, and the evidence that landed it.

The classifier was trained on US Twitter; our reviews are British polite-complaint prose

"I have been a loyal customer, however..." reads as joy on a Twitter-trained model and as anger or sadness in UK English. 20.6% of 1-star reviews tagged joy — far above the sarcasm base rate (2-5% per SARC/iSarcasm), which means it is a domain mismatch, not sarcasm. Two or three examples of each (anger / sadness / fear from joy/love/surprise) live in the appendix; the full 50-row table lives at 📋 hand-labelled rows. The fix re-ranked 1,512 rows using the model's own probability vector, kept the originals in emotion_raw for the audit trail, and left topic rankings unchanged — it's a data-quality intervention, not a finding-generator.

PureGym replies fastest to joy, slowest to anger

Median reply latency: joy 98 hours, anger 130 hours. Only 6.3% of angry reviews are answered within 24 hours; 38.3% are still unanswered after a week. Industry benchmark: 1-hour reply = 71% retention versus 24-hour = 48%. Easy change, easy to monitor — re-rank the reply queue by emotion at intake, no new infrastructure.

The top-10 worst clubs are 80% London

422 negative reviews concentrate in 10 of 410 corporate sites (2.4% of network → 7% of negative-review volume). 8 of 10 are London or Greater London. London creates a lot of noise — keep an eye on those locations specifically. A London-focused 90-day rapid-improvement programme covers most of the worst-club issue in one move.

Negative reviews tripled from June 2023

A trend, with multiple speculations as causes — H1 2023 price increases, 54 new sites in the same window, the Fitness World Denmark rebrand, and the cost-of-living squeeze. The timing has a date stamp; the cause is plural and not yet cleanly attributable. Worth flagging in any longitudinal review.

Two location IDs (345, 398) looked like junk — they were legitimate

Both are multi-site catch-alls (174 reviews aggregating multiple London gyms; 42 dominantly Shrewsbury). Initial instinct was to drop them as data-quality noise. Deeper look showed 112 of 216 rows are five-star — real reviews with broken display-name metadata, not bot output. They stay in the topic and sentiment analysis; they're excluded only from per-location ranking, where the catch-all aggregation would be misleading.

📐 Working rules

Discipline that came out of the project. Each is a pattern that paid for itself when followed and cost something when ignored.

Subtract by default — when adding X, name what retires

Strong prior that installs are tax. Resist tool sprawl by stating the constraint, not approving anyway. Caught a "let's install WSL on the laptop" in this project; a two-line shell alias served the same purpose with no install footprint.

List the right things, let stakeholders price them up

Russell-validated framework, adopted across the extended consultant memo. Every recommendation tagged OpEx (recurring) or CapEx (one-off); no fabricated pound figures per recommendation. The consultant's job is to surface and tag, not to fabricate the costing.

Clinical, not nice — in audits

Don't use "strong / robust / excellent" about your own work. Rubric verbatim with tick marks (✓ HIT, ⚠ THIN, ✗ MISS, ★ BEYOND). Per-chart commentary: rubric tag, one line on what it shows, one line on what it doesn't. If something's weak, say "weak because X" — don't soften with "could potentially be strengthened". The marker reads the language.

Stop overwriting the canonical notebook

In-place edit was destructive three times in one session. Code-enforced now via a PreToolUse hook denying writes that don't match the _pending filename suffix pattern. Promoted from prompt-level reminder to hook layer because the prompt-level rule was ignored every session.

📅 Journey

Six dated moments that together tell how the project got from "what is BERTopic?" to a defended submission.

2026-04-14 — Visual-verify discipline

A bar chart claimed to be rendered correctly was not. The assistant had checked its own source code instead of the rendered page. Rule promoted: every UI claim is verified in a browser, not from source.

2026-04-16 — Cohort Q&A

Falcon → Qwen swap green-lit. Custom stopwords ruling (pure, pure gym, gym). Trustpilot Title+Content merge: 59% of titles add information — must merge. Iterative stopwords licensed.

2026-04-18 — Workbench discipline codified

Four artefacts shipped: an apply.py patch executor, a render.py dashboard, a data-workbench-guard.sh PreToolUse hook, and a preflight.py environment check. Workbench tooling lifted to a standalone repo so one install serves every project.

2026-04-19 — First clean end-to-end run

Topic-ordering shuffled on a re-run; theme labels were wrong because they relied on a hardcoded dict. Fix: keyword-rule labelling that survives UMAP-induced topic-ID shuffles.

2026-04-25 — Russell day

Submission status check passed (53/53 cells, 0 errors, 216 placeholders kept, intersection 312 → 335). Russell 1:1 in the morning landed the stakeholder framework: list the right things, let stakeholders price them up. The Sonnet shift-worker validation completed earlier the same day; the framework gave the headline-saving reframe a place to land.

2026-04-25 evening — Two parallel sessions, then synthesis

One session in pace-nlp-project (visual polish + rubric tightening), one in data-workbench (extended consultant report + Sonnet shift-worker reframe + pilot designs). EXTENDED_REPORT.md (40 KB) deployed at /extended. Three parallel agents mined project docs, brain-vault, and Claude Code session logs to compile the addendum.

Proof of understanding deepening over time

A small audit, exploratory in spirit. From 223 user-typed prompts across 10 Claude Code sessions, the same topic surfacing early (working it out) and late (compressed, confident).

Versioning the canonical notebook (catching the destructive in-place edit default)

2026-04-18
early

"are we versioning? or are you just writing over the smae notebook file everyrt fucking time? that's destrcutive right?"

Naive realisation mid-thrash — the canonical notebook had been edited ~8 times in-place during the ollama-to-HF migration, and Pierre suddenly stops the loop to interrogate the version-control assumption out loud, profanity-laced and verbose, including a self-check ('that's destrcutive right?') that's still asking the assistant to confirm the vocabulary.

2026-04-24
late

"youre overwriteing again"

Instant pattern recognition — three words, no profanity, no question mark. Same destructive default, same notebook, six days later. The full prompt then includes a pasted FileNotFoundError trace and ends 'also sloppy with this. look at the v3 file and learn from that please' — but the cognitive move is the opening: Pierre saw the symptom and named the pattern in a glance, the way a senior catches a junior.

On Apr 18 he was discovering that 'destructive in-place edit of the canonical notebook' was even a thing he should be watching for; by Apr 24 it had become a reflex catch — same failure mode, fraction of the cognitive cost, and the next morning (Apr 25, 45465c5b:949) he promoted the rule to a standing instruction ('stop overwriting the basic notebook you've got to do versioning right and you have to commit to Github').

Emotion classifier OOD failure on UK reviews (sad-vs-anger and the joy-on-1-star pattern)

2026-04-16
early

"adhd friendly please. explain to me how to ask about the sad v anger classification trouble nad what does that mean for the client at the end of the day in terms of recommnedations"

Live during a tutor Q&A, working out HOW to even ask the question — 'explain to me how to ask' is a meta-step before he can intelligently raise it with his tutor; he doesn't yet have the vocabulary for 'out-of-distribution', he calls it 'the sad v anger classification trouble' and asks for both the question framing and the downstream business meaning.

2026-04-25
late

"for the top 20 - not sure how useful it is many of these words. and also what's up with 'joy' in the emotion stuff? it's really weird right? can you show me examples and can we figure this out? is there another model that is better for british stuff? or should we train one perhapss? how hard would that be?"

Three nested escalations in one prompt — (1) here's evidence the model is misbehaving, show me the examples; (2) is there a better-trained UK-domain model already out there; (3) if not, how expensive would training one be? Confident framing that assumes the OOD pattern is real, names 'british stuff' as the domain shift, and treats fine-tuning as a scopable next step rather than an unknown.

On Apr 16 the misclassification was a fuzzy phenomenon he needed help articulating to a tutor; by Apr 25 it was a known pattern (the j-hartmann / GoEmotions OOD on UK polite-complaint), and the prompt skips the 'what is happening' question entirely to go straight to 'show me examples → swap to a better model → or train our own — what's the cost?' — that's the move of a senior analyst who has internalised the diagnosis and is now scoping the remediation.

Method: 223 user prompts scanned across 10 sessions, span 2026-04-13 to 2026-04-25. Filtered to 85 learner-posture candidates via curiosity / doubt / mental-model regex markers, then 2 paired examples picked manually where the same concrete topic shows up early (naive framing) and late (compressed, confident).

📚 Appendix

Items that didn't earn the main page but are kept for transparency. Brief by design.

Surprises & findings

20.6% of 1-star reviews tagged joy

The classifier was right about its training data and wrong about ours. The fix is to label the failure mode, not retrain the world. A few verbatims live in the row table at 📋 hand-labelled rows. The mechanism deserves its own study; for this submission it was diagnosed and bounded.

Glassdoor staff have no public discussion of music

Initial hypothesis: loud music is corporate-mandated, so the lever is HQ. Glassdoor null result refutes that — staff do have control over volume. Operational fix: trigger a manager review when 2+ reviews at a site mention loud music; let staff try turning the music down a step. Music volume correlates positively with overall negativity (r=+0.60), so it's a possible aggravator worth treating as a per-site lever.

Income → negative reviews is a middle-income story

The headline correlation (r=+0.33) flattens out at the extremes. Less negativity in poor towns; less in very wealthy ones; the loudest complaints come from middle-income towns with options — places where the gym is one of several gym choices, and the customer feels free to switch. The mechanism is expectation gap modulated by exit cost.

Hand-label gold accuracy: raw 0%, 8b 42%, j-hartmann 18%, Gemini 40%, Claude 74%

The "indie cross-check" model (j-hartmann) is worse than the rubric model on this distribution — different OOD axis. Don't assume an indie cross-check beats the rubric model. Test it.

Pivots that mattered (kept brief)

Falcon-7B → Qwen2.5-7B

Falcon's rubric prompts no longer reproduce under post-update weights. Russell verbally green-lit the swap. Qwen is Apache 2.0, structured-output, runs deterministically on A100.

Stochastic UMAP → seeded random_state=42

Without seeding, topic IDs reshuffle between runs and a hardcoded label dict silently lies. Seeded across all four BERTopic calls.

Versioning the canonical notebook (Claude Code addition)

Rule promoted to a PreToolUse hook denying writes that don't match the _pending suffix. Born from a turn that opened literally "are we versioning, or just writing over the same notebook every time?".

Validation methods (kept brief)

Sonnet 200-sample shift-worker validation

Worth a further investigation — sample for confirmed shift-worker rows specifically and re-fit the cohort. The current keyword filter is mostly 24/7-praise.

j-hartmann cross-check on a 200-sample stratified set

DistilRoBERTa fine-tuned on 7 diverse corpora rather than Twitter. Used as an OOD canary, not a tie-breaker — disagreement at 3-star confirms the OOD pattern.

Workbench tested 10 preprocessing configurations

Empirical decision, not recipe. The "more cleaning hurts BERT" finding came out of running the test; it would have been hard to argue from theory alone — investment up front pays off when the failure plays out and gets fixed in plain sight.

Technical specifics (kept brief)

BERTopic tuning

Seeded UMAP across all 4 fit_transform calls, KeyBERTInspired + MaximalMarginalRelevance representation, reduce_outliers(strategy="embeddings"), keyword-rule labelling.

Emotion classifier OOD theory

Politeness-repair mechanism (Brown & Levinson 1987), Confident Learning (Northcutt 2021), Snorkel weak supervision (Ratner 2017), 3-star contrast-marker extension.

LLM extraction notes — for replication

Few-shot beats zero-shot for structured outputs; smaller models benefit more. Robust JSON parsing (find first [, last ], json.loads on slice). Left-padding for decoder-only models. Worth flagging that any single-model emotion run on UK reviews — without US/UK context-sensitive training — is barely reliable on joy.

Rubric-item-specific decisions

Item 26 (BERT emotion): rubric-mandated, kept; OOD handled inside the model's own probability vector. Item 31 (Falcon): substituted for Qwen via instructor green-light. Item 40 (pyLDAvis): the only thin item — hangs on prepare() for large corpora, fallbacks documented.

Domain preprocessing & hand-labelling discipline

Custom stopwords (pure, gym, puregym), generic-stops extension, Trustpilot Title+Content merge, langdetect deterministic seed. Hand-labelling discipline (find the pivot word "but"/"however"; anger blames outward, sadness grieves inward; wistful past = sadness; accusatory past = anger; explicit emotion words override) is the method behind the gold rows — recommended as a first move on any new emotion classifier deployment.

Tooling / lessons learned

Ways of working

Colab kaleido / Plotly fallbacks, Windows cp1252 stdout trap, Mermaid v11 quirks, two-Claude-sessions coordination on the same repo, secrets-discipline content-scan before commit. The full list is in the project's LESSONS_ADDENDUM.

Companies House FY2024 vs Perplexity Sonar Deep Research — an open issue

Tested Claude Code deep research against other AI deep-research products before settling on Perplexity Sonar; ended up with similar reliability concerns either way. Real numbers cross-checked against the FY2024 filing corrected several Perplexity outputs (gym format 5,500-25,000 sqft, not 2,500; 433 UK gyms, not ~400; ARPM £22.64, not £21.60; Adj EBITDA 29.7%, not 23%). Treat AI deep-research as a starting scaffold, not the final answer. Still an open issue — the cross-check pass is part of the workflow now.

Open issues

Phase 8c sadness-aware re-rank is future work; 8.54% residual error ceiling on high-star reviews tagged negative; music↔negativity correlation is observational only; no causal identification per intervention; word count over the rubric ceiling.

Restricted · PACE NLP

How it happened

🧪 Reliability & validation

The reliability rule

Eyeball the first 20 rows

🎯 Headline moves

📐 Working rules

📅 Journey

Proof of understanding deepening over time

Versioning the canonical notebook (catching the destructive in-place edit default)

Emotion classifier OOD failure on UK reviews (sad-vs-anger and the joy-on-1-star pattern)

📚 Appendix

Surprises & findings

Pivots that mattered (kept brief)

Validation methods (kept brief)

Technical specifics (kept brief)

Tooling / lessons learned