PACE · Extended Findings · operational memo built on the same dataset as the academic report · LGP+KKR-framed

PureGym Extended Findings — Beyond the Rubric

Audience: Leonard Green & Partners + KKR (PE owners), Pinnacle Topco / Pinnacle Bidco (UK group + bond issuer), Clive Chesser (CEO from Nov 2024), Humphrey Cobbold (Chairman). Treat as a consultant memo, not an academic report — every recommendation tagged OpEx (recurring £/year) vs CapEx (one-off £) and ranked by effort × leverage.

Scope. Extends the CAM_DS_301 academic submission (report.md) with operational findings that fell outside the 800-1000 word limit. All findings derive from the same dataset: 27,666 PureGym customer reviews (Trustpilot 15,815 + Google 11,851), plus 43 PureGym Glassdoor staff reviews, plus the FY2024 Companies House filing (Pure Gym Limited, reg 06690189), plus a Perplexity Sonar Deep Research industry brief (with documented errata — see § Caveats).

Frame. PureGym is a mature 410-corporate-gym + 23-franchise UK operator (433 total) with £416m turnover, £187m reported EBITDA (45% margin), £123.5m adjusted EBITDA (29.7% margin), 1.50m UK members, ARPM £22.64, +13% YoY revenue, +7% YoY UK membership (group 2.25m, +21% YoY post-Blink). 2024 saw 43 new UK gyms opened at £2m+ per site plus a £150m senior secured note (Oct 2024) funding the £97m Blink Fitness acquisition (56 US gyms from Chapter 11). LGP + KKR exit thesis is therefore *both* CapEx-led growth (active) *and* margin expansion (latent) — the memo argues the next-marginal-pound is more productive in retention/margin than in further UK new-store CapEx, given urban-saturation diminishing returns. Every recommendation here is filtered through that lens: recurring OpEx savings and retention lifts beat one-off CapEx unless the CapEx triggers a margin ratchet.

TL;DR — what to do Monday morning

#	Action	Effort	Leverage	OpEx / CapEx	Verdict
1	Re-rank Trustpilot reply queue by anger / sadness emotion so angry reviews get answered before joyful ones	1 unit	5/5	OpEx (~0 — workflow change to existing reply ops)	Ship this week
2	Lean into 24/7-flexibility as the lead acquisition narrative — Sonnet validation (2026-04-25) re-framed the 1,177 "shift-worker" matches as a 24/7-praise filter (3% verified shift-worker, 77% unclear, 20% no). The dominant signal is 24/7 access as lifestyle promise, with shift workers as a small but visible proof cluster (~36 verified). Marketing reallocates around 24/7-flexibility, not occupational identity	2 units	3/5	OpEx (marketing budget reallocation)	Q3 marketing slot
3	Seasonal HVAC pre-emptive ops — May–Sep AC complaints 3× baseline; Nov–Dec hot-water complaints 5× baseline. Pre-summer service contract + winter boiler audit	2 units	4/5	OpEx (service contract)	Annual cycle
4	Set per-site music-volume target with manager accountability — Glassdoor null result confirms music is per-site, not corporate. Ownership is club managers, not HQ	1 unit	3/5	OpEx (~0 — policy memo)	Ship next month
5	Revisit the top-10 worst clubs as a programme — already drafted in `v3_11_club_briefs.md` (London Stratford 81, Leicester Walnut 59, Enfield 48, Swiss Cottage 37, Birmingham 34, Bermondsey 33, Seven Sisters 33, Hayes 33, Hammersmith, Hounslow). Sequenced 90-day plans per site	3 units	4/5	Mixed (cleaning OpEx + locker / equipment CapEx)	Q3 ops plan
6	Speculative pilots — warm-shower temperature A/B, AC double-doors, capacity-cap queueing app	1 unit (writing only)	unknown	n/a — design proposals	Discussion item

§ 1 — Reply latency: the highest-leverage Trustpilot intervention

The data

PureGym replies to 97% of Trustpilot reviews (15,356 of 15,815). Reply rate is not the problem. Reply prioritisation is.

Emotion	Count	Median reply (hours)	Mean reply (hours)
Joy	9,960	98.8	136.5
Surprise	99	87.5	122.7
Love	328	99.0	126.7
Fear	482	122.6	163.3
Sadness	1,606	124.1	173.1
Anger	2,881	130.1	173.1

PureGym replies to angry customers 31% slower than to joyful ones. This is the inverse of best practice.

Latency band breakdown for the 2,881 anger-tagged Trustpilot reviews:

Band	Anger count	% of anger
<1 hour	6	0.2%
1–4 hours	24	0.8%
4–24 hours	153	5.3%
24–72 hours	494	17.1%
3–7 days	1,101	38.2%
1–4 weeks	1,103	38.3%

Only 6.3% of angry reviews are answered within 24 hours. 38.3% are still unanswered after a week.

Industry anchor

A 2026 support-response benchmark study found that organisations responding to customer messages within 1 hour achieve 71% retention vs 48% for those responding within 24 hours — a 23-percentage-point retention gap explained purely by reply speed ([Unthread 2026](https://unthread.io/blog/internal-support-response-time-service-desk/)). Top retailers respond within 2 hours; the retail-sector benchmark median is 17 hours. PureGym's 130-hour anger median is 7.6× the retail median.

The 2026 Google Business Profile benchmarks survey shows that "fast response is the strongest predictor of star-rating recovery after a 1- or 2-star review" ([WebFX 2026](https://www.webfx.com/blog/seo/google-business-profile-benchmarks/)).

The recommendation

Re-rank the Trustpilot reply queue by emotion before assignment to a customer-service agent. Anger and sadness move to the front. Joy and surprise can wait. The combined_phase9.parquet already has is_anger / is_sadness flags from the BERT emotion classifier; the same model can run inline at review-ingest time.

Cost. OpEx ~zero. The reply ops team already exists; this changes the *order* of work, not the *amount*. One sprint to wire the emotion classifier into the existing Trustpilot intake pipeline.

Leverage. Anger reviews are 18% of the Trustpilot negative cohort. If even half of them shift from "ignored for a week" to "answered within 24h", at industry retention multipliers, the implied churn reduction is material. Without a fabricated £ number: this is the single most concentrated improvement available from the existing data.

Owner. Customer service ops (corporate). Not a per-site fix.

→ What we landed on (and a hypothesis worth checking before re-rank goes live)

The naive read of the latency table is "PureGym replies fastest to joy-tagged reviews." Two complications surfaced after the report's first draft, and the re-rank recommendation should run *alongside* (not before) a check on each:

1. Sarcasm in the joy-tagged low-star band. 20.6% of 1- and 2-star reviews are tagged "joy" by the rubric BERT classifier. The dominant explanation is OOD register-mismatch (Twitter classifier hitting British complaint politeness — "Lovely staff, terrible billing, however..."). Brown & Levinson 1987 politeness theory + Biber & Conrad 2009 register theory frame this as classifier OOD, not sarcasm. But. The OOD framing assumes the misclassification is uniform; sarcasm at the individual-review level is a separate signal that can co-exist with register-mismatch and is operationally different. A 100-row hand-label of joy-tagged 1–2★ reviews — split for genuine politeness-prefix vs sarcasm vs misread negation — would resolve which pathway dominates. Cost: ~30 min of hand-labelling. Outcome: if sarcasm is >10% of the joy-low-star band, that's a separate model-level intervention, not a queue-reordering one.

2. Why is the joy queue moving fastest at all? If staff route by emotion tag, the inversion is explained. If they don't, *something else* is drawing reply ops toward joy-tagged reviews first — possible candidates: shorter mean length (joy median 34 words vs anger 30 words — close, but joy reviews tend to be more formulaic and faster to write a generic reply for); fewer unanswered substantive complaints (a "thanks!" reply takes seconds, an angry complaint requires a refund-policy lookup); chirpy framing pulls staff toward easy wins under workload pressure. Worth a 30-minute interview with two reply-ops agents before assuming the re-rank is purely a workflow fix. If the inversion is partly a workload-survival pattern, queue-reordering will get resisted unless paired with reply-template support for the harder anger cases.

The §1 recommendation (re-rank by emotion) still stands — anger latency 130h vs joy 98h is the inversion, regardless of mechanism. But these two checks turn it from a one-shot policy change into a sequenced rollout: hand-label first, interview second, re-rank third.

§ 2 — Shift-worker cohort: an underserved positive segment

The data

Filtering all 27,666 reviews for shift-worker keywords (nurse, police, officer, paramedic, NHS, night shift, after my shift, finished work, 24/7, 5am, 6am, etc.) returns 1,177 matches (4.3% of corpus). The sentiment skew is striking:

Score	Count	% of cohort
5★	462	39%
4★	234	20%
3★	173	15%
2★	103	9%
1★	205	17%

59% are 4–5 star. vs ~30% positive baseline across the full negative-filtered analysis. 681 of the 1,177 tag "joy" as primary emotion (58%).

Caveat from validation. A Claude Sonnet 4.6 zero-shot pass on a 200-row stratified sample of these matches confirms shift-worker identity in only 6 / 200 (3%, "yes"); 154 / 200 (77%) are "unclear" — the writer praises 24/7 access without naming a profession; 40 / 200 (20%) are explicit "no". The keyword filter is a 24/7-praise filter, not a verified-shift-worker filter. See § 7.5 for methodology. The campaign opportunity is no smaller for the correction, but the framing must lead with 24/7-as-flexibility-promise as the dominant data signal, with shift workers as the visible but smaller proof cohort.

Top keyword hits: - 24/7 access: 798 mentions ← the dominant signal - Security / cleaner / warehouse / bouncer: 254 - After-my-shift / finished work / before work: 102 - Police / officer: 27 - NHS / hospital / midwife / doctor: 15 - Night shift / shift work: 13

Verbatim positives (sample)

"Open 24 hours so I can always go after doing a night shift!" — Bletchley, 5★

"Don's 6am BOOTCAMP class on Monday's are the best!!! He's very encouraging and pushes you to be the best YOU" — Elkridge, 5★

"Good selection of equipment, not too busy (at the times I go) and accessible 24/7" — Inverness Inshes, 4★

"Brought for my son less than £6pw an excellent membership can use anytime as its open 24/7" — Weston Super Mare, 5★

"Lovely gym 24/7 access classes included with membership also free parking" — Newcastle Longbenton, 5★

The full 1,177-row CSV is at v3/output/v3_ext_shiftworker_matches.csv.

The recommendation

A targeted "We serve the people who serve you" campaign — discount or perk for verified NHS / police / fire / ambulance / armed-forces members. The data shows this cohort is already disproportionately satisfied; the marketing move converts a passive strength into an explicit acquisition channel and a defensive retention moat.

Channel design (proposed, not measured): 1. Verification via existing platforms (Blue Light Card, Defence Discount Service) — ~zero infrastructure cost. 2. Headline benefit: dedicated 5–7am "shift-end" perks (towels, free coffee, priority lockers) at sites with relevant catchments. 3. Storytelling content: 5 verbatim quotes (above) become the campaign's first three creatives. 4. Ad campaign sequenced for autumn / January push, when shift workers' New Year resolutions intersect with PureGym's January acquisition peak.

Cost. OpEx (marketing reallocation, not net new). The verification platforms already exist and are cheap-or-free for partners.

Leverage. The Service-Profit Chain literature (Heskett 1994; Bernhardt, Donthu & Kennett 2000) supports the upstream link from staff-and-niche-cohort satisfaction to retention. PureGym is the most-mentioned 24/7 budget gym in this dataset — defending that position with a campaign is cheaper than out-spending The Gym Group on generic acquisition.

Owner. Marketing (corporate) + ops (site-level execution).

→ What we landed on ultimately

The walk from first draft to current framing went through three positions, each killed by the next data point:

1. Initial framing. "1,177 underserved shift-workers, 696 are 4–5★, 681 tag joy — go after them." Strong narrative, weak validation. 2. First correction (LESSONS_ADDENDUM, 2026-04-25). Sonnet 4.6 200-sample classifier flipped the headline: 3% confirmed shift-worker, 77% unclear, 20% no. The keyword filter is a 24/7-praise filter, not a verified-shift-worker filter. Verified-yes count projects to ~36 across the 1,177. 3. Second correction (same addendum). Shift workers are NOT calmer than the general PureGym population — mean rating 3.84 vs general 3.89, joy 64.3% vs general 65.1%, and they have MORE equipment complaints, not fewer. The "underserved positive cohort" framing is wrong on its own data: this cohort is approximately *average* on satisfaction and slightly *worse* on equipment. The lift is in the 24/7-flexibility praise the keyword filter caught, not in shift-worker satisfaction.

Final position. The recommendation is now: lead acquisition copy with 24/7 flexibility as a lifestyle promise (proof: 798 unprompted "24/7" mentions in the keyword set, the dominant signal). Shift workers are a *small but vivid proof cluster* (verified ~36) that supports the narrative — quote them in creative, but don't make occupational identity the segmentation axis. The Blue Light Card / Defence Discount route is still cheap to add, but as a *secondary* tactic (verified retention proof), not the headline.

The "we serve who serves you" copy still works for those ~36 verified shift workers as a niche secondary creative; it just can't carry the campaign.

§ 3 — Seasonality: AC summer, heating winter, both predictable

AC complaints (negative reviews mentioning aircon / cooling / "too hot" / stuffy / sweat)

Month	Negative reviews	AC mentions	Rate
Jan	554	32	5.8%
Feb	632	44	7.0%
Mar	713	31	4.3%
Apr	704	30	4.3%
May	260	28	10.8%
Jun	371	52	14.0%
Jul	481	66	13.7%
Aug	366	45	12.3%
Sep	314	45	14.3%
Oct	613	44	7.2%
Nov	253	21	8.3%
Dec	564	31	5.5%

Summer (May–Sep) AC complaint rate = 13.0% vs spring (Mar–Apr) baseline 4.3% — 3× lift.

Heating + cold-shower complaints

Month	Rate
Feb	1.1% (lowest)
Mar	1.7%
Apr	1.4%
May	2.7%
Jun	3.2%
Jul	2.1%
Aug	3.3%
Sep	2.5%
Oct	3.1%
Nov	4.7%
Dec	5.5%
Jan	2.3%

Winter (Nov–Dec) cold-shower rate = 5.1% vs summer (Feb) baseline 1.1% — 4.6× lift.

Temperature mentions

Identical pattern: November–December peak (4.0%, 4.3%) vs spring trough (March 0.8%, April 0.6%).

The recommendation

Run an annual pre-emptive HVAC service cycle keyed to the data. - April pre-summer audit — every site's AC inspected before May. The chains that catch the issue in April reduce the May–Sep complaint surge. - October pre-winter boiler / hot-water audit — every site's hot-water system inspected before November. Cold-shower complaints in December are the clearest preventable churn driver in the dataset.

The existing v3_ctx_weather_* and v3_ext_weather_* artefacts already plot weather × negativity; this section closes the loop with the operational action.

Cost. OpEx (recurring service contract). The Perplexity panel review correctly flagged that the deep-research HVAC cost numbers were anchored on a 2,500 sqft boutique-studio assumption rather than PureGym's actual 10,000–20,000 sqft warehouse format — so absolute £ numbers must come from real PureGym facilities ops, not from this report. The directional recommendation (pre-emptive service > reactive repair) holds regardless.

Leverage. Removing one season-of-complaints from each year flattens the negative-review trajectory documented in the academic report's "Things Are Getting Worse" section.

Owner. Facilities ops (corporate, with site-level execution).

§ 4 — Music volume: a per-site finding, not a corporate one

The Glassdoor null result

The 43 PureGym Glassdoor staff reviews contain zero mentions of music, volume, loud, playlist, sound system, or speaker. The hypothesis that music volume is a corporate-mandated policy (à la some retail chains) is not supported by the data we have.

This re-frames the 43 Google "loud music" complaints (already documented in report.md) from a chain-wide policy lever into a per-site manager accountability lever.

The recommendation

Set a target dB range (e.g. 65–75 dB at the gym floor centre) and make it a club-manager monthly compliance metric. No CapEx — every PA system already supports volume control. The intervention is purely policy + manager dashboard.

If the company wants harder evidence, a 2-week dB measurement pass at the 10 most-complained sites is the next data step. Cost: a £30 SPL meter and ~10 site visits.

Leverage. 43 reviews is a small but concentrated thread; cleaning it up removes a recurring Google-review topic from the 11 BERTopic clusters identified in the academic report.

Owner. Club managers (per-site), accountable through standard ops review.

§ 5 — Top-10 worst clubs: an operational triage programme

The 10 clubs and their concentration

#	Club	Negative reviews	London?	Recurring theme cluster
1	London Stratford	81	yes	Hygiene + named-staff issues + WiFi/ventilation broken
2	Leicester Walnut Street	59	no	Temperature/HVAC + equipment + showers + rule enforcement
3	London Enfield	48	yes	Staff attitude + equipment disorganisation + overcrowding
4	London Swiss Cottage	37	yes	Maintenance/odours + overcrowding + missing equipment
5	Birmingham City Centre	34	no	Outdated equipment + flickering lights + lockers
6	London Bermondsey	33	yes	Hygiene + overcrowding + temperature/ventilation
7	London Seven Sisters	33	yes	Insufficient lockers + frequent unannounced closures
8	London Hayes	33	yes	Toilets/showers/water/equipment maintenance
9	New Barnet	32	suburban	Overcrowding + cleanliness + class noise + AC
10	London Canary Wharf	32	yes	Maintenance + ventilation + alleged staff harassment

422 negative reviews concentrated in 10 of 410 corporate sites — 2.4% of the network generating ~7% of the negative-review volume in this corpus. 8 of 10 are London / Greater London. The geographic concentration is itself the finding: a London-focused 90-day rapid-improvement programme covers 80% of the worst-club issue.

What fixing each typical issue would avoid paying for

Drawing on the recurring themes across the 10 briefs (v3_11_club_briefs.md), framed in the way Russell taught: list what you'd avoid paying for, hand the pricing to the owner.

Recurring issue	OpEx avoidance (recurring)	CapEx avoidance (one-off)	Owner
Hygiene / cleanliness	Negative-review-driven retention loss; refund / goodwill credits; viral 1★ Google velocity	None (cleaning is OpEx-only)	Regional ops + cleaning contract review
Equipment maintenance	Repeated repair callouts; member-LTV impact when 2-3 machines down at once	Phased equipment refresh (delays the inevitable, doesn't avoid)	Facilities ops
Overcrowding at city-centre sites	Refund pressure when peak-hour members can't use machines	Possible vestibule / capacity-cap CapEx (see Pilot 2)	Site-level ops
HVAC / ventilation / hot-water	Reactive repair callouts at peak failure; December cold-shower churn (§3)	Pre-emptive service contract (OpEx) usually beats reactive replacement (CapEx)	Facilities ops
Locker theft / insufficient lockers	Goodwill credits to theft victims; recurring 1★ reviews at affected sites	Locker hardware refresh at Seven Sisters / Stratford	Site-level ops + capital plan
Named-staff issues (e.g. Stratford)	HR cost of complaint resolution; brand damage if unaddressed	Training programmes (modest OpEx)	HR + regional manager
Frequent unannounced closures (Seven Sisters)	Pro-rata refund pressure; cancellations	Communication system / ops process	Regional ops

Recommendation. A sequenced 90-day-per-site programme, sequenced by complaint volume × London-density × OpEx-only-fixability. London Stratford first (81 reviews, manager-issue flagged, mostly hygiene/staff = pure OpEx). Birmingham and Leicester last (out-of-London, equipment-refresh CapEx-heavy = budget approval needed). The Qwen briefs at v3_11_club_briefs.md already provide the per-site this-week / next-month / next-quarter sequencing.

Owner. Regional ops manager (programme lead) + facilities ops (HVAC + locker work) + HR (named-staff escalations).

§ 6 — Speculative pilots (Tier 4: design proposals, not findings)

Full experiment specs at [v3/output/pilot_designs.md](v3/output/pilot_designs.md) — hypothesis / sites / duration / measurement / kill criteria / risk per pilot.

The academic report and §1–§5 above are grounded in dataset evidence. The proposals below are design ideas to test, not analyses with conclusions. Each is presented as an experiment design, not a recommendation.

6.1 Warm-shower temperature A/B

Hypothesis. Increasing shower water temperature 2–3°C (within safe limits) shortens average shower duration enough to offset the additional gas/electric cost, *and* reduces the cold-shower complaint cluster in the December review set.

Experiment design. Pick 4 matched sites by member count + complaint volume. Increase shower temp at 2 sites, hold the other 2 as control. Measure: shower duration (existing flow-meter data, if instrumented), gas/electric cost per member, December review complaint rate.

This memo will not estimate £ savings or hours. Pierre's working rule: cost / time / size claims fabricated by an LLM are always wrong unless quoted from real measurement. The experiment is the source of truth.

Risk. Scalding liability, member preference variance.

6.2 AC double-doors / capacity caps in summer

Hypothesis. The summer AC complaint surge (§3, 13% rate vs 4% baseline) is driven partly by door-open heat ingress at peak hours plus over-capacity heat generation. A door vestibule + a capacity cap at peak (e.g. 100 members on the floor max during 17:00–19:00 weekdays in July–August) reduces both.

Experiment design. Single-site pilot at a high-summer-complaint London site. Install temporary vestibule curtain + capacity-cap notice. Measure: AC complaint rate vs prior summer + control site, dB-of-complaints emotional intensity, cancellation rate during the pilot.

Cost. CapEx (vestibule install) + OpEx (capacity-cap signage and staff). Both small per site.

Risk. Member backlash on capacity caps. Mitigate: app-based reservation system.

6.3 Disney-style queueing app

Hypothesis. A "tell us what equipment you want for your workout, we'll give you a 10-minute window" reservation app reduces overcrowding-driven complaints (currently #3 BERTopic cluster in the Google reviews).

This is the most speculative item in the memo. No evidence for or against from the dataset. Worth one product-discovery sprint to validate user demand.

6.4 Buddy-pairing for similar-routine members

Hypothesis. Members of similar weight / routine paired into 30-minute slots create community-formation effects (Heskett service-profit-chain literature, Perplexity research §retention drivers).

Status. Left field. Park as a Phase 3 product idea. Mentioned for completeness.

6.5 45-day induction programme — gap noted, not previously in this memo

Hypothesis. The first 45 days post-join is the highest-churn window across the budget-gym sector. A structured induction (week-1 onboarding session, week-3 check-in, week-6 milestone) reduces 90-day churn by enough to recover programme cost.

Why it's here now. The Gym Group's FY25 annual report cites a 45-day induction programme with a 2–4 month payback. This memo's first draft did not consider it — call-out flagged in the LESSONS_ADDENDUM "open issues" section as missing from the intervention list. Adding here for completeness; PureGym's "low-labour-cost model" (per CEO's FY2024 statement) is a structural argument for *and* against — it makes induction cheap to staff but cuts against any model that requires more than a kiosk.

Experiment design. A/B at matched site pairs. Intervention site runs the 45-day induction; control runs status quo. Measure 90-day churn, week-2 / week-6 visit frequency, NPS at day-45.

Status. Worth a discovery sprint to assess feasibility under the low-labour-cost model. Not a Tier 1 recommendation; a Tier 4 design proposal flagged so it doesn't sit in the gap forever.

§ 7 — Caveats: what the deep research got wrong

The Perplexity Sonar Deep Research output (puregym_deep_research.md) was reviewed by a 5-expert panel (panel_review_perplexity.md). Errors found:

1. Leonard Green acquisition date wrong. Perplexity says Nov 2013 citing CCMP. CCMP did acquire PureGym in 2013; LGP acquired it from CCMP in 2017 for $786m (Bloomberg / Pitchbook confirmed via post-publish cross-check, 2026-04-25). The Companies House FY2024 filing names LGP + KKR as current investors. The date + transaction-size correction doesn't change the exit-thesis framing but is required for any transaction-multiple math anchored on entry price.

2. Gym floor size assumption wrong. Report uses 2,000–3,000 sqft (boutique studio); PureGym sites are 10,000–20,000+ sqft warehouse format. Any HVAC, cleaning, or per-sqft-OpEx number from the research is inflated against this assumption — disregard absolute £ figures, retain only directional ratios.

3. Year-1 churn 40% / Year-2+ retention 85% comes from a single non-peer-reviewed source (benfit.co.uk, a blog aggregator). Useful as a sense-check, not as a citable benchmark. The Gym Group's own 2024 / 2025 annual reports are the authoritative source if the actual number matters to a recommendation.

4. ROI range "265–905%" methodologically unsound — divides low benefits by high costs and high benefits by low costs. Same-percentile gives 453–525%, still high but more defensible. Sensitivity analysis missing.

5. r=0.60 music-negativity correlation needs caveats — Pearson vs Spearman distinction not stated; review-length confound not controlled; causation vs co-occurrence not separated. Treat as suggestive, not confirmatory.

6. Milliman 1980s music research cited from a commercial audio-equipment blog. Original sources are Milliman 1982 (J Marketing) and 1986 (J Consumer Research). The directional finding (slow tempo → longer dwell) is robust; the specific 25% number should be re-cited from primary source if used in a board pack.

The PureGym FY2024 financials in this memo come directly from Companies House (Pure Gym Limited, reg 06690189, filed 23/05/2025, audited by KPMG LLP Nottingham) and are not subject to the panel-review caveats above.

§ 7.5 — Methodology robustness checks

The panel review of the Perplexity industry research flagged "no sensitivity analysis" as a methodological gap. This section closes the most material parts of that gap for findings new to this memo.

Shift-worker keyword filter — Sonnet 4.6 validation pass (§2)

The 1,177-row keyword-filtered set in §2 was validated by a Claude Sonnet 4.6 zero-shot classification on a 200-row stratified sample (100 score≥4, 100 score<4). Prompt: "Does the writer plausibly identify as a shift worker — nurse, police, paramedic, firefighter, NHS staff, security, night cleaner, warehouse, or someone whose work hours fall outside standard 9-5? answer yes / no / unclear."

Verdict	Count	% of 200
yes	6	3.0%
unclear	154	77.0%
no	40	20.0%

Implication. The keyword filter is a 24/7-praise filter, not a verified-shift-worker filter. Of the 1,177 matches, the projected verified-yes count is ~36 (3% × 1,177). The 77% unclear band is genuinely indeterminate — many may still be shift workers but the review doesn't say so. The headline §2 finding is therefore reframed: 24/7-flexibility appreciation is the dominant signal, with shift workers as the visible but small proof cluster. The "we serve who serves you" campaign is still actionable, but on a smaller verified base than the unfiltered keyword count suggested.

Methodology + raw verdicts: v3/output/v3_ext_shiftworker_validate.{py,json,log}. Cost: 37k input tokens + 0.8k output tokens (Sonnet 4.6 standard pricing).

Emotion classifier OOD risk (§1, §2)

The bhadresh-savani/bert-base-uncased-emotion classifier used for the anger / sadness / joy / fear labels is Twitter-trained. The academic appendix already documents that ~1,486 1- and 2-star reviews are mis-labelled "joy" — a classic out-of-distribution failure on understated British complaint prose. Reply-latency-by-emotion (§1) and emotion-by-cohort (§2) both inherit this risk: - If 30% of "joy" labels are actually anger/sadness in the OOD prose (the appendix's churn_signal merge implies roughly that scale), the anger-median-130h vs joy-median-98h gap shrinks but does not disappear. The "PureGym replies fastest to joy and slowest to anger" inversion is robust to that level of mis-classification because the absolute ordering is preserved across both populations. - The §2 1,177 "joy" tag is more sensitive — if 30% of those are actually mis-labelled, the headline drops to ~480 verified joy. Still substantial. The campaign pitch is unchanged.

Cross-classifier gold-accuracy results (added 2026-04-25)

The intuitive cross-check — "validate the rubric BERT classifier against an independent emotion model" — was performed against j-hartmann/emotion-english-distilroberta-base on the 50-row hand-labelled gold set. Result was *not* what we initially expected.

Classifier	Gold accuracy	Notes
Raw rubric BERT (`bhadresh-savani/bert-base-uncased-emotion`)	0%	Expected — Twitter OOD failure
Phase-8b score-guided re-rank	42%	Best of the open-source pipeline
j-hartmann/emotion-english-distilroberta-base (indie cross-check)	18%	Worse than 8b — different OOD axis
Gemini 2.5 Flash zero-shot	40%	Roughly tied with 8b
Claude Sonnet 4.6 zero-shot	74%	Only model approaching usable accuracy

What we landed on. The "indie classifier validates 8b" assumption was wrong — j-hartmann fails on a *different* OOD axis (different training corpus, different bias profile) and underperforms 8b on this distribution. The honest cross-check is a frontier LLM (Sonnet at 74%); only Sonnet's hand-labelled output rises above coin-flip on this kind of British-complaint-prose register. §1's reply-latency-by-emotion table inherits a known mis-classification floor (8b is at 42% on gold). The directional finding (joy answered fastest, anger answered slowest) is robust to that — see preceding paragraph — but any per-emotion magnitude must carry the 42%-gold caveat.

For published material, the operational rule is: don't validate one open-source classifier with another open-source classifier on this register. Either hand-label, or use a frontier LLM.

r=0.60 music-negativity correlation (panel review caveat)

The panel review flagged the r=0.60 finding as needing "Pearson vs Spearman, control for review length, causation vs co-occurrence" caveats. None of the recommendations in this memo (§4) lean on the magnitude of that correlation; the music-volume hypothesis is supported on the count side (43 Google reviews mentioning loud music) and falsified on the policy side (Glassdoor null result). Sensitivity to the r=0.60 number itself is therefore not material to any §4 action.

§ 8 — What this memo deliberately does not say

- No £ savings figures per recommendation. Per Pierre's working rule (and the LGP+KKR audience's likely read), an LLM-fabricated cost number is worse than no number. Anchor any £ calculation in a real measurement (FY2025 facilities-ops data, real CAC from PureGym's own marketing analytics, real shower-duration measurement at a pilot site).

- No claim about The Gym Group beyond what their public filings show. The Gym Group's £244.9m FY2025 revenue, 23% Adjusted EBITDA Less Normalised Rent margin, 27% mature-site ROIC, £21.60 ARPM, and 4% YoY membership growth are filing-direct ([The Gym Group 2025 results](https://www.tggplc.com/investors)). PureGym's £22.64 ARPM is +5% above this — the price-power gap is real and worth defending.

- No CapEx-led growth recommendation. The literature (Huff gravity, retail cannibalization) suggests urban-market saturation is a real risk above ~5,000 UK clubs. PureGym's expansion programme (433 → 460+ targeted) faces decreasing-returns headwinds. Retention-led EBITDA expansion at the 433 existing sites is the LGP+KKR exit-friendly play. Every recommendation in §1–§5 above is retention-flavoured for that reason.

§ 9 — Cross-references

- Academic submission: report.md (995 words, Falcon→Qwen + T4→A100 drift fixed, Sonnet shootout in appendix) — Canvas 3354 due 2026-04-27. - Deployed site (PIN-gated): [pace-study.pages.dev](https://pace-study.pages.dev/) — 6 pages currently (Report, Audio deck, Cribsheet, Walkthrough, Row inspector, Top 20 tips). This memo proposes adding a 7th: Extended findings. - Companion site: [pace-compass.pages.dev](https://pace-compass.pages.dev/) (alive, distinct project). - Business model calculator: v3/business-model.html — sliders for # gyms, churn rate, ARPM, CAC, retention investment, churn reduction. Constants: 410 base gyms, 29.7% EBITDA margin, 10× valuation multiple, £2m CapEx per gym. Source for the LGP+KKR exit math. - Per-club briefs: v3/output/v3_11_club_briefs.md — top-10 worst-club action briefs (Qwen2.5-72B), reviewed for §5 above. - Industry research: v3/output/puregym_deep_research.md (use with §7 caveats). - Panel review of research: v3/output/panel_review_perplexity.md. - PureGym FY2024 real numbers: PUREGYM_FY2024_REAL_NUMBERS.md (Companies House primary). - Glassdoor staff data: v3/output/glassdoor_staff_reviews_full.csv (43 reviews, 2023–2024 span, used for §4 null-result on music). - Shift-worker filtered set: v3/output/v3_ext_shiftworker_matches.csv (1,177 reviews, generated for §2).

§ 10 — Sources cited in this memo (web)

- [Unthread 2026 — Support Response Time Benchmarks](https://unthread.io/blog/internal-support-response-time-service-desk/) — 1h reply = 71% retention; 24h reply = 48%. - [WebFX 2026 — Google Business Profile Benchmarks](https://www.webfx.com/blog/seo/google-business-profile-benchmarks/) — fast response is the strongest predictor of star-rating recovery. - [Email Meter — Industry Standard SLA Response Times](https://www.emailmeter.com/blog/understanding-industry-standard-sla-response-times) — retail median 17h email response; financial services 14h. - [Spidya 2026 — Enterprise Services SLA Benchmark](https://spidya.com/en/blog/it-service-management/enterprise-services-sla-benchmark-2026-key-metrics-for-success) — 95% compliance industry target. - Companies House — Pure Gym Limited (reg 06690189), Annual Report & Financial Statements year ended 31 December 2024.

§ 11 — Project meta-analytics: how this work was made

This appendix is a working-process record. It exists to be transparent about how a single person built the analyses, the report, the deployed site, and the supporting artefacts in this memo, using Claude Code (Anthropic's coding CLI) as the primary tool. Numbers are extracted from on-disk session JSONLs by pipelines/session-analytics/stats_summary.py; window is 17 active days from project start (2026-04-09) to today (2026-04-26).

§ 11.1 — Headline numbers

	Pace-NLP only
Active days	17
Sessions (top-level)	26
Real human prompts typed	351
Total events (user + assistant + meta)	9,899
Average prompts per session	13.5
Total characters typed by Pierre	~308,000
Total words typed by Pierre	~39,000
Output tokens emitted by Claude	10.0 M
Cache reads (hits)	1,255 M
Cache creations	61.6 M
Fresh input tokens	0.2 M

The cache-read figure dwarfs fresh input by ~6,250×. The CLAUDE.md preamble + LESSONS_ADDENDUM + memory files are loaded once and re-used across every turn in a session — the long-conversation pattern this project ran is the right shape for prompt caching.

§ 11.2 — Project arc (daily timeline, pace-nlp cwd only)

Date	Day	Sessions	Prompts	Anchor event
2026-04-09	Thu	2	20	Project start
2026-04-10	Fri	3	23
2026-04-11	Sat	5	42	First gold-label exploration; 50/50 hand-labelling
2026-04-13	Mon	1	25
2026-04-14	Tue	7	38	Visual-verify discipline established; Mermaid v11 CDN switch
2026-04-16	Thu	2	31	Russell live Q&A — Falcon→Qwen swap green-lit
2026-04-17	Fri	1	8
2026-04-18	Sat	5	96	Workbench tooling lifted out; 5 parallel agents; submission artefact ready 9 days early
2026-04-19	Sun	1	4	Quiet — PACE 301 finalisation
2026-04-25	Sat	2	27	Two parallel sessions on same repo (visual polish + extended report)
2026-04-26	Sun	2	37	Today — extended-report refresh, audio note, lift to data-workbench

Quiet pace-nlp days (Apr 12, 15, 20–24) overlapped with work on the same artefacts from the data-workbench cwd post-lift. True pace-related footprint is 422 prompts across 67 sessions when you union the cwd filter with a keyword filter (pace, puregym, rubric, bertopic, trustpilot, c301).

§ 11.3 — Models and tools

Model	Assistant events	Output tokens
`claude-opus-4-6`	3,619	4.6 M
`claude-opus-4-7`	2,425	5.4 M
	2	0 M

No Sonnet, no Haiku. Pure Opus across both 4.6 (the primary model for most of the project) and 4.7 (the 1M-context variant used for heavier turns from late April onwards).

For the tool-use breakdown during pace-nlp work, see the existing pace-deploy/.workspace/analytics.json — generated by a separate pipeline with a narrower window (analyze_pace.py, last refreshed 2026-04-25). Top tools that pipeline observed: Bash (1,082), Read (272), Edit (207), TaskUpdate (168), Write (135), TaskCreate (89), Grep (54), NotebookEdit (32). Two analytics surfaces, two windows; treat the table above as the primary 30-day view, treat analytics.json as the narrower tool-distribution slice.

§ 11.4 — Slash-command discipline (within pace-nlp)

Command	Count
`/clear`	9
`/done`	8
`/commit`	7
`/workbench`	5
`/model`	4
`/exit`	2
Others (1 each): `/chrome`, `/steve`, `/audit`, `/broadcast`, `/alex`	—

40 of 351 prompts (11.4%) were slash commands. Substantially lower than the cross-project rate (25.0%) — pace-nlp work was conducted conversationally, not workflow-macro-driven. The /gsd-* workflow macros that dominate cross-project use (238 invocations across 30 days) ran on pace-nlp from inside the data-workbench cwd post-lift, not from the pace-nlp-project cwd directly.

§ 11.5 — Prompt-length distribution (chars typed by Pierre)

Percentile	Chars
P10	7
P25	29
Median	82
P75	216
P90	616
P99	5,089
Mean	809

Median pace-nlp prompt was 82 characters — 40% longer than the cross-project median (59 chars). More instruction-dense per turn, fewer one-word acks. Mean is dragged up by a long tail of large pastes (data dumps, error tracebacks, panel-review specs).

§ 11.6 — Daily rhythm (UK time-zone)

Peak hour: 10:00. Different from Pierre's cross-project peak of 21:00. Pace-nlp got morning hours; deadline-driven course work shaped the rhythm.

``00:00 # 3 07:00 # 2 08:00 ####### 15 09:00 ############ 24 10:00 #################### 38 <- peak 11:00 ############### 30 12:00 ############### 30 13:00 ############## 28 14:00 ######## 16 15:00 ###### 12 16:00 ########### 22 17:00 ### 7 18:00 ########### 22 20:00 ################# 34 21:00 ############### 30 22:00 #### 9 23:00 ####### 15`


Daytime (06–22): 324 prompts. Night/early (22–06): 27 prompts (7.7%). Compare to cross-project night rate of 11.7% — discipline held on this project. Two clear daily blocks: 09:00–13:00 morning + 18:00–21:00 evening.
§ 11.7 — Method

- Source: Claude Code session JSONLs at ~/.claude/projects//*.jsonl. One file per session. - Filtering: subagent-spawned JSONLs excluded (those are tool calls, not Pierre). Tool-result messages, meta-events, and/compactsummaries excluded from the prompt count to leave only Pierre's typed input. - Ingestion:pipelines/session-analytics/ingest.py (parallel JSONL parse → events.parquet + prompts.parquet). - Tabulation:pipelines/session-analytics/stats_summary.py [project-name]— produces this section and the cross-project view. - Caveats: older sessions (pre-2026-03-27) are not on disk; Claude Code rotates session files out. The 17-day pace-nlp window is the full lifecycle of this project, but the cross-project baseline is necessarily 30-day-bounded.

*Document prepared 2026-04-25 for the PACE-NLP-PROJECT extended findings deployment. § 11 added 2026-04-26. Source data and analysis scripts at pace-nlp-project/v3/output/v3_ext_*.csv, pace-nlp-project/v3/output/combined_phase9.parquet, and data-workbench/pipelines/session-analytics/`.*

Restricted · PACE NLP

PureGym Extended Findings — Beyond the Rubric

TL;DR — what to do Monday morning

§ 1 — Reply latency: the highest-leverage Trustpilot intervention

The data

Industry anchor

The recommendation

→ What we landed on (and a hypothesis worth checking before re-rank goes live)

§ 2 — Shift-worker cohort: an underserved positive segment

The data

Verbatim positives (sample)

The recommendation

→ What we landed on ultimately

§ 3 — Seasonality: AC summer, heating winter, both predictable

AC complaints (negative reviews mentioning aircon / cooling / "too hot" / stuffy / sweat)

Heating + cold-shower complaints

Temperature mentions

The recommendation

§ 4 — Music volume: a per-site finding, not a corporate one

The Glassdoor null result

The recommendation

§ 5 — Top-10 worst clubs: an operational triage programme

The 10 clubs and their concentration

What fixing each typical issue would avoid paying for

§ 6 — Speculative pilots (Tier 4: design proposals, not findings)

6.1 Warm-shower temperature A/B

6.2 AC double-doors / capacity caps in summer

6.3 Disney-style queueing app

6.4 Buddy-pairing for similar-routine members

6.5 45-day induction programme — gap noted, not previously in this memo

§ 7 — Caveats: what the deep research got wrong

§ 7.5 — Methodology robustness checks

Shift-worker keyword filter — Sonnet 4.6 validation pass (§2)

Emotion classifier OOD risk (§1, §2)

Cross-classifier gold-accuracy results (added 2026-04-25)

r=0.60 music-negativity correlation (panel review caveat)

§ 8 — What this memo deliberately does not say

§ 9 — Cross-references

§ 10 — Sources cited in this memo (web)

§ 11 — Project meta-analytics: how this work was made

§ 11.1 — Headline numbers

§ 11.2 — Project arc (daily timeline, pace-nlp cwd only)

§ 11.3 — Models and tools

§ 11.4 — Slash-command discipline (within pace-nlp)

§ 11.5 — Prompt-length distribution (chars typed by Pierre)

§ 11.6 — Daily rhythm (UK time-zone)

§ 11.7 — Method