PureGym Extended Findings β Beyond the Rubric
Audience: Leonard Green & Partners + KKR (PE owners), Pinnacle Topco / Pinnacle Bidco (UK group + bond issuer), Clive Chesser (CEO from Nov 2024), Humphrey Cobbold (Chairman). Treat as a consultant memo, not an academic report β every recommendation tagged OpEx (recurring Β£/year) vs CapEx (one-off Β£) and ranked by effort Γ leverage.
Scope. Extends the CAM_DS_301 academic submission (report.md) with operational findings that fell outside the 800-1000 word limit. All findings derive from the same dataset: 27,666 PureGym customer reviews (Trustpilot 15,815 + Google 11,851), plus 43 PureGym Glassdoor staff reviews, plus the FY2024 Companies House filing (Pure Gym Limited, reg 06690189), plus a Perplexity Sonar Deep Research industry brief (with documented errata β see Β§ Caveats).
Frame. PureGym is a mature 410-corporate-gym + 23-franchise UK operator (433 total) with Β£416m turnover, Β£187m reported EBITDA (45% margin), Β£123.5m adjusted EBITDA (29.7% margin), 1.50m UK members, ARPM Β£22.64, +13% YoY revenue, +7% YoY UK membership (group 2.25m, +21% YoY post-Blink). 2024 saw 43 new UK gyms opened at Β£2m+ per site plus a Β£150m senior secured note (Oct 2024) funding the Β£97m Blink Fitness acquisition (56 US gyms from Chapter 11). LGP + KKR exit thesis is therefore *both* CapEx-led growth (active) *and* margin expansion (latent) β the memo argues the next-marginal-pound is more productive in retention/margin than in further UK new-store CapEx, given urban-saturation diminishing returns. Every recommendation here is filtered through that lens: recurring OpEx savings and retention lifts beat one-off CapEx unless the CapEx triggers a margin ratchet.
TL;DR β what to do Monday morning
| # | Action | Effort | Leverage | OpEx / CapEx | Verdict |
|---|---|---|---|---|---|
| 1 | Re-rank Trustpilot reply queue by anger / sadness emotion so angry reviews get answered before joyful ones | 1 unit | 5/5 | OpEx (~0 β workflow change to existing reply ops) | Ship this week |
| 2 | Lean into 24/7-flexibility as the lead acquisition narrative β Sonnet validation (2026-04-25) re-framed the 1,177 "shift-worker" matches as a 24/7-praise filter (3% verified shift-worker, 77% unclear, 20% no). The dominant signal is *24/7 access as lifestyle promise*, with shift workers as a small but visible proof cluster (~36 verified). Marketing reallocates around 24/7-flexibility, not occupational identity | 2 units | 3/5 | OpEx (marketing budget reallocation) | Q3 marketing slot |
| 3 | Seasonal HVAC pre-emptive ops β MayβSep AC complaints 3Γ baseline; NovβDec hot-water complaints 5Γ baseline. Pre-summer service contract + winter boiler audit | 2 units | 4/5 | OpEx (service contract) | Annual cycle |
| 4 | Set per-site music-volume target with manager accountability β Glassdoor null result confirms music is per-site, not corporate. Ownership is club managers, not HQ | 1 unit | 3/5 | OpEx (~0 β policy memo) | Ship next month |
| 5 | Revisit the top-10 worst clubs as a programme β already drafted in v3_11_club_briefs.md (London Stratford 81, Leicester Walnut 59, Enfield 48, Swiss Cottage 37, Birmingham 34, Bermondsey 33, Seven Sisters 33, Hayes 33, Hammersmith, Hounslow). Sequenced 90-day plans per site | 3 units | 4/5 | Mixed (cleaning OpEx + locker / equipment CapEx) | Q3 ops plan |
| 6 | Speculative pilots β warm-shower temperature A/B, AC double-doors, capacity-cap queueing app | 1 unit (writing only) | unknown | n/a β design proposals | Discussion item |
Β§ 1 β Reply latency: the highest-leverage Trustpilot intervention
The data
PureGym replies to 97% of Trustpilot reviews (15,356 of 15,815). Reply rate is not the problem. Reply prioritisation is.
| Emotion | Count | Median reply (hours) | Mean reply (hours) |
|---|---|---|---|
| Joy | 9,960 | 98.8 | 136.5 |
| Surprise | 99 | 87.5 | 122.7 |
| Love | 328 | 99.0 | 126.7 |
| Fear | 482 | 122.6 | 163.3 |
| Sadness | 1,606 | 124.1 | 173.1 |
| Anger | 2,881 | 130.1 | 173.1 |
Latency band breakdown for the 2,881 anger-tagged Trustpilot reviews:
| Band | Anger count | % of anger |
|---|---|---|
| <1 hour | 6 | 0.2% |
| 1β4 hours | 24 | 0.8% |
| 4β24 hours | 153 | 5.3% |
| 24β72 hours | 494 | 17.1% |
| 3β7 days | 1,101 | 38.2% |
| 1β4 weeks | 1,103 | 38.3% |
Industry anchor
A 2026 support-response benchmark study found that organisations responding to customer messages within 1 hour achieve 71% retention vs 48% for those responding within 24 hours β a 23-percentage-point retention gap explained purely by reply speed ([Unthread 2026](https://unthread.io/blog/internal-support-response-time-service-desk/)). Top retailers respond within 2 hours; the retail-sector benchmark median is 17 hours. PureGym's 130-hour anger median is 7.6Γ the retail median.
The 2026 Google Business Profile benchmarks survey shows that "fast response is the strongest predictor of star-rating recovery after a 1- or 2-star review" ([WebFX 2026](https://www.webfx.com/blog/seo/google-business-profile-benchmarks/)).
The recommendation
Re-rank the Trustpilot reply queue by emotion before assignment to a customer-service agent. Anger and sadness move to the front. Joy and surprise can wait. The combined_phase9.parquet already has is_anger / is_sadness flags from the BERT emotion classifier; the same model can run inline at review-ingest time.
Cost. OpEx ~zero. The reply ops team already exists; this changes the *order* of work, not the *amount*. One sprint to wire the emotion classifier into the existing Trustpilot intake pipeline.
Leverage. Anger reviews are 18% of the Trustpilot negative cohort. If even half of them shift from "ignored for a week" to "answered within 24h", at industry retention multipliers, the implied churn reduction is material. Without a fabricated Β£ number: this is the single most concentrated improvement available from the existing data.
Owner. Customer service ops (corporate). Not a per-site fix.
β What we landed on (and a hypothesis worth checking before re-rank goes live)
The naive read of the latency table is "PureGym replies fastest to joy-tagged reviews." Two complications surfaced after the report's first draft, and the re-rank recommendation should run *alongside* (not before) a check on each:
1. Sarcasm in the joy-tagged low-star band. 20.6% of 1- and 2-star reviews are tagged "joy" by the rubric BERT classifier. The dominant explanation is OOD register-mismatch (Twitter classifier hitting British complaint politeness β "Lovely staff, terrible billing, however..."). Brown & Levinson 1987 politeness theory + Biber & Conrad 2009 register theory frame this as classifier OOD, not sarcasm. But. The OOD framing assumes the misclassification is uniform; sarcasm at the individual-review level is a separate signal that can co-exist with register-mismatch and is operationally different. A 100-row hand-label of joy-tagged 1β2β reviews β split for genuine politeness-prefix vs sarcasm vs misread negation β would resolve which pathway dominates. Cost: ~30 min of hand-labelling. Outcome: if sarcasm is >10% of the joy-low-star band, that's a separate model-level intervention, not a queue-reordering one.
2. Why is the joy queue moving fastest at all? If staff route by emotion tag, the inversion is explained. If they don't, *something else* is drawing reply ops toward joy-tagged reviews first β possible candidates: shorter mean length (joy median 34 words vs anger 30 words β close, but joy reviews tend to be more formulaic and faster to write a generic reply for); fewer unanswered substantive complaints (a "thanks!" reply takes seconds, an angry complaint requires a refund-policy lookup); chirpy framing pulls staff toward easy wins under workload pressure. Worth a 30-minute interview with two reply-ops agents before assuming the re-rank is purely a workflow fix. If the inversion is partly a workload-survival pattern, queue-reordering will get resisted unless paired with reply-template support for the harder anger cases.
The Β§1 recommendation (re-rank by emotion) still stands β anger latency 130h vs joy 98h is the inversion, regardless of mechanism. But these two checks turn it from a one-shot policy change into a sequenced rollout: hand-label first, interview second, re-rank third.
Β§ 2 β Shift-worker cohort: an underserved positive segment
The data
Filtering all 27,666 reviews for shift-worker keywords (nurse, police, officer, paramedic, NHS, night shift, after my shift, finished work, 24/7, 5am, 6am, etc.) returns 1,177 matches (4.3% of corpus). The sentiment skew is striking:
| Score | Count | % of cohort |
|---|---|---|
| 5β | 462 | 39% |
| 4β | 234 | 20% |
| 3β | 173 | 15% |
| 2β | 103 | 9% |
| 1β | 205 | 17% |
Caveat from validation. A Claude Sonnet 4.6 zero-shot pass on a 200-row stratified sample of these matches confirms shift-worker identity in only 6 / 200 (3%, "yes"); 154 / 200 (77%) are "unclear" β the writer praises 24/7 access without naming a profession; 40 / 200 (20%) are explicit "no". The keyword filter is a 24/7-praise filter, not a verified-shift-worker filter. See Β§ 7.5 for methodology. The campaign opportunity is no smaller for the correction, but the framing must lead with 24/7-as-flexibility-promise as the dominant data signal, with shift workers as the visible but smaller proof cohort.
Top keyword hits: - 24/7 access: 798 mentions β the dominant signal - Security / cleaner / warehouse / bouncer: 254 - After-my-shift / finished work / before work: 102 - Police / officer: 27 - NHS / hospital / midwife / doctor: 15 - Night shift / shift work: 13
Verbatim positives (sample)
"Open 24 hours so I can always go after doing a night shift!" β Bletchley, 5β
"Don's 6am BOOTCAMP class on Monday's are the best!!! He's very encouraging and pushes you to be the best YOU" β Elkridge, 5β
"Good selection of equipment, not too busy (at the times I go) and accessible 24/7" β Inverness Inshes, 4β
"Brought for my son less than Β£6pw an excellent membership can use anytime as its open 24/7" β Weston Super Mare, 5β
"Lovely gym 24/7 access classes included with membership also free parking" β Newcastle Longbenton, 5βThe full 1,177-row CSV is at
v3/output/v3_ext_shiftworker_matches.csv.
The recommendation
A targeted "We serve the people who serve you" campaign β discount or perk for verified NHS / police / fire / ambulance / armed-forces members. The data shows this cohort is already disproportionately satisfied; the marketing move converts a passive strength into an explicit acquisition channel and a defensive retention moat.
Channel design (proposed, not measured): 1. Verification via existing platforms (Blue Light Card, Defence Discount Service) β ~zero infrastructure cost. 2. Headline benefit: dedicated 5β7am "shift-end" perks (towels, free coffee, priority lockers) at sites with relevant catchments. 3. Storytelling content: 5 verbatim quotes (above) become the campaign's first three creatives. 4. Ad campaign sequenced for autumn / January push, when shift workers' New Year resolutions intersect with PureGym's January acquisition peak.
Cost. OpEx (marketing reallocation, not net new). The verification platforms already exist and are cheap-or-free for partners.
Leverage. The Service-Profit Chain literature (Heskett 1994; Bernhardt, Donthu & Kennett 2000) supports the upstream link from staff-and-niche-cohort satisfaction to retention. PureGym is the most-mentioned 24/7 budget gym in this dataset β defending that position with a campaign is cheaper than out-spending The Gym Group on generic acquisition.
Owner. Marketing (corporate) + ops (site-level execution).
β What we landed on ultimately
The walk from first draft to current framing went through three positions, each killed by the next data point:
1. Initial framing. "1,177 underserved shift-workers, 696 are 4β5β , 681 tag joy β go after them." Strong narrative, weak validation. 2. First correction (LESSONS_ADDENDUM, 2026-04-25). Sonnet 4.6 200-sample classifier flipped the headline: 3% confirmed shift-worker, 77% unclear, 20% no. The keyword filter is a 24/7-praise filter, not a verified-shift-worker filter. Verified-yes count projects to ~36 across the 1,177. 3. Second correction (same addendum). Shift workers are NOT calmer than the general PureGym population β mean rating 3.84 vs general 3.89, joy 64.3% vs general 65.1%, and they have MORE equipment complaints, not fewer. The "underserved positive cohort" framing is wrong on its own data: this cohort is approximately *average* on satisfaction and slightly *worse* on equipment. The lift is in the 24/7-flexibility praise the keyword filter caught, not in shift-worker satisfaction.
Final position. The recommendation is now: lead acquisition copy with 24/7 flexibility as a lifestyle promise (proof: 798 unprompted "24/7" mentions in the keyword set, the dominant signal). Shift workers are a *small but vivid proof cluster* (verified ~36) that supports the narrative β quote them in creative, but don't make occupational identity the segmentation axis. The Blue Light Card / Defence Discount route is still cheap to add, but as a *secondary* tactic (verified retention proof), not the headline.
The "we serve who serves you" copy still works for those ~36 verified shift workers as a niche secondary creative; it just can't carry the campaign.
Β§ 3 β Seasonality: AC summer, heating winter, both predictable
AC complaints (negative reviews mentioning aircon / cooling / "too hot" / stuffy / sweat)
| Month | Negative reviews | AC mentions | Rate |
|---|---|---|---|
| Jan | 554 | 32 | 5.8% |
| Feb | 632 | 44 | 7.0% |
| Mar | 713 | 31 | 4.3% |
| Apr | 704 | 30 | 4.3% |
| May | 260 | 28 | 10.8% |
| Jun | 371 | 52 | 14.0% |
| Jul | 481 | 66 | 13.7% |
| Aug | 366 | 45 | 12.3% |
| Sep | 314 | 45 | 14.3% |
| Oct | 613 | 44 | 7.2% |
| Nov | 253 | 21 | 8.3% |
| Dec | 564 | 31 | 5.5% |
Heating + cold-shower complaints
| Month | Rate |
|---|---|
| Feb | 1.1% (lowest) |
| Mar | 1.7% |
| Apr | 1.4% |
| May | 2.7% |
| Jun | 3.2% |
| Jul | 2.1% |
| Aug | 3.3% |
| Sep | 2.5% |
| Oct | 3.1% |
| Nov | 4.7% |
| Dec | 5.5% |
| Jan | 2.3% |
Temperature mentions
Identical pattern: NovemberβDecember peak (4.0%, 4.3%) vs spring trough (March 0.8%, April 0.6%).
The recommendation
Run an annual pre-emptive HVAC service cycle keyed to the data. - April pre-summer audit β every site's AC inspected before May. The chains that catch the issue in April reduce the MayβSep complaint surge. - October pre-winter boiler / hot-water audit β every site's hot-water system inspected before November. Cold-shower complaints in December are the clearest preventable churn driver in the dataset.
The existing v3_ctx_weather_* and v3_ext_weather_* artefacts already plot weather Γ negativity; this section closes the loop with the operational action.
Cost. OpEx (recurring service contract). The Perplexity panel review correctly flagged that the deep-research HVAC cost numbers were anchored on a 2,500 sqft boutique-studio assumption rather than PureGym's actual 10,000β20,000 sqft warehouse format β so absolute Β£ numbers must come from real PureGym facilities ops, not from this report. The directional recommendation (pre-emptive service > reactive repair) holds regardless.
Leverage. Removing one season-of-complaints from each year flattens the negative-review trajectory documented in the academic report's "Things Are Getting Worse" section.
Owner. Facilities ops (corporate, with site-level execution).
Β§ 4 β Music volume: a per-site finding, not a corporate one
The Glassdoor null result
The 43 PureGym Glassdoor staff reviews contain zero mentions of music, volume, loud, playlist, sound system, or speaker. The hypothesis that music volume is a corporate-mandated policy (Γ la some retail chains) is not supported by the data we have.
This re-frames the 43 Google "loud music" complaints (already documented in report.md) from a chain-wide policy lever into a per-site manager accountability lever.
The recommendation
Set a target dB range (e.g. 65β75 dB at the gym floor centre) and make it a club-manager monthly compliance metric. No CapEx β every PA system already supports volume control. The intervention is purely policy + manager dashboard.
If the company wants harder evidence, a 2-week dB measurement pass at the 10 most-complained sites is the next data step. Cost: a Β£30 SPL meter and ~10 site visits.
Leverage. 43 reviews is a small but concentrated thread; cleaning it up removes a recurring Google-review topic from the 11 BERTopic clusters identified in the academic report.
Owner. Club managers (per-site), accountable through standard ops review.
Β§ 5 β Top-10 worst clubs: an operational triage programme
The 10 clubs and their concentration
| # | Club | Negative reviews | London? | Recurring theme cluster |
|---|---|---|---|---|
| 1 | London Stratford | 81 | yes | Hygiene + named-staff issues + WiFi/ventilation broken |
| 2 | Leicester Walnut Street | 59 | no | Temperature/HVAC + equipment + showers + rule enforcement |
| 3 | London Enfield | 48 | yes | Staff attitude + equipment disorganisation + overcrowding |
| 4 | London Swiss Cottage | 37 | yes | Maintenance/odours + overcrowding + missing equipment |
| 5 | Birmingham City Centre | 34 | no | Outdated equipment + flickering lights + lockers |
| 6 | London Bermondsey | 33 | yes | Hygiene + overcrowding + temperature/ventilation |
| 7 | London Seven Sisters | 33 | yes | Insufficient lockers + frequent unannounced closures |
| 8 | London Hayes | 33 | yes | Toilets/showers/water/equipment maintenance |
| 9 | New Barnet | 32 | suburban | Overcrowding + cleanliness + class noise + AC |
| 10 | London Canary Wharf | 32 | yes | Maintenance + ventilation + alleged staff harassment |
What fixing each typical issue would avoid paying for
Drawing on the recurring themes across the 10 briefs (v3_11_club_briefs.md), framed in the way Russell taught: list what you'd avoid paying for, hand the pricing to the owner.
| Recurring issue | OpEx avoidance (recurring) | CapEx avoidance (one-off) | Owner |
|---|---|---|---|
| Hygiene / cleanliness | Negative-review-driven retention loss; refund / goodwill credits; viral 1β Google velocity | None (cleaning is OpEx-only) | Regional ops + cleaning contract review |
| Equipment maintenance | Repeated repair callouts; member-LTV impact when 2-3 machines down at once | Phased equipment refresh (delays the inevitable, doesn't avoid) | Facilities ops |
| Overcrowding at city-centre sites | Refund pressure when peak-hour members can't use machines | Possible vestibule / capacity-cap CapEx (see Pilot 2) | Site-level ops |
| HVAC / ventilation / hot-water | Reactive repair callouts at peak failure; December cold-shower churn (Β§3) | Pre-emptive service contract (OpEx) usually beats reactive replacement (CapEx) | Facilities ops |
| Locker theft / insufficient lockers | Goodwill credits to theft victims; recurring 1β reviews at affected sites | Locker hardware refresh at Seven Sisters / Stratford | Site-level ops + capital plan |
| Named-staff issues (e.g. Stratford) | HR cost of complaint resolution; brand damage if unaddressed | Training programmes (modest OpEx) | HR + regional manager |
| Frequent unannounced closures (Seven Sisters) | Pro-rata refund pressure; cancellations | Communication system / ops process | Regional ops |
v3_11_club_briefs.md already provide the per-site this-week / next-month / next-quarter sequencing.
Owner. Regional ops manager (programme lead) + facilities ops (HVAC + locker work) + HR (named-staff escalations).
Β§ 6 β Speculative pilots (Tier 4: design proposals, not findings)
Full experiment specs at [v3/output/pilot_designs.md](v3/output/pilot_designs.md) β hypothesis / sites / duration / measurement / kill criteria / risk per pilot.
The academic report and Β§1βΒ§5 above are grounded in dataset evidence. The proposals below are design ideas to test, not analyses with conclusions. Each is presented as an experiment design, not a recommendation.
6.1 Warm-shower temperature A/B
Hypothesis. Increasing shower water temperature 2β3Β°C (within safe limits) shortens average shower duration enough to offset the additional gas/electric cost, *and* reduces the cold-shower complaint cluster in the December review set.
Experiment design. Pick 4 matched sites by member count + complaint volume. Increase shower temp at 2 sites, hold the other 2 as control. Measure: shower duration (existing flow-meter data, if instrumented), gas/electric cost per member, December review complaint rate.
This memo will not estimate Β£ savings or hours. Pierre's working rule: cost / time / size claims fabricated by an LLM are always wrong unless quoted from real measurement. The experiment is the source of truth.
Risk. Scalding liability, member preference variance.
6.2 AC double-doors / capacity caps in summer
Hypothesis. The summer AC complaint surge (Β§3, 13% rate vs 4% baseline) is driven partly by door-open heat ingress at peak hours plus over-capacity heat generation. A door vestibule + a capacity cap at peak (e.g. 100 members on the floor max during 17:00β19:00 weekdays in JulyβAugust) reduces both.
Experiment design. Single-site pilot at a high-summer-complaint London site. Install temporary vestibule curtain + capacity-cap notice. Measure: AC complaint rate vs prior summer + control site, dB-of-complaints emotional intensity, cancellation rate during the pilot.
Cost. CapEx (vestibule install) + OpEx (capacity-cap signage and staff). Both small per site.
Risk. Member backlash on capacity caps. Mitigate: app-based reservation system.
6.3 Disney-style queueing app
Hypothesis. A "tell us what equipment you want for your workout, we'll give you a 10-minute window" reservation app reduces overcrowding-driven complaints (currently #3 BERTopic cluster in the Google reviews).
This is the most speculative item in the memo. No evidence for or against from the dataset. Worth one product-discovery sprint to validate user demand.
6.4 Buddy-pairing for similar-routine members
Hypothesis. Members of similar weight / routine paired into 30-minute slots create community-formation effects (Heskett service-profit-chain literature, Perplexity research Β§retention drivers).
Status. Left field. Park as a Phase 3 product idea. Mentioned for completeness.
6.5 45-day induction programme β gap noted, not previously in this memo
Hypothesis. The first 45 days post-join is the highest-churn window across the budget-gym sector. A structured induction (week-1 onboarding session, week-3 check-in, week-6 milestone) reduces 90-day churn by enough to recover programme cost.
Why it's here now. The Gym Group's FY25 annual report cites a 45-day induction programme with a 2β4 month payback. This memo's first draft did not consider it β call-out flagged in the LESSONS_ADDENDUM "open issues" section as missing from the intervention list. Adding here for completeness; PureGym's "low-labour-cost model" (per CEO's FY2024 statement) is a structural argument for *and* against β it makes induction cheap to staff but cuts against any model that requires more than a kiosk.
Experiment design. A/B at matched site pairs. Intervention site runs the 45-day induction; control runs status quo. Measure 90-day churn, week-2 / week-6 visit frequency, NPS at day-45.
Status. Worth a discovery sprint to assess feasibility under the low-labour-cost model. Not a Tier 1 recommendation; a Tier 4 design proposal flagged so it doesn't sit in the gap forever.
Β§ 7 β Caveats: what the deep research got wrong
The Perplexity Sonar Deep Research output (puregym_deep_research.md) was reviewed by a 5-expert panel (panel_review_perplexity.md). Errors found:
1. Leonard Green acquisition date wrong. Perplexity says Nov 2013 citing CCMP. CCMP did acquire PureGym in 2013; LGP acquired it from CCMP in 2017 for $786m (Bloomberg / Pitchbook confirmed via post-publish cross-check, 2026-04-25). The Companies House FY2024 filing names LGP + KKR as current investors. The date + transaction-size correction doesn't change the exit-thesis framing but is required for any transaction-multiple math anchored on entry price.
2. Gym floor size assumption wrong. Report uses 2,000β3,000 sqft (boutique studio); PureGym sites are 10,000β20,000+ sqft warehouse format. Any HVAC, cleaning, or per-sqft-OpEx number from the research is inflated against this assumption β disregard absolute Β£ figures, retain only directional ratios.
3. Year-1 churn 40% / Year-2+ retention 85% comes from a single non-peer-reviewed source (benfit.co.uk, a blog aggregator). Useful as a sense-check, not as a citable benchmark. The Gym Group's own 2024 / 2025 annual reports are the authoritative source if the actual number matters to a recommendation.
4. ROI range "265β905%" methodologically unsound β divides low benefits by high costs and high benefits by low costs. Same-percentile gives 453β525%, still high but more defensible. Sensitivity analysis missing.
5. r=0.60 music-negativity correlation needs caveats β Pearson vs Spearman distinction not stated; review-length confound not controlled; causation vs co-occurrence not separated. Treat as suggestive, not confirmatory.
6. Milliman 1980s music research cited from a commercial audio-equipment blog. Original sources are Milliman 1982 (J Marketing) and 1986 (J Consumer Research). The directional finding (slow tempo β longer dwell) is robust; the specific 25% number should be re-cited from primary source if used in a board pack.
The PureGym FY2024 financials in this memo come directly from Companies House (Pure Gym Limited, reg 06690189, filed 23/05/2025, audited by KPMG LLP Nottingham) and are not subject to the panel-review caveats above.
Β§ 7.5 β Methodology robustness checks
The panel review of the Perplexity industry research flagged "no sensitivity analysis" as a methodological gap. This section closes the most material parts of that gap for findings new to this memo.
Shift-worker keyword filter β Sonnet 4.6 validation pass (Β§2)
The 1,177-row keyword-filtered set in Β§2 was validated by a Claude Sonnet 4.6 zero-shot classification on a 200-row stratified sample (100 scoreβ₯4, 100 score<4). Prompt: "Does the writer plausibly identify as a shift worker β nurse, police, paramedic, firefighter, NHS staff, security, night cleaner, warehouse, or someone whose work hours fall outside standard 9-5? answer yes / no / unclear."
| Verdict | Count | % of 200 |
|---|---|---|
| yes | 6 | 3.0% |
| unclear | 154 | 77.0% |
| no | 40 | 20.0% |
Methodology + raw verdicts: v3/output/v3_ext_shiftworker_validate.{py,json,log}. Cost: 37k input tokens + 0.8k output tokens (Sonnet 4.6 standard pricing).
Emotion classifier OOD risk (Β§1, Β§2)
The bhadresh-savani/bert-base-uncased-emotion classifier used for the anger / sadness / joy / fear labels is Twitter-trained. The academic appendix already documents that ~1,486 1- and 2-star reviews are mis-labelled "joy" β a classic out-of-distribution failure on understated British complaint prose. Reply-latency-by-emotion (Β§1) and emotion-by-cohort (Β§2) both inherit this risk:
- If 30% of "joy" labels are actually anger/sadness in the OOD prose (the appendix's churn_signal merge implies roughly that scale), the anger-median-130h vs joy-median-98h gap shrinks but does not disappear. The "PureGym replies fastest to joy and slowest to anger" inversion is robust to that level of mis-classification because the absolute ordering is preserved across both populations.
- The Β§2 1,177 "joy" tag is more sensitive β if 30% of those are actually mis-labelled, the headline drops to ~480 verified joy. Still substantial. The campaign pitch is unchanged.
Cross-classifier gold-accuracy results (added 2026-04-25)
The intuitive cross-check β "validate the rubric BERT classifier against an independent emotion model" β was performed against j-hartmann/emotion-english-distilroberta-base on the 50-row hand-labelled gold set. Result was *not* what we initially expected.
| Classifier | Gold accuracy | Notes |
|---|---|---|
Raw rubric BERT (bhadresh-savani/bert-base-uncased-emotion) | 0% | Expected β Twitter OOD failure |
| Phase-8b score-guided re-rank | 42% | Best of the open-source pipeline |
| j-hartmann/emotion-english-distilroberta-base (indie cross-check) | 18% | Worse than 8b β different OOD axis |
| Gemini 2.5 Flash zero-shot | 40% | Roughly tied with 8b |
| Claude Sonnet 4.6 zero-shot | 74% | Only model approaching usable accuracy |
For published material, the operational rule is: don't validate one open-source classifier with another open-source classifier on this register. Either hand-label, or use a frontier LLM.
r=0.60 music-negativity correlation (panel review caveat)
The panel review flagged the r=0.60 finding as needing "Pearson vs Spearman, control for review length, causation vs co-occurrence" caveats. None of the recommendations in this memo (Β§4) lean on the magnitude of that correlation; the music-volume hypothesis is supported on the count side (43 Google reviews mentioning loud music) and falsified on the policy side (Glassdoor null result). Sensitivity to the r=0.60 number itself is therefore not material to any Β§4 action.
Β§ 8 β What this memo deliberately does not say
- No Β£ savings figures per recommendation. Per Pierre's working rule (and the LGP+KKR audience's likely read), an LLM-fabricated cost number is worse than no number. Anchor any Β£ calculation in a real measurement (FY2025 facilities-ops data, real CAC from PureGym's own marketing analytics, real shower-duration measurement at a pilot site).
- No claim about The Gym Group beyond what their public filings show. The Gym Group's Β£244.9m FY2025 revenue, 23% Adjusted EBITDA Less Normalised Rent margin, 27% mature-site ROIC, Β£21.60 ARPM, and 4% YoY membership growth are filing-direct ([The Gym Group 2025 results](https://www.tggplc.com/investors)). PureGym's Β£22.64 ARPM is +5% above this β the price-power gap is real and worth defending.
- No CapEx-led growth recommendation. The literature (Huff gravity, retail cannibalization) suggests urban-market saturation is a real risk above ~5,000 UK clubs. PureGym's expansion programme (433 β 460+ targeted) faces decreasing-returns headwinds. Retention-led EBITDA expansion at the 433 existing sites is the LGP+KKR exit-friendly play. Every recommendation in Β§1βΒ§5 above is retention-flavoured for that reason.
Β§ 9 β Cross-references
- Academic submission: report.md (995 words, FalconβQwen + T4βA100 drift fixed, Sonnet shootout in appendix) β Canvas 3354 due 2026-04-27.
- Deployed site (PIN-gated): [pace-study.pages.dev](https://pace-study.pages.dev/) β 6 pages currently (Report, Audio deck, Cribsheet, Walkthrough, Row inspector, Top 20 tips). This memo proposes adding a 7th: Extended findings.
- Companion site: [pace-compass.pages.dev](https://pace-compass.pages.dev/) (alive, distinct project).
- Business model calculator: v3/business-model.html β sliders for # gyms, churn rate, ARPM, CAC, retention investment, churn reduction. Constants: 410 base gyms, 29.7% EBITDA margin, 10Γ valuation multiple, Β£2m CapEx per gym. Source for the LGP+KKR exit math.
- Per-club briefs: v3/output/v3_11_club_briefs.md β top-10 worst-club action briefs (Qwen2.5-72B), reviewed for Β§5 above.
- Industry research: v3/output/puregym_deep_research.md (use with Β§7 caveats).
- Panel review of research: v3/output/panel_review_perplexity.md.
- PureGym FY2024 real numbers: PUREGYM_FY2024_REAL_NUMBERS.md (Companies House primary).
- Glassdoor staff data: v3/output/glassdoor_staff_reviews_full.csv (43 reviews, 2023β2024 span, used for Β§4 null-result on music).
- Shift-worker filtered set: v3/output/v3_ext_shiftworker_matches.csv (1,177 reviews, generated for Β§2).
Β§ 10 β Sources cited in this memo (web)
- [Unthread 2026 β Support Response Time Benchmarks](https://unthread.io/blog/internal-support-response-time-service-desk/) β 1h reply = 71% retention; 24h reply = 48%. - [WebFX 2026 β Google Business Profile Benchmarks](https://www.webfx.com/blog/seo/google-business-profile-benchmarks/) β fast response is the strongest predictor of star-rating recovery. - [Email Meter β Industry Standard SLA Response Times](https://www.emailmeter.com/blog/understanding-industry-standard-sla-response-times) β retail median 17h email response; financial services 14h. - [Spidya 2026 β Enterprise Services SLA Benchmark](https://spidya.com/en/blog/it-service-management/enterprise-services-sla-benchmark-2026-key-metrics-for-success) β 95% compliance industry target. - Companies House β Pure Gym Limited (reg 06690189), Annual Report & Financial Statements year ended 31 December 2024.
Β§ 11 β Project meta-analytics: how this work was made
This appendix is a working-process record. It exists to be transparent about how a single person built the analyses, the report, the deployed site, and the supporting artefacts in this memo, using Claude Code (Anthropic's coding CLI) as the primary tool. Numbers are extracted from on-disk session JSONLs by pipelines/session-analytics/stats_summary.py; window is 17 active days from project start (2026-04-09) to today (2026-04-26).
Β§ 11.1 β Headline numbers
| Pace-NLP only | |
|---|---|
| Active days | 17 |
| Sessions (top-level) | 26 |
| Real human prompts typed | 351 |
| Total events (user + assistant + meta) | 9,899 |
| Average prompts per session | 13.5 |
| Total characters typed by Pierre | ~308,000 |
| Total words typed by Pierre | ~39,000 |
| Output tokens emitted by Claude | 10.0 M |
| Cache reads (hits) | 1,255 M |
| Cache creations | 61.6 M |
| Fresh input tokens | 0.2 M |
Β§ 11.2 β Project arc (daily timeline, pace-nlp cwd only)
| Date | Day | Sessions | Prompts | Anchor event |
|---|---|---|---|---|
| 2026-04-09 | Thu | 2 | 20 | Project start |
| 2026-04-10 | Fri | 3 | 23 | |
| 2026-04-11 | Sat | 5 | 42 | First gold-label exploration; 50/50 hand-labelling |
| 2026-04-13 | Mon | 1 | 25 | |
| 2026-04-14 | Tue | 7 | 38 | Visual-verify discipline established; Mermaid v11 CDN switch |
| 2026-04-16 | Thu | 2 | 31 | Russell live Q&A β FalconβQwen swap green-lit |
| 2026-04-17 | Fri | 1 | 8 | |
| 2026-04-18 | Sat | 5 | 96 | Workbench tooling lifted out; 5 parallel agents; submission artefact ready 9 days early |
| 2026-04-19 | Sun | 1 | 4 | Quiet β PACE 301 finalisation |
| 2026-04-25 | Sat | 2 | 27 | Two parallel sessions on same repo (visual polish + extended report) |
| 2026-04-26 | Sun | 2 | 37 | Today β extended-report refresh, audio note, lift to data-workbench |
data-workbench cwd post-lift. True pace-related footprint is 422 prompts across 67 sessions when you union the cwd filter with a keyword filter (pace, puregym, rubric, bertopic, trustpilot, c301).
Β§ 11.3 β Models and tools
| Model | Assistant events | Output tokens |
|---|---|---|
claude-opus-4-6 | 3,619 | 4.6 M |
claude-opus-4-7 | 2,425 | 5.4 M |
| 2 | 0 M |
For the tool-use breakdown during pace-nlp work, see the existing pace-deploy/.workspace/analytics.json β generated by a separate pipeline with a narrower window (analyze_pace.py, last refreshed 2026-04-25). Top tools that pipeline observed: Bash (1,082), Read (272), Edit (207), TaskUpdate (168), Write (135), TaskCreate (89), Grep (54), NotebookEdit (32). Two analytics surfaces, two windows; treat the table above as the primary 30-day view, treat analytics.json as the narrower tool-distribution slice.
Β§ 11.4 β Slash-command discipline (within pace-nlp)
| Command | Count |
|---|---|
/clear | 9 |
/done | 8 |
/commit | 7 |
/workbench | 5 |
/model | 4 |
/exit | 2 |
Others (1 each): /chrome, /steve, /audit, /broadcast, /alex | β |
/gsd-* workflow macros that dominate cross-project use (238 invocations across 30 days) ran on pace-nlp from inside the data-workbench cwd post-lift, not from the pace-nlp-project cwd directly.
Β§ 11.5 β Prompt-length distribution (chars typed by Pierre)
| Percentile | Chars |
|---|---|
| P10 | 7 |
| P25 | 29 |
| Median | 82 |
| P75 | 216 |
| P90 | 616 |
| P99 | 5,089 |
| Mean | 809 |
Β§ 11.6 β Daily rhythm (UK time-zone)
Peak hour: 10:00. Different from Pierre's cross-project peak of 21:00. Pace-nlp got morning hours; deadline-driven course work shaped the rhythm.
``
00:00 # 3
07:00 # 2
08:00 ####### 15
09:00 ############ 24
10:00 #################### 38 <- peak
11:00 ############### 30
12:00 ############### 30
13:00 ############## 28
14:00 ######## 16
15:00 ###### 12
16:00 ########### 22
17:00 ### 7
18:00 ########### 22
20:00 ################# 34
21:00 ############### 30
22:00 #### 9
23:00 ####### 15
`
Daytime (06β22): 324 prompts. Night/early (22β06): 27 prompts (7.7%). Compare to cross-project night rate of 11.7% β discipline held on this project. Two clear daily blocks: 09:00β13:00 morning + 18:00β21:00 evening.
Β§ 11.7 β Method
- Source: Claude Code session JSONLs at ~/.claude/projects/. One file per session.
- Filtering: subagent-spawned JSONLs excluded (those are tool calls, not Pierre). Tool-result messages, meta-events, and /compact summaries excluded from the prompt count to leave only Pierre's typed input.
- Ingestion: pipelines/session-analytics/ingest.py (parallel JSONL parse β events.parquet + prompts.parquet).
- Tabulation: pipelines/session-analytics/stats_summary.py [project-name] β produces this section and the cross-project view.
- Caveats: older sessions (pre-2026-03-27) are not on disk; Claude Code rotates session files out. The 17-day pace-nlp window is the full lifecycle of this project, but the cross-project baseline is necessarily 30-day-bounded.
*Document prepared 2026-04-25 for the PACE-NLP-PROJECT extended findings deployment. Β§ 11 added 2026-04-26. Source data and analysis scripts at pace-nlp-project/v3/output/v3_ext_*.csv, pace-nlp-project/v3/output/combined_phase9.parquet, and data-workbench/pipelines/session-analytics/`.*