Pierre Sutherland
CAM_DS_301 Weeks 4-5 | Topic Project 1
This project analyses 39,923 PureGym customer reviews across two platforms — Google Reviews (23,250) and Trustpilot (16,673) — using a multi-method NLP pipeline: word frequency analysis, BERTopic, Gensim LDA, BERT-based emotion classification, and Qwen2.5-7B-Instruct. The objective is to extract actionable complaint themes that PureGym management could use to reduce negative reviews and improve customer retention.
The analysis followed a progressive experimentation approach. After loading and cleaning the data (removing 9,352 Google reviews with no text), we parsed datetime fields and filtered to 5,931 negative reviews (score < 3); 4,137 of those came from 335 locations common to both platforms (after manual cross-platform name-merging) and fed cross-platform topic modelling.
Text preprocessing was evaluated using a workbench of 10 configurations. A critical finding: heavy preprocessing hurts BERTopic. Lemmatization increased outliers from 36.7% to 47.6%, because BERT embeddings rely on natural language context. Zipf’s Law analysis (slope -1.034, R²=0.993) confirmed why: BERT was trained on the full Zipf distribution, so stripping stopwords strips signal. The optimal pipeline feeds raw text for embeddings while using a CountVectorizer with custom stopwords and bigrams for labels.
BERTopic was applied as four complementary lenses: full negative
reviews at common locations, the top-30-locations subset, anger-filtered
reviews, and an LLM-driven run where Qwen2.5-7B-Instruct first extracted
natural-language topic phrases from each anger-filtered review and
BERTopic then meta-clustered the resulting 5,999 phrase outputs. UMAP is
seeded (random_state=42) across all four runs for
reproducibility. Gensim LDA ran on the same negative pool as a classical
comparison (10 topics, coherence 0.449). Emotion analysis classified
every negative review across six categories. Qwen2.5-7B-Instruct
substitutes for the rubric’s Falcon-7B (rationale in Appendix); a
per-method comparison table is in the Appendix.
flowchart TB
subgraph Input["39,923 raw reviews"]
G["Google<br>23,250"]
T["Trustpilot<br>16,673"]
end
subgraph Clean["Clean and filter"]
D1["Drop empty text<br>-9,352 Google"]
D2["Parse datetime<br>extract temporal features"]
D3["Negative filter<br>5,931 reviews<br>4,137 at 335 shared locations"]
end
subgraph Methods["Multi-method analysis"]
M1["BERTopic<br>4 lenses<br>full · top-30 · anger · LLM"]
M2["Gensim LDA<br>10 classical topics<br>coherence 0.449"]
M3["Emotion classifier<br>6 emotions"]
M4["Qwen-7B on A100<br>5,999 phrase outputs"]
end
Input --> Clean
Clean --> Methods
Methods --> Out["Triangulated findings"]
classDef input fill:#16213e,stroke:#70c0d0,color:#b0e0e8
classDef clean fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
classDef method fill:#14261a,stroke:#5fc95f,color:#c0f0c0
classDef out fill:#1e1830,stroke:#a078d8,color:#e0d0ff
class G,T input
class D1,D2,D3 clean
class M1,M2,M3,M4 method
class Out out
Figure 1. Multi-method NLP pipeline overview — from raw reviews through cleaning to four complementary analytical methods.
The most significant structural finding is that Google and Trustpilot capture fundamentally different complaint types. Google reviews focus on the physical gym experience: equipment, cleanliness, overcrowding, music volume, and parking. Trustpilot reviews focus on the business experience: membership fees, cancellation difficulties, billing errors, and customer service. The word “membership” ranks #3 on Trustpilot negative reviews but does not appear in Google’s top 10. Platform-specific bigrams confirm this: Google surfaces “free weights” and “peak times” while Trustpilot surfaces “joining fee” and “direct debit.”
flowchart TB
Core["PureGym negative reviews"]
subgraph Google["Google: physical experience"]
G1["Locker theft (51)"]
G2["Loud music (43)"]
G3["Overcrowding (37)"]
G4["Equipment / cleanliness"]
G5["Peak times, parking"]
end
subgraph Trust["Trustpilot: business experience"]
T1["Joining fee (231)"]
T2["PIN / app problems (164)"]
T3["Customer service (38)"]
T4["Cancellation / billing"]
T5["Direct debit disputes"]
end
Core --> Google
Core --> Trust
classDef core fill:#2a1212,stroke:#d07070,color:#f0b0b0
classDef google fill:#16213e,stroke:#70c0d0,color:#b0e0e8
classDef trust fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
class Core core
class G1,G2,G3,G4,G5 google
class T1,T2,T3,T4,T5 trust
Figure 2. Platform-level complaint culture split — Google captures the physical gym experience, Trustpilot the business-process experience.
Running BERTopic separately on each platform produced 11 topics each, but with distinct compositions. Google found locker theft (51 reviews), loud music (43), and overcrowding (37) — all physical issues. Trustpilot found joining fee complaints (231 reviews), PIN/app problems (164), and customer service failures (38) — all business process issues.
BERTopic identified 10 distinct complaint clusters from the merged negative reviews. Beyond the expected general complaints (1,443 documents), the model isolated highly specific, actionable issues: parking fines of exactly £85 (124 reviews), a 4-hour class cancellation policy causing frustration (115 reviews), locker theft and broken locks (112 reviews), and water machines broken for weeks without management action (38 reviews). This granularity is where BERTopic excels over LDA, which tends to merge related themes.
LDA’s 10-topic model provided a useful complementary perspective. It automatically separated Danish (Topic 5) and German (Topic 7) language clusters — reviews from PureGym’s Danish and Swiss operations. It also produced a clear billing cluster (Topic 9: “membership, cancel, joining fee, payment”) that aligned with Trustpilot’s platform-specific topics.
Two further BERTopic runs sharpened the picture. The anger-filtered run (2,537 reviews, 24.8% outliers vs the full run’s 35.6%) narrowed the primary anger drivers to membership cancellation, rude staff, and equipment failures — confirming the same themes the full run found but at higher resolution, since the anger filter strips out the more diffuse complaints. The LLM-driven run, where Qwen first extracted natural-language topic phrases per review and BERTopic then meta-clustered the 5,999 phrase outputs, produced substantively different clusters dominated by intent-bearing phrases like “personal trainer turnover” and “rude staff feedback” — capturing customer meaning in a way bag-of-words BERTopic cannot.
Classifying all negative reviews by emotion shows anger as the dominant signal, with temporal and star-band patterns that point to different response strategies — a 1★ anger review is a furious customer needing a fast reply; a 2★ sadness review is a disappointed one needing a different touch.
| Emotion | Count | % of negs | Peak hour | Dominant star | Median words |
|---|---|---|---|---|---|
| Anger | 2,815 | 44% | 20:00 | 1★ (48%) | 34 |
| Sadness | ~1,471 | 23% | 20:00 | 2★ (25%) | — |
| Fear | ~384 | 6% | 01:00 | — | — |
| Surprise | — | — | — | — | 58 |
Table 1. Emotion landscape — share of negative reviews, peak
hour, dominant star band, median word length. Counts marked
~ are derived from the percentage split.
Fear peaking at 01:00 is the late-night safety story at 24/7 gyms; surprise reviews running ~70% longer than anger hint that unexpected experiences prompt more detailed accounts.
Dynamic topic modelling revealed that every complaint category is rising. General complaints grew from 45 to 530 mentions across the analysis period. App and PIN problems — virtually nonexistent at the start — grew to 76 mentions, indicating a new systemic issue likely tied to a software update. Negative reviews tripled from June 2023 onwards, with April 2024 as the worst month (335 negative Google reviews). This trajectory suggests a structural change in operations, pricing, or staffing rather than isolated incidents.
Seven locations appear in both platforms’ top 20 worst lists,
confirming them as genuinely problematic rather than platform-specific
anomalies. London Stratford leads with 81 combined negative reviews. The
top-30 wordcloud sharpened against the full-dataset wordcloud — broad
complaint terms gave way to location-specific vocabulary
(mould, closed, instructor names) — confirming
that the focusing effect happens at the wordcloud level too, not just in
BERTopic. Running BERTopic on just the top 30 locations (37.1% outliers
vs 35.6% for the full dataset) acted as a different lens, surfacing
location-specific issues invisible at full scale: mould in showers,
individual gym closures despite 24/7 advertising, and
instructor-specific class complaints.
flowchart TB
BT["BERTopic<br>specific clusters"] --> F["Triangulated<br>conclusions"]
LDA["Gensim LDA<br>broad themes<br>+ language detection"] --> F
Em["Emotion classifier<br>urgency + timing"] --> F
LLM["Qwen-7B<br>natural phrases"] --> F
F --> A1["Rising app/PIN issues"]
F --> A2["7 worst locations<br>London Stratford #1"]
F --> A3["June 2023 inflection<br>structural change"]
F --> A4["Trustpilot response<br>132h median — slowest"]
classDef method fill:#14261a,stroke:#5fc95f,color:#c0f0c0
classDef synth fill:#1e1830,stroke:#a078d8,color:#e0d0ff
classDef action fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
class BT,LDA,Em,LLM method
class F synth
class A1,A2,A3,A4 action
Figure 3. Triangulated synthesis — four method lenses converge on four headline findings.
The power of this analysis is triangulation: BERTopic isolates specific actionable clusters; LDA provides interpretable broad themes and detects multilingual segments; emotion classification adds urgency that pure topic modelling misses; Qwen-7B captures intent — “charged after cancelling” — that no bag-of-words model can.
For PureGym, the recommendations are tiered by urgency.
IMMEDIATE — fix this month. Re-rank the Trustpilot reply queue by emotion severity (anger first). Current median anger reply is 132 hours; industry benchmarks indicate 1-hour replies retain 71% of customers versus 48% at 24 hours.
OPERATIONAL — next 90 days. App/PIN problems grew from near-zero to 76 mentions — a likely software regression needing dedicated triage. Address the seven worst-performing locations (London Stratford leading at 81 combined negatives) as a sequenced programme.
STRATEGIC — next 6 months. Negative reviews tripled from June 2023 onwards. Cross-reference with operational, pricing, and staffing data to identify the structural cause before further degradation.
Table 2. Method-by-method comparison — input size, key hyperparameters, outlier rate, and headline cluster.
| Run | Input n | Hyperparams | Outliers | Headline cluster |
|---|---|---|---|---|
| BERTopic — full negatives | 4,137 | min_topic_size=20, UMAP seed=42 | 35.6% | parking £85 (124) |
| BERTopic — top-30 worst sites | 3,690 | min_topic_size=30, UMAP seed=42 | 37.1% | mould, 24/7 closures |
| BERTopic — anger-filtered | 2,537 | min_topic_size=10, UMAP seed=42 | 24.8% | membership, rude staff, equipment |
| BERTopic — LLM-driven (Qwen phrases) | 5,999 | min_topic_size=5, UMAP seed=42 | 9.8% | “personal trainer turnover”, “rude staff feedback” |
| Gensim LDA — classical comparison | full negs | num_topics=10, passes=5, seed=42 | n/a | 10 broad themes; coherence 0.449; Danish + German auto-detected |
Table 3. Hyperparameter snapshot — full pipeline configuration.
| Component | Library | Configuration | Why |
|---|---|---|---|
| Sentence embeddings | sentence-transformers | model = all-MiniLM-L6-v2 |
BERTopic default; speed/quality balance |
| UMAP | umap-learn | n_neighbors=15, n_components=5, min_dist=0.0, metric=cosine, random_state=42 | reproducibility across all 4 lenses |
| Vectorizer (topic labels) | sklearn | CountVectorizer + NLTK english + custom stops + bigrams; min_df=2 | balance generic + domain |
| BERTopic | bertopic | min_topic_size varies 5–30 per run (see method table above) | tuned per run for noise floor |
| Gensim LDA | gensim | num_topics=10, passes=5, random_state=42; coherence (u_mass) = 0.449 | like-for-like with BERTopic |
| Emotion classifier | transformers | model = bhadresh-savani/bert-base-uncased-emotion |
rubric-mandated |
| LLM (topics + insights) | transformers | model = Qwen/Qwen2.5-7B-Instruct, max_new_tokens
120/400, temperature = null (greedy) |
A100; replaces Falcon-7B; deterministic |
Why Qwen. Falcon-7B’s rubric prompts no longer reproduce under post-update weights (model-version drift). Qwen-2.5-7B runs the same prompts deterministically on A100; the Sonnet-Qwen agreement below confirms operational equivalence.
Churn signal. The Twitter-trained emotion classifier
labels ~1,486 British-understated 1-2 star reviews as “joy”. A merged
churn_signal = emotion ∈ {anger, sadness} ∨ stars ≤ 2
recovers those OOD cases: on the 30,617-review full population it
captures 20.9% as churn-risk versus 14.2% for emotion alone — 2,032
additional reviews that emotion-only misses.
Gold-vs-workhorse LLM comparison (30 held-out reviews). Claude Sonnet 4.6 produced gold-standard labels under a system prompt grounded in Perplexity Sonar Deep Research on PureGym’s PE-ownership economics (40% first-year churn, 27% mature-site ROIC, £600M Leonard Green acquisition EV). Qwen 2.5-7B zero-shot and Qwen 2.5-7B 10-shot (with ten Sonnet-derived few-shot examples) were benchmarked against that gold. Operational-lever agreement rose from 60.0% → 73.3%, churn-risk agreement from 53.3% → 70.0%, primary-topic token Jaccard from 0.124 → 0.166. Coaching a small open model with frontier-model examples closes most of the gap at zero marginal compute cost.
Cost. $1.10 sunk Perplexity Sonar Deep Research
(2026-04-11) + $0.148 new Claude Sonnet 4.6 (40 calls: 10 iteration + 30
gold eval) + $0 Qwen on Colab Pro+ = $1.25 total, of which $0.148 is
incremental. Full code and artefacts in
basic/basic_notebook_appendix.ipynb at commit
81cede1.
Grootendorst, M. (2022) ‘BERTopic: Neural topic modeling with a class-based TF-IDF procedure’, arXiv preprint, arXiv:2203.05794. Available at: https://arxiv.org/abs/2203.05794.
Wang, W. et al. (2020) ‘MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers’, in Advances in Neural Information Processing Systems, 33. Available at: https://arxiv.org/abs/2002.10957.
Yang, A. et al. (2024) ‘Qwen2.5 Technical Report’, arXiv preprint, arXiv:2412.15115. Available at: https://arxiv.org/abs/2412.15115.