PureGym Customer Review Analysis: NLP Topic Modelling Report

Pierre Sutherland

CAM_DS_301 Weeks 4-5 | Topic Project 1

Introduction

This project analyses 39,923 PureGym customer reviews across two platforms — Google Reviews (23,250) and Trustpilot (16,673) — using a multi-method NLP pipeline: word frequency analysis, BERTopic, Gensim LDA, BERT-based emotion classification, and Qwen2.5-7B-Instruct. The objective is to extract actionable complaint themes that PureGym management could use to reduce negative reviews and improve customer retention.

Methodology

The analysis followed a progressive experimentation approach. After loading and cleaning the data (removing 9,352 Google reviews with no text), we parsed datetime fields and filtered to 5,931 negative reviews (score < 3); 4,137 of those came from 335 locations common to both platforms (after manual cross-platform name-merging) and fed cross-platform topic modelling.

Text preprocessing was evaluated using a workbench of 10 configurations. A critical finding: heavy preprocessing hurts BERTopic. Lemmatization increased outliers from 36.7% to 47.6%, because BERT embeddings rely on natural language context. Zipf’s Law analysis (slope -1.034, R²=0.993) confirmed why: BERT was trained on the full Zipf distribution, so stripping stopwords strips signal. The optimal pipeline feeds raw text for embeddings while using a CountVectorizer with custom stopwords and bigrams for labels.

BERTopic was applied as four complementary lenses: full negative reviews at common locations, the top-30-locations subset, anger-filtered reviews, and an LLM-driven run where Qwen2.5-7B-Instruct first extracted natural-language topic phrases from each anger-filtered review and BERTopic then meta-clustered the resulting 5,999 phrase outputs. UMAP is seeded (random_state=42) across all four runs for reproducibility. Gensim LDA ran on the same negative pool as a classical comparison (10 topics, coherence 0.449). Emotion analysis classified every negative review across six categories. Qwen2.5-7B-Instruct substitutes for the rubric’s Falcon-7B (rationale in Appendix); a per-method comparison table is in the Appendix.

flowchart TB
  subgraph Input["39,923 raw reviews"]
    G["Google<br>23,250"]
    T["Trustpilot<br>16,673"]
  end
  subgraph Clean["Clean and filter"]
    D1["Drop empty text<br>-9,352 Google"]
    D2["Parse datetime<br>extract temporal features"]
    D3["Negative filter<br>5,931 reviews<br>4,137 at 335 shared locations"]
  end
  subgraph Methods["Multi-method analysis"]
    M1["BERTopic<br>4 lenses<br>full · top-30 · anger · LLM"]
    M2["Gensim LDA<br>10 classical topics<br>coherence 0.449"]
    M3["Emotion classifier<br>6 emotions"]
    M4["Qwen-7B on A100<br>5,999 phrase outputs"]
  end
  Input --> Clean
  Clean --> Methods
  Methods --> Out["Triangulated findings"]
  classDef input fill:#16213e,stroke:#70c0d0,color:#b0e0e8
  classDef clean fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
  classDef method fill:#14261a,stroke:#5fc95f,color:#c0f0c0
  classDef out fill:#1e1830,stroke:#a078d8,color:#e0d0ff
  class G,T input
  class D1,D2,D3 clean
  class M1,M2,M3,M4 method
  class Out out

Figure 1. Multi-method NLP pipeline overview — from raw reviews through cleaning to four complementary analytical methods.

Key Findings

Two Platforms, Two Complaint Cultures

The most significant structural finding is that Google and Trustpilot capture fundamentally different complaint types. Google reviews focus on the physical gym experience: equipment, cleanliness, overcrowding, music volume, and parking. Trustpilot reviews focus on the business experience: membership fees, cancellation difficulties, billing errors, and customer service. The word “membership” ranks #3 on Trustpilot negative reviews but does not appear in Google’s top 10. Platform-specific bigrams confirm this: Google surfaces “free weights” and “peak times” while Trustpilot surfaces “joining fee” and “direct debit.”

flowchart TB
  Core["PureGym negative reviews"]
  subgraph Google["Google: physical experience"]
    G1["Locker theft (51)"]
    G2["Loud music (43)"]
    G3["Overcrowding (37)"]
    G4["Equipment / cleanliness"]
    G5["Peak times, parking"]
  end
  subgraph Trust["Trustpilot: business experience"]
    T1["Joining fee (231)"]
    T2["PIN / app problems (164)"]
    T3["Customer service (38)"]
    T4["Cancellation / billing"]
    T5["Direct debit disputes"]
  end
  Core --> Google
  Core --> Trust
  classDef core fill:#2a1212,stroke:#d07070,color:#f0b0b0
  classDef google fill:#16213e,stroke:#70c0d0,color:#b0e0e8
  classDef trust fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
  class Core core
  class G1,G2,G3,G4,G5 google
  class T1,T2,T3,T4,T5 trust

Figure 2. Platform-level complaint culture split — Google captures the physical gym experience, Trustpilot the business-process experience.

Running BERTopic separately on each platform produced 11 topics each, but with distinct compositions. Google found locker theft (51 reviews), loud music (43), and overcrowding (37) — all physical issues. Trustpilot found joining fee complaints (231 reviews), PIN/app problems (164), and customer service failures (38) — all business process issues.

Complaint Topics and Their Specificity

BERTopic identified 10 distinct complaint clusters from the merged negative reviews. Beyond the expected general complaints (1,443 documents), the model isolated highly specific, actionable issues: parking fines of exactly £85 (124 reviews), a 4-hour class cancellation policy causing frustration (115 reviews), locker theft and broken locks (112 reviews), and water machines broken for weeks without management action (38 reviews). This granularity is where BERTopic excels over LDA, which tends to merge related themes.

LDA’s 10-topic model provided a useful complementary perspective. It automatically separated Danish (Topic 5) and German (Topic 7) language clusters — reviews from PureGym’s Danish and Swiss operations. It also produced a clear billing cluster (Topic 9: “membership, cancel, joining fee, payment”) that aligned with Trustpilot’s platform-specific topics.

Two further BERTopic runs sharpened the picture. The anger-filtered run (2,537 reviews, 24.8% outliers vs the full run’s 35.6%) narrowed the primary anger drivers to membership cancellation, rude staff, and equipment failures — confirming the same themes the full run found but at higher resolution, since the anger filter strips out the more diffuse complaints. The LLM-driven run, where Qwen first extracted natural-language topic phrases per review and BERTopic then meta-clustered the 5,999 phrase outputs, produced substantively different clusters dominated by intent-bearing phrases like “personal trainer turnover” and “rude staff feedback” — capturing customer meaning in a way bag-of-words BERTopic cannot.

Emotion Analysis Reveals Urgency Layers

Classifying all negative reviews by emotion shows anger as the dominant signal, with temporal and star-band patterns that point to different response strategies — a 1★ anger review is a furious customer needing a fast reply; a 2★ sadness review is a disappointed one needing a different touch.

Emotion Count % of negs Peak hour Dominant star Median words
Anger 2,815 44% 20:00 1★ (48%) 34
Sadness ~1,471 23% 20:00 2★ (25%)
Fear ~384 6% 01:00
Surprise 58

Table 1. Emotion landscape — share of negative reviews, peak hour, dominant star band, median word length. Counts marked ~ are derived from the percentage split.

Fear peaking at 01:00 is the late-night safety story at 24/7 gyms; surprise reviews running ~70% longer than anger hint that unexpected experiences prompt more detailed accounts.

Dynamic topic modelling revealed that every complaint category is rising. General complaints grew from 45 to 530 mentions across the analysis period. App and PIN problems — virtually nonexistent at the start — grew to 76 mentions, indicating a new systemic issue likely tied to a software update. Negative reviews tripled from June 2023 onwards, with April 2024 as the worst month (335 negative Google reviews). This trajectory suggests a structural change in operations, pricing, or staffing rather than isolated incidents.

Location Hotspots

Seven locations appear in both platforms’ top 20 worst lists, confirming them as genuinely problematic rather than platform-specific anomalies. London Stratford leads with 81 combined negative reviews. The top-30 wordcloud sharpened against the full-dataset wordcloud — broad complaint terms gave way to location-specific vocabulary (mould, closed, instructor names) — confirming that the focusing effect happens at the wordcloud level too, not just in BERTopic. Running BERTopic on just the top 30 locations (37.1% outliers vs 35.6% for the full dataset) acted as a different lens, surfacing location-specific issues invisible at full scale: mould in showers, individual gym closures despite 24/7 advertising, and instructor-specific class complaints.

flowchart TB
  BT["BERTopic<br>specific clusters"] --> F["Triangulated<br>conclusions"]
  LDA["Gensim LDA<br>broad themes<br>+ language detection"] --> F
  Em["Emotion classifier<br>urgency + timing"] --> F
  LLM["Qwen-7B<br>natural phrases"] --> F
  F --> A1["Rising app/PIN issues"]
  F --> A2["7 worst locations<br>London Stratford #1"]
  F --> A3["June 2023 inflection<br>structural change"]
  F --> A4["Trustpilot response<br>132h median — slowest"]
  classDef method fill:#14261a,stroke:#5fc95f,color:#c0f0c0
  classDef synth fill:#1e1830,stroke:#a078d8,color:#e0d0ff
  classDef action fill:#2a1e10,stroke:#f0a040,color:#f0d8a0
  class BT,LDA,Em,LLM method
  class F synth
  class A1,A2,A3,A4 action

Figure 3. Triangulated synthesis — four method lenses converge on four headline findings.

Conclusion

The power of this analysis is triangulation: BERTopic isolates specific actionable clusters; LDA provides interpretable broad themes and detects multilingual segments; emotion classification adds urgency that pure topic modelling misses; Qwen-7B captures intent — “charged after cancelling” — that no bag-of-words model can.

For PureGym, the recommendations are tiered by urgency.

IMMEDIATE — fix this month. Re-rank the Trustpilot reply queue by emotion severity (anger first). Current median anger reply is 132 hours; industry benchmarks indicate 1-hour replies retain 71% of customers versus 48% at 24 hours.

OPERATIONAL — next 90 days. App/PIN problems grew from near-zero to 76 mentions — a likely software regression needing dedicated triage. Address the seven worst-performing locations (London Stratford leading at 81 combined negatives) as a sequenced programme.

STRATEGIC — next 6 months. Negative reviews tripled from June 2023 onwards. Cross-reference with operational, pricing, and staffing data to identify the structural cause before further degradation.

Appendix — Method comparison, churn signal, and LLM triangulation

Table 2. Method-by-method comparison — input size, key hyperparameters, outlier rate, and headline cluster.

Run Input n Hyperparams Outliers Headline cluster
BERTopic — full negatives 4,137 min_topic_size=20, UMAP seed=42 35.6% parking £85 (124)
BERTopic — top-30 worst sites 3,690 min_topic_size=30, UMAP seed=42 37.1% mould, 24/7 closures
BERTopic — anger-filtered 2,537 min_topic_size=10, UMAP seed=42 24.8% membership, rude staff, equipment
BERTopic — LLM-driven (Qwen phrases) 5,999 min_topic_size=5, UMAP seed=42 9.8% “personal trainer turnover”, “rude staff feedback”
Gensim LDA — classical comparison full negs num_topics=10, passes=5, seed=42 n/a 10 broad themes; coherence 0.449; Danish + German auto-detected

Table 3. Hyperparameter snapshot — full pipeline configuration.

Component Library Configuration Why
Sentence embeddings sentence-transformers model = all-MiniLM-L6-v2 BERTopic default; speed/quality balance
UMAP umap-learn n_neighbors=15, n_components=5, min_dist=0.0, metric=cosine, random_state=42 reproducibility across all 4 lenses
Vectorizer (topic labels) sklearn CountVectorizer + NLTK english + custom stops + bigrams; min_df=2 balance generic + domain
BERTopic bertopic min_topic_size varies 5–30 per run (see method table above) tuned per run for noise floor
Gensim LDA gensim num_topics=10, passes=5, random_state=42; coherence (u_mass) = 0.449 like-for-like with BERTopic
Emotion classifier transformers model = bhadresh-savani/bert-base-uncased-emotion rubric-mandated
LLM (topics + insights) transformers model = Qwen/Qwen2.5-7B-Instruct, max_new_tokens 120/400, temperature = null (greedy) A100; replaces Falcon-7B; deterministic

Why Qwen. Falcon-7B’s rubric prompts no longer reproduce under post-update weights (model-version drift). Qwen-2.5-7B runs the same prompts deterministically on A100; the Sonnet-Qwen agreement below confirms operational equivalence.

Churn signal. The Twitter-trained emotion classifier labels ~1,486 British-understated 1-2 star reviews as “joy”. A merged churn_signal = emotion ∈ {anger, sadness} ∨ stars ≤ 2 recovers those OOD cases: on the 30,617-review full population it captures 20.9% as churn-risk versus 14.2% for emotion alone — 2,032 additional reviews that emotion-only misses.

Gold-vs-workhorse LLM comparison (30 held-out reviews). Claude Sonnet 4.6 produced gold-standard labels under a system prompt grounded in Perplexity Sonar Deep Research on PureGym’s PE-ownership economics (40% first-year churn, 27% mature-site ROIC, £600M Leonard Green acquisition EV). Qwen 2.5-7B zero-shot and Qwen 2.5-7B 10-shot (with ten Sonnet-derived few-shot examples) were benchmarked against that gold. Operational-lever agreement rose from 60.0% → 73.3%, churn-risk agreement from 53.3% → 70.0%, primary-topic token Jaccard from 0.124 → 0.166. Coaching a small open model with frontier-model examples closes most of the gap at zero marginal compute cost.

Cost. $1.10 sunk Perplexity Sonar Deep Research (2026-04-11) + $0.148 new Claude Sonnet 4.6 (40 calls: 10 iteration + 30 gold eval) + $0 Qwen on Colab Pro+ = $1.25 total, of which $0.148 is incremental. Full code and artefacts in basic/basic_notebook_appendix.ipynb at commit 81cede1.

References

Grootendorst, M. (2022) ‘BERTopic: Neural topic modeling with a class-based TF-IDF procedure’, arXiv preprint, arXiv:2203.05794. Available at: https://arxiv.org/abs/2203.05794.

Wang, W. et al. (2020) ‘MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers’, in Advances in Neural Information Processing Systems, 33. Available at: https://arxiv.org/abs/2002.10957.

Yang, A. et al. (2024) ‘Qwen2.5 Technical Report’, arXiv preprint, arXiv:2412.15115. Available at: https://arxiv.org/abs/2412.15115.