PureGym Customer Review Analysis: NLP Topic Modelling Report

CAM_DS_301 Weeks 4-5 | Topic Project 1

Introduction

This project analyses 39,923 PureGym customer reviews across two platforms — Google Reviews (23,250) and Trustpilot (16,673) — using a multi-method NLP pipeline: word frequency analysis, BERTopic, Gensim LDA, BERT-based emotion classification, and Falcon-7b-instruct. The objective is to extract actionable complaint themes that PureGym management could use to reduce negative reviews and improve customer retention.

Methodology

The analysis followed a progressive experimentation approach. After loading and cleaning the data (removing 9,352 Google reviews with no text), we parsed datetime fields to extract temporal features and filtered to 6,328 negative reviews (score < 3) from 310 locations common to both platforms.

Text preprocessing was evaluated using a workbench of 10 configurations. A critical finding: heavy preprocessing hurts BERTopic. Lemmatization increased outliers from 36.7% to 47.6%, because BERT embeddings rely on natural language context. Zipf's Law analysis (slope -1.034, R²=0.993) confirmed why: BERT was trained on the full Zipf distribution, so stripping stopwords strips signal. The optimal pipeline feeds raw text for embeddings while using a CountVectorizer with custom stopwords and bigrams for labels.

BERTopic was run in multiple configurations: full dataset, platform-specific subsets, top 30 worst locations, and anger-filtered reviews. Gensim LDA provided a classical comparison (10 topics, coherence 0.449). Emotion analysis classified every negative review into six emotions. Falcon-7b-instruct (run on a T4 GPU via Colab) extracted natural-language topic phrases from 600 reviews, averaging 3.9 topics per review and producing 1,349 unique topics which were then meta-clustered with BERTopic.

Key Findings

Two Platforms, Two Complaint Cultures

The most significant structural finding is that Google and Trustpilot capture fundamentally different complaint types. Google reviews focus on the physical gym experience: equipment, cleanliness, overcrowding, music volume, and parking. Trustpilot reviews focus on the business experience: membership fees, cancellation difficulties, billing errors, and customer service. The word "membership" ranks #3 on Trustpilot negative reviews but does not appear in Google's top 10. Platform-specific bigrams confirm this: Google surfaces "free weights" and "peak times" while Trustpilot surfaces "joining fee" and "direct debit."

Running BERTopic separately on each platform produced 11 topics each, but with distinct compositions. Google found locker theft (51 reviews), loud music (43), and overcrowding (37) — all physical issues. Trustpilot found joining fee complaints (231 reviews), PIN/app problems (164), and customer service failures (38) — all business process issues.

Complaint Topics and Their Specificity

BERTopic identified 10 distinct complaint clusters from the merged negative reviews. Beyond the expected general complaints (1,443 documents), the model isolated highly specific, actionable issues: parking fines of exactly £85 (124 reviews), a 4-hour class cancellation policy causing frustration (115 reviews), locker theft and broken locks (112 reviews), and water machines broken for weeks without management action (38 reviews). This granularity is where BERTopic excels over LDA, which tends to merge related themes.

LDA's 10-topic model provided a useful complementary perspective. It automatically separated Danish (Topic 5) and German (Topic 7) language clusters — reviews from PureGym's Danish and Swiss operations. It also produced a clear billing cluster (Topic 9: "membership, cancel, joining fee, payment") that aligned with Trustpilot's platform-specific topics.

Emotion Analysis Reveals Urgency Layers

Classifying all negative reviews by emotion showed that 44% express anger (2,815 reviews), split evenly between platforms. Sadness accounts for 23%, and fear for 6%. The temporal dimension adds depth: anger and sadness peak at 8PM (post-gym frustration), but fear peaks at 1AM — late-night safety concerns at 24/7 gyms. Surprise reviews are the longest (median 58 words vs 34 for anger), suggesting unexpected experiences prompt more detailed accounts.

Star rating further differentiates emotions: 1-star reviews are 48% anger, while 2-star reviews shift toward sadness (25%) — the difference between a furious customer and a disappointed one, each requiring a different response strategy.

Temporal Trends: Things Are Getting Worse

Dynamic topic modelling revealed that every complaint category is rising. General complaints grew from 45 to 530 mentions across the analysis period. App and PIN problems — virtually nonexistent at the start — grew to 76 mentions, indicating a new systemic issue likely tied to a software update. Negative reviews tripled from June 2023 onwards, with April 2024 as the worst month (335 negative Google reviews). This trajectory suggests a structural change in operations, pricing, or staffing rather than isolated incidents.

Location Hotspots

Seven locations appear in both platforms' top 20 worst lists, confirming them as genuinely problematic rather than platform-specific anomalies. London Stratford leads with 81 combined negative reviews. Running BERTopic on just the top 30 locations produced 38 topics with only 24.4% outliers — sharper than the full dataset's 32.7% — revealing location-specific issues such as mould in showers and gym closures despite 24/7 advertising.

Conclusion

The power of this analysis lies in triangulation. BERTopic excels at specific, actionable complaint isolation. LDA provides interpretable broad themes and detects multilingual segments. Emotion classification adds urgency that pure topic modelling misses. Falcon-7b generates human-readable topic phrases that capture nuance — "charged after cancelling" conveys intent that no bag-of-words model can detect. Aspect-based sentiment analysis (using zero-shot NLI) revealed that cleanliness and safety have the highest negative sentiment rates (98%) when mentioned, while parking is the least universally discussed aspect. A "complaint DNA" profiling approach — clustering on combined semantic, emotion, and structural features — independently confirmed the non-English review issue and sarcasm patterns without being told about either, validating findings from multiple independent methods.

For PureGym, the immediate priorities are clear: address the rising app/PIN issues, improve response times for angry Trustpilot complaints (currently 132 hours median — the slowest, when they should be fastest), investigate the June 2023 inflection point, and focus operational improvements on the seven consistently worst locations. Each method pointed to the same conclusion from a different angle — which is precisely what makes multi-method NLP analysis valuable.

976 words · report.md · PACE C301 · companion deck: audio deck · visual walkthrough

Restricted · PACE NLP