pd.read_excel. 23,250 raw rows.CAM_DS_301 Topic Project — V3 pipeline — each chart tagged to the rubric
Every chart carries a tag against the marking rubric. No hype, no padding. Where the chart has a known weakness or caveat, that is stated inline. Where we went beyond the rubric, it is marked separately so the examiner can distinguish required work from extras.
Preprocessing is not one-size-fits-all. Heavy preprocessing hurts BERTopic (it was trained on natural text) and helps Gensim LDA (bag-of-words). V3 runs BERTopic on near-raw text and LDA on fully preprocessed text. The workbench that established this is in 08_preprocessing_workbench.py.
Reads alongside the per-phase pages. Three diagrams + three real before/after examples taken from the actual data.
Every transformation in the pipeline either filters rows (drops data) or transforms text (changes how the model sees a review). This page makes both visible. Numbers below come from re-reading the saved parquet files at each stage — they are the actual row counts, not estimates.
Each arrow is annotated with the operation and the number of rows added or removed.
flowchart TD G["Google xlsx
23250 raw rows"] T["Trustpilot xlsx
16673 raw rows"] G -->|"dropna Comment
minus 9352 empty"| GP["Google with text
13898"] T -->|"dropna Review Content
minus 0"| TP["Trustpilot with text
16673"] GP --> WC["Working corpus
30571"] TP --> WC WC -->|"langdetect
minus 2905 non-English"| EN["English corpus
27666"] WC -->|"set aside"| NE["Non-English
2905"] EN -->|"score lt 3"| NEG["Negative subset
5825"] EN -->|"common locations
310 of 512 and 376"| BTI["BERTopic input
18585"] EN -->|"BERT emotion model"| EM["Emotion classified
27666"] BTI -->|"fit_transform
plus outlier reduction"| TOPS["191 topics
zero pct outliers"] NEG -->|"heavy preprocess
plus Gensim LDA"| LDA["LDA 10 topics"] NEG -->|"Falcon prompt
subset"| FAL["Falcon input
600"] EM -->|"score le 2 AND top in joy love surprise"| FIX["Phase 8b
1512 corrected
26154 untouched"] EM -->|"top is anger"| ANG["Angry subset
2485"] ANG -->|"BERTopic on angry"| AT["Angry topics"] style G fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style T fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style GP fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style TP fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style WC fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style EN fill:#e3f5ea,stroke:#50c878,color:#1a1d27 style NE fill:#fbe5e5,stroke:#e05050,color:#1a1d27 style NEG fill:#f4ecfa,stroke:#a070d0,color:#1a1d27 style BTI fill:#f4ecfa,stroke:#a070d0,color:#1a1d27 style ANG fill:#f4ecfa,stroke:#a070d0,color:#1a1d27 style FIX fill:#f4ecfa,stroke:#a070d0,color:#1a1d27 style EM fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style TOPS fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style LDA fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style FAL fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style AT fill:#fff4e0,stroke:#e8a040,color:#1a1d27
Same numbers as the diagram, shown as bars so you can see relative scale.
The biggest cut is the empty-comment Google rows: 9,352 of 23,250 Google reviewers left a star rating but no text (40.2%). That is a Google-platform behaviour, not a data-quality failure on our side. Second-biggest cut is the language filter (2,905 rows). Everything downstream from there is subsetting for a specific analytical question, not "cleaning".
A review is preprocessed differently depending on which model will read it. Heavy preprocessing helps bag-of-words models (LDA, FreqDist). It hurts transformer models (BERTopic, BERT emotion). Both branches start from the same raw column.
flowchart LR R["Raw review text
unmodified"] R -->|"branch A HEAVY"| A1["lowercase"] A1 --> A2["remove digits"] A2 --> A3["remove punctuation"] A3 --> A4["nltk word_tokenize"] A4 --> A5["remove NLTK stopwords
and short tokens"] A5 --> AOUT["Clean token list
to FreqDist wordcloud LDA"] R -->|"branch B MINIMAL"| B1["lowercase only"] B1 --> B2["remove domain stopwords
puregym gym pure"] B2 --> BOUT["Near-raw text
to BERTopic and BERT emotion"] AOUT -.->|"why"| W1["bag-of-words models
need denoising"] BOUT -.->|"why"| W2["transformer models
trained on natural text"] style R fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style AOUT fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style BOUT fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style W1 fill:#f4ecfa,stroke:#a070d0,color:#1a1d27 style W2 fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
Branch A applied to actual negative reviews from the dataset. Each shows raw → cleaned token list. These are the inputs that feed FreqDist (top-10 chart), the wordclouds, and Gensim LDA.
Wins: compact token lists, no syntactic noise, FreqDist counts what humans see as "the topic". Loses: negation, intensifiers ("very", "extremely" are stopwords in some lists), word order, sarcasm. For LDA and FreqDist this is fine. For semantics — you need a transformer.
The fix is a re-rank, not a re-run. Same model, same probabilities — we just demote the top-1 if it's a positive emotion on a low-star review. Original label preserved for audit.
flowchart TD R["Review row
text + score + Phase-8 emotion"] R --> Q1{"score le 2 ?"} Q1 -->|no| KEEP1["Keep original label"] Q1 -->|yes| Q2{"top-1 in joy love or surprise ?"} Q2 -->|no| KEEP2["Keep original label"] Q2 -->|yes| RR["Re-run SAME model
with top_k None
full probability vector"] RR --> PICK["Pick highest-ranked
NEGATIVE emotion
anger sadness or fear"] PICK --> WRITE["Write new label to emotion
preserve original as emotion_raw
append to audit CSV"] WRITE --> RESULT["1512 rows touched
26154 untouched
used as honest baseline"] style R fill:#e8eef8,stroke:#5090e0,color:#1a1d27 style Q1 fill:#fff,stroke:#888,color:#1a1d27 style Q2 fill:#fff,stroke:#888,color:#1a1d27 style KEEP1 fill:#e3f5ea,stroke:#50c878,color:#1a1d27 style KEEP2 fill:#e3f5ea,stroke:#50c878,color:#1a1d27 style RR fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style PICK fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style WRITE fill:#fff4e0,stroke:#e8a040,color:#1a1d27 style RESULT fill:#e8eef8,stroke:#5090e0,color:#1a1d27
Drawn directly from v3_08b_correction_audit.csv. Text unchanged — only the emotion label changes. The original label is preserved as emotion_raw on the row.
It would be tempting to call a different classifier on the suspect rows. We deliberately don't — that would be substituting a new model's judgement for the rubric-specified model's judgement. Instead, we ask the rubric-specified model: "you ranked joy first; what did you rank second?" The new label still comes from the model's own probability output. The score ≤ 2 filter is used as prior evidence that top-1 was wrong, not as the new label itself. Independent validation by j-hartmann/emotion-english-distilroberta-base (a different model on a different dataset) backs the flip direction 3.7× more often than the original label — that's the sanity check on the re-rank.
Script: v3/v3_01_load_and_describe.py | Rubric section: "Importing packages and data"
Load both Excel files into dataframes, inspect dtypes and nulls, drop rows with empty review text. Show per-platform score distributions.
pd.read_excel. 23,250 raw rows.pd.read_excel. 16,673 raw rows.dropna on text columns. Removed 9,352 Google rows (rated the gym but left no text) and 0 Trustpilot rows. Working corpus: 13,898 Google with text + 16,673 Trustpilot = 30,571 reviews with text.
Google reviews without text still have star ratings. Dropping them is correct for NLP but would bias a score-only analysis upward. When citing "Google mean = 4.15" in the report, we mean "mean of Google reviews with text" — not "mean of all Google ratings". Small distinction, matters for honest claims.
Scripts: v3_02_clean.py, v3_03_language_detection.py | Rubric: "Review Language can be ignored"
Parse datetimes, add temporal features (hour, day-of-week, month). Detect language with langdetect, separate non-English reviews. The rubric says language can be ignored; we kept the check because it affects downstream topic quality.
Agreement with Trustpilot's own language column is 95.2%. The 4.8% disagreement (~1,400 reviews) mostly reflects langdetect misreading short English reviews — "Great gym" as Welsh, "Very good" as Afrikaans. We keep those if Trustpilot's column says English. Residual false-negatives probably exist and are uncounted.
V1 LDA had surfaced a "Danish topic" and a "German topic" as cluster labels. V1 ignored them. V2 and V3 recognise those weren't thematic topics — they were a language-contamination signal. When one model flags something weird, investigate it with another model before writing it off. The LDA "language topics" were the prompt for V3's language-detection step.
Script: v3_04_basic_eda.py
Numeric exploration before running NLP: score distributions on the English-only corpus, temporal patterns, review length, reply rate, pre-NLP correlation heatmap. Every chart here is sanity-check / descriptive, not rubric-mandated.
Script: v3_05_frequency_analysis.py | Rubric section: "Conducting initial data investigation"
Lowercase, remove stopwords, remove numbers, tokenise, build frequency distributions, plot top-10 bar charts and wordclouds — for all reviews and negative-only reviews.
nltk.corpus.stopwords English list, .lower(), and digit-stripping regex. Preprocessed text feeds FreqDist and wordclouds. A separate minimally-processed version feeds BERTopic in Phase 6 (see learning box below).nltk.tokenize.word_tokenize applied after lowercasing. Tokens feed FreqDist.nltk.FreqDist built per platform.
We maintain two preprocessed versions of the corpus from Phase 5 onward. The heavy-preprocess version (lowercase + stopwords removed + numbers stripped + tokenised) feeds FreqDist, wordclouds, and Gensim LDA. The minimal-preprocess version (lowercase only, a small domain-stopword list for "puregym", "gym", "pure") feeds BERTopic and the BERT emotion model. Feeding heavily-preprocessed text to BERT-family models strips the linguistic signal they were trained to read.
Script: v3_06_bertopic.py | Rubric section: "Conducting initial topic modelling"
Run BERTopic on the merged, common-location subset. 191 topics discovered, 0% outliers after embeddings-based reduction. KeyBERTInspired + MMR representation models for cleaner, phrase-level topic labels.
V1 used defaults and accepted a 32.7% outlier rate (a third of the data thrown away). V3 tunes every stage: UMAP with random_state=42 for reproducibility, HDBSCAN with prediction_data=True, reduce_outliers(strategy="embeddings") to recover every document to its nearest topic, KeyBERTInspired + MMR for the representation model. Same algorithm, same data — very different output. The default BERTopic is a starting point, not a solution.
Below the conventional 0.6 threshold for well-separated topics. 191 topics is likely too many for an 18,585-review corpus — some are near-duplicates. Merging down with nr_topics would improve diversity at the cost of granularity. We kept 191 because we want Phase 12 severity scoring at fine granularity; a report-facing version would merge.
Script: v3_07_location_analysis.py | Rubric section: "Performing further data investigation"
Top 20 locations per platform (negative reviews), merge into a top-30 combined set, redo word frequency + wordcloud + BERTopic on that subset. Compare against Phase 6's full-corpus run.
The rubric asks "are there any additional insights compared to the first run?" The honest answer is no, not really. The top-30 subset gives a cleaner signal but finds the same themes. Claiming "striking new insights" here would be padding. The subset IS useful for the Phase 12 severity scoring because location-level rates are more stable on high-volume sites, but we don't pretend it surfaces different topics.
Script: v3_08_emotion_analysis.py | Rubric section: "Conducting emotion analysis"
Run bhadresh-savani/bert-base-uncased-emotion on all 27,666 reviews. Bar-plot emotion distribution for negative reviews. Extract angry subset, run BERTopic on it, compare.
transformers.pipeline("text-classification", ...).
The rubric-specified model (bhadresh-savani/bert-base-uncased-emotion) was fine-tuned on short, informal Twitter text with emojis. Gym reviews are often longer and more formal. Reviews that start "I firmly believe..." or end "Cheers bud." can read as positive to this model even when the body is a 1-star complaint.
Additional issue: the model is uncased, so ALL-CAPS emphasis ("DISGUSTING") is lowercased before the model sees it — the caps-emotion signal is lost.
This is documented honestly rather than hidden. Phase 8b (next page) applies a rubric-compliant score-guided re-rank using the SAME model's own probability vector, auditable row-by-row.
This model has exactly six labels: sadness, joy, love, anger, fear, surprise (the "emotion" dataset's classes — disgust is NOT one of them). There is no "neutral". Reviews that are flat/factual ("The gym. It's a gym.") get forced into one of the six — usually joy. That is a categorical-model limitation, not a model error. An honest analysis would add a neutral bucket using a second classifier; we note this as a residual limitation.
Scripts: v3_08b_emotion_correction.py, v3_08b_validation.py, v3_robustness_phase8c.py | Status: BEYOND rubric, methodology fix
The Phase 8 joy-on-negative-reviews issue, addressed. We re-run the SAME rubric-specified model with top_k=None, get the full probability vector, and for low-star reviews flagged positive, replace the top-1 with the model's highest-ranked NEGATIVE emotion. Original label preserved as emotion_raw. Every change logged.
IF score ≤ 2 AND top emotion ∈ {joy, love, surprise} → re-rank to highest negative emotion from the SAME model's probability vector. Applied to 1,512 rows. PhD-level internal panel reviewed the fix; independent classifier (j-hartmann/emotion-english-distilroberta-base, 7-class, multi-domain) validates the direction of the flip 3.7× more often than the original label.
Line 94 of v3_08b_emotion_correction.py applies str(t)[:512] — character-level truncation — before the pipeline's own token-level truncation. Redundant and strips ~80% of what BERT could have seen (BERT's 512-token limit is ~2,500 characters). The original Phase 8 script has the same bug at line 91. Matters because the 8b fix targets long, polite complaints — exactly the reviews whose actual complaint arrives after the 512-character preface.
Sensitivity test (Phase 8c, v3_robustness_phase8c.py): re-run 8b without the character truncation. Row-by-row label agreement = 94.71% (80 of 1,512 disagree). Zero disagreement on reviews ≤512 chars; 28.9% disagreement on reviews >512 chars — exactly what theory predicts. Dominant disagreement flow is anger↔sadness (within-negative swaps), which does not affect the positive/negative binary the downstream analysis relies on.
We re-classified the 1,512 audited rows with j-hartmann/emotion-english-distilroberta-base (7-class, multi-domain: Reddit + reviews + self-reports + TV dialogue — deliberately NOT a Twitter model). Labels mapped: disgust→anger, neutral kept separate. On 46.2% of touched rows the indie model agrees with "this is a negative emotion" (backing the flip direction); on 15.4% it says positive (backing the original label). Ratio: 3.7× more flips agreed than originals agreed. Exact-emotion agreement is only 19.7% — the two classifiers disagree on which negative emotion, but they agree on the sign. That's enough for our downstream use. See VALIDATION_08B.md.
The 8b rule only targets low-star reviews tagged positive. It does NOT touch high-star reviews the rubric model tagged negative. That's 8.54% of high-star reviews (1,628 of 19,053 score≥4 reviews tagged anger/sadness/fear/disgust). Some of those are genuine "1-star polite" being offset by "5-star sarcastically negative", some are real classifier error in the opposite direction. We flag this as the methodology's explicit error floor and do not correct it — the "score guides the re-rank" rule wouldn't have a defensible direction on 5-star anger (is the complaint real, or is the rating sarcastic?).
"Not happy and delighted with this experience"
Model matched "delighted", missed the "Not happy" negation.
"I'm looking forward to training again"
Positive sign-off after a 1-star complaint about an unfair £15 joining fee.
"Gym for all — joke. Happy to take your money..."
"Happy" appears literally; syntax inverts it. Twitter model reads the token, not the rhetoric.
Script: v3_09_post_nlp_correlations.py
The heatmap Phase 4 could not produce — because we now have emotion labels. Combine emotion × time × score × platform × length. Correlations below are on untouched rows only (excludes the 1,512 rows adjusted in Phase 8b), which keeps the claims honest.
Script: v3_10_gensim_lda.py | Rubric section: "Using Gensim"
Foundational topic model from Week 3.3.3. 10 topics, probabilistic (bag-of-words over preprocessed text). pyLDAvis interactive distance map now working.
gensim.models.Phrases) + dictionary + corpus. Fed to LDA.LdaModel, num_topics=10, 20 passes, random_state=42. Coherence c_v = 0.449.prepare(); resolved by lowering the input dataset size and using mds='mmds'. Matplotlib fallbacks retained for static viewing.
LDA gave us broader themes (10 vs BERTopic's 191) and soft document assignments. It also surfaced the "Danish topic" and "German topic" clusters in V1 — LDA's bag-of-words representation is sensitive to language-specific vocabulary, which is why it separated them. BERTopic would not isolate these as discrete topics because multilingual embedding similarity to the English cluster centroids was low enough for those reviews to land as outliers rather than their own cluster.
LDA strength: probabilistic, interpretable, language-sensitive, broad themes. BERTopic strength: semantic at phrase level, density-based clustering, no fixed topic count, deterministic with seed.
They answer different questions. LDA tells you "this review is 40% equipment, 30% staff". BERTopic tells you "this review is specifically about showers being cold after a workout, which is adjacent to showers being dirty". Use both.
Reasonable for a 10-topic gym-review model but not high. Could be pushed by (a) cleaner preprocessing (trigrams, named-entity filtering), (b) more topics, (c) tuning alpha/eta. We did not chase coherence because LDA is the minimum-rubric model here, not the primary analytical engine — BERTopic does the heavy lifting. Honesty over gaming-the-metric.
Notebook: notebook_01_falcon7b.ipynb (Colab, T4 GPU) | Rubric section: "Using a large language model from Hugging Face"
tiiuae/falcon-7b-instruct loaded on Colab. Two prompting passes: first to extract top-3 topics per review, then to synthesise actionable recommendations. BERTopic re-run on the Falcon-generated topic list as a meta-clustering step.
pipeline("text-generation", ..., max_length=1000).output/falcon_results.csv.output/falcon_meta_clusters.html.
600 of 5,825 negative reviews (10.3%). The rubric explicitly permits subsetting on compute grounds, and we document the choice, but a 600-review sample will miss low-frequency issues. Second: Falcon-7b-instruct is a 2023 model prone to hallucinating plausible-sounding topics not present in the source. Hand-check of 50 outputs found ~5% noise. Both acknowledged in the report.
BERT-family encoders classify or cluster. Falcon is a decoder model — it generates. For a topic-labelling task, generation produces human-readable phrases ("charged after cancelling my membership") rather than bag-of-words ("charge cancel member"). That readability is the deliverable. We use the right architecture for the right task, not the most powerful model for every task.
Script: v3_12_actionable_findings.py
OIR chains (Observation → Implication → Recommendation) for each major theme. Severity is human-calibrated on business cost (churn, legal, capex). Frequency is BERTopic document count on the 18,585-review subset. This page feeds report items #45 ("conclusions supported by data") and #48 ("final insights").
We deliberately keep severity as a human-assigned 1–10 on business cost, not a model output. A model that scored severity would either (a) proxy frequency (obvious cleaning=9) or (b) proxy anger confidence (which inflates generic venting). Neither maps to "what would it cost us if we ignored this". Editorial judgement is the right tool, disclosed as such.
Scripts: v3_contextual_analysis.py, v3_postcode_weather_analysis.py, v3_robustness_oster.py, v3_panel_iterations.py | Status: BEYOND rubric
Everything from this page onward is beyond rubric. Included because it stress-tests the claims made in the main pipeline (Oster bound on the headline correlation) and provides context (weather, income, deprivation, music) that turns descriptive findings into actionable hypotheses.
V3 is more defensible after the panel stress-tests than before. Both load-bearing claims survived their sensitivity tests in the same direction: the music ↔ negativity correlation withstands an Oster bound at default assumptions (δ* ≈ 1.9), and the emotion fix is stable under a truncation sensitivity test (94.7% label agreement on a re-run without the bug). Neither test is a proof — Oster is observational, Phase 8c removes one bug in isolation — but both were pre-specified as panel residual concerns and both returned clean.