PureGym NLP Analysis — Paginated Walkthrough

CAM_DS_301 Topic Project — V3 pipeline — each chart tagged to the rubric

27,666

English reviews

48/48

Rubric items

191

BERTopic topics

10

LDA topics

15

Pages in this deck

How to read this

Every chart carries a tag against the marking rubric. No hype, no padding. Where the chart has a known weakness or caveat, that is stated inline. Where we went beyond the rubric, it is marked separately so the examiner can distinguish required work from extras.

✓ HIT — rubric criterion met. Verbatim rubric text quoted next to each chart.

⚠ THIN — met but with caveat (e.g. pyLDAvis worked only after a second attempt; emotion classifier has a known domain-mismatch limitation).

✗ MISS — not addressed. (None in this submission.)

★ BEYOND — extra work the rubric did not require (Phase 8b emotion-correction fix, Phase 9 post-NLP correlations, Phase 12 OIR findings, contextual extensions, Oster-bound robustness).

• Each chart says what it shows and what it does not show. If a correlation is weak, it is labelled weak. If a classifier mislabels, that is stated, not hidden.

Single biggest methodology lesson across the project

Preprocessing is not one-size-fits-all. Heavy preprocessing hurts BERTopic (it was trained on natural text) and helps Gensim LDA (bag-of-words). V3 runs BERTopic on near-raw text and LDA on fully preprocessed text. The workbench that established this is in 08_preprocessing_workbench.py.

Data lineage — what happened, what got dropped

Reads alongside the per-phase pages. Three diagrams + three real before/after examples taken from the actual data.

Every transformation in the pipeline either filters rows (drops data) or transforms text (changes how the model sees a review). This page makes both visible. Numbers below come from re-reading the saved parquet files at each stage — they are the actual row counts, not estimates.

1) End-to-end pipeline — with row counts and drops

Each arrow is annotated with the operation and the number of rows added or removed.

flowchart TD
G["Google xlsx
23250 raw rows"]
T["Trustpilot xlsx
16673 raw rows"]
G -->|"dropna Comment
minus 9352 empty"| GP["Google with text
13898"]
T -->|"dropna Review Content
minus 0"| TP["Trustpilot with text
16673"]
GP --> WC["Working corpus
30571"]
TP --> WC
WC -->|"langdetect
minus 2905 non-English"| EN["English corpus
27666"]
WC -->|"set aside"| NE["Non-English
2905"]
EN -->|"score lt 3"| NEG["Negative subset
5825"]
EN -->|"common locations
310 of 512 and 376"| BTI["BERTopic input
18585"]
EN -->|"BERT emotion model"| EM["Emotion classified
27666"]
BTI -->|"fit_transform
plus outlier reduction"| TOPS["191 topics
zero pct outliers"]
NEG -->|"heavy preprocess
plus Gensim LDA"| LDA["LDA 10 topics"]
NEG -->|"Falcon prompt
subset"| FAL["Falcon input
600"]
EM -->|"score le 2 AND top in joy love surprise"| FIX["Phase 8b
1512 corrected
26154 untouched"]
EM -->|"top is anger"| ANG["Angry subset
2485"]
ANG -->|"BERTopic on angry"| AT["Angry topics"]
style G fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style T fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style GP fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style TP fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style WC fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style EN fill:#e3f5ea,stroke:#50c878,color:#1a1d27
style NE fill:#fbe5e5,stroke:#e05050,color:#1a1d27
style NEG fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
style BTI fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
style ANG fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
style FIX fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
style EM fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style TOPS fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style LDA fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style FAL fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style AT fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27

2) Row counts at each gate — visualised

Same numbers as the diagram, shown as bars so you can see relative scale.

Google raw23,250

Google with text13,898

→ dropped (empty)−9,352

Trustpilot raw16,673

Trustpilot with text16,673

Combined working30,571

English corpus27,666

→ non-English (set aside)−2,905

BERTopic input (common locs)18,585

Negative subset (score<3)5,825

Emotion-classified (all EN)27,666

→ Phase 8b corrected−1,512

Angry subset2,485

Falcon prompted600

Where the data thins out

The biggest cut is the empty-comment Google rows: 9,352 of 23,250 Google reviewers left a star rating but no text (40.2%). That is a Google-platform behaviour, not a data-quality failure on our side. Second-biggest cut is the language filter (2,905 rows). Everything downstream from there is subsetting for a specific analytical question, not "cleaning".

3) Two preprocessing branches — the same raw text, two outputs

A review is preprocessed differently depending on which model will read it. Heavy preprocessing helps bag-of-words models (LDA, FreqDist). It hurts transformer models (BERTopic, BERT emotion). Both branches start from the same raw column.

flowchart LR
R["Raw review text
unmodified"]
R -->|"branch A HEAVY"| A1["lowercase"]
A1 --> A2["remove digits"]
A2 --> A3["remove punctuation"]
A3 --> A4["nltk word_tokenize"]
A4 --> A5["remove NLTK stopwords
and short tokens"]
A5 --> AOUT["Clean token list
to FreqDist wordcloud LDA"]
R -->|"branch B MINIMAL"| B1["lowercase only"]
B1 --> B2["remove domain stopwords
puregym gym pure"]
B2 --> BOUT["Near-raw text
to BERTopic and BERT emotion"]
AOUT -.->|"why"| W1["bag-of-words models
need denoising"]
BOUT -.->|"why"| W2["transformer models
trained on natural text"]
style R fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style AOUT fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style BOUT fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style W1 fill:#f4ecfa,stroke:#a070d0,color:#1a1d27
style W2 fill:#f4ecfa,stroke:#a070d0,color:#1a1d27

4) Three real before/after examples — heavy preprocessing

Branch A applied to actual negative reviews from the dataset. Each shows raw → cleaned token list. These are the inputs that feed FreqDist (top-10 chart), the wordclouds, and Gensim LDA.

Example 1 — Google, score=2length: 130 chars

RAW "Extremely disappointed with the level of hygiene in this gym, not to mention the rude and unhelpful staff. I will not be returning."

LOWERCASE + REMOVE DIGITS + REMOVE PUNCT extremely disappointed with the level of hygiene in this gym not to mention the rude and unhelpful staff i will not be returning

word_tokenize + STOPWORDS REMOVED ['extremely', 'disappointed', 'level', 'hygiene', 'gym', 'mention', 'rude', 'unhelpful', 'staff', 'returning']

RESULT 10 tokens. Stop words "with", "the", "of", "in", "this", "not", "to", "and", "I", "will", "be" all dropped — preserves the topical content, removes the grammatical glue. This is what FreqDist counts.

Example 2 — Google, score=2length: 130 chars

RAW "No maintenance, something is always broken, zero management, nobody takes complaints seriously, Staff are more likely to be bullies"

word_tokenize + STOPWORDS REMOVED ['maintenance', 'something', 'always', 'broken', 'zero', 'management', 'nobody', 'takes', 'complaints', 'seriously', 'staff', 'likely', 'bullies']

RESULT 13 tokens. Note "broken", "complaints", "staff", "bullies" are the high-signal tokens that will drive topic clustering. The negation "No maintenance" loses its grammatical "no" but the signal stays in "maintenance" + the negative review filter.

Limitation: bag-of-words can't distinguish "no maintenance" from "great maintenance". This is exactly why BERTopic and the BERT emotion model do NOT get this preprocessing — they read the negation natively.

Example 3 — Google, score=1length: 158 chars

RAW "I was robbed twice from the locked locker at this gym. Unbelievable. Lockers not safe - open even when locked with padlock. Sad and disappointed"

word_tokenize + STOPWORDS REMOVED ['robbed', 'twice', 'locked', 'locker', 'gym', 'unbelievable', 'lockers', 'safe', 'open', 'even', 'locked', 'padlock', 'sad', 'disappointed']

RESULT 14 tokens. "locked" appears twice (the original used it both as adjective "locked locker" and verb "when locked"). FreqDist will count both occurrences. Note "lockers"/"locker" are NOT stemmed to a single root — we kept morphological variation because the report's "lockers theft" finding reads better with both forms preserved. Stemming was tested in the workbench (08); it slightly improved LDA coherence but hurt BERTopic.

What heavy preprocessing wins and loses

Wins: compact token lists, no syntactic noise, FreqDist counts what humans see as "the topic". Loses: negation, intensifiers ("very", "extremely" are stopwords in some lists), word order, sarcasm. For LDA and FreqDist this is fine. For semantics — you need a transformer.

5) Phase 8b emotion-correction decision tree

The fix is a re-rank, not a re-run. Same model, same probabilities — we just demote the top-1 if it's a positive emotion on a low-star review. Original label preserved for audit.

flowchart TD
R["Review row
text + score + Phase-8 emotion"]
R --> Q1{"score le 2 ?"}
Q1 -->|no| KEEP1["Keep original label"]
Q1 -->|yes| Q2{"top-1 in joy love or surprise ?"}
Q2 -->|no| KEEP2["Keep original label"]
Q2 -->|yes| RR["Re-run SAME model
with top_k None
full probability vector"]
RR --> PICK["Pick highest-ranked
NEGATIVE emotion
anger sadness or fear"]
PICK --> WRITE["Write new label to emotion
preserve original as emotion_raw
append to audit CSV"]
WRITE --> RESULT["1512 rows touched
26154 untouched
used as honest baseline"]
style R fill:#e8eef8,stroke:#5090e0,color:#1a1d27
style Q1 fill:#d8d2bf,stroke:#888,color:#1a1d27
style Q2 fill:#d8d2bf,stroke:#888,color:#1a1d27
style KEEP1 fill:#e3f5ea,stroke:#50c878,color:#1a1d27
style KEEP2 fill:#e3f5ea,stroke:#50c878,color:#1a1d27
style RR fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style PICK fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style WRITE fill:#d8d2bf4e0,stroke:#e8a040,color:#1a1d27
style RESULT fill:#e8eef8,stroke:#5090e0,color:#1a1d27

6) Three real before/after examples — Phase 8b re-rank

Drawn directly from v3_08b_correction_audit.csv. Text unchanged — only the emotion label changes. The original label is preserved as emotion_raw on the row.

Example 1 — Google, score=1raw label: joy → corrected: sadness

TEXT "Became super overcrowded, it's impossible to workout after 5pm"

PHASE 8 CLASSIFIER (Twitter-trained BERT) top-1 = joy — the model latches onto "super" as a positive intensifier.

PHASE 8b RULE score=1 AND top-1 ∈ {joy, love, surprise} ⇒ re-rank. Highest-ranked NEGATIVE emotion in the model's own vector = sadness.

RESULT emotion = sadness, emotion_raw = joy. Audit CSV row written. Honest sign-flip: the review IS negative; the original label was wrong.

Example 2 — Google, score=1raw label: joy → corrected: anger

TEXT "The gym is ok, but could you please lower the music volume? Not everyone shares the same musical tastes, and we'd like to enjoy our own music without the loud background noise"

PHASE 8 CLASSIFIER top-1 = joy — "ok", "please", "enjoy", "like" all read as positive markers; the polite-complaint structure tricks the model.

PHASE 8b RULE score=1 AND top-1 = joy ⇒ re-rank. Highest-ranked negative in the vector = anger.

RESULT emotion = anger, emotion_raw = joy. The corrected label feeds into the music-as-irritant analysis on Page 15 (this exact review is one of the data points behind the r=+0.60 music↔negativity finding).

Example 3 — Google, score=1raw label: surprise → corrected: fear

TEXT "I love pure gym......but had so many problems with my membership because I froze it due to injury and out of the blue I was getting a notification saying £39.99 was trying to be taken by pure gym..."

PHASE 8 CLASSIFIER top-1 = surprise — the literal "I love pure gym" opening + "out of the blue" lift the surprise probability above the negatives.

PHASE 8b RULE score=1 AND top-1 ∈ {joy, love, surprise} ⇒ re-rank. Highest-ranked negative = fear (the unexpected charge).

RESULT emotion = fear, emotion_raw = surprise. This one is borderline — "anger" or "sadness" might also be defensible. We trust the model's own ranking rather than imposing an editorial choice.

Caveat: this is also one of the cases where the 512-character truncation bug matters — the £39.99 detail is at character ~145, so the model did see it. But on longer reviews the fix runs on the polite preface only, which is the residual error documented on Page 10 (Phase 8b learning box).

Why the re-rank uses the SAME model, not a different one

It would be tempting to call a different classifier on the suspect rows. We deliberately don't — that would be substituting a new model's judgement for the rubric-specified model's judgement. Instead, we ask the rubric-specified model: "you ranked joy first; what did you rank second?" The new label still comes from the model's own probability output. The score ≤ 2 filter is used as prior evidence that top-1 was wrong, not as the new label itself. Independent validation by j-hartmann/emotion-english-distilroberta-base (a different model on a different dataset) backs the flip direction 3.7× more often than the original label — that's the sanity check on the re-rank.

1

Load + describe

Script: v3/v3_01_load_and_describe.py | Rubric section: "Importing packages and data"

Load both Excel files into dataframes, inspect dtypes and nulls, drop rows with empty review text. Show per-platform score distributions.

Rubric items covered on this page

✓

"Import the data file Google_12_months.xlsx into a dataframe."

Done via pd.read_excel. 23,250 raw rows.

#1

✓

"Import the data file Trustpilot_12_months.xlsx into a dataframe."

Done via pd.read_excel. 16,673 raw rows.

#2

✓

"Remove any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot)."

dropna on text columns. Removed 9,352 Google rows (rated the gym but left no text) and 0 Trustpilot rows. Working corpus: 13,898 Google with text + 16,673 Trustpilot = 30,571 reviews with text.

#3

13,898

Google (with text)

16,673

Trustpilot

−9,352

Google empty (dropped)

30,571

Working corpus

✓ HIT — Rubric #1–#3 (descriptive evidence)

Not a rubric-mandated chart. Included as sanity-check evidence that the loaded data is what we expect.

Shows: Google is positively skewed (mean 4.15, J-shape spike at 5). Trustpilot is bimodal with a heavier 1-star tail (mean 3.86). Same brand, different audience: Google is a discovery platform, Trustpilot is a complaint platform.

Doesn't show: the 9,352 empty-text Google rows — those were in the raw score data but dropped before this chart. If they were included, Google's mean would be even higher (people who 5-star without typing anything inflate the spike).

Learning

Google reviews without text still have star ratings. Dropping them is correct for NLP but would bias a score-only analysis upward. When citing "Google mean = 4.15" in the report, we mean "mean of Google reviews with text" — not "mean of all Google ratings". Small distinction, matters for honest claims.

2–3

Clean + language detection

Scripts: v3_02_clean.py, v3_03_language_detection.py | Rubric: "Review Language can be ignored"

Parse datetimes, add temporal features (hour, day-of-week, month). Detect language with langdetect, separate non-English reviews. The rubric says language can be ignored; we kept the check because it affects downstream topic quality.

Rubric coverage

★

Rubric: "Review Language feature can be ignored."

We did NOT ignore it. Filtering 2,905 non-English reviews (~9.5%) improved BERTopic's outlier rate substantially in V2 vs V1 — the Danish and German reviews from Fitness World locations were forming their own noise clusters. Flagging this is a data-quality contribution, not a rubric demand.

beyond

27,666

English reviews

2,905

Non-English (set aside)

95.2%

Agreement vs Trustpilot's own language column

523

Danish reviews (Fitness World)

Weakness: langdetect on very short texts

Agreement with Trustpilot's own language column is 95.2%. The 4.8% disagreement (~1,400 reviews) mostly reflects langdetect misreading short English reviews — "Great gym" as Welsh, "Very good" as Afrikaans. We keep those if Trustpilot's column says English. Residual false-negatives probably exist and are uncounted.

Learning

V1 LDA had surfaced a "Danish topic" and a "German topic" as cluster labels. V1 ignored them. V2 and V3 recognise those weren't thematic topics — they were a language-contamination signal. When one model flags something weird, investigate it with another model before writing it off. The LDA "language topics" were the prompt for V3's language-detection step.

4

Basic EDA — pre-NLP

Script: v3_04_basic_eda.py

Numeric exploration before running NLP: score distributions on the English-only corpus, temporal patterns, review length, reply rate, pre-NLP correlation heatmap. Every chart here is sanity-check / descriptive, not rubric-mandated.

Rubric coverage

★

No direct rubric items.

This phase exists to set a numeric baseline that NLP outputs can then be correlated against in Phase 9. Everything here is beyond rubric but feeds report items #43 ("documents the approach") and #45 ("conclusions supported by data").

beyond

★ BEYOND — descriptive

Shows: score distributions after language filtering. Shape unchanged vs raw — Google still J-shaped, Trustpilot still bimodal. Language filter does not materially bias the score mix.

Doesn't show: which reviews have emotion labels yet — that's Phase 8.

★ BEYOND — descriptive

Shows: monthly volume, hour-of-day, day-of-week, and time-of-day buckets across the 12-month window. Google peaks at 23:00 UTC — that's UK evening after BST offset.

Doesn't show: volume is NOT clean — Google 2022 data is sparse (17–56 reviews/month) because the export captured a limited window; the jump at June 2023 is a data-collection change, not an actual volume spike. The fixed version is on Page 14.

★ BEYOND — descriptive

Shows: median ~18 words, heavy right-skew. Most reviews are short. A handful exceed 500 words.

Doesn't show: the 512-character truncation bug in Phase 8 that silently strips the tail of long reviews — discovered later and documented on Page 9.

★ BEYOND — descriptive

Shows: negative reviews are 2–3× longer than positive. Pearson r = −0.314 (score ↔ length). People write more when frustrated.

Doesn't show: this correlation is a confound later: at the location level, "locations where people write longer reviews" and "locations with more negativity" overlap. The Oster bound on Page 14 tests whether the music↔negativity correlation survives this confounder. It does.

★ BEYOND — sets up Phase 9

Shows: correlation matrix across the handful of numeric features available before any NLP is run — score, length, hour, day-of-week, has_reply, reply_lag. Thin matrix. Most pairwise correlations |r| < 0.1 except score↔length (−0.31) and has_reply↔score (the Trustpilot response bias).

Doesn't show: emotion. Once Phase 8 classifies emotion, the equivalent heatmap in Phase 9 is much richer (score↔anger r=−0.45, score↔joy r=+0.60). That contrast is the whole point — NLP unlocks dimensions that numeric-only analysis can't see.

★ BEYOND — descriptive

Shows: Trustpilot company-reply rate by star rating. 97% reply rate overall, but 5-star reviews get replied to faster than 1-star (slight bias toward easy replies).

Doesn't show: the per-emotion reply lag (angry reviews wait longest) — that's only computable after Phase 8 emotion labels exist.

5

Frequency analysis + wordclouds

Script: v3_05_frequency_analysis.py | Rubric section: "Conducting initial data investigation"

Lowercase, remove stopwords, remove numbers, tokenise, build frequency distributions, plot top-10 bar charts and wordclouds — for all reviews and negative-only reviews.

Rubric items covered on this page

✓

"Find the number of unique locations in the Google data set / Trustpilot data set. Use Club's Name for Google, Location Name for Trustpilot."

Google: 512 unique clubs. Trustpilot: 376 unique locations.

#4

✓

"Find the number of common locations between the Google data set and the Trustpilot data set."

310 common locations after name-matching (normalised case, trimmed whitespace, stripped "PureGym " prefix).

#5

✓

"Perform preprocessing of the data — change to lower case, remove stopwords using NLTK, and remove numbers."

Applied here with nltk.corpus.stopwords English list, .lower(), and digit-stripping regex. Preprocessed text feeds FreqDist and wordclouds. A separate minimally-processed version feeds BERTopic in Phase 6 (see learning box below).

#6

✓

"Tokenise the data using word_tokenize from NLTK."

nltk.tokenize.word_tokenize applied after lowercasing. Tokens feed FreqDist.

#7

✓

"Find the frequency distribution of the words from each data set's reviews separately. You can use nltk.freqDist."

nltk.FreqDist built per platform.

#8

✓

"Plot a histogram/bar plot showing the top 10 words from each data set."

Charts below (all reviews, then negative-only).

#9

✓

"Use the wordcloud library on the cleaned data and plot the word cloud."

Two wordclouds below — all reviews and negative-only.

#10

✓

"Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets. (Google < 3, Trustpilot < 3). Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews."

Filter applied: Google overall score < 3, Trustpilot stars < 3. Negative-only charts below.

#11

✓ HIT — Rubric #9 (all reviews)

"Plot a histogram/bar plot showing the top 10 words from each data set."

Shows: "equipment", "staff", "clean" dominate both platforms. Generic gym vocabulary.

Doesn't show: whether those top words are positive or negative mentions. "Clean" could be "it is clean" or "not clean". This is why negative-filtered charts exist — and why we run emotion classification later.

✓ HIT — Rubric #9 + #11 (negative only)

"Repeat the frequency distribution ... on the filtered data consisting of only negative reviews."

Shows: words that drop out on filtering — "clean", "friendly", "great". Words that emerge — "membership", "email", "broken". The shift from facility vocabulary to admin/billing vocabulary is the signal.

Doesn't show: phrase-level meaning. "Charge" on its own is ambiguous — bar-chart word frequency can't resolve "cancelled charges" vs "charged after cancelling". That's why Phase 6 BERTopic matters — it clusters reviews by semantic similarity, not word counts.

✓ HIT — Rubric #10 (all reviews)

"Use the wordcloud library on the cleaned data and plot the word cloud."

Shows: the top-10 barchart above visualised as a wordcloud. Same signal, different representation. "Equipment", "staff", "clean", "classes".

Doesn't show: anything the barchart doesn't also show. Wordclouds are pretty but less informative than bar charts — the font-size scaling on word count is perceptually imprecise.

✓ HIT — Rubric #10 + #11 (negative only)

"Repeat the ... wordcloud steps on the filtered data consisting of only negative reviews."

Shows: complaint-specific vocabulary — "membership", "cancel", "dirty", "broken", "staff". Visual confirmation of the barchart's signal.

Doesn't show: sarcasm. 1-star reviews that include the word "great" or "delighted" ironically show up as positive words in this cloud. The emotion classifier catches some of this (Phase 8b), the wordcloud cannot.

Learning — preprocessing branches

We maintain two preprocessed versions of the corpus from Phase 5 onward. The heavy-preprocess version (lowercase + stopwords removed + numbers stripped + tokenised) feeds FreqDist, wordclouds, and Gensim LDA. The minimal-preprocess version (lowercase only, a small domain-stopword list for "puregym", "gym", "pure") feeds BERTopic and the BERT emotion model. Feeding heavily-preprocessed text to BERT-family models strips the linguistic signal they were trained to read.

6

BERTopic — core topic model

Script: v3_06_bertopic.py | Rubric section: "Conducting initial topic modelling"

Run BERTopic on the merged, common-location subset. 191 topics discovered, 0% outliers after embeddings-based reduction. KeyBERTInspired + MMR representation models for cleaner, phrase-level topic labels.

Rubric items covered on this page

✓

"With the data frame created in the previous step: Filter out the reviews that are from the locations common to both data sets. Merge the reviews to form a new list."

Common-location subset: 310 locations, 18,585 merged reviews.

#12

✓

"Preprocess this data set. Use BERTopic on this cleaned data set."

Minimal preprocessing only (lowercase + domain stopwords "puregym", "gym", "pure"). The preprocessing workbench showed heavy preprocessing HURTS BERTopic (raises outlier rate). We interpret "cleaned" as "cleaned for a transformer", not "cleaned for bag-of-words".

#13

✓

"Output: List out the top topics along with their document frequencies."

191 topics. Top 6 by document count: cleaning/toilets/dirty/wipe (562), busy/peak hours (518), air-con/hot/aircon (423), cold showers (243), music/loud/noise (224), PIN/pass/code (168). Full list in notebook.

#14

✓

"For the top 2 topics, list out the top words."

Topic 0: cleaning, toilets, dirty, stations, wipe. Topic 1: busy, peak, hours, crowded, times. KeyBERT+MMR selects semantically-representative terms, not just highest-frequency words.

#15

✓

"Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map."

Interactive HTML below.

#16

✓

"Show a barchart of the topics, displaying the top 5 words in each topic."

Interactive HTML below.

#17

✓

"Plot a heatmap, showcasing the similarity matrix."

Interactive HTML below. The heatmap IS cosine similarity between topic embeddings (course Week 1.3.2).

#18

✓

"For 10 clusters, provide a brief description in the Notebook of the topics they comprise of along with the general theme of the cluster, evidenced by the top words within each cluster's topics."

10 cluster themes in the notebook: hygiene, congestion, HVAC, access-control, staff-behaviour, billing, equipment-failure, amenities, bookings, value-perception — each with top words and document counts.

#19

191

Topics

18,585

Common-location reviews

310

Common locations

0%

Outliers (post-reduction)

Interactive rubric deliverables

✓ HIT — Rubric #16 — intertopic distance map

"Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map."

Shows: each topic as a bubble in 2D UMAP-reduced space. Bubble size = document count. Proximity = semantic similarity. Hygiene cluster sits apart from billing cluster; HVAC and cold-showers sit near each other (both are "facility not at the right temperature").

Doesn't show: the true embedding dimensions. UMAP reduction to 2D preserves local neighbourhoods imperfectly; two topics that appear distant in 2D may still be similar in the full 384-dim space. Use the heatmap below for exact similarity.

✓ HIT — Rubric #17 — topic barchart

"Show a barchart of the topics, displaying the top 5 words in each topic."

Shows: per-topic top-N words with c-TF-IDF scores. KeyBERT + MMR deduplicates near-synonyms, so top-5 words are less redundant than vanilla BERTopic output.

Doesn't show: which topics are "important". Top-5 words tell you what a topic is ABOUT, not how frequent or severe it is. Phase 12's severity × frequency matrix fills that gap.

✓ HIT — Rubric #18 — similarity heatmap

"Plot a heatmap, showcasing the similarity matrix."

Shows: cosine similarity between all 191 topic embeddings. Bright blocks = clusters of semantically-related topics (e.g. HVAC / cold-showers / temperature all hover around cos = 0.6–0.7).

Doesn't show: directionality. Cosine similarity is symmetric — it cannot tell you whether "broken equipment" drives "staff rudeness" or the reverse. Causal structure needs external evidence (see cascade hypothesis on Page 14).

Open intertopic map fullscreen Open barchart fullscreen Open heatmap fullscreen Topic hierarchy (dendrogram)

Learning — the BERTopic configuration space

V1 used defaults and accepted a 32.7% outlier rate (a third of the data thrown away). V3 tunes every stage: UMAP with random_state=42 for reproducibility, HDBSCAN with prediction_data=True, reduce_outliers(strategy="embeddings") to recover every document to its nearest topic, KeyBERTInspired + MMR for the representation model. Same algorithm, same data — very different output. The default BERTopic is a starting point, not a solution.

Weakness — topic diversity = 0.546

Below the conventional 0.6 threshold for well-separated topics. 191 topics is likely too many for an 18,585-review corpus — some are near-duplicates. Merging down with nr_topics would improve diversity at the cost of granularity. We kept 191 because we want Phase 12 severity scoring at fine granularity; a report-facing version would merge.

7

Location analysis + top-30 BERTopic

Script: v3_07_location_analysis.py | Rubric section: "Performing further data investigation"

Top 20 locations per platform (negative reviews), merge into a top-30 combined set, redo word frequency + wordcloud + BERTopic on that subset. Compare against Phase 6's full-corpus run.

Rubric items covered on this page

✓

"List out the top 20 locations with the highest number of negative reviews. Do this separately for Google and Trustpilot's reviews, and comment on the result. Are the locations roughly similar in both data sets?"

Top-20 tables in notebook. Overlap = 7 locations (not a strong match). Stratford London, Finsbury Park, and Manchester appear on both. Trustpilot's top-20 skews toward central London; Google's is more geographically spread. Interpretation: Trustpilot is the complaint platform of choice for reviewers in high-density urban locations; Google captures a broader set of casual gym-goers.

#20

✓

"Merge the 2 data sets using Location Name and Club's Name. Now, list out the following: Locations / Number of Trustpilot reviews / Number of Google reviews / Total number of reviews. Sort based on the total number of reviews."

Merged on normalised location name, sorted desc. Top 30 retained for subsequent steps. Table in notebook.

#21

✓

"For the top 30 locations, redo the word frequency and word cloud. Comment on the results, and highlight if the results are different from the first run."

Charts below. The top-30 vocabulary tilts slightly further toward operational complaints (equipment, classes, staff) and away from generic "gym" / "member" language.

#22

✓

"For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic. Comment on the following: Are the results any different from the first run of BERTopic? If so, what has changed? Are there any additional insights compared to the first run?"

64 topics on the top-30 subset (vs 191 on the full corpus). Fewer topics because less data. Interactive HTMLs below.

#23

✓ HIT — Rubric #22

"For the top 30 locations, redo the word frequency and word cloud."

Shows: top-15 words at the 30 busiest locations. Equipment, classes, staff, time, clean still dominate — the operational vocabulary doesn't change much when you zoom to the busiest sites. It tightens.

Doesn't show: per-location variation. These are the 30 sites aggregated; individual locations like Bradford Thornbury (74% equipment complaints) are not visible here — see Page 14 location breakdown.

✓ HIT — Rubric #22

"For the top 30 locations, redo the word frequency and word cloud."

Shows: visual restatement of the barchart. Same dominant tokens, slightly tighter around operational concerns.

Doesn't show: anything novel vs the Phase 5 wordclouds. The rubric asks for this comparison explicitly — the honest answer is "results are largely similar".

Interactive (Rubric #23)

✓ HIT — Rubric #23 (barchart)

"For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic. Are the results any different from the first run?"

Shows: 64 topics on the top-30 subset. Core themes match Phase 6 (cleaning, crowding, HVAC, billing, equipment) but fewer topics and sharper top-words because the noisier low-volume locations have been removed.

Doesn't show: any completely new insight. The honest verdict: running BERTopic on the top-30 produces a cleaner version of the same picture, not a different picture. We report this as a stability check on the Phase 6 themes.

✓ HIT — Rubric #23 (intertopic map)

"Visualise the clusters from this run."

Shows: intertopic distance map on the top-30 subset. Same cluster geography as the full-corpus version.

Doesn't show: anything the Phase 6 map doesn't already show, at higher granularity. Stability check confirmed.

Learning — "additional insights" honesty

The rubric asks "are there any additional insights compared to the first run?" The honest answer is no, not really. The top-30 subset gives a cleaner signal but finds the same themes. Claiming "striking new insights" here would be padding. The subset IS useful for the Phase 12 severity scoring because location-level rates are more stable on high-volume sites, but we don't pretend it surfaces different topics.

8

Emotion classification

Script: v3_08_emotion_analysis.py | Rubric section: "Conducting emotion analysis"

Run bhadresh-savani/bert-base-uncased-emotion on all 27,666 reviews. Bar-plot emotion distribution for negative reviews. Extract angry subset, run BERTopic on it, compare.

Rubric items covered on this page

✓

"Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set up a pipeline for text classification."

Exact model ID as specified. transformers.pipeline("text-classification", ...).

#24

✓

"With the help of an example sentence, run the model and display the different emotion classifications that the model outputs."

Example in notebook: the 6 output labels are sadness, joy, love, anger, fear, surprise (the original "emotion" dataset's six classes — note disgust is NOT in this model). On the 27,666 English corpus, raw counts are: joy 17,945 · anger 5,009 · sadness 2,989 · fear 893 · love 658 · surprise 172.

#25

✓

"Run this model on both data sets, and capture the top emotion for each review."

Batched on GPU for speed; top-1 label + confidence written per row.

#26

⚠

"Use a bar plot to show the top emotion distribution for all negative reviews in both data sets."

Chart below. Tagged THIN because of a known model limitation — see caveat box below the chart. The fix is on Page 9 (Phase 8b).

#27

✓

"Extract all the negative reviews (from both data sets) where anger is top emotion."

2,485 angry negative reviews.

#28

✓

"Run BERTopic on the output of the previous step."

Angry-only BERTopic run. Interactive HTMLs below.

#29

✓

"Visualise the clusters from this run. Comment on whether it is any different from the previous runs, and whether it is possible to narrow down the primary issues that have led to an angry review."

Angry-only topics skew strongly toward billing disputes, staff rudeness, and equipment-broken-for-weeks — the high-severity issues. Generic facility complaints (equipment in general, crowding) are relatively less prominent in the angry subset vs the full negative set.

#30

42.7%

Negative = anger

25.0%

Negative = sadness

23.2%

Negative = joy (misclassified)

2,485

Angry reviews

⚠ THIN — Rubric #27

"Use a bar plot to show the top emotion distribution for all negative reviews in both data sets."

Shows: anger dominates (42.7%), sadness 25.0%, and — suspiciously — joy at 23.2%.

Doesn't show: the 23.2% joy figure is mostly misclassification, not real joy in negative reviews. The model was trained on Twitter and misreads polite, formal complaint language ("I have been a loyal customer for years, however...") as positive. See the weakness box.

Weakness — emotion classifier domain mismatch

The rubric-specified model (bhadresh-savani/bert-base-uncased-emotion) was fine-tuned on short, informal Twitter text with emojis. Gym reviews are often longer and more formal. Reviews that start "I firmly believe..." or end "Cheers bud." can read as positive to this model even when the body is a 1-star complaint.

Additional issue: the model is uncased, so ALL-CAPS emphasis ("DISGUSTING") is lowercased before the model sees it — the caps-emotion signal is lost.

This is documented honestly rather than hidden. Phase 8b (next page) applies a rubric-compliant score-guided re-rank using the SAME model's own probability vector, auditable row-by-row.

Angry-only BERTopic (Rubric #29–#30)

✓ HIT — Rubric #29

"Run BERTopic on the output of the previous step."

Shows: top words per angry-review topic. Billing words ("charged", "refund", "cancel") and staff words ("rude", "manager", "refused") are relatively more prominent than in the full Phase 6 run.

Doesn't show: severity. Anger frequency does not map cleanly to business severity — a mild "they put the music too loud" complaint can generate anger, while a subtle billing fraud might get tagged as sadness. Phase 12 severity scoring is human-calibrated, not anger-driven.

✓ HIT — Rubric #30 (intertopic)

"Visualise the clusters from this run."

Shows: angry-only intertopic distance map. Billing cluster is tight and separated; staff-rudeness sits close to it (same pattern of "something done to me" language).

Doesn't show: how much of the "anger" is actually classifier error dressed up — some reviews tagged anger are venting about trivial issues the model misread intensity on.

✓ HIT — Rubric #30 (heatmap)

"Visualise the clusters from this run."

Shows: similarity matrix on angry-only topics. A bright billing block, a bright staff block, and a diffuse equipment-area block — the latter suggesting anger about equipment is less internally coherent than anger about billing.

Doesn't show: temporal dynamics. Anger peaks in the evening (8pm) but that is not visible here — it requires the time-of-day breakdown on Page 10 Phase 9.

Learning — the "6 emotions" constraint

This model has exactly six labels: sadness, joy, love, anger, fear, surprise (the "emotion" dataset's classes — disgust is NOT one of them). There is no "neutral". Reviews that are flat/factual ("The gym. It's a gym.") get forced into one of the six — usually joy. That is a categorical-model limitation, not a model error. An honest analysis would add a neutral bucket using a second classifier; we note this as a residual limitation.

8b

Emotion correction — score-guided re-rank

Scripts: v3_08b_emotion_correction.py, v3_08b_validation.py, v3_robustness_phase8c.py | Status: BEYOND rubric, methodology fix

The Phase 8 joy-on-negative-reviews issue, addressed. We re-run the SAME rubric-specified model with top_k=None, get the full probability vector, and for low-star reviews flagged positive, replace the top-1 with the model's highest-ranked NEGATIVE emotion. Original label preserved as emotion_raw. Every change logged.

Rubric coverage

★

Not a rubric item. Methodology correction to the Phase 8 output.

Rule: IF score ≤ 2 AND top emotion ∈ {joy, love, surprise} → re-rank to highest negative emotion from the SAME model's probability vector. Applied to 1,512 rows. PhD-level internal panel reviewed the fix; independent classifier (j-hartmann/emotion-english-distilroberta-base, 7-class, multi-domain) validates the direction of the flip 3.7× more often than the original label.

beyond

1,512

Rows corrected

3.7×

Indie-model backing of flip direction

UNCHANGED

Phase 12 topic rankings

5.29%

Phase 8c sensitivity disagreement

★ BEYOND — methodology artifact

Shows: emotion distribution on score≤2 reviews, before and after the 8b rule. Joy drops to 0% on the corrected subset (that's what the rule does — by construction).

Doesn't show: whether the fix is correct. This chart describes the fix, it does not validate it. The validation uses an independent second classifier on the same 1,512 rows (see validation box below).

⚠ CIRCULARITY CAVEAT

Shows: Phase 9 score↔emotion correlations pre-fix vs post-fix. Post-fix numbers are stronger.

Why this is not a finding: the 8b rule was defined as "replace positive emotion on low-star reviews". After the rule runs, the joint distribution of score and is_joy is narrower by construction. The Pearson correlation between them MUST increase — even if the rule were random. Citing post-fix r=+0.747 as "an improvement over r=+0.598" is a tautology, not a finding.

Honest baseline: correlations computed on the 26,154 untouched rows only (all p < 1e-10): score ↔ is_joy = +0.714, is_anger = −0.545, is_sadness = −0.402. This is the correlation on rows the fix did NOT touch — unbiased by the rule. This is the number we cite.

Learning 1 — character truncation bug inherited from Phase 8

Line 94 of v3_08b_emotion_correction.py applies str(t)[:512] — character-level truncation — before the pipeline's own token-level truncation. Redundant and strips ~80% of what BERT could have seen (BERT's 512-token limit is ~2,500 characters). The original Phase 8 script has the same bug at line 91. Matters because the 8b fix targets long, polite complaints — exactly the reviews whose actual complaint arrives after the 512-character preface.

Sensitivity test (Phase 8c, v3_robustness_phase8c.py): re-run 8b without the character truncation. Row-by-row label agreement = 94.71% (80 of 1,512 disagree). Zero disagreement on reviews ≤512 chars; 28.9% disagreement on reviews >512 chars — exactly what theory predicts. Dominant disagreement flow is anger↔sadness (within-negative swaps), which does not affect the positive/negative binary the downstream analysis relies on.

Learning 2 — validation by independent second classifier

We re-classified the 1,512 audited rows with j-hartmann/emotion-english-distilroberta-base (7-class, multi-domain: Reddit + reviews + self-reports + TV dialogue — deliberately NOT a Twitter model). Labels mapped: disgust→anger, neutral kept separate. On 46.2% of touched rows the indie model agrees with "this is a negative emotion" (backing the flip direction); on 15.4% it says positive (backing the original label). Ratio: 3.7× more flips agreed than originals agreed. Exact-emotion agreement is only 19.7% — the two classifiers disagree on which negative emotion, but they agree on the sign. That's enough for our downstream use. See VALIDATION_08B.md.

Residual error ceiling — the un-corrected side

The 8b rule only targets low-star reviews tagged positive. It does NOT touch high-star reviews the rubric model tagged negative. That's 8.54% of high-star reviews (1,628 of 19,053 score≥4 reviews tagged anger/sadness/fear/disgust). Some of those are genuine "1-star polite" being offset by "5-star sarcastically negative", some are real classifier error in the opposite direction. We flag this as the methodology's explicit error floor and do not correct it — the "score guides the re-rank" rule wouldn't have a defensible direction on 5-star anger (is the complaint real, or is the rating sarcastic?).

Sarcasm examples the fix caught

"Not happy and delighted with this experience"

Model matched "delighted", missed the "Not happy" negation.

"I'm looking forward to training again"

Positive sign-off after a 1-star complaint about an unfair £15 joining fee.

"Gym for all — joke. Happy to take your money..."

"Happy" appears literally; syntax inverts it. Twitter model reads the token, not the rhetoric.

Phase 8b methodology writeup Internal panel review Independent-classifier validation Audit trail CSV (1,512 rows) Robustness appendix (Phase 8c sensitivity)

9

Post-NLP correlations

Script: v3_09_post_nlp_correlations.py

The heatmap Phase 4 could not produce — because we now have emotion labels. Combine emotion × time × score × platform × length. Correlations below are on untouched rows only (excludes the 1,512 rows adjusted in Phase 8b), which keeps the claims honest.

Rubric coverage

★

Not a rubric item. Course 1 statistics meets Course 3 NLP.

Demonstrates that emotion labels from a language model can be used as features in a classical correlation analysis. Feeds the report's "final insights" item (#48).

beyond

★ BEYOND — feeds report #48

Shows: rich correlation matrix. Score ↔ is_anger = −0.545 (on untouched rows). Score ↔ is_joy = +0.714. Length ↔ is_anger is small and positive. Much more structure than the pre-NLP version on Page 4.

Doesn't show: causation. Even strong correlations here don't say anger causes low scores or low scores cause anger — both are expressions of the same underlying dissatisfaction. Also: the post-fix all-rows version of this heatmap looks stronger but is the circular one (see Page 9 caveat); we cite the untouched-rows version instead.

★ BEYOND

Shows: clean monotonic gradient — 1-star = 46% anger, 5-star = 91% joy. Emotion and score ARE carrying overlapping information but not redundantly.

Doesn't show: which direction is "more truthful" on the ambiguous middle (3-star). 3-star reviews are the most emotionally mixed and the most informative per the J-shape literature — but they are also where the emotion classifier is least confident.

★ BEYOND

Shows: emotion by time-of-day bucket. Joy dominates everywhere. Anger&sadness fraction shifts slightly toward evening.

Doesn't show: timezone. Google reviews arrive in UTC, UK posts land at "UK evening" via BST offset. The apparent "8pm peak" is real but its UTC stamp hides UK local time.

★ BEYOND

Shows: hourly anger and sadness track together; joy is inverse. Anger peaks around 19:00–21:00 UTC (UK evening post-workout frustration).

Doesn't show: whether evening anger is about evening (peak-time crowding) or is just when angry people choose to write. Both plausible.

★ BEYOND

Shows: median review length by emotion. Surprise (38 words) and sadness (41) are longest. Joy (16) is shortest. Angry people write shorter than sad people — anger vents in bursts; sadness narrates.

Doesn't show: information density. Long review ≠ informative review. The "2-star reviews are most informative" finding in REFLECTIONS.md uses topic-breadth and both-sentiment-markers-present as metrics, not raw length.

10

Gensim LDA

Script: v3_10_gensim_lda.py | Rubric section: "Using Gensim"

Foundational topic model from Week 3.3.3. 10 topics, probabilistic (bag-of-words over preprocessed text). pyLDAvis interactive distance map now working.

Rubric items covered on this page

✓

"Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews)."

Full heavy-preprocess branch from Phase 5: lowercase + stopwords + numbers removed + tokenised + bigram phrases (gensim.models.Phrases) + dictionary + corpus. Fed to LDA.

#38

✓

"Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10."

LdaModel, num_topics=10, 20 passes, random_state=42. Coherence c_v = 0.449.

#39

✓

"Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms."

pyLDAvis interactive HTML below. Initially the notebook hung on prepare(); resolved by lowering the input dataset size and using mds='mmds'. Matplotlib fallbacks retained for static viewing.

#40

✓

"Comment on the output and whether it is similar to other techniques, and whether any extra insights were obtained."

Commentary in notebook. Three concrete comparisons below.

#41

✓ HIT — Rubric #39–#40 (static view)

"Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10."

Shows: the 10 LDA topics with their top terms. Much broader themes than BERTopic's 191 — "equipment/facilities" is one LDA topic but spreads across ~8 BERTopic topics.

Doesn't show: document-level assignment. LDA gives each document a distribution over topics (mix), not a single label. The "dominant topic" assignment in the next chart is a chosen simplification.

✓ HIT — descriptive

Shows: document count per dominant LDA topic. Equipment and staff dominate.

Doesn't show: topic mixtures. A review that is 40% equipment / 30% staff / 30% billing is counted here ONLY as equipment. LDA's real strength (soft assignment) is not visible in a bar chart.

✓ HIT — Rubric #40 (interactive)

"Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms."

Shows: pyLDAvis canonical visualisation. Left panel = intertopic distance map (Jensen-Shannon divergence reduced via MDS). Right panel = salient terms per selected topic with relevance slider (λ).

Doesn't show: topic quality at a glance. The MDS 2D projection is a lossy reduction; two topics that appear close can still differ on specific terms. Use the λ slider (0.3–0.7) to see distinctive terms rather than just most-frequent terms.

Open pyLDAvis fullscreen

Learning — LDA vs BERTopic, what each shows the other doesn't

LDA gave us broader themes (10 vs BERTopic's 191) and soft document assignments. It also surfaced the "Danish topic" and "German topic" clusters in V1 — LDA's bag-of-words representation is sensitive to language-specific vocabulary, which is why it separated them. BERTopic would not isolate these as discrete topics because multilingual embedding similarity to the English cluster centroids was low enough for those reviews to land as outliers rather than their own cluster.

LDA strength: probabilistic, interpretable, language-sensitive, broad themes. BERTopic strength: semantic at phrase level, density-based clustering, no fixed topic count, deterministic with seed.

They answer different questions. LDA tells you "this review is 40% equipment, 30% staff". BERTopic tells you "this review is specifically about showers being cold after a workout, which is adjacent to showers being dirty". Use both.

Weakness — LDA coherence 0.449

Reasonable for a 10-topic gym-review model but not high. Could be pushed by (a) cleaner preprocessing (trigrams, named-entity filtering), (b) more topics, (c) tuning alpha/eta. We did not chase coherence because LDA is the minimum-rubric model here, not the primary analytical engine — BERTopic does the heavy lifting. Honesty over gaming-the-metric.

11

Falcon-7b — generative topic extraction + actionable insights

Notebook: notebook_01_falcon7b.ipynb (Colab, T4 GPU) | Rubric section: "Using a large language model from Hugging Face"

tiiuae/falcon-7b-instruct loaded on Colab. Two prompting passes: first to extract top-3 topics per review, then to synthesise actionable recommendations. BERTopic re-run on the Falcon-generated topic list as a meta-clustering step.

Rubric items covered on this page

✓

"Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review."

Exact model ID. pipeline("text-generation", ..., max_length=1000).

#31

⚠

"Add the following prompt to every review, before passing it on to the model: 'In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line.' Run the model. Note: If the execution time is too high, you can use a subset of the bad reviews (instead of the full set) to run this model."

Exact prompt used. 600-review subset (out of 5,825 negative reviews — 10.3%) to fit within free Colab T4 compute budget. Rubric explicitly permits subsetting. Tagged THIN because 600 is a modest sample; we disclose this in the report.

#32

✓

"The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list."

1,349 unique topic strings extracted (after deduplication). Raw list in output/falcon_results.csv.

#33

✓

"Use this list as input to run BERTopic again."

Meta-clustering step: BERTopic over the Falcon topic strings. Interactive output in output/falcon_meta_clusters.html.

#34

✓

"Comment about the output of BERTopic. Highlight any changes, improvements, and if any further insights have been obtained."

Commentary in notebook: Falcon produces human-readable topic labels that BERTopic's c-TF-IDF cannot. "Charged after cancelling" carries intent that "charge, cancel, member" (bag of words) does not. Meta-clustering over Falcon outputs yields a smaller, semantically cleaner topic set than the c-TF-IDF version on raw text.

#35

✓

"Use the comprehensive list from Step 3. Pass it to the model as the input, but pre-fix the following to the prompt: 'For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?' Run the Falcon-7b-Instruct model."

Exact prompt with prefix. Run in notebook Cells 23–24.

#36

✓

"List the output, ideally in the form of suggestions, that the company can employ to address customer concerns."

Suggestions covered hygiene protocols, peak-time capacity comms, billing transparency, equipment SLA. Many overlap with Phase 12 OIR findings — arrived at independently, which is corroborative rather than redundant.

#37

600

Reviews prompted

1,349

Unique topic strings

10.3%

Of the 5,825 negative reviews

T4

GPU tier (Colab free)

✓ HIT — Rubric #33

"The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list."

Shows: distribution of Falcon-extracted topics. Long tail — most topics appear only a handful of times; a head of ~30 common topics covers most reviews.

Doesn't show: quality of the extraction. Falcon-7b-instruct occasionally hallucinates a topic that's not in the source review — we spotted ~5% noise rate in a hand-check of 50 outputs. Reported as a limitation, not scrubbed.

✓ HIT — Rubric #34

"Use this list as input to run BERTopic again."

Shows: BERTopic run over the Falcon-generated topic strings. Fewer, tighter meta-clusters than raw-text BERTopic because Falcon has already done one layer of abstraction.

Doesn't show: where the reduction came from — Falcon, or the c-TF-IDF over short text. Hard to disentangle. Useful as a corroboration of Phase 6 themes rather than a novel finding.

✓ HIT — Rubric #34 (intertopic)

Shows: intertopic distance map of the Falcon meta-clusters.

Doesn't show: a completely novel cluster structure. It echoes Phase 6.

Weakness — sample size + hallucination risk

600 of 5,825 negative reviews (10.3%). The rubric explicitly permits subsetting on compute grounds, and we document the choice, but a 600-review sample will miss low-frequency issues. Second: Falcon-7b-instruct is a 2023 model prone to hallucinating plausible-sounding topics not present in the source. Hand-check of 50 outputs found ~5% noise. Both acknowledged in the report.

Learning — why use a generative model here at all?

BERT-family encoders classify or cluster. Falcon is a decoder model — it generates. For a topic-labelling task, generation produces human-readable phrases ("charged after cancelling my membership") rather than bag-of-words ("charge cancel member"). That readability is the deliverable. We use the right architecture for the right task, not the most powerful model for every task.

12

Actionable findings — severity × frequency

Script: v3_12_actionable_findings.py

OIR chains (Observation → Implication → Recommendation) for each major theme. Severity is human-calibrated on business cost (churn, legal, capex). Frequency is BERTopic document count on the 18,585-review subset. This page feeds report items #45 ("conclusions supported by data") and #48 ("final insights").

Rubric coverage

★

Not a rubric item. OIR framework layered on top of BERTopic output.

Feeds the report section. Severity scoring is an editorial judgement (not algorithmic), which we disclose — we do NOT present it as model output.

beyond

★ BEYOND

Shows: each theme as a bubble. x-axis = frequency (document count). y-axis = severity (1–10, human-calibrated). Bubble size = review volume. Cleaning sits top-right (fix immediately). Billing sits top-centre (low frequency but highest severity).

Doesn't show: cost of fix. Severity is "what it costs the business if we ignore it", not "what it costs to fix". A cold-shower fix is cheap (boiler resizing); an HVAC fix is expensive (capex). The Phase 12 recommendations sort on fix-cost implicitly but the chart does not.

1. Cleaning & Hygiene

Severity 9/10 562 reviews

Observation: BERTopic Topic 1 (562 docs): cleaning, toilets, dirty, stations, wipe. The #1 complaint theme by frequency.
Implication: Illustrative churn impact: 1% churn from cleanliness × ~1.8M members = ~18,000 lost memberships ≈ £4.3M/year. Number is an order-of-magnitude bound, not a point estimate.
Recommendation: Visible cleaning schedules at peak transitions. QR-based spray-bottle reporting via app. KPI: reduce cleaning-topic share 30% over 6 months.

2. Overcrowding at peak times

Severity 7/10 743 reviews

Observation: ~743 reviews across multiple topics mention peak-time congestion — the single largest aggregated theme.
Implication: Budget model needs density. Exceeding the density tipping point drives churn within ~3 months of joining (industry benchmark, not our finding).
Recommendation: In-app capacity notifications. Gamify off-peak visits. Consider dynamic pricing for peak slots.

3. Air conditioning & ventilation

Severity 8/10 423 reviews

Observation: Topic 3 (423 docs): air, conditioning, hot. Infrastructure, not operational.
Implication: Capex investment. Post-COVID air-quality sensitivity makes this a reputational risk, not just a comfort issue.
Recommendation: HVAC audit at top-30 most-reviewed sites first. Interim: industrial fans during heatwave-forecast weeks.

4. Billing & cancellation disputes

Severity 10/10 157 reviews

Observation: Joining fees, double charges, cancellation friction. Trustpilot-heavy. Legal language ("solicitor", "trading standards") appears.
Implication: Highest severity — lowest frequency but regulator-adjacent. 5.4-day reply lag on these is too slow.
Recommendation: 24-hour SLA for billing complaints. Self-service cancellation in app. Proactive refund workflow for duplicate charges.

5. Broken equipment — weeks, not days

Severity 7/10 265 reviews

Observation: The word "weeks" appears as a modifier on equipment complaints specifically. Members report items broken for weeks, not days.
Implication: Procurement/maintenance bottleneck. Compounds the overcrowding problem (fewer working units → more queueing).
Recommendation: 72-hour repair SLA. Publish repair status in app. Track mean-time-to-repair as a KPI.

6. Staff behaviour — rude / unprofessional

Severity 8/10 315 reviews

Observation: Personal pronouns (she, her, he, him) dominate — complaints are about specific named or identifiable individuals, not "the staff" generically.
Implication: Training + per-site management, not systemic. One rude staff member = dozens of 1-star reviews.
Recommendation: Mystery-member programme. De-escalation training. Per-site complaint tracking surfaced to area managers.

7. Cold showers

Severity 6/10 243 reviews

Observation: Topic 5 (243 docs): cold showers after workout. "Moment of truth" failure — the final interaction before leaving.
Implication: Cheap to fix (boiler sizing) relative to HVAC or equipment. Highest ROI intervention by £-per-sentiment-point.
Recommendation: Audit water heater capacity at the 20 sites with most shower complaints. Quick win.

Learning — severity is editorial, not algorithmic

We deliberately keep severity as a human-assigned 1–10 on business cost, not a model output. A model that scored severity would either (a) proxy frequency (obvious cleaning=9) or (b) proxy anger confidence (which inflates generic venting). Neither maps to "what would it cost us if we ignored this". Editorial judgement is the right tool, disclosed as such.

E

Extended analysis, limitations & robustness

Scripts: v3_contextual_analysis.py, v3_postcode_weather_analysis.py, v3_robustness_oster.py, v3_panel_iterations.py | Status: BEYOND rubric

Everything from this page onward is beyond rubric. Included because it stress-tests the claims made in the main pipeline (Oster bound on the headline correlation) and provides context (weather, income, deprivation, music) that turns descriptive findings into actionable hypotheses.

Monthly volume — the honest view

★ BEYOND

Shows: annotated monthly volume. Red zone = Google 2022 data is sparse (export limitation). Green = full collection. Trustpilot doesn't start until May 2023 and has spiky months.

Doesn't show: why the export is sparse. We can't tell if it's a Google Takeout window limit or a scraping cutoff — we know the data is incomplete for 2022, not why. Don't draw temporal conclusions from pre-June 2023.

Music ↔ negativity — the single load-bearing correlation

★ BEYOND

Shows: location-level music complaint rate vs overall negativity rate. Unconditional Pearson r = +0.60 (p < 1e-39, n=390 locations).

Doesn't show: causation. "Locations where people write longer reviews about everything" is a real confounder (the complaint-amplifier effect). The partial correlation controlling for log(n_reviews) and median review length shrinks to r = +0.476 — a 21% reduction, interpretable floor.

Oster (2019) bound: unobservables would need to be δ* = 1.9× as informative as the observables (including the confounder above) to kill this effect. The correlation survives the observational stress test. We do NOT claim causal identification — that needs a volume-cap geo-experiment — but the finding is robust. Full detail: ROBUSTNESS_APPENDIX.md.

Income ↔ negativity — the budget-gym paradox

★ BEYOND

Shows: median household income by Local Authority vs location negativity. r = +0.33. Higher-income areas complain more.

Doesn't show: individual-level dynamics. This is an ecological correlation — we don't know whether the individual affluent member is complaining more, or whether these locations draw different member types. Also can't distinguish "I expect more for the same price" from "I have more time to write reviews".

Weather ↔ facility complaints

★ BEYOND

Shows: AC complaint rate 10.7% on >25°C days vs 3.5% on 5–10°C days — 3× increase. Cold-shower complaints spike on cold days. Clean signal that weather modulates specific facility complaints.

Doesn't show: the daily-noise variance. Single hot days get no complaints; sustained heatwaves do. We use daily correlation but the real driver is probably multi-day integrated heat exposure.

Cascade hypothesis — tested, half-right

★ BEYOND — hypothesis test

Shows: AC complaint rate vs staff-rudeness rate at location level. Pre-registered hypothesis: hot days → broken AC → angry customers → staff absorb frustration → staff become rude. At location level, r(AC, rudeness) = +0.069 — essentially zero. r(AC, overall anger) = +0.256 — moderate.

Verdict: the cascade hypothesis is half-right, half-wrong. Facility failures do generate anger (supported). But staff rudeness is an independent problem that does NOT correlate with AC failures (not supported). Don't assume fixing AC will fix staff behaviour — fix them as separate workstreams.

Deprivation ↔ reviews — weak signal

★ BEYOND

Shows: Local Authority deprivation rank vs review metrics. r = +0.20 (deprivation, %negative) — weak positive.

Honest read: this is a weak correlation. r = +0.20 with a noisy scatter. Deprivation is a poor predictor of review tone compared to income (r = +0.33) or music complaint rate (r = +0.60). The headline "least deprived areas have worse reviews" is real but not the main story — income gradient is cleaner.

Panel findings — location hotspots

★ BEYOND

Shows: top-10 locations by negative review count, broken down by emotion. London Stratford leads (81 negative; 46% anger).

Doesn't show: per-capita exposure — high-traffic locations get more negatives in absolute terms but not necessarily per member. Rate-based view would be cleaner but we don't have membership counts per site.

★ BEYOND

Shows: quarterly complaint-category trends. Equipment dominates throughout. Cleaning and billing track similarly.

Doesn't show: seasonality clearly — one-year panel is too short to separate trend from seasonal effect. Would need multi-year data.

Known limitations (combined list)

1. Emotion classifier domain mismatch: Twitter-trained model misreads polite complaint language as joy. Addressed in Phase 8b with score-guided re-rank + independent-classifier validation. Residual error ceiling 8.54% on the un-corrected high-star-negative side.
2. Language detection on short texts: langdetect is unreliable on reviews under ~5 words. 4.8% disagreement with Trustpilot's own column. Partially mitigated by trusting Trustpilot's column when the text is short.
3. BERTopic topic diversity 0.546: Below 0.6 threshold — 191 topics has some near-duplicates. Kept as-is for Phase 12 granularity; merged-down version would be used for a report-facing summary.
4. Falcon-7b subset: 600 of 5,825 negative reviews (10.3%). Rubric-permitted on compute grounds; disclosed in report. Hallucination rate ~5% on hand-checked sample.
5. Survivorship bias: We analyse only people who write reviews. Silent-quitters are invisible; the corpus is the vocal minority. Severity probably understated.
6. Ecological correlations: Income, deprivation, weather correlations are location-level (aggregated). Individual-level inference from these is not justified without further data.
7. 512-character truncation in emotion model: Line-level bug (char truncation on top of token truncation). Sensitivity-tested in Phase 8c: 5.29% label disagreement, 0% on reviews ≤512 chars, 28.9% on >512 chars. Dominant flow is within-negative (anger↔sadness), does not affect the positive/negative binary the pipeline relies on. Patch available but not applied to submission pickles (for reproducibility of the submitted artefact).

Overall robustness verdict

V3 is more defensible after the panel stress-tests than before. Both load-bearing claims survived their sensitivity tests in the same direction: the music ↔ negativity correlation withstands an Oster bound at default assumptions (δ* ≈ 1.9), and the emotion fix is stable under a truncation sensitivity test (94.7% label agreement on a re-run without the bug). Neither test is a proof — Oster is observational, Phase 8c removes one bug in isolation — but both were pre-specified as panel residual concerns and both returned clean.

Robustness appendix (full) Phase 8b panel review Phase 8c summary JSON Oster summary JSON Lessons learned (V1 → V3) Rubric readiness audit

Restricted · PACE NLP

How to read this

Single biggest methodology lesson across the project

Contents

Data lineage — what happened, what got dropped

1) End-to-end pipeline — with row counts and drops

2) Row counts at each gate — visualised

Where the data thins out

3) Two preprocessing branches — the same raw text, two outputs

4) Three real before/after examples — heavy preprocessing

What heavy preprocessing wins and loses

5) Phase 8b emotion-correction decision tree

6) Three real before/after examples — Phase 8b re-rank

Why the re-rank uses the SAME model, not a different one

Load + describe

Rubric items covered on this page

Learning

Clean + language detection

Rubric coverage

Weakness: langdetect on very short texts

Learning

Basic EDA — pre-NLP

Rubric coverage

Frequency analysis + wordclouds

Rubric items covered on this page

Learning — preprocessing branches

BERTopic — core topic model

Rubric items covered on this page

Interactive rubric deliverables

Learning — the BERTopic configuration space

Weakness — topic diversity = 0.546

Location analysis + top-30 BERTopic

Rubric items covered on this page

Interactive (Rubric #23)

Learning — "additional insights" honesty

Emotion classification

Rubric items covered on this page

Weakness — emotion classifier domain mismatch

Angry-only BERTopic (Rubric #29–#30)

Learning — the "6 emotions" constraint

Emotion correction — score-guided re-rank

Rubric coverage

Learning 1 — character truncation bug inherited from Phase 8

Learning 2 — validation by independent second classifier

Residual error ceiling — the un-corrected side

Sarcasm examples the fix caught

Post-NLP correlations

Rubric coverage

Gensim LDA

Rubric items covered on this page

Learning — LDA vs BERTopic, what each shows the other doesn't

Weakness — LDA coherence 0.449

Falcon-7b — generative topic extraction + actionable insights

Rubric items covered on this page

Weakness — sample size + hallucination risk

Learning — why use a generative model here at all?

Actionable findings — severity × frequency

Rubric coverage

1. Cleaning & Hygiene

2. Overcrowding at peak times

3. Air conditioning & ventilation

4. Billing & cancellation disputes

5. Broken equipment — weeks, not days

6. Staff behaviour — rude / unprofessional

7. Cold showers

Learning — severity is editorial, not algorithmic

Extended analysis, limitations & robustness

Monthly volume — the honest view

Music ↔ negativity — the single load-bearing correlation

Income ↔ negativity — the budget-gym paradox

Weather ↔ facility complaints

Cascade hypothesis — tested, half-right

Deprivation ↔ reviews — weak signal

Panel findings — location hotspots

Known limitations (combined list)

Overall robustness verdict