Data load & clean
DataFrame
What it is: A table in pandas. Columns have names. Rows are records.
Our data: `google_df` has 13,898 rows and 7 columns after the NaN drop. `trustpilot_df` has 15,815.
When to reach for it: Every step in this pipeline starts from a DataFrame and ends with a DataFrame.
NaN / missing value
What it is: A cell with no value. Not an empty string; a flag that says "nothing here".
Our data: ~9,000 Google rows had a star rating but no comment text. Not NLP-useful. We dropped them.
When to reach for it: `dropna(subset=['Comment'])` drops rows where the review text is missing, nothing else.
Column harmonisation
What it is: Making two sources' columns line up so you can merge them.
Our data: Google's `Comment` + `Overall Score` + `Club's Name` become one platform's `text` + `score` + `location`. Trustpilot same, from `Review Content` + `Review Stars` + `Location Name`.
When to reach for it: Do this before any cross-source analysis. Do it once, in one script.
Title + Content merge (Trustpilot)
What it is: Join the short headline to the full review before any model sees the text.
Our data: 59% of Trustpilot Review Titles carry information NOT in Review Content. Ignoring titles loses signal.
When to reach for it: When a source gives you two text fields for the same thing, check whether both add info before picking one.
Language detection filter
What it is: Drop reviews that are not in your target language.
Our data: `langdetect` flagged 9.5% as non-English โ Danish, German, Welsh, French. 2,047 Google + 858 Trustpilot rows held aside.
When to reach for it: Filter on detected language only when your downstream models are monolingual. Keep the non-English set for audit.
Preprocessing โ philosophy and recipes
Lowercasing
What it is: Turn all letters into lowercase.
Our data: "The GYM is FANTASTIC" becomes "the gym is fantastic".
When to reach for it: Do it before counting words (FreqDist, LDA, wordcloud). Don't bother for BERT โ its tokenizer already handles it.
Stopwords
What it is: High-frequency words that appear everywhere and mean nothing.
Our data: `the`, `and`, `is`, plus domain-specifics we added: `pure`, `puregym`, `gym`.
When to reach for it: Strip for the counting branch (FreqDist, LDA, wordcloud). Keep for BERT โ BERT uses their positions.
Tokenisation
What it is: Split a string into individual word pieces.
Our data: "don't go" becomes `['do', "n't", 'go']` via NLTK `word_tokenize`.
When to reach for it: Needed for any counting step. BERT uses its own subword tokeniser; don't double-tokenise.
Lemmatisation
What it is: Reduce words to their base form. `running` โ `run`; `better` โ `good`.
Our data: Used in the Gensim LDA branch. Not used for BERTopic.
When to reach for it: Collapses related forms for counting models. Never lemmatise before BERT โ destroys tense/mood signals.
Counting branch
What it is: The pipeline that wants clean, de-noised text: lowercase, stopword-strip, number-strip, maybe lemmatise.
Our data: Feeds FreqDist, wordclouds, Gensim LDA.
When to reach for it: Any model where the output is a count or probability over tokens.
Embedding branch
What it is: The pipeline that wants raw, unmodified sentences.
Our data: Feeds BERTopic, BERT emotion, Qwen LLM.
When to reach for it: Any model where the internal mechanism reads language. Preprocessing destroys its signal.
The one rule
What it is: If the model understands language, feed it language. If the model counts words, clean the words.
Our data: Same raw text, two branches, opposite recipes.
When to reach for it: Quote this when a teammate says "we always lowercase before modelling". They don't know their model.
Frequency & word counting
Frequency distribution (FreqDist)
What it is: A count of how often each word appears in a corpus.
Our data: Top 10 Google words after domain stopwords: `equipment`, `staff`, `clean`, `broken`, `busy`, ...
When to reach for it: Cheapest descriptive statistic for text. Use as sanity check, not as evidence.
Bar plot of top N words
What it is: A horizontal bar chart of the top N words by frequency.
Our data: Rubric 2.6. Saved as `output/v3_05_top10_google.png`, `_trustpilot.png`.
When to reach for it: Always horizontal โ word labels don't fit vertically. Reverse sort so the biggest is at the top.
Word cloud
What it is: A visual where word size is proportional to frequency.
Our data: Rubric 2.7. Four files: all + negative, per platform.
When to reach for it: Decorative. Two words with frequencies 1,000 and 1,001 look identical. Don't use as analytic evidence.
Score-filtered subset
What it is: The subset of rows where the star rating is below a threshold.
Our data: Negatives = score < 3. 5,825 English negatives across both platforms.
When to reach for it: Filter the ORIGINAL DataFrame (not the tokenised one). Re-tokenise on the filtered set. The instructor was explicit.
BERTopic & embedding models
BERTopic
What it is: A topic-modelling pipeline: sentence embeddings โ UMAP โ HDBSCAN โ c-TF-IDF.
Our data: Found 191 topics on 18,000 common-location reviews. Top cluster: equipment + air + staff + showers.
When to reach for it: When you want specific, sharp clusters of complaints. Feed it RAW sentences, not preprocessed tokens.
Sentence transformer
What it is: A neural net that turns a sentence into a fixed-length vector. Similar sentences land near each other.
Our data: `all-MiniLM-L6-v2` โ 384-dimensional output. Used by BERTopic's default embedder.
When to reach for it: The first step of BERTopic. Replace it with a larger model if you need better semantic matching.
UMAP
What it is: A dimensionality reducer. Turns a 384-dim vector into ~5 dims for clustering.
Our data: Configured with `random_state=42`. Without the seed, BERTopic gives different clusters each run.
When to reach for it: You MUST seed it for reproducibility. Also sort your documents before fit โ order matters too.
HDBSCAN
What it is: A density-based clusterer. Groups nearby points; leaves isolated points as outliers.
Our data: Labels unclustered points as topic -1 (the outlier bucket).
When to reach for it: Standard inside BERTopic. No Python 3.14 Windows wheel โ so local preview uses KMeans; Colab uses the real thing.
c-TF-IDF
What it is: TF-IDF applied per cluster to find the words most distinctive to each topic.
Our data: How BERTopic gets from "cluster 0" to the label `[equipment, air, staff, showers]`.
When to reach for it: Always tune BERTopic's vectoriser with `stop_words` and `min_df` for cleaner labels.
Topic (BERTopic cluster)
What it is: A group of semantically similar documents with a list of representative words.
Our data: 191 topics on the full set. 64 on the top-30-locations subset. Similar themes, different granularity.
When to reach for it:
Outlier topic (-1)
What it is: The "couldn't cluster" bucket. Not a topic; noise.
Our data: 43% of documents were outliers before reduction. Reduced to 0% by two-stage outlier reduction.
When to reach for it: Never report the outlier bucket as a finding. Reduce it. Accept what's left.
Intertopic distance map
What it is: A 2D scatter plot of topic positions. Close-on-screen means semantically similar.
Our data: `visualize_topics()` output in `output/v3_06_intertopic_distance.html`.
When to reach for it: Good for intuition. Bad for measurement โ 2D distorts high-dim distances.
Similarity heatmap
What it is: A matrix showing cosine similarity between every pair of topics.
Our data: `visualize_heatmap()` in `output/v3_06_similarity_heatmap.html`.
When to reach for it: Use to defend merge decisions. "Topics 4 and 11 have 0.78 similarity โ we merge them".
Cosine similarity
What it is: The cosine of the angle between two vectors. 1 = identical direction; 0 = unrelated.
Our data: Used throughout embedding comparisons. Week 1.3.2 of the course taught it explicitly.
When to reach for it: The default similarity measure for word and sentence embeddings. Better than Euclidean for vectors of different magnitudes.
Reproducibility (BERTopic)
What it is: Same inputs โ same outputs across runs.
Our data: Two-step fix: `UMAP(random_state=42)` AND `sort_values('Creation Date')` before `fit_transform`.
When to reach for it: Required for any BERTopic result you put in a report. Cohort failure mode: students forget the sort.
Emotion classification
Emotion classifier
What it is: A model that labels each text with one emotion from a fixed class set.
Our data: Rubric mandates `bhadresh-savani/bert-base-uncased-emotion`. 6 Ekman classes: anger, sadness, joy, love, fear, surprise. No `neutral` class.
When to reach for it: Needed for rubric section 5. Swap for `SamLowe/roberta-base-go_emotions` (28 classes) in a second pass for accuracy.
Domain mismatch
What it is: Training data โ inference data. The model misreads what it wasn't trained on.
Our data: bhadresh was trained on shouty tweets. British gym reviews are polite prose. Polite anger gets labelled as sadness.
When to reach for it: Always check the training distribution against your inference distribution. It's the #1 silent source of error.
Probability distribution (classifier output)
What it is: A score per class. Full distribution beats "top-1 only".
Our data: 51% anger / 49% sadness is very different from 95% anger. Store the full dist, not just the top class.
When to reach for it: Persist at richest form. Re-running inference is expensive; aggregating is cheap.
Neutral class
What it is: A class meaning "no strong emotion".
Our data: Absent in bhadresh. Present in go_emotions. Accounts for 467/1,500 of our reviews = 31%.
When to reach for it: A classifier with no neutral class overstates emotion presence. Flag this in your writeup.
Reclassification pass (Phase 8c)
What it is: Running a second, richer classifier and noting where it disagrees with the first.
Our data: 10.5% of bhadresh-sadness labels reclassified to anger-family (annoyance, disappointment, disapproval) under go_emotions.
When to reach for it: When you can't swap the mandated model, run a second pass and report the delta.
LLMs & topic extraction
LLM (Large Language Model)
What it is: A neural net with billions of parameters that generates natural-language output from a prompt.
Our data: Qwen2.5-72B-Instruct via HF Inference Providers. Replaces the rubric's Falcon-7B.
When to reach for it: When you need structured natural-language summaries. When prompt engineering will be cheaper than training.
Prompt
What it is: The instruction you send to an LLM. Gets concatenated with the input data.
Our data: Verbatim rubric prompt: "In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line."
When to reach for it: Write it verbatim when the rubric specifies. Test on 3-5 examples before running on hundreds.
Temperature (LLM)
What it is: A knob between 0 and 1+ that controls output randomness. 0 = deterministic. 1 = creative.
Our data: `temperature=0.1` for topic extraction (we want consistency). `temperature=0.3` for insights (we want some variation).
When to reach for it: Low temperatures for classification and extraction. Higher for creative generation. Never 0 for creative tasks.
HuggingFace Inference Providers
What it is: A paid API that routes your LLM calls to providers (Together, Fireworks, Groq, Novita).
Our data: 600 reviews ร 2 API calls in 120 s total. 100% success rate. No GPU needed locally.
When to reach for it: The cheapest path to frontier-model output. Free tier has rate limits; HF Pro raises them.
Parse (LLM output)
What it is: Turning model output back into structured data (a list, a JSON, a table).
Our data: Regex to strip `1.`, `2.`, bullet markers. 100% success โ clean prompts produce parseable output.
When to reach for it: Always validate after parsing. 70B models are reliable but not perfect.
Meta-BERTopic
What it is: Running BERTopic on the LLM's distilled topic phrases, not on raw reviews.
Our data: Clusters 1,433 Qwen-extracted phrases into ~15 interpretable themes.
When to reach for it: Two-stage pipeline โ LLM distill, then cluster โ beats one-stage clustering on raw text when the audience is human.
Actionable insight
What it is: A recommendation with OWNER + COST + DEADLINE. Without those three, it's a suggestion, not an insight.
Our data: "Operations: introduce reservation system for peak hours. Cost: medium. This month." That's actionable. "Reduce overcrowding" is not.
When to reach for it: Required format for rubric 6.7. Required format for any consulting report you ever write.
Gensim LDA
LDA (Latent Dirichlet Allocation)
What it is: A probabilistic topic model. Every document is a mix of topics; every topic is a distribution over words.
Our data: `num_topics=10`. Runs on Colab A100 in 3 min. Heavy preprocessing helps.
When to reach for it: When you want smooth, general themes AND probabilistic topic-per-document assignment. BERTopic is sharper; LDA is more interpretable.
Gensim
What it is: A Python library for topic modelling. Canonical implementation of LDA.
Our data: `LdaMulticore` for parallel fit. Saved corpus + dictionary to pickle.
When to reach for it: The standard. Don't write LDA from scratch.
num_topics hyperparameter
What it is: How many topics LDA finds. You set it before training.
Our data: Set to 10 (rubric suggestion). Coherence curve is shallow from 5 to 20 โ 10 is near-optimal.
When to reach for it: When the rubric or client gives you a number, use it and move on. Deep hyperparameter search is usually low-return.
pyLDAvis
What it is: An interactive visualisation for LDA topics. Distance map + salient-terms bar chart with relevance slider.
Our data: Output: `output/v3_10_pyldavis.html`.
When to reach for it: Much more informative than BERTopic's static visuals. Always include for LDA. Back it up with a static chart for PDF-safe sharing.
Reporting & consulting
OIR chain
What it is: Observation. Implication. Recommendation. Every finding needs all three.
Our data: O: Music volume correlates r=+0.60 with overall negativity. I: It's the cheapest irritant. R: Turn it down tomorrow; club manager owns it; zero budget.
When to reach for it: Senior consultants chain OIR automatically. Junior consultants jump to R. Academics stop at O. Your job: complete every chain.
Methodological triangulation
What it is: Running multiple methods on the same data and reporting where they agree.
Our data: FreqDist + BERTopic + LDA + LLM. Four methods on the same negatives. Cleaning and equipment dominate all four โ that's a fact, not an opinion.
When to reach for it: Use it when a stakeholder will challenge findings. Convergence across methods is your evidence.
Severity ร frequency matrix
What it is: A grid that ranks findings by both how bad a complaint is and how often it appears.
Our data: Phase 12's 8 findings. Billing disputes: severity 10, frequency 148. Music: severity 5, frequency 201. Different fix-first candidates.
When to reach for it: Turns a long list of complaints into a decision. "What do we fix first" is the universal board question.
Word count discipline
What it is: Writing to a strict word limit.
Our data: Rubric 8.1: 800-1000 words. Our report: 976.
When to reach for it: Over or under is a mark drop. Long reports aren't better reports; compressed reports are.
Citability
What it is: Every claim in the report traces to a specific figure, table, or cell in the notebook.
Our data: "r=0.60 correlation" โ cite cell. "London Stratford worst club" โ cite the per-club brief.
When to reach for it: Graders and sceptical stakeholders test claims by picking one and asking "show me". Unsupported = drop.
Executive summary discipline
What it is: Write the compressed version first. Build the analysis that supports it.
Our data: The top 20 tips and rubric audio clips were produced BEFORE the final notebook refactor. They forced prioritisation.
When to reach for it: Changes what you build, because you know what has to fit.