PACE · NLP Cribsheet

Data load & clean

DataFrame

What it is: A table in pandas. Columns have names. Rows are records.

Our data: `google_df` has 13,898 rows and 7 columns after the NaN drop. `trustpilot_df` has 15,815.

When to reach for it: Every step in this pipeline starts from a DataFrame and ends with a DataFrame.

NaN / missing value

What it is: A cell with no value. Not an empty string; a flag that says "nothing here".

Our data: ~9,000 Google rows had a star rating but no comment text. Not NLP-useful. We dropped them.

When to reach for it: `dropna(subset=['Comment'])` drops rows where the review text is missing, nothing else.

Column harmonisation

What it is: Making two sources' columns line up so you can merge them.

Our data: Google's `Comment` + `Overall Score` + `Club's Name` become one platform's `text` + `score` + `location`. Trustpilot same, from `Review Content` + `Review Stars` + `Location Name`.

When to reach for it: Do this before any cross-source analysis. Do it once, in one script.

Title + Content merge (Trustpilot)

What it is: Join the short headline to the full review before any model sees the text.

Our data: 59% of Trustpilot Review Titles carry information NOT in Review Content. Ignoring titles loses signal.

When to reach for it: When a source gives you two text fields for the same thing, check whether both add info before picking one.

Language detection filter

What it is: Drop reviews that are not in your target language.

Our data: `langdetect` flagged 9.5% as non-English — Danish, German, Welsh, French. 2,047 Google + 858 Trustpilot rows held aside.

When to reach for it: Filter on detected language only when your downstream models are monolingual. Keep the non-English set for audit.

Preprocessing — philosophy and recipes

Lowercasing

What it is: Turn all letters into lowercase.

Our data: "The GYM is FANTASTIC" becomes "the gym is fantastic".

When to reach for it: Do it before counting words (FreqDist, LDA, wordcloud). Don't bother for BERT — its tokenizer already handles it.

Stopwords

What it is: High-frequency words that appear everywhere and mean nothing.

Our data: `the`, `and`, `is`, plus domain-specifics we added: `pure`, `puregym`, `gym`.

When to reach for it: Strip for the counting branch (FreqDist, LDA, wordcloud). Keep for BERT — BERT uses their positions.

Tokenisation

What it is: Split a string into individual word pieces.

Our data: "don't go" becomes `['do', "n't", 'go']` via NLTK `word_tokenize`.

When to reach for it: Needed for any counting step. BERT uses its own subword tokeniser; don't double-tokenise.

Lemmatisation

What it is: Reduce words to their base form. `running` → `run`; `better` → `good`.

Our data: Used in the Gensim LDA branch. Not used for BERTopic.

When to reach for it: Collapses related forms for counting models. Never lemmatise before BERT — destroys tense/mood signals.

Counting branch

What it is: The pipeline that wants clean, de-noised text: lowercase, stopword-strip, number-strip, maybe lemmatise.

Our data: Feeds FreqDist, wordclouds, Gensim LDA.

When to reach for it: Any model where the output is a count or probability over tokens.

Embedding branch

What it is: The pipeline that wants raw, unmodified sentences.

Our data: Feeds BERTopic, BERT emotion, Qwen LLM.

When to reach for it: Any model where the internal mechanism reads language. Preprocessing destroys its signal.

The one rule

What it is: If the model understands language, feed it language. If the model counts words, clean the words.

Our data: Same raw text, two branches, opposite recipes.

When to reach for it: Quote this when a teammate says "we always lowercase before modelling". They don't know their model.

Frequency & word counting

Frequency distribution (FreqDist)

What it is: A count of how often each word appears in a corpus.

Our data: Top 10 Google words after domain stopwords: `equipment`, `staff`, `clean`, `broken`, `busy`, ...

When to reach for it: Cheapest descriptive statistic for text. Use as sanity check, not as evidence.

Bar plot of top N words

What it is: A horizontal bar chart of the top N words by frequency.

Our data: Rubric 2.6. Saved as `output/v3_05_top10_google.png`, `_trustpilot.png`.

When to reach for it: Always horizontal — word labels don't fit vertically. Reverse sort so the biggest is at the top.

Word cloud

What it is: A visual where word size is proportional to frequency.

Our data: Rubric 2.7. Four files: all + negative, per platform.

When to reach for it: Decorative. Two words with frequencies 1,000 and 1,001 look identical. Don't use as analytic evidence.

Score-filtered subset

What it is: The subset of rows where the star rating is below a threshold.

Our data: Negatives = score < 3. 5,825 English negatives across both platforms.

When to reach for it: Filter the ORIGINAL DataFrame (not the tokenised one). Re-tokenise on the filtered set. The instructor was explicit.

BERTopic & embedding models

BERTopic

What it is: A topic-modelling pipeline: sentence embeddings → UMAP → HDBSCAN → c-TF-IDF.

Our data: Found 191 topics on 18,000 common-location reviews. Top cluster: equipment + air + staff + showers.

When to reach for it: When you want specific, sharp clusters of complaints. Feed it RAW sentences, not preprocessed tokens.

Sentence transformer

What it is: A neural net that turns a sentence into a fixed-length vector. Similar sentences land near each other.

Our data: `all-MiniLM-L6-v2` — 384-dimensional output. Used by BERTopic's default embedder.

When to reach for it: The first step of BERTopic. Replace it with a larger model if you need better semantic matching.

UMAP

What it is: A dimensionality reducer. Turns a 384-dim vector into ~5 dims for clustering.

Our data: Configured with `random_state=42`. Without the seed, BERTopic gives different clusters each run.

When to reach for it: You MUST seed it for reproducibility. Also sort your documents before fit — order matters too.

HDBSCAN

What it is: A density-based clusterer. Groups nearby points; leaves isolated points as outliers.

Our data: Labels unclustered points as topic -1 (the outlier bucket).

When to reach for it: Standard inside BERTopic. No Python 3.14 Windows wheel — so local preview uses KMeans; Colab uses the real thing.

c-TF-IDF

What it is: TF-IDF applied per cluster to find the words most distinctive to each topic.

Our data: How BERTopic gets from "cluster 0" to the label `[equipment, air, staff, showers]`.

When to reach for it: Always tune BERTopic's vectoriser with `stop_words` and `min_df` for cleaner labels.

Topic (BERTopic cluster)

What it is: A group of semantically similar documents with a list of representative words.

Our data: 191 topics on the full set. 64 on the top-30-locations subset. Similar themes, different granularity.

When to reach for it:

Outlier topic (-1)

What it is: The "couldn't cluster" bucket. Not a topic; noise.

Our data: 43% of documents were outliers before reduction. Reduced to 0% by two-stage outlier reduction.

When to reach for it: Never report the outlier bucket as a finding. Reduce it. Accept what's left.

Intertopic distance map

What it is: A 2D scatter plot of topic positions. Close-on-screen means semantically similar.

Our data: `visualize_topics()` output in `output/v3_06_intertopic_distance.html`.

When to reach for it: Good for intuition. Bad for measurement — 2D distorts high-dim distances.

Similarity heatmap

What it is: A matrix showing cosine similarity between every pair of topics.

Our data: `visualize_heatmap()` in `output/v3_06_similarity_heatmap.html`.

When to reach for it: Use to defend merge decisions. "Topics 4 and 11 have 0.78 similarity → we merge them".

Cosine similarity

What it is: The cosine of the angle between two vectors. 1 = identical direction; 0 = unrelated.

Our data: Used throughout embedding comparisons. Week 1.3.2 of the course taught it explicitly.

When to reach for it: The default similarity measure for word and sentence embeddings. Better than Euclidean for vectors of different magnitudes.

Reproducibility (BERTopic)

What it is: Same inputs → same outputs across runs.

Our data: Two-step fix: `UMAP(random_state=42)` AND `sort_values('Creation Date')` before `fit_transform`.

When to reach for it: Required for any BERTopic result you put in a report. Cohort failure mode: students forget the sort.

Emotion classification

Emotion classifier

What it is: A model that labels each text with one emotion from a fixed class set.

Our data: Rubric mandates `bhadresh-savani/bert-base-uncased-emotion`. 6 Ekman classes: anger, sadness, joy, love, fear, surprise. No `neutral` class.

When to reach for it: Needed for rubric section 5. Swap for `SamLowe/roberta-base-go_emotions` (28 classes) in a second pass for accuracy.

Domain mismatch

What it is: Training data ≠ inference data. The model misreads what it wasn't trained on.

Our data: bhadresh was trained on shouty tweets. British gym reviews are polite prose. Polite anger gets labelled as sadness.

When to reach for it: Always check the training distribution against your inference distribution. It's the #1 silent source of error.

Probability distribution (classifier output)

What it is: A score per class. Full distribution beats "top-1 only".

Our data: 51% anger / 49% sadness is very different from 95% anger. Store the full dist, not just the top class.

When to reach for it: Persist at richest form. Re-running inference is expensive; aggregating is cheap.

Neutral class

What it is: A class meaning "no strong emotion".

Our data: Absent in bhadresh. Present in go_emotions. Accounts for 467/1,500 of our reviews = 31%.

When to reach for it: A classifier with no neutral class overstates emotion presence. Flag this in your writeup.

Reclassification pass (Phase 8c)

What it is: Running a second, richer classifier and noting where it disagrees with the first.

Our data: 10.5% of bhadresh-sadness labels reclassified to anger-family (annoyance, disappointment, disapproval) under go_emotions.

When to reach for it: When you can't swap the mandated model, run a second pass and report the delta.

LLMs & topic extraction

LLM (Large Language Model)

What it is: A neural net with billions of parameters that generates natural-language output from a prompt.

Our data: Qwen2.5-72B-Instruct via HF Inference Providers. Replaces the rubric's Falcon-7B.

When to reach for it: When you need structured natural-language summaries. When prompt engineering will be cheaper than training.

Prompt

What it is: The instruction you send to an LLM. Gets concatenated with the input data.

Our data: Verbatim rubric prompt: "In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line."

When to reach for it: Write it verbatim when the rubric specifies. Test on 3-5 examples before running on hundreds.

Temperature (LLM)

What it is: A knob between 0 and 1+ that controls output randomness. 0 = deterministic. 1 = creative.

Our data: `temperature=0.1` for topic extraction (we want consistency). `temperature=0.3` for insights (we want some variation).

When to reach for it: Low temperatures for classification and extraction. Higher for creative generation. Never 0 for creative tasks.

HuggingFace Inference Providers

What it is: A paid API that routes your LLM calls to providers (Together, Fireworks, Groq, Novita).

Our data: 600 reviews × 2 API calls in 120 s total. 100% success rate. No GPU needed locally.

When to reach for it: The cheapest path to frontier-model output. Free tier has rate limits; HF Pro raises them.

Parse (LLM output)

What it is: Turning model output back into structured data (a list, a JSON, a table).

Our data: Regex to strip `1.`, `2.`, bullet markers. 100% success — clean prompts produce parseable output.

When to reach for it: Always validate after parsing. 70B models are reliable but not perfect.

Meta-BERTopic

What it is: Running BERTopic on the LLM's distilled topic phrases, not on raw reviews.

Our data: Clusters 1,433 Qwen-extracted phrases into ~15 interpretable themes.

When to reach for it: Two-stage pipeline — LLM distill, then cluster — beats one-stage clustering on raw text when the audience is human.

Actionable insight

What it is: A recommendation with OWNER + COST + DEADLINE. Without those three, it's a suggestion, not an insight.

Our data: "Operations: introduce reservation system for peak hours. Cost: medium. This month." That's actionable. "Reduce overcrowding" is not.

When to reach for it: Required format for rubric 6.7. Required format for any consulting report you ever write.

Gensim LDA

LDA (Latent Dirichlet Allocation)

What it is: A probabilistic topic model. Every document is a mix of topics; every topic is a distribution over words.

Our data: `num_topics=10`. Runs on Colab A100 in 3 min. Heavy preprocessing helps.

When to reach for it: When you want smooth, general themes AND probabilistic topic-per-document assignment. BERTopic is sharper; LDA is more interpretable.

Gensim

What it is: A Python library for topic modelling. Canonical implementation of LDA.

Our data: `LdaMulticore` for parallel fit. Saved corpus + dictionary to pickle.

When to reach for it: The standard. Don't write LDA from scratch.

num_topics hyperparameter

What it is: How many topics LDA finds. You set it before training.

Our data: Set to 10 (rubric suggestion). Coherence curve is shallow from 5 to 20 — 10 is near-optimal.

When to reach for it: When the rubric or client gives you a number, use it and move on. Deep hyperparameter search is usually low-return.

pyLDAvis

What it is: An interactive visualisation for LDA topics. Distance map + salient-terms bar chart with relevance slider.

Our data: Output: `output/v3_10_pyldavis.html`.

When to reach for it: Much more informative than BERTopic's static visuals. Always include for LDA. Back it up with a static chart for PDF-safe sharing.

Reporting & consulting

OIR chain

What it is: Observation. Implication. Recommendation. Every finding needs all three.

Our data: O: Music volume correlates r=+0.60 with overall negativity. I: It's the cheapest irritant. R: Turn it down tomorrow; club manager owns it; zero budget.

When to reach for it: Senior consultants chain OIR automatically. Junior consultants jump to R. Academics stop at O. Your job: complete every chain.

Methodological triangulation

What it is: Running multiple methods on the same data and reporting where they agree.

Our data: FreqDist + BERTopic + LDA + LLM. Four methods on the same negatives. Cleaning and equipment dominate all four — that's a fact, not an opinion.

When to reach for it: Use it when a stakeholder will challenge findings. Convergence across methods is your evidence.

Severity × frequency matrix

What it is: A grid that ranks findings by both how bad a complaint is and how often it appears.

Our data: Phase 12's 8 findings. Billing disputes: severity 10, frequency 148. Music: severity 5, frequency 201. Different fix-first candidates.

When to reach for it: Turns a long list of complaints into a decision. "What do we fix first" is the universal board question.

Word count discipline

What it is: Writing to a strict word limit.

Our data: Rubric 8.1: 800-1000 words. Our report: 976.

When to reach for it: Over or under is a mark drop. Long reports aren't better reports; compressed reports are.

Citability

What it is: Every claim in the report traces to a specific figure, table, or cell in the notebook.

Our data: "r=0.60 correlation" → cite cell. "London Stratford worst club" → cite the per-club brief.

When to reach for it: Graders and sceptical stakeholders test claims by picking one and asking "show me". Unsupported = drop.

Executive summary discipline

What it is: Write the compressed version first. Build the analysis that supports it.

Our data: The top 20 tips and rubric audio clips were produced BEFORE the final notebook refactor. They forced prioritisation.

When to reach for it: Changes what you build, because you know what has to fit.

Meta / workflow / project

Rubric

What it is: The grading criteria. A list of items the marker ticks.

Our data: 48 items. All 48 anchored in our submission notebook (Appendix A explicit tick cells).

When to reach for it: Always map the rubric to your artefacts before you start, not after. Drives structure.

Verbal rubric amendment

What it is: A change to the rubric made in a live session but not reflected in the written doc.

Our data: Instructor said "pick another one" for both the LLM and the emotion classifier. Written rubric still says Falcon and bhadresh.

When to reach for it: Re-watch every Q&A before submitting. The written rubric is a floor; the verbal amendments raise the ceiling.

Don't over-engineer

What it is: Stop at what the brief asks for. Label anything beyond as optional.

Our data: Instructor: "The codes he shared from weeks 3 and 4 are all that is needed." Our ABSA, churn, per-club briefs sit in a labelled "beyond rubric" section.

When to reach for it: Over-delivery confuses graders and clients. Clear brief compliance plus clearly-labelled extras wins.

Brain DB (doc_chunks)

What it is: A pgvector table on Hetzner with chunk + embedding + metadata. Retrievable by semantic and text search.

Our data: 775 chunks ingested: 57 Q&A transcript + 17 extracted learnings + 701 WhatsApp cohort messages.

When to reach for it: Ingest all course material once. Future sessions can recall "what did the instructor say about X" in one query.

Reproducibility as condition

What it is: Not a virtue. The line between "demo" and "work you can defend".

Our data: Every random step has a seed. Every script reads from parquet, not re-fetches.

When to reach for it: Set it once at the start. Retrofitting reproducibility mid-project is painful and error-prone.

Restricted · PACE NLP