Top 20 tips — PACE DS301 NLP project
Hand-curated from Q&A transcript (2026-04-16, Speaker 0 = instructor Russell) + CAM P4 WhatsApp cohort (Nov 2025 – Apr 2026). Auto-gen by Qwen was too noisy; this version is sourced to specific quotes.Legend
- Scope —required = rubric tick; recommended = cohort consensus; advanced = goes beyond; meta = process/submission
- Source — QA = live Q&A 2026-04-16; WA = cohort WhatsApp; both = independently confirmed
- Conf — H/M/L
Summary table
| # | Tip | Rubric | Scope | Source | Conf |
|---|---|---|---|---|---|
| 1 | Add pure, pure gym, gym, puregym to stopwords (counting branch only) | 2.3, 2.5, 7.1 | required | both | H |
| 2 | For the negative-review wordcloud + freq, filter the original DataFrame (post-NaN, score<3) — NOT the tokenised data | 2.8 | required | QA | H |
| 3 | Feed full sentences to BERTopic and the LLM, not preprocessed tokens | 3.2, 6.2 | required | QA | H |
| 4 | You can swap Falcon for any HF model — "pick another one if you want" | 6.1 | required | QA | H |
| 5 | Cohort consensus: phi-4-mini via Ollama is an acceptable Falcon replacement | 6.1 | recommended | WA | H |
| 6 | Set the UMAP random_state so BERTopic is reproducible | 3.2, 5.6 | required | WA | H |
| 7 | Keep the data-order identical between BERTopic runs (sort before fit) | 3.2 | recommended | WA | H |
| 8 | Emotion classifier swap is also permitted — "pick another one if you want" | 5.1 | recommended | QA | H |
| 9 | Merge Google + Trustpilot negatives into one set for the LLM / emotion step — "don't worry too much about that" | 3.1, 5.3 | required | QA | H |
| 10 | For Trustpilot, concatenate Review Title + Review Content before passing to any model | 3.1 | recommended | WA | H |
| 11 | Do NOT over-engineer. The instructor has said the week 3 & 4 notebooks are all that is needed | — | meta | both | H |
| 12 | Top-30-locations BERTopic often returns only ~2 topics — this is fine, note it in the report | 4.4 | recommended | WA | M |
| 13 | Pre-processing is cheap — run it on the whole combined dataset, not a sample | 2.3 | recommended | WA | H |
| 14 | Match your notebook's headings 1-to-1 with the rubric wording — the cohort's "79 headings" pattern | 8.2, 8.5 | recommended | WA | M |
| 15 | Emojis: assess first (count distribution), then decide — don't blanket-strip | 2.3 | recommended | QA | M |
| 16 | Section 6.7 wants suggestions as a list, not a paragraph — use bullet points | 6.7 | required | QA | H |
| 17 | Falcon-on-T4 OOMs; if using Falcon specifically, request an A100 or skip Falcon entirely | 6.1 | meta | WA | H |
| 18 | Falcon needs an older transformers version that conflicts with BERTopic — run each in a separate Colab session | 6.1 | meta | WA | H |
| 19 | Report 800-1000 words; over/under is an explicit rubric item | 8.1 | required | rubric | H |
| 20 | The rubric has been verbally amended in lectures (Ollama option, stopwords list). Re-watch the last session before submitting | — | meta | WA | H |
1. Add pure, pure gym, gym, puregym to stopwords (counting branch only)
scope: required · source: both · conf: H · rubric 2.3, 2.5, 7.1
Action. Extend stopwords.words('english') with ['pure', 'gym', 'puregym', 'pure gym'] in the preprocessing step. Apply it to the counting branch (freq dist, wordcloud, LDA). Do NOT strip these from the embedding / BERT / LLM branches — those want full prose.
Risk if missed. The top 10 words will be dominated by "pure", "gym", "puregym" and reviewers will ask "why is the word cloud just the company name?"
[QA] "Pure, pure gem, and gem. Technically speaking two of them would deal with the other one. But just to make sure, pure, pure gym and gym are the three main ones" [WA] "Just a reminder don't forget to take out PureGym and pure and gym from the dataset when you start. It is not in the rubric but Russell was mentioning it a few times" (+44 7459, 13/04/2026)
2. For the negative-review wordcloud + freq, filter the original DataFrame, not the tokenised data
scope: required · source: QA · conf: H · rubric 2.8Action. In the cell that creates the negative subset: start from the phase-1 or phase-3 DataFrame (post-NaN drop, language-filtered), apply score < 3, then redo word_tokenize + FreqDist + wordcloud on THAT set. Don't reuse the already-tokenised rows.
Risk if missed. You re-tokenise tokenised text → bugs. Also the rubric text says "create a new DataFrame by filtering out the data" — literal filtering on the source.
[QA] "create a new data frame by filtering out the data to extract only the negative reviews. So it doesn't say create create it from the tokenized data. It says create a new data frame."
3. Feed full sentences to BERTopic and the LLM, not preprocessed tokens
scope: required · source: QA · conf: H · rubric 3.2, 6.2Action. Pass the raw Comment / Review Content string straight into bertopic.fit_transform() and into the LLM prompt. BERT handles grammar and position; tokenising + stopword-stripping destroys the signal.
Risk if missed. Topics degrade; reviewers get fragmentary keyword clusters.
[QA] "you wanna hand in the full sentence because it's got its own, as you say, it's already preset to go. All you have to do is hand in the full text, and it will understand what the text is." [QA] "BERT requires or doesn't require, but can take the full review and understand it better than if we split it up and take away the all the little stop words and things."
4. You can swap Falcon for any HF model
scope: required · source: QA · conf: H · rubric 6.1Action. Rubric section 6 heading says "Using a large language model from Hugging Face". Instructor confirmed verbally: pick what works. Qwen2.5-72B via HF Inference Providers is fine.
Risk if missed. Trying to force Falcon-7B wastes Colab credits and delivers garbage output.
[QA] "So yeah. I know. Fine. Okay. Alex. ... Pick another one if you want." [WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (+44 7359, 14/04/2026)
5. Cohort-recommended Falcon replacement: phi-4-mini via Ollama
scope: recommended · source: WA · conf: H · rubric 6.1Action. If you want a name the graders can't argue with, phi-4-mini via Ollama is the cohort-pinned choice. Qwen2.5-72B is stronger but less "course-official"-feeling.
[WA] "For the rubric: gym, pure & puregym have to be removed. And phi 4 mini from Ollama can be used instead of Falcon. Do you agree then I will pin this message" (David Van De Vijver, 14/04/2026)
6. Set the UMAP random_state for BERTopic reproducibility
scope: required · source: WA · conf: H · rubric 3.2, 5.6Action.
``python
from umap import UMAP
umap_model = UMAP(n_neighbors=10, n_components=5, random_state=42)
BERTopic(umap_model=umap_model, ...)
`
HDBSCAN downstream is deterministic; UMAP is the source of randomness.
Risk if missed. Re-running the notebook gives different cluster numbers → grader sees drifting results.
[WA] "You can set the seed using UMAP. The other parts of BERTopic are deterministic" (David Van De Vijver, 14/04/2026)
7. Keep the BERTopic document order stable between runs
scope: recommended · source: WA · conf: H · rubric 3.2Action. sort_values('Creation Date').reset_index(drop=True) before fit_transform. Even with the UMAP seed, if your docs list re-orders (e.g. different pandas operations upstream) you'll get different clusters.
[WA] "if you want results from BERTopic to be repeatable, make sure that the data it uses in docs is identical each run - including its order… I found out the hard way!" (+44 7404, 14/04/2026)
8. Emotion classifier swap is also permitted
scope: recommended · source: QA · conf: H · rubric 5.1Action. bhadresh-savani/bert-base-uncased-emotion is tweet-trained and under-labels anger in polite UK prose. Swap to SamLowe/roberta-base-go_emotions (28 classes inc. "annoyance", "disappointment") and note the reason in the report.
[QA] "Pick another one if you want." (instructor, in response to the emotion-classifier question)
9. Merge Google + Trustpilot negatives into one set for the LLM / emotion step
scope: required · source: QA · conf: H · rubric 3.1, 5.3Action. Concat both platforms' negatives into a single DataFrame, then run the emotion classifier and LLM on that merged set. The instructor explicitly said not to sweat location mismatch at this step.
[QA] "Oh, yeah. Yeah. Yeah. Sort of. But if you put them together into one thing, don't worry too much about that. Yeah. If you've already done it, then it'll be fine."
10. For Trustpilot, concatenate Review Title + Review Content before any model
scope: recommended · source: WA · conf: H · rubric 3.1Action. tp['text'] = tp['Review Title'].fillna('') + '. ' + tp['Review Content'].fillna(''). Trustpilot titles often contain the core complaint; throwing them away loses signal.
[WA] "Project question. In trustpilot we have Review title and Review Content columns. Have you merged them?" (+44 7842, 12/04/2026) — cohort confirmed yes
11. Do NOT over-engineer. Week 3 & 4 notebooks are the benchmark
scope: meta · source: both · conf: HAction. The instructor has said explicitly that the notebooks shared in weeks 3 and 4 contain all the code you need for the project. Advanced additions (ABSA, churn intent, per-club briefs, multilingual) should be clearly labelled "Beyond rubric" and separated from the main submission, not interleaved.
Risk if missed. A grader who expected a simple rubric-pass may dock for "over-scoped", or miss the rubric ticks under the bonus noise.
[WA] "the codes he shared from week 3 and 4 are all that is needed for this project" (Pierre's paraphrase of instructor, 2026-04-16)
12. Top-30-location BERTopic commonly returns only 2 topics — not necessarily wrong
scope: recommended · source: WA · conf: M · rubric 4.4Action. If rubric 4.4's BERTopic on top-30-locations gives ~2 topics, don't panic. Two students independently reported this. Document it in the commentary and compare to run 1 explicitly.
[WA] "I got the same results: two topics for top 30 locations. Are you sure that's wrong - and not an accurate view of the topic structure?" (Mark Griffiths PACE, 12/04/2026)
13. Pre-processing is cheap — run on the whole dataset, not a sample
scope: recommended · source: WA · conf: H · rubric 2.3Action. Avoid explaining "we sampled X for compute reasons" in the report. On a modern laptop, lowercase + stopword-strip + tokenise for ~28k reviews runs in seconds.
[WA] "The pre-processing isn't computationally expensive and doesn't take long so I ran it on the whole of each dataset." (+44 7404, 13/04/2026)
14. Match notebook headings 1-to-1 with rubric wording
scope: recommended · source: WA · conf: M · rubric 8.2, 8.5Action. The cohort's best-structured submissions have ~79 headings/subheadings in the notebook, each mapping verbatim to a rubric line. Makes grading trivial for the marker. Your RUBRIC_TICK_MAP.md is already this structure — mirror it in the notebook's markdown cells.
[WA] "I've just structured the notebook matched to the rubric - only 79 headings / subheadings" (+44 7404, 10/04/2026)
15. Emojis: assess first, then decide
scope: recommended · source: QA · conf: M · rubric 2.3Action. Before stripping or keeping emojis, run a count: how many reviews contain an emoji? If <2% it probably doesn't matter; if >10% it might carry emotional signal worth keeping.
[QA] "I would do an assessment to see how many emojis there were, how many reviews on emojis in them."
16. Section 6.7 actionable insights as a bullet list, not a paragraph
scope: required · source: QA · conf: H · rubric 6.7Action. The rubric says "list the output, ideally in the form of suggestions". Use a numbered or bulleted list. Pierre's v3_11_qwen_insights.md already does this — keep the format in the notebook too.
[QA] "list the output, ideally in the form of suggestions"
17. Falcon-on-T4 OOMs. Use A100 or skip Falcon
scope: meta · source: WA · conf: H · rubric 6.1Action. Falcon-7B at fp16 needs ~15GB VRAM; T4 has 16GB. Budget OOM errors on a free Colab. Colab Pro (A100) solves it, but the LLM swap (tip 4) solves it cheaper.
[WA] "T4 GPU is too short (out-of-memory) for some models such as Falcon (row 34 in the rubric). I have purchased some computer units for running A100." (+32 497, 10/04/2026)
18. Falcon + BERTopic version conflict — separate Colab sessions
scope: meta · source: WA · conf: H · rubric 6.1Action. Falcon needs transformers<=4.30; BERTopic wants newer. Split workflows:
1. Runtime 1: BERTopic → save outputs as parquet
2. Runtime → Disconnect and delete runtime
3. Runtime 2: fresh env + Falcon
Or dodge the whole thing by using the HF Inference Providers swap (tip 4).
[WA] "Falcon requires an older version of the Transformers library, which is incompatible with BERTopic. As far as I can tell, the two can't coexist in the same environment." (David Van De Vijver, 12/04/2026)
19. Report: 800-1000 words, strict
scope: required · source: rubric · conf: H · rubric 8.1Action. Word-count report.md now. If over, cut the "approach" section first — your notebook documents the approach. Reserve the 800-1000 words for insights and business implications.
[rubric 8.1] "The report is between 800-1000 words."
20. Rubric has been verbally amended in lectures — keep watching recordings
scope: meta · source: WA · conf: HAction. At least two rubric amendments have been made verbally (Ollama option, custom stopwords). Before submitting, scan the Q&A transcript for anything that might have changed since. Pierre has the 2026-04-16 transcript ingested; future Q&A sessions need the same treatment.
[WA] "darent log off early in case he verbally changes the rubric again." (+44 7359, 14/04/2026) [WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (same user)
Integration status
| Tip | Integrated? | Where |
|---|---|---|
| 1 stopwords | 🟡 code change needed — see v3/v3_05_frequency_analysis.py, add to STOPWORDS list | |
| 2 original df filter | 🟢 already correct in v3_05_frequency_analysis.py — verified | |
| 3 full sentences | 🟢 already correct in Phase 6 BERTopic and Phase 11 LLM | |
| 4 model swap | 🟢 already done (Qwen2.5-72B) | |
| 5 phi-4-mini | ⚪ optional alt-path; keep Qwen | |
| 6 UMAP seed | 🟡 verify v3_06_bertopic.py sets random_state=42 on UMAP | |
| 7 data order | 🟡 add .sort_values('Creation Date') before fit_transform | |
| 8 emotion swap | 🟡 build v3_08c_roberta_go_emotions.py | |
| 9 merge G+TP | 🟢 already done in combined_phase4.parquet | |
| 10 Trustpilot title+body | 🟡 verify text coalesce in v3_02_clean.py | |
| 11 don't over-engineer | 🟢 bonus work is in separate v3_11_bonus_* files | |
| 12 top-30 → 2 topics | ⚪ if we see it, annotate | |
| 13 no sampling | 🟢 preprocess runs full | |
| 14 79 headings | 🟡 check submission notebook markdown depth | |
| 15 emoji assess | 🟡 add emoji-count cell | |
| 16 bullet list | 🟢 done | |
| 17 T4 OOM | N/A (using Inference API) | |
| 18 Falcon/BERTopic conflict | N/A (no Falcon) | |
| 19 800-1000 words | 🟡 word-count report.md` now | |
| 20 keep watching | 🟢 Q&A ingested, pattern documented |