Restricted · PACE NLP

Pierre's project pack. Enter name + PIN.

That combination doesn't match. Try again.
DS301 NLP project · private preview

Top 20 tips — PACE DS301 NLP project

Hand-curated from Q&A transcript (2026-04-16, Speaker 0 = instructor Russell) + CAM P4 WhatsApp cohort (Nov 2025 – Apr 2026). Auto-gen by Qwen was too noisy; this version is sourced to specific quotes.

Legend

- Scoperequired = rubric tick; recommended = cohort consensus; advanced = goes beyond; meta = process/submission - SourceQA = live Q&A 2026-04-16; WA = cohort WhatsApp; both = independently confirmed - Conf — H/M/L

Summary table

#TipRubricScopeSourceConf
1Add pure, pure gym, gym, puregym to stopwords (counting branch only)2.3, 2.5, 7.1requiredbothH
2For the negative-review wordcloud + freq, filter the original DataFrame (post-NaN, score<3) — NOT the tokenised data2.8requiredQAH
3Feed full sentences to BERTopic and the LLM, not preprocessed tokens3.2, 6.2requiredQAH
4You can swap Falcon for any HF model — "pick another one if you want"6.1requiredQAH
5Cohort consensus: phi-4-mini via Ollama is an acceptable Falcon replacement6.1recommendedWAH
6Set the UMAP random_state so BERTopic is reproducible3.2, 5.6requiredWAH
7Keep the data-order identical between BERTopic runs (sort before fit)3.2recommendedWAH
8Emotion classifier swap is also permitted — "pick another one if you want"5.1recommendedQAH
9Merge Google + Trustpilot negatives into one set for the LLM / emotion step — "don't worry too much about that"3.1, 5.3requiredQAH
10For Trustpilot, concatenate Review Title + Review Content before passing to any model3.1recommendedWAH
11Do NOT over-engineer. The instructor has said the week 3 & 4 notebooks are all that is neededmetabothH
12Top-30-locations BERTopic often returns only ~2 topics — this is fine, note it in the report4.4recommendedWAM
13Pre-processing is cheap — run it on the whole combined dataset, not a sample2.3recommendedWAH
14Match your notebook's headings 1-to-1 with the rubric wording — the cohort's "79 headings" pattern8.2, 8.5recommendedWAM
15Emojis: assess first (count distribution), then decide — don't blanket-strip2.3recommendedQAM
16Section 6.7 wants suggestions as a list, not a paragraph — use bullet points6.7requiredQAH
17Falcon-on-T4 OOMs; if using Falcon specifically, request an A100 or skip Falcon entirely6.1metaWAH
18Falcon needs an older transformers version that conflicts with BERTopic — run each in a separate Colab session6.1metaWAH
19Report 800-1000 words; over/under is an explicit rubric item8.1requiredrubricH
20The rubric has been verbally amended in lectures (Ollama option, stopwords list). Re-watch the last session before submittingmetaWAH

1. Add pure, pure gym, gym, puregym to stopwords (counting branch only)

scope: required · source: both · conf: H · rubric 2.3, 2.5, 7.1

Action. Extend stopwords.words('english') with ['pure', 'gym', 'puregym', 'pure gym'] in the preprocessing step. Apply it to the counting branch (freq dist, wordcloud, LDA). Do NOT strip these from the embedding / BERT / LLM branches — those want full prose.

Risk if missed. The top 10 words will be dominated by "pure", "gym", "puregym" and reviewers will ask "why is the word cloud just the company name?"

[QA] "Pure, pure gem, and gem. Technically speaking two of them would deal with the other one. But just to make sure, pure, pure gym and gym are the three main ones" [WA] "Just a reminder don't forget to take out PureGym and pure and gym from the dataset when you start. It is not in the rubric but Russell was mentioning it a few times" (+44 7459, 13/04/2026)

2. For the negative-review wordcloud + freq, filter the original DataFrame, not the tokenised data

scope: required · source: QA · conf: H · rubric 2.8

Action. In the cell that creates the negative subset: start from the phase-1 or phase-3 DataFrame (post-NaN drop, language-filtered), apply score < 3, then redo word_tokenize + FreqDist + wordcloud on THAT set. Don't reuse the already-tokenised rows.

Risk if missed. You re-tokenise tokenised text → bugs. Also the rubric text says "create a new DataFrame by filtering out the data" — literal filtering on the source.

[QA] "create a new data frame by filtering out the data to extract only the negative reviews. So it doesn't say create create it from the tokenized data. It says create a new data frame."

3. Feed full sentences to BERTopic and the LLM, not preprocessed tokens

scope: required · source: QA · conf: H · rubric 3.2, 6.2

Action. Pass the raw Comment / Review Content string straight into bertopic.fit_transform() and into the LLM prompt. BERT handles grammar and position; tokenising + stopword-stripping destroys the signal.

Risk if missed. Topics degrade; reviewers get fragmentary keyword clusters.

[QA] "you wanna hand in the full sentence because it's got its own, as you say, it's already preset to go. All you have to do is hand in the full text, and it will understand what the text is." [QA] "BERT requires or doesn't require, but can take the full review and understand it better than if we split it up and take away the all the little stop words and things."

4. You can swap Falcon for any HF model

scope: required · source: QA · conf: H · rubric 6.1

Action. Rubric section 6 heading says "Using a large language model from Hugging Face". Instructor confirmed verbally: pick what works. Qwen2.5-72B via HF Inference Providers is fine.

Risk if missed. Trying to force Falcon-7B wastes Colab credits and delivers garbage output.

[QA] "So yeah. I know. Fine. Okay. Alex. ... Pick another one if you want." [WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (+44 7359, 14/04/2026)

5. Cohort-recommended Falcon replacement: phi-4-mini via Ollama

scope: recommended · source: WA · conf: H · rubric 6.1

Action. If you want a name the graders can't argue with, phi-4-mini via Ollama is the cohort-pinned choice. Qwen2.5-72B is stronger but less "course-official"-feeling.

[WA] "For the rubric: gym, pure & puregym have to be removed. And phi 4 mini from Ollama can be used instead of Falcon. Do you agree then I will pin this message" (David Van De Vijver, 14/04/2026)

6. Set the UMAP random_state for BERTopic reproducibility

scope: required · source: WA · conf: H · rubric 3.2, 5.6

Action. ``python from umap import UMAP umap_model = UMAP(n_neighbors=10, n_components=5, random_state=42) BERTopic(umap_model=umap_model, ...) ` HDBSCAN downstream is deterministic; UMAP is the source of randomness.

Risk if missed. Re-running the notebook gives different cluster numbers → grader sees drifting results.

[WA] "You can set the seed using UMAP. The other parts of BERTopic are deterministic" (David Van De Vijver, 14/04/2026)

7. Keep the BERTopic document order stable between runs

scope: recommended · source: WA · conf: H · rubric 3.2

Action. sort_values('Creation Date').reset_index(drop=True) before fit_transform. Even with the UMAP seed, if your docs list re-orders (e.g. different pandas operations upstream) you'll get different clusters.

[WA] "if you want results from BERTopic to be repeatable, make sure that the data it uses in docs is identical each run - including its order… I found out the hard way!" (+44 7404, 14/04/2026)

8. Emotion classifier swap is also permitted

scope: recommended · source: QA · conf: H · rubric 5.1

Action. bhadresh-savani/bert-base-uncased-emotion is tweet-trained and under-labels anger in polite UK prose. Swap to SamLowe/roberta-base-go_emotions (28 classes inc. "annoyance", "disappointment") and note the reason in the report.

[QA] "Pick another one if you want." (instructor, in response to the emotion-classifier question)

9. Merge Google + Trustpilot negatives into one set for the LLM / emotion step

scope: required · source: QA · conf: H · rubric 3.1, 5.3

Action. Concat both platforms' negatives into a single DataFrame, then run the emotion classifier and LLM on that merged set. The instructor explicitly said not to sweat location mismatch at this step.

[QA] "Oh, yeah. Yeah. Yeah. Sort of. But if you put them together into one thing, don't worry too much about that. Yeah. If you've already done it, then it'll be fine."

10. For Trustpilot, concatenate Review Title + Review Content before any model

scope: recommended · source: WA · conf: H · rubric 3.1

Action. tp['text'] = tp['Review Title'].fillna('') + '. ' + tp['Review Content'].fillna(''). Trustpilot titles often contain the core complaint; throwing them away loses signal.

[WA] "Project question. In trustpilot we have Review title and Review Content columns. Have you merged them?" (+44 7842, 12/04/2026) — cohort confirmed yes

11. Do NOT over-engineer. Week 3 & 4 notebooks are the benchmark

scope: meta · source: both · conf: H

Action. The instructor has said explicitly that the notebooks shared in weeks 3 and 4 contain all the code you need for the project. Advanced additions (ABSA, churn intent, per-club briefs, multilingual) should be clearly labelled "Beyond rubric" and separated from the main submission, not interleaved.

Risk if missed. A grader who expected a simple rubric-pass may dock for "over-scoped", or miss the rubric ticks under the bonus noise.

[WA] "the codes he shared from week 3 and 4 are all that is needed for this project" (Pierre's paraphrase of instructor, 2026-04-16)

12. Top-30-location BERTopic commonly returns only 2 topics — not necessarily wrong

scope: recommended · source: WA · conf: M · rubric 4.4

Action. If rubric 4.4's BERTopic on top-30-locations gives ~2 topics, don't panic. Two students independently reported this. Document it in the commentary and compare to run 1 explicitly.

[WA] "I got the same results: two topics for top 30 locations. Are you sure that's wrong - and not an accurate view of the topic structure?" (Mark Griffiths PACE, 12/04/2026)

13. Pre-processing is cheap — run on the whole dataset, not a sample

scope: recommended · source: WA · conf: H · rubric 2.3

Action. Avoid explaining "we sampled X for compute reasons" in the report. On a modern laptop, lowercase + stopword-strip + tokenise for ~28k reviews runs in seconds.

[WA] "The pre-processing isn't computationally expensive and doesn't take long so I ran it on the whole of each dataset." (+44 7404, 13/04/2026)

14. Match notebook headings 1-to-1 with rubric wording

scope: recommended · source: WA · conf: M · rubric 8.2, 8.5

Action. The cohort's best-structured submissions have ~79 headings/subheadings in the notebook, each mapping verbatim to a rubric line. Makes grading trivial for the marker. Your RUBRIC_TICK_MAP.md is already this structure — mirror it in the notebook's markdown cells.

[WA] "I've just structured the notebook matched to the rubric - only 79 headings / subheadings" (+44 7404, 10/04/2026)

15. Emojis: assess first, then decide

scope: recommended · source: QA · conf: M · rubric 2.3

Action. Before stripping or keeping emojis, run a count: how many reviews contain an emoji? If <2% it probably doesn't matter; if >10% it might carry emotional signal worth keeping.

[QA] "I would do an assessment to see how many emojis there were, how many reviews on emojis in them."

16. Section 6.7 actionable insights as a bullet list, not a paragraph

scope: required · source: QA · conf: H · rubric 6.7

Action. The rubric says "list the output, ideally in the form of suggestions". Use a numbered or bulleted list. Pierre's v3_11_qwen_insights.md already does this — keep the format in the notebook too.

[QA] "list the output, ideally in the form of suggestions"

17. Falcon-on-T4 OOMs. Use A100 or skip Falcon

scope: meta · source: WA · conf: H · rubric 6.1

Action. Falcon-7B at fp16 needs ~15GB VRAM; T4 has 16GB. Budget OOM errors on a free Colab. Colab Pro (A100) solves it, but the LLM swap (tip 4) solves it cheaper.

[WA] "T4 GPU is too short (out-of-memory) for some models such as Falcon (row 34 in the rubric). I have purchased some computer units for running A100." (+32 497, 10/04/2026)

18. Falcon + BERTopic version conflict — separate Colab sessions

scope: meta · source: WA · conf: H · rubric 6.1

Action. Falcon needs transformers<=4.30; BERTopic wants newer. Split workflows: 1. Runtime 1: BERTopic → save outputs as parquet 2. Runtime → Disconnect and delete runtime 3. Runtime 2: fresh env + Falcon

Or dodge the whole thing by using the HF Inference Providers swap (tip 4).

[WA] "Falcon requires an older version of the Transformers library, which is incompatible with BERTopic. As far as I can tell, the two can't coexist in the same environment." (David Van De Vijver, 12/04/2026)

19. Report: 800-1000 words, strict

scope: required · source: rubric · conf: H · rubric 8.1

Action. Word-count report.md now. If over, cut the "approach" section first — your notebook documents the approach. Reserve the 800-1000 words for insights and business implications.

[rubric 8.1] "The report is between 800-1000 words."

20. Rubric has been verbally amended in lectures — keep watching recordings

scope: meta · source: WA · conf: H

Action. At least two rubric amendments have been made verbally (Ollama option, custom stopwords). Before submitting, scan the Q&A transcript for anything that might have changed since. Pierre has the 2026-04-16 transcript ingested; future Q&A sessions need the same treatment.

[WA] "darent log off early in case he verbally changes the rubric again." (+44 7359, 14/04/2026) [WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (same user)

Integration status

TipIntegrated?Where
1 stopwords🟡 code change needed — see v3/v3_05_frequency_analysis.py, add to STOPWORDS list
2 original df filter🟢 already correct in v3_05_frequency_analysis.py — verified
3 full sentences🟢 already correct in Phase 6 BERTopic and Phase 11 LLM
4 model swap🟢 already done (Qwen2.5-72B)
5 phi-4-mini⚪ optional alt-path; keep Qwen
6 UMAP seed🟡 verify v3_06_bertopic.py sets random_state=42 on UMAP
7 data order🟡 add .sort_values('Creation Date') before fit_transform
8 emotion swap🟡 build v3_08c_roberta_go_emotions.py
9 merge G+TP🟢 already done in combined_phase4.parquet
10 Trustpilot title+body🟡 verify text coalesce in v3_02_clean.py
11 don't over-engineer🟢 bonus work is in separate v3_11_bonus_* files
12 top-30 → 2 topics⚪ if we see it, annotate
13 no sampling🟢 preprocess runs full
14 79 headings🟡 check submission notebook markdown depth
15 emoji assess🟡 add emoji-count cell
16 bullet list🟢 done
17 T4 OOMN/A (using Inference API)
18 Falcon/BERTopic conflictN/A (no Falcon)
19 800-1000 words🟡 word-count report.md` now
20 keep watching🟢 Q&A ingested, pattern documented
🟢 done · 🟡 action needed · ⚪ optional · N/A irrelevant now