Top 20 tips — PACE DS301 NLP project

Hand-curated from Q&A transcript (2026-04-16, Speaker 0 = instructor Russell) + CAM P4 WhatsApp cohort (Nov 2025 – Apr 2026). Auto-gen by Qwen was too noisy; this version is sourced to specific quotes.

Legend

- Scope — required = rubric tick; recommended = cohort consensus; advanced = goes beyond; meta = process/submission - Source — QA = live Q&A 2026-04-16; WA = cohort WhatsApp; both = independently confirmed - Conf — H/M/L

Summary table

#	Tip	Rubric	Scope	Source	Conf
1	Add `pure`, `pure gym`, `gym`, `puregym` to stopwords (counting branch only)	2.3, 2.5, 7.1	required	both	H
2	For the negative-review wordcloud + freq, filter the original DataFrame (post-NaN, score<3) — NOT the tokenised data	2.8	required	QA	H
3	Feed full sentences to BERTopic and the LLM, not preprocessed tokens	3.2, 6.2	required	QA	H
4	You can swap Falcon for any HF model — "pick another one if you want"	6.1	required	QA	H
5	Cohort consensus: phi-4-mini via Ollama is an acceptable Falcon replacement	6.1	recommended	WA	H
6	Set the UMAP random_state so BERTopic is reproducible	3.2, 5.6	required	WA	H
7	Keep the data-order identical between BERTopic runs (sort before fit)	3.2	recommended	WA	H
8	Emotion classifier swap is also permitted — "pick another one if you want"	5.1	recommended	QA	H
9	Merge Google + Trustpilot negatives into one set for the LLM / emotion step — "don't worry too much about that"	3.1, 5.3	required	QA	H
10	For Trustpilot, concatenate Review Title + Review Content before passing to any model	3.1	recommended	WA	H
11	Do NOT over-engineer. The instructor has said the week 3 & 4 notebooks are all that is needed	—	meta	both	H
12	Top-30-locations BERTopic often returns only ~2 topics — this is fine, note it in the report	4.4	recommended	WA	M
13	Pre-processing is cheap — run it on the whole combined dataset, not a sample	2.3	recommended	WA	H
14	Match your notebook's headings 1-to-1 with the rubric wording — the cohort's "79 headings" pattern	8.2, 8.5	recommended	WA	M
15	Emojis: assess first (count distribution), then decide — don't blanket-strip	2.3	recommended	QA	M
16	Section 6.7 wants suggestions as a list, not a paragraph — use bullet points	6.7	required	QA	H
17	Falcon-on-T4 OOMs; if using Falcon specifically, request an A100 or skip Falcon entirely	6.1	meta	WA	H
18	Falcon needs an older transformers version that conflicts with BERTopic — run each in a separate Colab session	6.1	meta	WA	H
19	Report 800-1000 words; over/under is an explicit rubric item	8.1	required	rubric	H
20	The rubric has been verbally amended in lectures (Ollama option, stopwords list). Re-watch the last session before submitting	—	meta	WA	H

1. Add `pure`, `pure gym`, `gym`, `puregym` to stopwords (counting branch only)

scope: required · source: both · conf: H · rubric 2.3, 2.5, 7.1

Action. Extend stopwords.words('english') with ['pure', 'gym', 'puregym', 'pure gym'] in the preprocessing step. Apply it to the counting branch (freq dist, wordcloud, LDA). Do NOT strip these from the embedding / BERT / LLM branches — those want full prose.

Risk if missed. The top 10 words will be dominated by "pure", "gym", "puregym" and reviewers will ask "why is the word cloud just the company name?"

[QA] "Pure, pure gem, and gem. Technically speaking two of them would deal with the other one. But just to make sure, pure, pure gym and gym are the three main ones" [WA] "Just a reminder don't forget to take out PureGym and pure and gym from the dataset when you start. It is not in the rubric but Russell was mentioning it a few times" (+44 7459, 13/04/2026)

2. For the negative-review wordcloud + freq, filter the original DataFrame, not the tokenised data

scope: required · source: QA · conf: H · rubric 2.8

Action. In the cell that creates the negative subset: start from the phase-1 or phase-3 DataFrame (post-NaN drop, language-filtered), apply score < 3, then redo word_tokenize + FreqDist + wordcloud on THAT set. Don't reuse the already-tokenised rows.

Risk if missed. You re-tokenise tokenised text → bugs. Also the rubric text says "create a new DataFrame by filtering out the data" — literal filtering on the source.

[QA] "create a new data frame by filtering out the data to extract only the negative reviews. So it doesn't say create create it from the tokenized data. It says create a new data frame."

3. Feed full sentences to BERTopic and the LLM, not preprocessed tokens

scope: required · source: QA · conf: H · rubric 3.2, 6.2

Action. Pass the raw Comment / Review Content string straight into bertopic.fit_transform() and into the LLM prompt. BERT handles grammar and position; tokenising + stopword-stripping destroys the signal.

Risk if missed. Topics degrade; reviewers get fragmentary keyword clusters.

[QA] "you wanna hand in the full sentence because it's got its own, as you say, it's already preset to go. All you have to do is hand in the full text, and it will understand what the text is." [QA] "BERT requires or doesn't require, but can take the full review and understand it better than if we split it up and take away the all the little stop words and things."

4. You can swap Falcon for any HF model

scope: required · source: QA · conf: H · rubric 6.1

Action. Rubric section 6 heading says "Using a large language model from Hugging Face". Instructor confirmed verbally: pick what works. Qwen2.5-72B via HF Inference Providers is fine.

Risk if missed. Trying to force Falcon-7B wastes Colab credits and delivers garbage output.

[QA] "So yeah. I know. Fine. Okay. Alex. ... Pick another one if you want." [WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (+44 7359, 14/04/2026)

5. Cohort-recommended Falcon replacement: phi-4-mini via Ollama

scope: recommended · source: WA · conf: H · rubric 6.1

Action. If you want a name the graders can't argue with, phi-4-mini via Ollama is the cohort-pinned choice. Qwen2.5-72B is stronger but less "course-official"-feeling.

[WA] "For the rubric: gym, pure & puregym have to be removed. And phi 4 mini from Ollama can be used instead of Falcon. Do you agree then I will pin this message" (David Van De Vijver, 14/04/2026)

6. Set the UMAP random_state for BERTopic reproducibility

scope: required · source: WA · conf: H · rubric 3.2, 5.6

Action. ``python from umap import UMAP umap_model = UMAP(n_neighbors=10, n_components=5, random_state=42) BERTopic(umap_model=umap_model, ...)`HDBSCAN downstream is deterministic; UMAP is the source of randomness.


Risk if missed. Re-running the notebook gives different cluster numbers → grader sees drifting results.
[WA] "You can set the seed using UMAP. The other parts of BERTopic are deterministic" (David Van De Vijver, 14/04/2026)

7. Keep the BERTopic document order stable between runs
scope: recommended · source: WA · conf: H · rubric 3.2

Action. sort_values('Creation Date').reset_index(drop=True) before fit_transform. Even with the UMAP seed, if your docs list re-orders (e.g. different pandas operations upstream) you'll get different clusters.


[WA] "if you want results from BERTopic to be repeatable, make sure that the data it uses in docs is identical each run - including its order… I found out the hard way!" (+44 7404, 14/04/2026)

8. Emotion classifier swap is also permitted
scope: recommended · source: QA · conf: H · rubric 5.1

Action. bhadresh-savani/bert-base-uncased-emotion is tweet-trained and under-labels anger in polite UK prose. Swap to SamLowe/roberta-base-go_emotions (28 classes inc. "annoyance", "disappointment") and note the reason in the report.


[QA] "Pick another one if you want." (instructor, in response to the emotion-classifier question)

9. Merge Google + Trustpilot negatives into one set for the LLM / emotion step
scope: required · source: QA · conf: H · rubric 3.1, 5.3
Action. Concat both platforms' negatives into a single DataFrame, then run the emotion classifier and LLM on that merged set. The instructor explicitly said not to sweat location mismatch at this step.
[QA] "Oh, yeah. Yeah. Yeah. Sort of. But if you put them together into one thing, don't worry too much about that. Yeah. If you've already done it, then it'll be fine."

10. For Trustpilot, concatenate Review Title + Review Content before any model
scope: recommended · source: WA · conf: H · rubric 3.1

Action. tp['text'] = tp['Review Title'].fillna('') + '. ' + tp['Review Content'].fillna(''). Trustpilot titles often contain the core complaint; throwing them away loses signal.


[WA] "Project question. In trustpilot we have Review title and Review Content columns. Have you merged them?" (+44 7842, 12/04/2026) — cohort confirmed yes

11. Do NOT over-engineer. Week 3 & 4 notebooks are the benchmark
scope: meta · source: both · conf: H
Action. The instructor has said explicitly that the notebooks shared in weeks 3 and 4 contain all the code you need for the project. Advanced additions (ABSA, churn intent, per-club briefs, multilingual) should be clearly labelled "Beyond rubric" and separated from the main submission, not interleaved.
Risk if missed. A grader who expected a simple rubric-pass may dock for "over-scoped", or miss the rubric ticks under the bonus noise.
[WA] "the codes he shared from week 3 and 4 are all that is needed for this project" (Pierre's paraphrase of instructor, 2026-04-16)

12. Top-30-location BERTopic commonly returns only 2 topics — not necessarily wrong
scope: recommended · source: WA · conf: M · rubric 4.4
Action. If rubric 4.4's BERTopic on top-30-locations gives ~2 topics, don't panic. Two students independently reported this. Document it in the commentary and compare to run 1 explicitly.
[WA] "I got the same results: two topics for top 30 locations. Are you sure that's wrong - and not an accurate view of the topic structure?" (Mark Griffiths PACE, 12/04/2026)

13. Pre-processing is cheap — run on the whole dataset, not a sample
scope: recommended · source: WA · conf: H · rubric 2.3
Action. Avoid explaining "we sampled X for compute reasons" in the report. On a modern laptop, lowercase + stopword-strip + tokenise for ~28k reviews runs in seconds.
[WA] "The pre-processing isn't computationally expensive and doesn't take long so I ran it on the whole of each dataset." (+44 7404, 13/04/2026)

14. Match notebook headings 1-to-1 with rubric wording
scope: recommended · source: WA · conf: M · rubric 8.2, 8.5

Action. The cohort's best-structured submissions have ~79 headings/subheadings in the notebook, each mapping verbatim to a rubric line. Makes grading trivial for the marker. Your RUBRIC_TICK_MAP.md is already this structure — mirror it in the notebook's markdown cells.


[WA] "I've just structured the notebook matched to the rubric - only 79 headings / subheadings" (+44 7404, 10/04/2026)

15. Emojis: assess first, then decide
scope: recommended · source: QA · conf: M · rubric 2.3
Action. Before stripping or keeping emojis, run a count: how many reviews contain an emoji? If <2% it probably doesn't matter; if >10% it might carry emotional signal worth keeping.
[QA] "I would do an assessment to see how many emojis there were, how many reviews on emojis in them."

16. Section 6.7 actionable insights as a bullet list, not a paragraph
scope: required · source: QA · conf: H · rubric 6.7

Action. The rubric says "list the output, ideally in the form of suggestions". Use a numbered or bulleted list. Pierre's v3_11_qwen_insights.md already does this — keep the format in the notebook too.


[QA] "list the output, ideally in the form of suggestions"

17. Falcon-on-T4 OOMs. Use A100 or skip Falcon
scope: meta · source: WA · conf: H · rubric 6.1
Action. Falcon-7B at fp16 needs ~15GB VRAM; T4 has 16GB. Budget OOM errors on a free Colab. Colab Pro (A100) solves it, but the LLM swap (tip 4) solves it cheaper.
[WA] "T4 GPU is too short (out-of-memory) for some models such as Falcon (row 34 in the rubric). I have purchased some computer units for running A100." (+32 497, 10/04/2026)

18. Falcon + BERTopic version conflict — separate Colab sessions
scope: meta · source: WA · conf: H · rubric 6.1

Action. Falcon needs transformers<=4.30; BERTopic wants newer. Split workflows: 1. Runtime 1: BERTopic → save outputs as parquet 2. Runtime → Disconnect and delete runtime 3. Runtime 2: fresh env + Falcon


Or dodge the whole thing by using the HF Inference Providers swap (tip 4).
[WA] "Falcon requires an older version of the Transformers library, which is incompatible with BERTopic. As far as I can tell, the two can't coexist in the same environment." (David Van De Vijver, 12/04/2026)

19. Report: 800-1000 words, strict
scope: required · source: rubric · conf: H · rubric 8.1

Action. Word-count report.md now. If over, cut the "approach" section first — your notebook documents the approach. Reserve the 800-1000 words for insights and business implications.


[rubric 8.1] "The report is between 800-1000 words."

20. Rubric has been verbally amended in lectures — keep watching recordings
scope: meta · source: WA · conf: H
Action. At least two rubric amendments have been made verbally (Ollama option, custom stopwords). Before submitting, scan the Q&A transcript for anything that might have changed since. Pierre has the 2026-04-16 transcript ingested; future Q&A sessions need the same treatment.
[WA] "darent log off early in case he verbally changes the rubric again." (+44 7359, 14/04/2026)
[WA] "he verbally changed the rubric … With reference to allowing us to use Ollama instead of Falcon" (same user)

Integration status
Tip Integrated? Where
1 stopwords 🟡 code change needed — see v3/v3_05_frequency_analysis.py, add to STOPWORDS list
2 original df filter 🟢 already correct in v3_05_frequency_analysis.py — verified
3 full sentences 🟢 already correct in Phase 6 BERTopic and Phase 11 LLM
4 model swap 🟢 already done (Qwen2.5-72B)
5 phi-4-mini ⚪ optional alt-path; keep Qwen
6 UMAP seed 🟡 verify v3_06_bertopic.py sets random_state=42 on UMAP
7 data order 🟡 add .sort_values('Creation Date') before fit_transform
8 emotion swap 🟡 build v3_08c_roberta_go_emotions.py
9 merge G+TP 🟢 already done in combined_phase4.parquet
10 Trustpilot title+body 🟡 verify text coalesce in v3_02_clean.py
11 don't over-engineer 🟢 bonus work is in separate v3_11_bonus_* files
12 top-30 → 2 topics ⚪ if we see it, annotate
13 no sampling 🟢 preprocess runs full
14 79 headings 🟡 check submission notebook markdown depth
15 emoji assess 🟡 add emoji-count cell
16 bullet list 🟢 done
17 T4 OOM N/A (using Inference API)
18 Falcon/BERTopic conflict N/A (no Falcon)
19 800-1000 words 🟡 word-count report.md` now
20 keep watching 🟢 Q&A ingested, pattern documented
🟢 done · 🟡 action needed · ⚪ optional · N/A irrelevant now

Tip	Integrated?	Where
1 stopwords	🟡 code change needed — see v3/v3_05_frequency_analysis.py`, add to STOPWORDS list`
2 original df filter	🟢 already correct in v3_05_frequency_analysis.py `— verified`
3 full sentences	🟢 already correct in Phase 6 BERTopic and Phase 11 LLM
4 model swap	🟢 already done (Qwen2.5-72B)
5 phi-4-mini	⚪ optional alt-path; keep Qwen
6 UMAP seed	🟡 verify v3_06_bertopic.py `sets random_state=42 on UMAP`
7 data order	🟡 add .sort_values('Creation Date') `before fit_transform`
8 emotion swap	🟡 build v3_08c_roberta_go_emotions.py
9 merge G+TP	🟢 already done in combined_phase4.parquet
10 Trustpilot title+body	🟡 verify text coalesce in v3_02_clean.py
11 don't over-engineer	🟢 bonus work is in separate v3_11_bonus_* `files`
12 top-30 → 2 topics	⚪ if we see it, annotate
13 no sampling	🟢 preprocess runs full
14 79 headings	🟡 check submission notebook markdown depth
15 emoji assess	🟡 add emoji-count cell
16 bullet list	🟢 done
17 T4 OOM	N/A (using Inference API)
18 Falcon/BERTopic conflict	N/A (no Falcon)
19 800-1000 words	🟡 word-count report.md` now
20 keep watching	🟢 Q&A ingested, pattern documented

Restricted · PACE NLP

Top 20 tips — PACE DS301 NLP project

Legend

Summary table

1. Add pure, pure gym, gym, puregym to stopwords (counting branch only)

2. For the negative-review wordcloud + freq, filter the original DataFrame, not the tokenised data

3. Feed full sentences to BERTopic and the LLM, not preprocessed tokens

4. You can swap Falcon for any HF model

5. Cohort-recommended Falcon replacement: phi-4-mini via Ollama

6. Set the UMAP random_state for BERTopic reproducibility

7. Keep the BERTopic document order stable between runs

8. Emotion classifier swap is also permitted

9. Merge Google + Trustpilot negatives into one set for the LLM / emotion step

10. For Trustpilot, concatenate Review Title + Review Content before any model

11. Do NOT over-engineer. Week 3 & 4 notebooks are the benchmark

12. Top-30-location BERTopic commonly returns only 2 topics — not necessarily wrong

13. Pre-processing is cheap — run on the whole dataset, not a sample

14. Match notebook headings 1-to-1 with rubric wording

15. Emojis: assess first, then decide

16. Section 6.7 actionable insights as a bullet list, not a paragraph

17. Falcon-on-T4 OOMs. Use A100 or skip Falcon

18. Falcon + BERTopic version conflict — separate Colab sessions

19. Report: 800-1000 words, strict

20. Rubric has been verbally amended in lectures — keep watching recordings

Integration status

1. Add `pure`, `pure gym`, `gym`, `puregym` to stopwords (counting branch only)