Restricted · PACE NLP

Pierre's project pack. Enter name + PIN.

That combination doesn't match. Try again.
DS301 NLP project · private preview
Nadia — NLP-specialist persona
Nadia · NLP specialist "The work" — what got submitted
Nadia — NLP-specialist persona
Nadia · NLP specialist "The work" — what got submitted

Rubric audio deck

Clinical walk-through of 48 rubric items. ~30 seconds each; start here, skip around as needed. UK English (Sonia voice), -5% rate for walk listening.

Overview · 00_overview.mp3
Importing packages and data

#1 — Importing packages and data

Rubric line — listen ~30 s
Import the data file Google_12_months.xlsx into a dataframe.

Loaded with pandas. 13,898 rows after NaN drop.

#2 — Importing packages and data

Rubric line — listen ~30 s
Import the data file Trustpilot_12_months.xlsx into a dataframe.

Trustpilot separates Title from Content — we merge them in Phase 2.

#3 — Importing packages and data

Rubric line — listen ~30 s
Remove any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot).

dropna on Comment / Review Content.

Conducting initial data investigation

#4 — Conducting initial data investigation

Rubric line — listen ~30 s
Find the number of unique locations in the Google data set. Find the number of unique locations in the Trustpilot data set. Use Club's Name for the Google data set. Use Location Name for the Trustpilot data set.

Google 512 clubs, Trustpilot 376 — different casing, normalise first.

#5 — Conducting initial data investigation

Rubric line — listen ~30 s
Find the number of common locations between the Google data set and the Trustpilot data set.

310 shared clubs.

#6 — Conducting initial data investigation

Rubric line — listen ~30 s
Perform preprocessing of the data – change to lower case, remove stopwords using NLTK, and remove numbers.

NLTK pipeline plus domain stopwords: pure, puregym, gym.

#7 — Conducting initial data investigation

Rubric line — listen ~30 s
Tokenise the data using word_tokenize from NLTK.

nltk.tokenize.word_tokenize handles contractions.

#8 — Conducting initial data investigation

Rubric line — listen ~30 s
Find the frequency distribution of the words from each data set's reviews separately. You can use nltk.freqDist.

FreqDist, saved as CSV. Run once for all, once for negatives.

#9 — Conducting initial data investigation

Rubric line — listen ~30 s
Plot a histogram/bar plot showing the top 10 words from each data set.

Horizontal bar plots — long labels don't fit vertically.

#10 — Conducting initial data investigation

Rubric line — listen ~30 s
Use the wordcloud library on the cleaned data and plot the word cloud.

WordCloud. Visually cheap, analytically thin.

#11 — Conducting initial data investigation

Rubric line — listen ~30 s
Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets. • For Google reviews, overall scores < 3 can be considered negative scores. • For Trustpilot reviews, stars < 3 can be considered negative scores. Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews.

Filter the ORIGINAL dataframe, not the tokenised data. Instructor was explicit.

Conducting initial topic modelling

#12 — Conducting initial topic modelling

Rubric line — listen ~30 s
With the data frame created in the previous step: • Filter out the reviews that are from the locations common to both data sets. • Merge the reviews to form a new list.

Intersection of club names. Sort by creation date before BERTopic fit.

#13 — Conducting initial topic modelling

Rubric line — listen ~30 s
Preprocess this data set. Use BERTopic on this cleaned data set.

Feed BERTopic RAW full sentences, not tokens.

#14 — Conducting initial topic modelling

Rubric line — listen ~30 s
Output: List out the top topics along with their document frequencies.

191 topics found. get_topic_info dataframe.

#15 — Conducting initial topic modelling

Rubric line — listen ~30 s
For the top 2 topics, list out the top words.

Topic 0: equipment, air, staff, showers. Topic 1: smell, cleaning, spray.

#16 — Conducting initial topic modelling

Rubric line — listen ~30 s
Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map.

visualize_topics produces a plotly HTML distance map.

#17 — Conducting initial topic modelling

Rubric line — listen ~30 s
Show a barchart of the topics, displaying the top 5 words in each topic.

visualize_barchart for top-10 topics, 6 words each.

#18 — Conducting initial topic modelling

Rubric line — listen ~30 s
Plot a heatmap, showcasing the similarity matrix.

Similarity heatmap — close clusters are merge candidates.

#19 — Conducting initial topic modelling

Rubric line — listen ~30 s
For 10 clusters, provide a brief description in the Notebook of the topics they comprise of along with the general theme of the cluster, evidenced by the top words within each cluster's topics.

Appendix A maps clusters to English themes.

Performing further data investigation

#20 — Performing further data investigation

Rubric line — listen ~30 s
List out the top 20 locations with the highest number of negative reviews. Do this separately for Google and Trustpilot's reviews, and comment on the result. Are the locations roughly similar in both data sets?

London Stratford, Leicester Walnut Street, London Enfield — worst 3.

#21 — Performing further data investigation

Rubric line — listen ~30 s
Merge the 2 data sets using Location Name and Club's Name. Now, list out the following: • Locations • Number of Trustpilot reviews for this location • Number of Google reviews for this location • Total number of reviews for this location (sum of Google reviews and Trustpilot reviews) Sort based on the total number of reviews.

Full outer join, total reviews per club.

#22 — Performing further data investigation

Rubric line — listen ~30 s
For the top 30 locations, redo the word frequency and word cloud. Comment on the results, and highlight if the results are different from the first run.

Long tail collapses — hygiene + equipment dominate top 30.

#23 — Performing further data investigation

Rubric line — listen ~30 s
For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic. Comment on the following: • Are the results any different from the first run of BERTopic? • If so, what has changed? • Are there any additional insights compared to the first run?

64 topics vs 191 for the full set. Dominant labels stabilise.

Conducting emotion analysis

#24 — Conducting emotion analysis

Rubric line — listen ~30 s
Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set up a pipeline for text classification.

HF pipeline, 6 Ekman classes. Model was tweet-trained.

#25 — Conducting emotion analysis

Rubric line — listen ~30 s
With the help of an example sentence, run the model and display the different emotion classifications that the model outputs.

Appendix A has a demo cell with full probability distribution.

#26 — Conducting emotion analysis

Rubric line — listen ~30 s
Run this model on both data sets, and capture the top emotion for each review.

Full run on 11,851 English negatives. Keep the full dist, not just top-1.

#27 — Conducting emotion analysis

Rubric line — listen ~30 s
Use a bar plot to show the top emotion distribution for all negative reviews in both data sets.

Stacked by platform. Bhadresh over-labels sadness.

#28 — Conducting emotion analysis

Rubric line — listen ~30 s
Extract all the negative reviews (from both data sets) where anger is top emotion.

Anger subset filter on is_anger column.

#29 — Conducting emotion analysis

Rubric line — listen ~30 s
Run BERTopic on the output of the previous step.

BERTopic on anger — narrows to equipment, cleaning, staff, billing.

#30 — Conducting emotion analysis

Rubric line — listen ~30 s
Visualise the clusters from this run. Comment on whether it is any different from the previous runs, and whether it is possible to narrow down the primary issues that have led to an angry review.

Yes, you can narrow anger drivers to 3 themes. Phase 8 commentary.

Using a large language model from Hugging Face

#31 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review.

Falcon → Qwen2.5-72B via HF Inference. 600 reviews in 2 min, 100% success.

#32 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
Add the following prompt to every review, before passing it on to the model: In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line. Run the model. Note: If the execution time is too high, you can use a subset of the bad reviews (instead of the full set) to run this model.

Verbatim rubric prompt. Parse as numbered list.

#33 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list.

1,433 unique phrases from 600 reviews.

#34 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
Use this list as input to run BERTopic again.

Meta-BERTopic. Colab uses real BERTopic; local uses KMeans (no HDBSCAN wheel).

#35 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
Comment about the output of BERTopic. Highlight any changes, improvements, and if any further insights have been obtained.

Fewer clusters, cleaner labels. Overcrowding-physical vs -management split apart.

#36 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
Use the comprehensive list from Step 3. Pass it to the model as the input, but pre-fix the following to the prompt: For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company? Run the Falcon-7b-Instruct model.

Verbatim prompt prefix. 6 suggestions tagged with owner + cost.

#37 — Using a large language model from Hugging Face

Rubric line — listen ~30 s
List the output, ideally in the form of suggestions, that the company can employ to address customer concerns.

v3_11_qwen_insights.md, rendered inline in Appendix A.

Using Gensim

#38 — Using Gensim

Rubric line — listen ~30 s
Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews).

Colab A100, ~3 min. Heavy preprocessing (lemmatise, bigrams).

#39 — Using Gensim

Rubric line — listen ~30 s
Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.

num_topics=10. LdaMulticore.

#40 — Using Gensim

Rubric line — listen ~30 s
Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms.

pyLDAvis interactive HTML — hover + relevance lambda.

#41 — Using Gensim

Rubric line — listen ~30 s
Comment on the output and whether it is similar to other techniques, and whether any extra insights were obtained.

LDA smoother, BERTopic sharper. LDA found a Danish-language topic BERTopic filtered.

Report: Communicating business impact and insights

#42 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The report is between 800–1000 words.

976 words — in range.

#43 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The report documents the approach used.

Approach walks through phases 1-12.

#44 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The report is clear, well-organised, and engaging to facilitate learning from the analysis.

Structure: intro, approach, findings, recs, caveats.

#45 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
Conclusions drawn are clearly supported by the data.

Every claim cites a figure, table, or cell.

#46 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The code is well-organised and well-presented.

v3_01 through v3_12 scripts + submission notebook.

#47 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The report captures and summarises the comments requested in earlier steps.

Phase-by-phase commentary mirrors rubric 3.5, 4.4, 5.7, 6.5, 7.4.

#48 — Report: Communicating business impact and insights

Rubric line — listen ~30 s
The report is comprised of final insights, based on the output obtained from the various models employed.

Severity × frequency matrix synthesises BERTopic + LDA + emotion + LLM.

Top 20 cohort learnings

Synthesis of Q&A + WhatsApp cohort · ~5 min

Full written list with sources and quotes: v3/output/pace_top20_tips.md.

Closing

~30 s · the four things to remember

Where everything lives

Generated by v3/build_audio_deck_html.py. MP3s via v3/generate_rubric_audio.py (edge-tts, en-GB-SoniaNeural).