Nadia · NLP specialist"The work" — what got submitted
Nadia · NLP specialist"The work" — what got submitted
Rubric audio deck
Clinical walk-through of 48 rubric items. ~30 seconds each; start here, skip around as needed. UK English (Sonia voice), -5% rate for walk listening.
Overview · 00_overview.mp3
Importing packages and data
#1 — Importing packages and data
Rubric line — listen ~30 s
Import the data file Google_12_months.xlsx into a dataframe.
Loaded with pandas. 13,898 rows after NaN drop.
#2 — Importing packages and data
Rubric line — listen ~30 s
Import the data file Trustpilot_12_months.xlsx into a dataframe.
Trustpilot separates Title from Content — we merge them in Phase 2.
#3 — Importing packages and data
Rubric line — listen ~30 s
Remove any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot).
dropna on Comment / Review Content.
Conducting initial data investigation
#4 — Conducting initial data investigation
Rubric line — listen ~30 s
Find the number of unique locations in the Google data set.
Find the number of unique locations in the Trustpilot data set.
Use Club's Name for the Google data set.
Use Location Name for the Trustpilot data set.
Google 512 clubs, Trustpilot 376 — different casing, normalise first.
#5 — Conducting initial data investigation
Rubric line — listen ~30 s
Find the number of common locations between the Google data set and the Trustpilot data set.
310 shared clubs.
#6 — Conducting initial data investigation
Rubric line — listen ~30 s
Perform preprocessing of the data – change to lower case, remove stopwords using NLTK, and remove numbers.
NLTK pipeline plus domain stopwords: pure, puregym, gym.
#7 — Conducting initial data investigation
Rubric line — listen ~30 s
Tokenise the data using word_tokenize from NLTK.
nltk.tokenize.word_tokenize handles contractions.
#8 — Conducting initial data investigation
Rubric line — listen ~30 s
Find the frequency distribution of the words from each data set's reviews separately. You can use nltk.freqDist.
FreqDist, saved as CSV. Run once for all, once for negatives.
#9 — Conducting initial data investigation
Rubric line — listen ~30 s
Plot a histogram/bar plot showing the top 10 words from each data set.
Horizontal bar plots — long labels don't fit vertically.
#10 — Conducting initial data investigation
Rubric line — listen ~30 s
Use the wordcloud library on the cleaned data and plot the word cloud.
WordCloud. Visually cheap, analytically thin.
#11 — Conducting initial data investigation
Rubric line — listen ~30 s
Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets.
• For Google reviews, overall scores < 3 can be considered negative scores.
• For Trustpilot reviews, stars < 3 can be considered negative scores.
Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews.
Filter the ORIGINAL dataframe, not the tokenised data. Instructor was explicit.
Conducting initial topic modelling
#12 — Conducting initial topic modelling
Rubric line — listen ~30 s
With the data frame created in the previous step:
• Filter out the reviews that are from the locations common to both data sets.
• Merge the reviews to form a new list.
Intersection of club names. Sort by creation date before BERTopic fit.
#13 — Conducting initial topic modelling
Rubric line — listen ~30 s
Preprocess this data set. Use BERTopic on this cleaned data set.
Feed BERTopic RAW full sentences, not tokens.
#14 — Conducting initial topic modelling
Rubric line — listen ~30 s
Output: List out the top topics along with their document frequencies.
Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map.
visualize_topics produces a plotly HTML distance map.
#17 — Conducting initial topic modelling
Rubric line — listen ~30 s
Show a barchart of the topics, displaying the top 5 words in each topic.
visualize_barchart for top-10 topics, 6 words each.
#18 — Conducting initial topic modelling
Rubric line — listen ~30 s
Plot a heatmap, showcasing the similarity matrix.
Similarity heatmap — close clusters are merge candidates.
#19 — Conducting initial topic modelling
Rubric line — listen ~30 s
For 10 clusters, provide a brief description in the Notebook of the topics they comprise of along with the general theme of the cluster, evidenced by the top words within each cluster's topics.
Appendix A maps clusters to English themes.
Performing further data investigation
#20 — Performing further data investigation
Rubric line — listen ~30 s
List out the top 20 locations with the highest number of negative reviews. Do this separately for Google and Trustpilot's reviews, and comment on the result. Are the locations roughly similar in both data sets?
London Stratford, Leicester Walnut Street, London Enfield — worst 3.
#21 — Performing further data investigation
Rubric line — listen ~30 s
Merge the 2 data sets using Location Name and Club's Name.
Now, list out the following:
• Locations
• Number of Trustpilot reviews for this location
• Number of Google reviews for this location
• Total number of reviews for this location (sum of Google reviews and Trustpilot reviews)
Sort based on the total number of reviews.
Full outer join, total reviews per club.
#22 — Performing further data investigation
Rubric line — listen ~30 s
For the top 30 locations, redo the word frequency and word cloud. Comment on the results, and highlight if the results are different from the first run.
Long tail collapses — hygiene + equipment dominate top 30.
#23 — Performing further data investigation
Rubric line — listen ~30 s
For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic.
Comment on the following:
• Are the results any different from the first run of BERTopic?
• If so, what has changed?
• Are there any additional insights compared to the first run?
64 topics vs 191 for the full set. Dominant labels stabilise.
Conducting emotion analysis
#24 — Conducting emotion analysis
Rubric line — listen ~30 s
Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set up a pipeline for text classification.
HF pipeline, 6 Ekman classes. Model was tweet-trained.
#25 — Conducting emotion analysis
Rubric line — listen ~30 s
With the help of an example sentence, run the model and display the different emotion classifications that the model outputs.
Appendix A has a demo cell with full probability distribution.
#26 — Conducting emotion analysis
Rubric line — listen ~30 s
Run this model on both data sets, and capture the top emotion for each review.
Full run on 11,851 English negatives. Keep the full dist, not just top-1.
#27 — Conducting emotion analysis
Rubric line — listen ~30 s
Use a bar plot to show the top emotion distribution for all negative reviews in both data sets.
Stacked by platform. Bhadresh over-labels sadness.
#28 — Conducting emotion analysis
Rubric line — listen ~30 s
Extract all the negative reviews (from both data sets) where anger is top emotion.
Anger subset filter on is_anger column.
#29 — Conducting emotion analysis
Rubric line — listen ~30 s
Run BERTopic on the output of the previous step.
BERTopic on anger — narrows to equipment, cleaning, staff, billing.
#30 — Conducting emotion analysis
Rubric line — listen ~30 s
Visualise the clusters from this run. Comment on whether it is any different from the previous runs, and whether it is possible to narrow down the primary issues that have led to an angry review.
Yes, you can narrow anger drivers to 3 themes. Phase 8 commentary.
Using a large language model from Hugging Face
#31 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review.
Falcon → Qwen2.5-72B via HF Inference. 600 reviews in 2 min, 100% success.
#32 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
Add the following prompt to every review, before passing it on to the model: In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line.
Run the model.
Note: If the execution time is too high, you can use a subset of the bad reviews (instead of the full set) to run this model.
Verbatim rubric prompt. Parse as numbered list.
#33 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list.
1,433 unique phrases from 600 reviews.
#34 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
Use this list as input to run BERTopic again.
Meta-BERTopic. Colab uses real BERTopic; local uses KMeans (no HDBSCAN wheel).
#35 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
Comment about the output of BERTopic. Highlight any changes, improvements, and if any further insights have been obtained.
Fewer clusters, cleaner labels. Overcrowding-physical vs -management split apart.
#36 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
Use the comprehensive list from Step 3.
Pass it to the model as the input, but pre-fix the following to the prompt: For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company?
Run the Falcon-7b-Instruct model.
Verbatim prompt prefix. 6 suggestions tagged with owner + cost.
#37 — Using a large language model from Hugging Face
Rubric line — listen ~30 s
List the output, ideally in the form of suggestions, that the company can employ to address customer concerns.
v3_11_qwen_insights.md, rendered inline in Appendix A.
Using Gensim
#38 — Using Gensim
Rubric line — listen ~30 s
Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews).
Colab A100, ~3 min. Heavy preprocessing (lemmatise, bigrams).
#39 — Using Gensim
Rubric line — listen ~30 s
Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.
num_topics=10. LdaMulticore.
#40 — Using Gensim
Rubric line — listen ~30 s
Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms.
pyLDAvis interactive HTML — hover + relevance lambda.
#41 — Using Gensim
Rubric line — listen ~30 s
Comment on the output and whether it is similar to other techniques, and whether any extra insights were obtained.
LDA smoother, BERTopic sharper. LDA found a Danish-language topic BERTopic filtered.
Report: Communicating business impact and insights
#42 — Report: Communicating business impact and insights
Rubric line — listen ~30 s
The report is between 800–1000 words.
976 words — in range.
#43 — Report: Communicating business impact and insights
Rubric line — listen ~30 s
The report documents the approach used.
Approach walks through phases 1-12.
#44 — Report: Communicating business impact and insights
Rubric line — listen ~30 s
The report is clear, well-organised, and engaging to facilitate learning from the analysis.