Restricted · PACE Audio Deck

Pierre's rubric walk-through. Enter your name and PIN to listen.

That combination doesn't match. Try again.
DS301 NLP project · private preview

PACE DS301 — rubric audio deck

Pierre Sutherland · 48 rubric items, one audio clip each · ~1 hour 40 min total · UK English, Sonia voice · click play-all for autoplay, or jump by section.

48 / 48rubric items
~1h 40mtotal audio
~2 minper item
10 minoverview
976report words
Ready — press Play all or click any item below.
0 / 51
Intro
Overview — ~10 min
Importing packages and data
#1 — Import the data file Google_12_months.xlsx into a dataframe.
Import the data file Google_12_months.xlsx into a dataframe.
#2 — Import the data file Trustpilot_12_months.xlsx into a dataframe.
Import the data file Trustpilot_12_months.xlsx into a dataframe.
#3 — Remove any rows with missing values in the Comment column (Google review) and Review Conte
Remove any rows with missing values in the Comment column (Google review) and Review Content column (Trustpilot).
Conducting initial data investigation
#4 — Find the number of unique locations in the Google data set. Find the number of unique loc
Find the number of unique locations in the Google data set. Find the number of unique locations in the Trustpilot data set. Use Club's Name for the Google data set. Use Location Name for the Trustpilot data set.
#5 — Find the number of common locations between the Google data set and the Trustpilot data se
Find the number of common locations between the Google data set and the Trustpilot data set.
#6 — Perform preprocessing of the data – change to lower case, remove stopwords using NLTK, and
Perform preprocessing of the data – change to lower case, remove stopwords using NLTK, and remove numbers.
#7 — Tokenise the data using word_tokenize from NLTK.
Tokenise the data using word_tokenize from NLTK.
#8 — Find the frequency distribution of the words from each data set's reviews separately. You
Find the frequency distribution of the words from each data set's reviews separately. You can use nltk.freqDist.
#9 — Plot a histogram/bar plot showing the top 10 words from each data set.
Plot a histogram/bar plot showing the top 10 words from each data set.
#10 — Use the wordcloud library on the cleaned data and plot the word cloud.
Use the wordcloud library on the cleaned data and plot the word cloud.
#11 — Create a new dataframe by filtering out the data to extract only the negative reviews from
Create a new dataframe by filtering out the data to extract only the negative reviews from both data sets. • For Google reviews, overall scores < 3 can be considered negative scores. • For Trustpilot reviews, stars < 3 can be considered negative scores. Repeat the frequency distribution and wordcloud steps on the filtered data consisting of only negative reviews.
Conducting initial topic modelling
#12 — With the data frame created in the previous step: • Filter out the reviews that are from
With the data frame created in the previous step: • Filter out the reviews that are from the locations common to both data sets. • Merge the reviews to form a new list.
#13 — Preprocess this data set. Use BERTopic on this cleaned data set.
Preprocess this data set. Use BERTopic on this cleaned data set.
#14 — Output: List out the top topics along with their document frequencies.
Output: List out the top topics along with their document frequencies.
#15 — For the top 2 topics, list out the top words.
For the top 2 topics, list out the top words.
#16 — Show an interactive visualisation of the topics to identify the cluster of topics and to u
Show an interactive visualisation of the topics to identify the cluster of topics and to understand the intertopic distance map.
#17 — Show a barchart of the topics, displaying the top 5 words in each topic.
Show a barchart of the topics, displaying the top 5 words in each topic.
#18 — Plot a heatmap, showcasing the similarity matrix.
Plot a heatmap, showcasing the similarity matrix.
#19 — For 10 clusters, provide a brief description in the Notebook of the topics they comprise o
For 10 clusters, provide a brief description in the Notebook of the topics they comprise of along with the general theme of the cluster, evidenced by the top words within each cluster's topics.
Performing further data investigation
#20 — List out the top 20 locations with the highest number of negative reviews. Do this separat
List out the top 20 locations with the highest number of negative reviews. Do this separately for Google and Trustpilot's reviews, and comment on the result. Are the locations roughly similar in both data sets?
#21 — Merge the 2 data sets using Location Name and Club's Name. Now, list out the following:
Merge the 2 data sets using Location Name and Club's Name. Now, list out the following: • Locations • Number of Trustpilot reviews for this location • Number of Google reviews for this location • Total number of reviews for this location (sum of Google reviews and Trustpilot reviews) Sort based on the total number of reviews.
#22 — For the top 30 locations, redo the word frequency and word cloud. Comment on the results,
For the top 30 locations, redo the word frequency and word cloud. Comment on the results, and highlight if the results are different from the first run.
#23 — For the top 30 locations, combine the reviews from Google and Trustpilot and run them thro
For the top 30 locations, combine the reviews from Google and Trustpilot and run them through BERTopic. Comment on the following: • Are the results any different from the first run of BERTopic? • If so, what has changed? • Are there any additional insights compared to the first run?
Conducting emotion analysis
#24 — Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set
Import the BERT model bhadresh-savani/bert-base-uncased-emotion from Hugging Face, and set up a pipeline for text classification.
#25 — With the help of an example sentence, run the model and display the different emotion clas
With the help of an example sentence, run the model and display the different emotion classifications that the model outputs.
#26 — Run this model on both data sets, and capture the top emotion for each review.
Run this model on both data sets, and capture the top emotion for each review.
#27 — Use a bar plot to show the top emotion distribution for all negative reviews in both data
Use a bar plot to show the top emotion distribution for all negative reviews in both data sets.
#28 — Extract all the negative reviews (from both data sets) where anger is top emotion.
Extract all the negative reviews (from both data sets) where anger is top emotion.
#29 — Run BERTopic on the output of the previous step.
Run BERTopic on the output of the previous step.
#30 — Visualise the clusters from this run. Comment on whether it is any different from the prev
Visualise the clusters from this run. Comment on whether it is any different from the previous runs, and whether it is possible to narrow down the primary issues that have led to an angry review.
Using a large language model from Hugging Face
#31 — Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation
Load the following model: tiiuae/falcon-7b-instruct. Set the pipeline for text generation and a max length of 1,000 for each review.
#32 — Add the following prompt to every review, before passing it on to the model: In the follow
Add the following prompt to every review, before passing it on to the model: In the following customer review, pick out the main 3 topics. Return them in a numbered list format, with each one on a new line. Run the model. Note: If the execution time is too high, you can use a subset of the bad reviews (instead of the full set) to run this model.
#33 — The output of the model will be the top 3 topics from each review. Append each of these to
The output of the model will be the top 3 topics from each review. Append each of these topics from each review to create a comprehensive list.
#34 — Use this list as input to run BERTopic again.
Use this list as input to run BERTopic again.
#35 — Comment about the output of BERTopic. Highlight any changes, improvements, and if any furt
Comment about the output of BERTopic. Highlight any changes, improvements, and if any further insights have been obtained.
#36 — Use the comprehensive list from Step 3. Pass it to the model as the input, but pre-fix t
Use the comprehensive list from Step 3. Pass it to the model as the input, but pre-fix the following to the prompt: For the following text topics obtained from negative customer reviews, can you give some actionable insights that would help this gym company? Run the Falcon-7b-Instruct model.
#37 — List the output, ideally in the form of suggestions, that the company can employ to addres
List the output, ideally in the form of suggestions, that the company can employ to address customer concerns.
Using Gensim
#38 — Perform the preprocessing required to run the LDA model from Gensim. Use the list of negat
Perform the preprocessing required to run the LDA model from Gensim. Use the list of negative reviews (combined Google and Trustpilot reviews).
#39 — Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.
Using Gensim, perform LDA on the tokenised data. Specify the number of topics = 10.
#40 — Show the visualisations of the topics, displaying the distance maps and the bar chart list
Show the visualisations of the topics, displaying the distance maps and the bar chart listing out the most salient terms.
#41 — Comment on the output and whether it is similar to other techniques, and whether any extra
Comment on the output and whether it is similar to other techniques, and whether any extra insights were obtained.
Report: Communicating business impact and insights
#42 — The report is between 800–1000 words.
The report is between 800–1000 words.
#43 — The report documents the approach used.
The report documents the approach used.
#44 — The report is clear, well-organised, and engaging to facilitate learning from the analysis
The report is clear, well-organised, and engaging to facilitate learning from the analysis.
#45 — Conclusions drawn are clearly supported by the data.
Conclusions drawn are clearly supported by the data.
#46 — The code is well-organised and well-presented.
The code is well-organised and well-presented.
#47 — The report captures and summarises the comments requested in earlier steps.
The report captures and summarises the comments requested in earlier steps.
#48 — The report is comprised of final insights, based on the output obtained from the various m
The report is comprised of final insights, based on the output obtained from the various models employed.
Learnings
Top 20 cohort learnings — ~5 min
Closing
Closing — ~90 s