[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]
The Public Relations Department has provided us a dataset with evaluations of their customers’ commercial products. The team seeks to predict if their customers are satisfied and they ask us to develop a predictive model.
1 – Import libraries and data visualization
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt
reviews_df = pd.read_csv("amazon_alexa.tsv", sep = "\t") reviews_df
rating | date | variation | verified_reviews | feedback | |
---|---|---|---|---|---|
0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 |
1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 |
2 | 4 | 31-Jul-18 | Walnut Finish | Sometimes while playing a game, you can answer... | 1 |
3 | 5 | 31-Jul-18 | Charcoal Fabric | I have had a lot of fun with this thing. My 4 ... | 1 |
4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 |
... | ... | ... | ... | ... | ... |
3145 | 5 | 30-Jul-18 | Black Dot | Perfect for kids, adults and everyone in betwe... | 1 |
3146 | 5 | 30-Jul-18 | Black Dot | Listening to music, searching locations, check... | 1 |
3147 | 5 | 30-Jul-18 | Black Dot | I do love these things, i have them running my... | 1 |
3148 | 5 | 30-Jul-18 | White Dot | Only complaint I have is that the sound qualit... | 1 |
3149 | 4 | 29-Jul-18 | Black Dot | Good | 1 |
reviews_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3150 entries, 0 to 3149 Data columns (total 5 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 rating 3150 non-null int64 1 date 3150 non-null object 2 variation 3150 non-null object 3 verified_reviews 3150 non-null object 4 feedback 3150 non-null int64 dtypes: int64(2), object(3) memory usage: 123.2+ KB
reviews_df.describe()
rating | feedback | |
---|---|---|
count | 3150.000000 | 3150.000000 |
mean | 4.463175 | 0.918413 |
std | 1.068506 | 0.273778 |
min | 1.000000 | 0.000000 |
25% | 4.000000 | 1.000000 |
50% | 5.000000 | 1.000000 |
75% | 5.000000 | 1.000000 |
max | 5.000000 | 1.000000 |
- We can observe that 91% of the people leave a very good grade in the rating.
reviews_df['verified_reviews']
0 Love my Echo! 1 Loved it! 2 Sometimes while playing a game, you can answer... 3 I have had a lot of fun with this thing. My 4 ... 4 Music ... 3145 Perfect for kids, adults and everyone in betwe... 3146 Listening to music, searching locations, check... 3147 I do love these things, i have them running my... 3148 Only complaint I have is that the sound qualit... 3149 Good Name: verified_reviews, Length: 3150, dtype: object
Missing values
sns.heatmap(reviews_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")
Visualization
reviews_df.hist(bins = 30, figsize = (13, 5), color = 'r')
- We can see that it is a fairly unbalanced dataset, there are many more positive than negative evaluations.
– We add a column ‘ length’ with the number of characters of the rewiews:
reviews_df['length'] = reviews_df['verified_reviews'].apply(len) reviews_df.head()
rating | date | variation | verified_reviews | feedback | length | |
---|---|---|---|---|---|---|
0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 | 13 |
1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 | 9 |
2 | 4 | 31-Jul-18 | Walnut Finish | Sometimes while playing a game, you can answer... | 1 | 195 |
3 | 5 | 31-Jul-18 | Charcoal Fabric | I have had a lot of fun with this thing. My 4 ... | 1 | 172 |
4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 | 5 |
reviews_df['length'].plot(bins = 100, kind = 'hist')
- There are many customers with short reviews.
reviews_df[reviews_df['length'] == 2851]['verified_reviews'].iloc[0]
"Incredible piece of technology.I have this right center of my living room on an island kitchen counter. The mic and speaker goes in every direction and the quality of the sound is quite good. I connected the Echo via Bluetooth to my Sony soundbar on my TV but find the Echo placement and 360 sound more appealing. It's no audiophile equipment but there is good range and decent bass. The sound is more than adequate for any indoor entertaining and loud enough to bother neighbors in my building. The knob on the top works great for adjusting volume. This is my first Echo device and I would imagine having to press volume buttons (on the Echo 2) a large inconvenience and not as precise. For that alone I would recommend this over the regular Echo (2nd generation).The piece looks quality and is quite sturdy with some weight on it. The rubber material on the bottom has a good grip on the granite counter-- my cat can even rub her scent on it without tipping it over.This order came with a free Philips Hue Bulb which I installed along with an extra one I bought. I put the 2 bulbs into my living room floor lamp, turned on the light, and all I had to do was say "Alexa, connect my devices". The default names for each bulb was assigned as "First light" and "Second light", so I can have a dimmer floor lamp if I just turned on/off one of the lights by saying "Alexa, turn off the second light". In the Alexa app, I created a 'Group' with "First light" and "Second light" and named the group "The light", so to turn on the lamp with both bulbs shining I just say "Alexa, turn on The light".I was surprised how easily the bulbs connected to the Echo Plus with its built in hub. I thought I would have to buy a hub bridge to connect to my floor lamp power plug. Apparently there is some technology built directly inside the bulb! I was surprised by that. Awesome.You will feel like Tony Stark on this device. I added quite a few "Skills" like 'Thunderstorm sounds' and 'Quote of the day' . Alexa always loads them up quickly. Adding songs that you hear to specific playlists on Amazon Music is also a great feature.I can go on and on and this is only my second day of ownership.I was lucky to buy this for $100 on Prime Day, but I think for $150 is it pretty expensive considering the Echo 2 is only $100. In my opinion, you will be paying a premium for the Echo Plus and you have to decide if the value is there for you:1) Taller and 360 sound unit.2) Volume knob on top that you spin (I think this is a huge benefit over buttons)3) Built in hub for Hue bulbs. After researching more, there are some cons to this setup if you plan on having more advanced light setups. For me and my floor lamp, it's just perfect.I highly recommend it and will buy an Echo dot for my bedroom now."
reviews_df.length.describe()
count 3150.000000 mean 132.049524 std 182.099952 min 1.000000 25% 30.000000 50% 74.000000 75% 165.000000 max 2851.000000 Name: length, dtype: float64
reviews_df[reviews_df['length'] == 1]['verified_reviews'].iloc[0]
'?'
reviews_df[reviews_df['length'] == 133]['verified_reviews'].iloc[0]
'Fun item to play with and get used to using. Sometimes has hard time answering the questions you ask, but I think it will be better.'
- The longest review is 2851 characters and the one with less 1 (an emoticon).
- The average of reviews is 132.
Let’s go to separate the positive(1) and negative(0) feedback
positive = reviews_df[reviews_df['feedback'] == 1] negative = reviews_df[reviews_df['feedback'] == 0] negative
rating | date | variation | verified_reviews | feedback | length | |
---|---|---|---|---|---|---|
46 | 2 | 30-Jul-18 | Charcoal Fabric | It's like Siri, in fact, Siri answers more acc... | 0 | 163 |
111 | 2 | 30-Jul-18 | Charcoal Fabric | Sound is terrible if u want good music too get... | 0 | 53 |
141 | 1 | 30-Jul-18 | Charcoal Fabric | Not much features. | 0 | 18 |
162 | 1 | 30-Jul-18 | Sandstone Fabric | Stopped working after 2 weeks ,didn't follow c... | 0 | 87 |
176 | 2 | 30-Jul-18 | Heather Gray Fabric | Sad joke. Worthless. | 0 | 20 |
... | ... | ... | ... | ... | ... | ... |
3047 | 1 | 30-Jul-18 | Black Dot | Echo Dot responds to us when we aren't even ta... | 0 | 120 |
3048 | 1 | 30-Jul-18 | White Dot | NOT CONNECTED TO MY PHONE PLAYLIST 🙁 | 0 | 37 |
3067 | 2 | 30-Jul-18 | Black Dot | The only negative we have on this product is t... | 0 | 240 |
3091 | 1 | 30-Jul-18 | Black Dot | I didn’t order it | 0 | 17 |
3096 | 1 | 30-Jul-18 | White Dot | The product sounded the same as the emoji spea... | 0 | 210 |
positive
rating | date | variation | verified_reviews | feedback | length | |
---|---|---|---|---|---|---|
0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 | 13 |
1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 | 9 |
2 | 4 | 31-Jul-18 | Walnut Finish | Sometimes while playing a game, you can answer... | 1 | 195 |
3 | 5 | 31-Jul-18 | Charcoal Fabric | I have had a lot of fun with this thing. My 4 ... | 1 | 172 |
4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 | 5 |
... | ... | ... | ... | ... | ... | ... |
3145 | 5 | 30-Jul-18 | Black Dot | Perfect for kids, adults and everyone in betwe... | 1 | 50 |
3146 | 5 | 30-Jul-18 | Black Dot | Listening to music, searching locations, check... | 1 | 135 |
3147 | 5 | 30-Jul-18 | Black Dot | I do love these things, i have them running my... | 1 | 441 |
3148 | 5 | 30-Jul-18 | White Dot | Only complaint I have is that the sound qualit... | 1 | 380 |
3149 | 4 | 29-Jul-18 | Black Dot | Good | 1 | 4 |
sns.countplot(reviews_df['feedback'], label = "Count")
sns.countplot(x = 'rating', data = reviews_df)
reviews_df['rating'].hist(bins = 5)
- Most consumers give a 5.
– Let’s see the variation field that indicates the type of product:
plt.figure(figsize=(40, 15)) sns.barplot(x = 'variation', y = 'rating', data = reviews_df, palette='deep')
2- Reviews
All Reviews
– We put the reviews in a single list and separate them by a blank space:
sentences = reviews_df['verified_reviews'].tolist() len(sentences)
3150
print(sentences)
['Love my Echo!', 'Loved it!', 'Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.', 'I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.', 'Music', 'I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.', 'Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet.', "I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.", 'looks great', 'Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!', 'I sent it to my 85 year old Dad, and he talks to it constantly.', "I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times", 'I purchased this for my mother who is having knee problems now, to give her something to do while trying to over come not getting around so fast like she did.She enjoys all the little and big things it can do...Alexa play this song, What time is it and where, and how to cook this and that!', 'Love, Love, Love!!', 'Just what I expected....', 'I love it, wife hates it.', ...................................................................................... ...................................................................................... ...................................................................................... "I was really happy with my original echo so i thought I'd get an echo dot to use in my bedroom. I was really disappointed in the audio quality so I connected an external speaker via bluetooth. The audio was much better but I started having problems with it loosing connection with the wifi, presumably due to interference from the bluetooth. Then I connected a speaker via the auxiliary jack. when i did that, the auxiliary jack picked up interference from the wifi and I was woken up in the middle of the night by a horrible buzzing sound. im hoping Amazon will take this thing back and give me a good deal on an echo spot which I hope will be a better nightstand device.", "Weak sound. Compared to the Google Home Mini the sound is bad. Also you need Prime to have a small selection of music and need prime music to play everything. Also if you get two you need prime family music to play music on both. I think it's lame that you need a family plan to use multiple devices.", "Echo Dot responds to us when we aren't even talking to it. I've unplugged it. It feels like it's "spying" on us.", 'NOT CONNECTED TO MY PHONE PLAYLIST :(', 'The only negative we have on this product is the terrible sound quality. A massive difference from the Alexa. Which to us was a big reason we wanted to purchase this.Won’t be buying another until the speaker and sound quality can improve.', 'I didn’t order it', 'The product sounded the same as the emoji speaker from five below my sister has ... and even that one has Bluetooth and doesn’t need to be plugged in. The only good thing about this is that you can speak to it.']
negative_sentences_as_one_string = " ".join(negative_list)
plt.figure(figsize = (20,20)) plt.imshow(WordCloud().generate(negative_sentences_as_one_string))
- device and work and love stand out from negative words.
3 – Data Cleaning
reviews_df.head()
rating | date | variation | verified_reviews | feedback | length | |
---|---|---|---|---|---|---|
0 | 5 | 31-Jul-18 | Charcoal Fabric | Love my Echo! | 1 | 13 |
1 | 5 | 31-Jul-18 | Charcoal Fabric | Loved it! | 1 | 9 |
2 | 4 | 31-Jul-18 | Walnut Finish | Sometimes while playing a game, you can answer... | 1 | 195 |
3 | 5 | 31-Jul-18 | Charcoal Fabric | I have had a lot of fun with this thing. My 4 ... | 1 | 172 |
4 | 5 | 31-Jul-18 | Charcoal Fabric | Music | 1 | 5 |
– Let’s remove some columns:
reviews_df = reviews_df.drop(['date', 'rating', 'length'], axis = 1) reviews_df
variation | verified_reviews | feedback | |
---|---|---|---|
0 | Charcoal Fabric | Love my Echo! | 1 |
1 | Charcoal Fabric | Loved it! | 1 |
2 | Walnut Finish | Sometimes while playing a game, you can answer... | 1 |
3 | Charcoal Fabric | I have had a lot of fun with this thing. My 4 ... | 1 |
4 | Charcoal Fabric | Music | 1 |
... | ... | ... | ... |
3145 | Black Dot | Perfect for kids, adults and everyone in betwe... | 1 |
3146 | Black Dot | Listening to music, searching locations, check... | 1 |
3147 | Black Dot | I do love these things, i have them running my... | 1 |
3148 | White Dot | Only complaint I have is that the sound qualit... | 1 |
3149 | Black Dot | Good | 1 |
– Pass ‘variation’ (kind of product) to dummie variable:
variation_dummies = pd.get_dummies(reviews_df['variation'], drop_first=True) variation_dummies
Black Dot | Black Plus | Black Show | Black Spot | Charcoal Fabric | Configuration: Fire TV Stick | Heather Gray Fabric | Oak Finish | Sandstone Fabric | Walnut Finish | White | White Dot | White Plus | White Show | White Spot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3145 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3146 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3147 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3148 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3149 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
– Remove ‘variation’ column an add ‘variation_dummies’ column from reviews_df:
reviews_df.drop(['variation'], axis = 1, inplace = True) reviews_df = pd.concat([reviews_df, variation_dummies],axis = 1) reviews_df
verified_reviews | feedback | Black Dot | Black Plus | Black Show | Black Spot | Charcoal Fabric | Configuration: Fire TV Stick | Heather Gray Fabric | Oak Finish | Sandstone Fabric | Walnut Finish | White | White Dot | White Plus | White Show | White Spot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Love my Echo! | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | Loved it! | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Sometimes while playing a game, you can answer... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | I have had a lot of fun with this thing. My 4 ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Music | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3145 | Perfect for kids, adults and everyone in betwe... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3146 | Listening to music, searching locations, check... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3147 | I do love these things, i have them running my... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3148 | Only complaint I have is that the sound qualit... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3149 | Good | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 – Punctuation marks, StopWords and ‘Tokenization’
import string string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'
– Test Part –
Removing punctuation marks:
Test = "Hello Mr. Future, I am so happy to be learning AI now!!!"
Test_punc_removed = [char for char in Test if char not in string.punctuation] Test_punc_removed
['H', 'e', 'l', 'l', 'o', ' ', 'M', 'r', ' ', 'F', 'u', 't', 'u', 'r', 'e', ' ', 'I', ' ', 'a', 'm', ' ', 's', 'o', ' ', 'h', 'a', 'p', 'p', 'y', ' ', 't', 'o', ' ', 'b', 'e', ' ', 'l', 'e', 'a', 'r', 'n', 'i', 'n', 'g', ' ', 'A', 'I', ' ', 'n', 'o', 'w'
Test_punc_removed_join = ''.join(Test_punc_removed) Test_punc_removed_join
'Hello Mr Future I am so happy to be learning AI now'
StopWords:
– We eliminate the stopwords: There’re words like ‘I’, ‘you’, ‘and’, ‘but’…..They don’t give us information for our analysis:
import nltk nltk.download('stopwords')
True
from nltk.corpus import stopwords stopwords.words('english') #In english lenguage
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
Test_punc_removed_join
'Hello Mr Future I am so happy to be learning AI now'
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')] Test_punc_removed_join_clean
['Hello', 'Mr', 'Future', 'happy', 'learning', 'AI']
mini_challenge = "Here is a mini challenge, that will teach you how to remove stopwords and punctuations!!"
challenge = [char for char in mini_challenge if char not in string.punctuation] challenge = ''.join(challenge) challenge = [word for word in challenge.split() if word.lower() not in stopwords.words('english')] challenge
['mini', 'challenge', 'teach', 'remove', 'stopwords', 'punctuations']
‘Tokenization’:
from sklearn.feature_extraction.text import CountVectorizer sample_data = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?'] vectorizer = CountVectorizer() X = vectorizer.fit_transform(sample_data) print(vectorizer.get_feature_names())
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.toarray())
[[0 1 1 1 0 0 1 0 1] [0 2 0 1 0 1 1 0 1] [1 0 0 1 1 0 1 1 1] [0 1 1 1 0 0 1 0 1]]
mini_challenge = ["Hello World", "Hello Hello World", "Hello World world world"] vectorizer_challenge = CountVectorizer() X_challenge = vectorizer_challenge.fit_transform(mini_challenge) print(vectorizer_challenge.get_feature_names()) print(X_challenge.toarray())
['hello', 'world'] [[1 1] [2 1] [1 3]]
– With our data –
We define a pipeline to clean all the reviews and keep us the most relevant words.
1.1 – We’re going to create a function to remove punctuation marks and stopwords:
def message_cleaning(message): Test_punc_removed = [char for char in message if char not in string.punctuation] Test_punc_removed_join = ''.join(Test_punc_removed) Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')] return Test_punc_removed_join_clean
reviews_df_clean = reviews_df['verified_reviews'].apply(message_cleaning) print(reviews_df_clean[3])
['lot', 'fun', 'thing', '4', 'yr', 'old', 'learns', 'dinosaurs', 'control', 'lights', 'play', 'games', 'like', 'categories', 'nice', 'sound', 'playing', 'music', 'well']
print(reviews_df['verified_reviews'][3])
I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.
reviews_df_clean
0 [Love, Echo] 1 [Loved] 2 [Sometimes, playing, game, answer, question, c... 3 [lot, fun, thing, 4, yr, old, learns, dinosaur... 4 [Music] ... 3145 [Perfect, kids, adults, everyone] 3146 [Listening, music, searching, locations, check... 3147 [love, things, running, entire, home, TV, ligh... 3148 [complaint, sound, quality, isnt, great, mostl... 3149 [Good] Name: verified_reviews, Length: 3150, dtype: object
1.2 – Applying a pre-made function in a library:
from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(analyzer=message_cleaning) reviews_countvectorizer = vectorizer.fit_transform(reviews_df['verified_reviews']) print(vectorizer.get_feature_names())
['072318', '1', '10', '100', '1000', '100X', '1010', '1030pm', '11', '1100sf', '1220', '129', '12am', '15', '150', '19', '1964', '1990s', '1990’s', '1GB', '1rst', '1st', '2', '20', '200', '2000', '2017', '2030', '229', '23', '2448', '247', '24GHZ', '24ghz', '25', '29', '299', '2999', '2Original', '2nd', '2or', '2package', '3', '30', '300', '30so', '334', '34', '342nd', '3434', '34A34', '34Alexa', '34Alexa34', '34Certified', '34Computer34', '34Dot34', '34Drop', '34First', '34Hub', '34I', '34Im', '34NEVER', '34Philips', '34Play', '34Second', '34Skills34', '34Tell', '34The', '34Things', '34Thongs', '34Try', '34Whats', '34alexa34', '34card34', '34cycle', '34cycle34', '34fixes34', '34fun34', '34group34', '34hear34', '34hmm', '34hmmm', '34it34', '34late', '34learn', '34light34', '34lights34', '34listen34', '34minor', '34outlet34', '34personal34', '34she34', '34show', '34smart', '34smart34', '34sorry', '34spying34', '34the', '34thick34', '34things', '34this', '34trouble', '34try', '34turn', '34visual34', '34wake34me', '34warehouse34', '35', '360', '39', '399', '3999', '3Dots', '3rd', '3xs', '4', '40000', '45', '48', '4K', '4am', '4k', '4th', '5', '50', '54', '5GHZ', '5GHz', '5am”', '5ghz', '5th', '6', '600', '62', '672', '6th', '7', '7000', '70s', '75', '7900', '8', '80s', '81', '83', '85', '88', '888', '8GB', '9', '90', '91', '911', '99', 'A1', 'A19', 'ABC', 'ABSOLUTELY', 'AF', 'AI', 'AIs', 'ALARM', 'ALEXA', 'ALEXUS', 'ALLRecipes', 'AMAZING', 'AMAZON', 'ANNOYING', 'ANOTHER', 'ASAP', 'ASK', 'AV', 'AVAILABLE', 'Able', 'Absolutely', 'Absolutly', 'Ac', 'AccentThe', 'Access', 'Acoustical', 'Acting', ....................................................................................................................... ....................................................................................................................... ....................................................................................................................... ,'zero', 'zigbee', 'zonkedout', 'zwave', 'zzzz', 'í', 'útil', '—', '‘Drop', '‘appliance’', '‘technically', '‘til', '“', '“Alexa', '“Alexa”', '“Echo”', '“Hey', '“Jaws”', '“OK', '“apps”', '“check', '“dot”', '“drop', '“dropin”', '“dropping', '“free”', '“inactivity”', '“learn”', '“live”version', '“name”', '“no”', '“oh', '“oops', '“plusprimeetc”', '“ready', '“smart', '“themes”', '“things', '“updated', '“wake', '“wake”', '“white”', '“your”', '⏰', '❤', '⭐⭐⭐⭐⭐', '?', '?', '??', '??', '????', '?', '??', '?', '?', '?', '?????❤', '?', '?', '?', '?', '?', '?']
2.1 – Tokenization:
print(reviews_countvectorizer.toarray())
[[0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0] [0 0 0 ... 0 0 0]]
reviews_countvectorizer.shape
(3150, 5211)
reviews_df
verified_reviews | feedback | Black Dot | Black Plus | Black Show | Black Spot | Charcoal Fabric | Configuration: Fire TV Stick | Heather Gray Fabric | Oak Finish | Sandstone Fabric | Walnut Finish | White | White Dot | White Plus | White Show | White Spot | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Love my Echo! | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | Loved it! | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Sometimes while playing a game, you can answer... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
3 | I have had a lot of fun with this thing. My 4 ... | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | Music | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3145 | Perfect for kids, adults and everyone in betwe... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3146 | Listening to music, searching locations, check... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3147 | I do love these things, i have them running my... | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3148 | Only complaint I have is that the sound qualit... | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
3149 | Good | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2.2 – Delete and paste the new tokenized column:
reviews_df.drop(['verified_reviews'], axis = 1, inplace = True)
reviews = pd.DataFrame(reviews_countvectorizer.toarray())
reviews_df = pd.concat([reviews_df, reviews], axis = 1) reviews_df
feedback | Black Dot | Black Plus | Black Show | Black Spot | Charcoal Fabric | Configuration: Fire TV Stick | Heather Gray Fabric | Oak Finish | Sandstone Fabric | ... | 5201 | 5202 | 5203 | 5204 | 5205 | 5206 | 5207 | 5208 | 5209 | 5210 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3145 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3146 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3147 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3148 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3149 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 – Separate our input data from our predict variable (‘feedback’):
X = reviews_df.drop(['feedback'], axis = 1) X
Black Dot | Black Plus | Black Show | Black Spot | Charcoal Fabric | Configuration: Fire TV Stick | Heather Gray Fabric | Oak Finish | Sandstone Fabric | Walnut Finish | ... | 5201 | 5202 | 5203 | 5204 | 5205 | 5206 | 5207 | 5208 | 5209 | 5210 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
3145 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3146 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3147 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3148 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3149 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
y = reviews_df['feedback'] y
0 1 1 1 2 1 3 1 4 1 .. 3145 1 3146 1 3147 1 3148 1 3149 1 Name: feedback, Length: 3150, dtype: int64
5 – Naive Bayes Classifier
X.shape
(3150, 5226)
y.shape
(3150,)
Creating the model:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.naive_bayes import MultinomialNB nb_multinomial = MultinomialNB() nb_multinomial.fit(X_train, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
Model validation and Confusion Matrix:
from sklearn.metrics import classification_report, confusion_matrix y_predict_train = nb_multinomial.predict(X_train) y_predict_train cm = confusion_matrix(y_train, y_predict_train) sns.heatmap(cm, annot = True)
y_predict_test = nb_multinomial.predict(X_test) cm = confusion_matrix(y_test, y_predict_test) sns.heatmap(cm, annot = True)
print(classification_report(y_test, y_predict_test))
precision recall f1-score support 0 0.58 0.30 0.40 60 1 0.93 0.98 0.95 570 accuracy 0.91 630 macro avg 0.76 0.64 0.67 630 weighted avg 0.90 0.91 0.90 630
- Our model captures positive valuations very well (we have a recall of 0.98) however the negative ones are detect quite badly (beacuse of our dataset is not very well balanced).
6- Logistic regression classifier
from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score
Creating the model:
model = LogisticRegression() model.fit(X_train, y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True, intercept_scaling=1, l1_ratio=None, max_iter=100, multi_class='auto', n_jobs=None, penalty='l2', random_state=None, solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)
Model validation and Confusion Matrix:
y_pred = model.predict(X_test) y_pred
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
print("Accuracy {} %".format(100*accuracy_score(y_test, y_pred)))
Accuracy 93.33333333333333 %
cm = confusion_matrix(y_test, y_pred) sns.heatmap(cm, annot = True)
print(classification_report(y_test, y_pred))
precision recall f1-score support 0 0.78 0.42 0.54 60 1 0.94 0.99 0.96 570 accuracy 0.93 630 macro avg 0.86 0.70 0.75 630 weighted avg 0.93 0.93 0.92 630
- Our model captures positive valuations very well (we have a recall of 0.99), however the negative ones are detect quite badly (0.42), nevertheless is better than the Naive Bayer model.