Predective Model for Public Relations Department

[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

The Public Relations Department has provided us a dataset with evaluations of their customers’ commercial products. The team seeks to predict if their customers are satisfied and they ask us to develop a predictive model.

1 – Import libraries and data visualization

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
reviews_df = pd.read_csv("amazon_alexa.tsv", sep = "\t")
reviews_df
ratingdatevariationverified_reviewsfeedback
0531-Jul-18Charcoal FabricLove my Echo!1
1531-Jul-18Charcoal FabricLoved it!1
2431-Jul-18Walnut FinishSometimes while playing a game, you can answer...1
3531-Jul-18Charcoal FabricI have had a lot of fun with this thing. My 4 ...1
4531-Jul-18Charcoal FabricMusic1
..................
3145530-Jul-18Black DotPerfect for kids, adults and everyone in betwe...1
3146530-Jul-18Black DotListening to music, searching locations, check...1
3147530-Jul-18Black DotI do love these things, i have them running my...1
3148530-Jul-18White DotOnly complaint I have is that the sound qualit...1
3149429-Jul-18Black DotGood1
reviews_df.info()
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3150 entries, 0 to 3149
    Data columns (total 5 columns):
     #   Column            Non-Null Count  Dtype 
    ---  ------            --------------  ----- 
     0   rating            3150 non-null   int64 
     1   date              3150 non-null   object
     2   variation         3150 non-null   object
     3   verified_reviews  3150 non-null   object
     4   feedback          3150 non-null   int64 
    dtypes: int64(2), object(3)
    memory usage: 123.2+ KB
reviews_df.describe()
ratingfeedback
count3150.0000003150.000000
mean4.4631750.918413
std1.0685060.273778
min1.0000000.000000
25%4.0000001.000000
50%5.0000001.000000
75%5.0000001.000000
max5.0000001.000000
  • We can observe that 91% of the people leave a very good grade in the rating.
reviews_df['verified_reviews']
    0                                           Love my Echo!
    1                                               Loved it!
    2       Sometimes while playing a game, you can answer...
    3       I have had a lot of fun with this thing. My 4 ...
    4                                                   Music
                                  ...                        
    3145    Perfect for kids, adults and everyone in betwe...
    3146    Listening to music, searching locations, check...
    3147    I do love these things, i have them running my...
    3148    Only complaint I have is that the sound qualit...
    3149                                                 Good
    Name: verified_reviews, Length: 3150, dtype: object

Missing values

sns.heatmap(reviews_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")

Visualization

reviews_df.hist(bins = 30, figsize = (13, 5), color = 'r')
  • We can see that it is a fairly unbalanced dataset, there are many more positive than negative evaluations.

– We add a column ‘ length’ with the number of characters of the rewiews:

reviews_df['length'] = reviews_df['verified_reviews'].apply(len)
reviews_df.head()
ratingdatevariationverified_reviewsfeedbacklength
0531-Jul-18Charcoal FabricLove my Echo!113
1531-Jul-18Charcoal FabricLoved it!19
2431-Jul-18Walnut FinishSometimes while playing a game, you can answer...1195
3531-Jul-18Charcoal FabricI have had a lot of fun with this thing. My 4 ...1172
4531-Jul-18Charcoal FabricMusic15
reviews_df['length'].plot(bins = 100, kind = 'hist')
  • There are many customers with short reviews.
reviews_df[reviews_df['length'] == 2851]['verified_reviews'].iloc[0]
    "Incredible piece of technology.I have this right center of my living room on an island kitchen counter. The mic and speaker goes in every direction and the quality of the sound is quite good. I connected the Echo via Bluetooth to my Sony soundbar on my TV but find the Echo placement and 360 sound more appealing. It's no audiophile equipment but there is good range and decent bass. The sound is more than adequate for any indoor entertaining and loud enough to bother neighbors in my building. The knob on the top works great for adjusting volume. This is my first Echo device and I would imagine having to press volume buttons (on the Echo 2) a large inconvenience and not as precise. For that alone I would recommend this over the regular Echo (2nd generation).The piece looks quality and is quite sturdy with some weight on it. The rubber material on the bottom has a good grip on the granite counter-- my cat can even rub her scent on it without tipping it over.This order came with a free Philips Hue Bulb which I installed along with an extra one I bought. I put the 2 bulbs into my living room floor lamp, turned on the light, and all I had to do was say &#34;Alexa, connect my devices&#34;. The default names for each bulb was assigned as &#34;First light&#34; and &#34;Second light&#34;, so I can have a dimmer floor lamp if I just turned on/off one of the lights by saying &#34;Alexa, turn off the second light&#34;. In the Alexa app, I created a 'Group' with &#34;First light&#34; and &#34;Second light&#34; and named the group &#34;The light&#34;, so to turn on the lamp with both bulbs shining I just say &#34;Alexa, turn on The light&#34;.I was surprised how easily the bulbs connected to the Echo Plus with its built in hub. I thought I would have to buy a hub bridge to connect to my floor lamp power plug. Apparently there is some technology built directly inside the bulb! I was surprised by that. Awesome.You will feel like Tony Stark on this device. I added quite a few &#34;Skills&#34; like 'Thunderstorm sounds' and 'Quote of the day' . Alexa always loads them up quickly. Adding songs that you hear to specific playlists on Amazon Music is also a great feature.I can go on and on and this is only my second day of ownership.I was lucky to buy this for $100 on Prime Day, but I think for $150 is it pretty expensive considering the Echo 2 is only $100. In my opinion, you will be paying a premium for the Echo Plus and you have to decide if the value is there for you:1) Taller and 360 sound unit.2) Volume knob on top that you spin (I think this is a huge benefit over buttons)3) Built in hub for Hue bulbs. After researching more, there are some cons to this setup if you plan on having more advanced light setups. For me and my floor lamp, it's just perfect.I highly recommend it and will buy an Echo dot for my bedroom now."
reviews_df.length.describe()
    count    3150.000000
    mean      132.049524
    std       182.099952
    min         1.000000
    25%        30.000000
    50%        74.000000
    75%       165.000000
    max      2851.000000
    Name: length, dtype: float64
reviews_df[reviews_df['length'] == 1]['verified_reviews'].iloc[0]
  '?'
reviews_df[reviews_df['length'] == 133]['verified_reviews'].iloc[0]
  'Fun item to play with and get used to using.  Sometimes has hard time answering the questions you ask, but I think it will be better.'
  • The longest review is 2851 characters and the one with less 1 (an emoticon).
  • The average of reviews is 132.

Let’s go to separate the positive(1) and negative(0) feedback

positive = reviews_df[reviews_df['feedback'] == 1]
negative = reviews_df[reviews_df['feedback'] == 0]
negative
ratingdatevariationverified_reviewsfeedbacklength
46230-Jul-18Charcoal FabricIt's like Siri, in fact, Siri answers more acc...0163
111230-Jul-18Charcoal FabricSound is terrible if u want good music too get...053
141130-Jul-18Charcoal FabricNot much features.018
162130-Jul-18Sandstone FabricStopped working after 2 weeks ,didn't follow c...087
176230-Jul-18Heather Gray FabricSad joke. Worthless.020
.....................
3047130-Jul-18Black DotEcho Dot responds to us when we aren't even ta...0120
3048130-Jul-18White DotNOT CONNECTED TO MY PHONE PLAYLIST 🙁037
3067230-Jul-18Black DotThe only negative we have on this product is t...0240
3091130-Jul-18Black DotI didn’t order it017
3096130-Jul-18White DotThe product sounded the same as the emoji spea...0210
positive
ratingdatevariationverified_reviewsfeedbacklength
0531-Jul-18Charcoal FabricLove my Echo!113
1531-Jul-18Charcoal FabricLoved it!19
2431-Jul-18Walnut FinishSometimes while playing a game, you can answer...1195
3531-Jul-18Charcoal FabricI have had a lot of fun with this thing. My 4 ...1172
4531-Jul-18Charcoal FabricMusic15
.....................
3145530-Jul-18Black DotPerfect for kids, adults and everyone in betwe...150
3146530-Jul-18Black DotListening to music, searching locations, check...1135
3147530-Jul-18Black DotI do love these things, i have them running my...1441
3148530-Jul-18White DotOnly complaint I have is that the sound qualit...1380
3149429-Jul-18Black DotGood14
sns.countplot(reviews_df['feedback'], label = "Count")
sns.countplot(x = 'rating', data = reviews_df)
reviews_df['rating'].hist(bins = 5)
  • Most consumers give a 5.

– Let’s see the variation field that indicates the type of product:

plt.figure(figsize=(40, 15))
sns.barplot(x = 'variation', y = 'rating', data = reviews_df, palette='deep')

2- Reviews

All Reviews

– We put the reviews in a single list and separate them by a blank space:

sentences = reviews_df['verified_reviews'].tolist()
len(sentences)
    3150
print(sentences)
['Love my Echo!', 'Loved it!', 'Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you.  I like being able to turn lights on and off while away from home.', 'I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.', 'Music', 'I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.', 'Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use.  It IS a great alarm.  If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response.  She does not seem to be very smartbon politics yet.', "I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.", 'looks great', 'Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!', 'I sent it to my 85 year old Dad, and he talks to it constantly.', "I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times", 'I purchased this for my mother who is having knee problems now, to give her something to do while trying to over come not getting around so fast like she did.She enjoys all the little and big things it can do...Alexa play this song, What time is it and where, and how to cook this and that!', 'Love, Love, Love!!', 'Just what I expected....', 'I love it, wife hates it.', 
......................................................................................
......................................................................................
......................................................................................
"I was really happy with my original echo so i thought I'd get an echo dot to use in my bedroom. I was really disappointed in the audio quality so I connected an external speaker via bluetooth. The audio was much better but I started having problems with it loosing connection with the wifi, presumably due to interference from the bluetooth. Then  I connected a speaker via the auxiliary jack. when i did that, the auxiliary jack picked up interference from the wifi and I was woken up in the middle of the night by a horrible buzzing sound. im hoping Amazon will take this thing back and give me a good deal on an echo spot which I hope will be a better nightstand device.",
     "Weak sound. Compared to the Google Home Mini the sound is bad. Also you need Prime to have a small selection of music and need prime music to play everything. Also if you get two you need prime family music to play music on both. I think it's lame that you need a family plan to use multiple devices.",
     "Echo Dot responds to us when we aren't even talking to it. I've unplugged it. It feels like it's &#34;spying&#34; on us.",
     'NOT CONNECTED TO MY PHONE PLAYLIST :(',
     'The only negative we have on this product is the terrible sound quality.  A massive difference from the Alexa.  Which to us was a big reason we wanted to purchase this.Won’t be buying another until the speaker and sound quality can improve.',
     'I didn’t order it',
     'The product sounded the same as the emoji speaker from five below my sister has ... and even that one has Bluetooth and doesn’t need to be plugged in. The only good thing about this is that you can speak to it.']
negative_sentences_as_one_string = " ".join(negative_list)
plt.figure(figsize = (20,20))
plt.imshow(WordCloud().generate(negative_sentences_as_one_string))
  • device and work and love stand out from negative words.

3 – Data Cleaning

reviews_df.head()
ratingdatevariationverified_reviewsfeedbacklength
0531-Jul-18Charcoal FabricLove my Echo!113
1531-Jul-18Charcoal FabricLoved it!19
2431-Jul-18Walnut FinishSometimes while playing a game, you can answer...1195
3531-Jul-18Charcoal FabricI have had a lot of fun with this thing. My 4 ...1172
4531-Jul-18Charcoal FabricMusic15

– Let’s remove some columns:

reviews_df = reviews_df.drop(['date', 'rating', 'length'], axis = 1)
reviews_df
variationverified_reviewsfeedback
0Charcoal FabricLove my Echo!1
1Charcoal FabricLoved it!1
2Walnut FinishSometimes while playing a game, you can answer...1
3Charcoal FabricI have had a lot of fun with this thing. My 4 ...1
4Charcoal FabricMusic1
............
3145Black DotPerfect for kids, adults and everyone in betwe...1
3146Black DotListening to music, searching locations, check...1
3147Black DotI do love these things, i have them running my...1
3148White DotOnly complaint I have is that the sound qualit...1
3149Black DotGood1

– Pass ‘variation’ (kind of product) to dummie variable:

variation_dummies = pd.get_dummies(reviews_df['variation'], drop_first=True)
variation_dummies
Black DotBlack PlusBlack ShowBlack SpotCharcoal FabricConfiguration: Fire TV StickHeather Gray FabricOak FinishSandstone FabricWalnut FinishWhiteWhite DotWhite PlusWhite ShowWhite Spot
0000010000000000
1000010000000000
2000000000100000
3000010000000000
4000010000000000
................................................
3145100000000000000
3146100000000000000
3147100000000000000
3148000000000001000
3149100000000000000

– Remove ‘variation’ column an add ‘variation_dummies’ column from reviews_df:

reviews_df.drop(['variation'], axis = 1, inplace = True)
reviews_df = pd.concat([reviews_df, variation_dummies],axis = 1)
reviews_df
verified_reviewsfeedbackBlack DotBlack PlusBlack ShowBlack SpotCharcoal FabricConfiguration: Fire TV StickHeather Gray FabricOak FinishSandstone FabricWalnut FinishWhiteWhite DotWhite PlusWhite ShowWhite Spot
0Love my Echo!1000010000000000
1Loved it!1000010000000000
2Sometimes while playing a game, you can answer...1000000000100000
3I have had a lot of fun with this thing. My 4 ...1000010000000000
4Music1000010000000000
......................................................
3145Perfect for kids, adults and everyone in betwe...1100000000000000
3146Listening to music, searching locations, check...1100000000000000
3147I do love these things, i have them running my...1100000000000000
3148Only complaint I have is that the sound qualit...1000000000001000
3149Good1100000000000000

4 – Punctuation marks, StopWords and ‘Tokenization’

import string

string.punctuation
    '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

– Test Part –

Removing punctuation marks:

Test = "Hello Mr. Future, I am so happy to be learning AI now!!!"
Test_punc_removed = [char for char in Test if char not in string.punctuation]
Test_punc_removed
   ['H',
     'e',
     'l',
     'l',
     'o',
     ' ',
     'M',
     'r',
     ' ',
     'F',
     'u',
     't',
     'u',
     'r',
     'e',
     ' ',
     'I',
     ' ',
     'a',
     'm',
     ' ',
     's',
     'o',
     ' ',
     'h',
     'a',
     'p',
     'p',
     'y',
     ' ',
     't',
     'o',
     ' ',
     'b',
     'e',
     ' ',
     'l',
     'e',
     'a',
     'r',
     'n',
     'i',
     'n',
     'g',
     ' ',
     'A',
     'I',
     ' ',
     'n',
     'o',
     'w'
Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join
    'Hello Mr Future I am so happy to be learning AI now'

StopWords:

– We eliminate the stopwords: There’re words like ‘I’, ‘you’, ‘and’, ‘but’…..They don’t give us information for our analysis:

import nltk 

nltk.download('stopwords')
    True
from nltk.corpus import stopwords

stopwords.words('english') #In english lenguage
['i',
     'me',
     'my',
     'myself',
     'we',
     'our',
     'ours',
     'ourselves',
     'you',
     "you're",
     "you've",
     "you'll",
     "you'd",
     'your',
     'yours',
     'yourself',
     'yourselves',
     'he',
     'him',
     'his',
     'himself',
     'she',
     "she's",
     'her',
     'hers',
     'herself',
     'it',
     "it's",
     'its',
     'itself',
     'they',
     'them',
     'their',
     'theirs',
     'themselves',
     'what',
     'which',
     'who',
     'whom',
     'this',
     'that',
     "that'll",
     'these',
     'those',
     'am',
     'is',
     'are',
     'was',
     'were',
     'be',
     'been',
     'being',
     'have',
     'has',
     'had',
     'having',
     'do',
     'does',
     'did',
     'doing',
     'a',
     'an',
     'the',
     'and',
     'but',
     'if',
     'or',
     'because',
     'as',
     'until',
     'while',
     'of',
     'at',
     'by',
     'for',
     'with',
     'about',
     'against',
     'between',
     'into',
     'through',
     'during',
     'before',
     'after',
     'above',
     'below',
     'to',
     'from',
     'up',
     'down',
     'in',
     'out',
     'on',
     'off',
     'over',
     'under',
     'again',
     'further',
     'then',
     'once',
     'here',
     'there',
     'when',
     'where',
     'why',
     'how',
     'all',
     'any',
     'both',
     'each',
     'few',
     'more',
     'most',
     'other',
     'some',
     'such',
     'no',
     'nor',
     'not',
     'only',
     'own',
     'same',
     'so',
     'than',
     'too',
     'very',
     's',
     't',
     'can',
     'will',
     'just',
     'don',
     "don't",
     'should',
     "should've",
     'now',
     'd',
     'll',
     'm',
     'o',
     're',
     've',
     'y',
     'ain',
     'aren',
     "aren't",
     'couldn',
     "couldn't",
     'didn',
     "didn't",
     'doesn',
     "doesn't",
     'hadn',
     "hadn't",
     'hasn',
     "hasn't",
     'haven',
     "haven't",
     'isn',
     "isn't",
     'ma',
     'mightn',
     "mightn't",
     'mustn',
     "mustn't",
     'needn',
     "needn't",
     'shan',
     "shan't",
     'shouldn',
     "shouldn't",
     'wasn',
     "wasn't",
     'weren',
     "weren't",
     'won',
     "won't",
     'wouldn',
     "wouldn't"]
Test_punc_removed_join
    'Hello Mr Future I am so happy to be learning AI now'
Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
Test_punc_removed_join_clean
    ['Hello', 'Mr', 'Future', 'happy', 'learning', 'AI']
mini_challenge = "Here is a mini challenge, that will teach you how to remove stopwords and punctuations!!"
challenge = [char for char in mini_challenge if char not in string.punctuation]
challenge = ''.join(challenge)
challenge = [word for word in challenge.split() if word.lower() not in stopwords.words('english')]
challenge
    ['mini', 'challenge', 'teach', 'remove', 'stopwords', 'punctuations']

‘Tokenization’:

from sklearn.feature_extraction.text import CountVectorizer

sample_data = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)

print(vectorizer.get_feature_names())
    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']
print(X.toarray())
    [[0 1 1 1 0 0 1 0 1]
     [0 2 0 1 0 1 1 0 1]
     [1 0 0 1 1 0 1 1 1]
     [0 1 1 1 0 0 1 0 1]]
mini_challenge = ["Hello World", "Hello Hello World", "Hello World world world"]

vectorizer_challenge = CountVectorizer()
X_challenge = vectorizer_challenge.fit_transform(mini_challenge)
print(vectorizer_challenge.get_feature_names())
print(X_challenge.toarray())
    ['hello', 'world']
    [[1 1]
     [2 1]
     [1 3]]

– With our data –

We define a pipeline to clean all the reviews and keep us the most relevant words.

1.1 – We’re going to create a function to remove punctuation marks and stopwords:

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean
reviews_df_clean = reviews_df['verified_reviews'].apply(message_cleaning)
print(reviews_df_clean[3])
    ['lot', 'fun', 'thing', '4', 'yr', 'old', 'learns', 'dinosaurs', 'control', 'lights', 'play', 'games', 'like', 'categories', 'nice', 'sound', 'playing', 'music', 'well']
print(reviews_df['verified_reviews'][3])
    I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.
reviews_df_clean
    0                                            [Love, Echo]
    1                                                 [Loved]
    2       [Sometimes, playing, game, answer, question, c...
    3       [lot, fun, thing, 4, yr, old, learns, dinosaur...
    4                                                 [Music]
                                  ...                        
    3145                    [Perfect, kids, adults, everyone]
    3146    [Listening, music, searching, locations, check...
    3147    [love, things, running, entire, home, TV, ligh...
    3148    [complaint, sound, quality, isnt, great, mostl...
    3149                                               [Good]
    Name: verified_reviews, Length: 3150, dtype: object

1.2 – Applying a pre-made function in a library:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=message_cleaning)
reviews_countvectorizer = vectorizer.fit_transform(reviews_df['verified_reviews'])

print(vectorizer.get_feature_names())
['072318', '1', '10', '100', '1000', '100X', '1010', '1030pm', '11', '1100sf', '1220', '129', '12am', '15', '150', '19', '1964', '1990s', '1990’s', '1GB', '1rst', '1st', '2', '20', '200', '2000', '2017', '2030', '229', '23', '2448', '247', '24GHZ', '24ghz', '25', '29', '299', '2999', '2Original', '2nd', '2or', '2package', '3', '30', '300', '30so', '334', '34', '342nd', '3434', '34A34', '34Alexa', '34Alexa34', '34Certified', '34Computer34', '34Dot34', '34Drop', '34First', '34Hub', '34I', '34Im', '34NEVER', '34Philips', '34Play', '34Second', '34Skills34', '34Tell', '34The', '34Things', '34Thongs', '34Try', '34Whats', '34alexa34', '34card34', '34cycle', '34cycle34', '34fixes34', '34fun34', '34group34', '34hear34', '34hmm', '34hmmm', '34it34', '34late', '34learn', '34light34', '34lights34', '34listen34', '34minor', '34outlet34', '34personal34', '34she34', '34show', '34smart', '34smart34', '34sorry', '34spying34', '34the', '34thick34', '34things', '34this', '34trouble', '34try', '34turn', '34visual34', '34wake34me', '34warehouse34', '35', '360', '39', '399', '3999', '3Dots', '3rd', '3xs', '4', '40000', '45', '48', '4K', '4am', '4k', '4th', '5', '50', '54', '5GHZ', '5GHz', '5am”', '5ghz', '5th', '6', '600', '62', '672', '6th', '7', '7000', '70s', '75', '7900', '8', '80s', '81', '83', '85', '88', '888', '8GB', '9', '90', '91', '911', '99', 'A1', 'A19', 'ABC', 'ABSOLUTELY', 'AF', 'AI', 'AIs', 'ALARM', 'ALEXA', 'ALEXUS', 'ALLRecipes', 'AMAZING', 'AMAZON', 'ANNOYING', 'ANOTHER', 'ASAP', 'ASK', 'AV', 'AVAILABLE', 'Able', 'Absolutely', 'Absolutly', 'Ac', 'AccentThe', 'Access', 'Acoustical', 'Acting', 
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................
,'zero', 'zigbee', 'zonkedout', 'zwave', 'zzzz', 'í', 'útil', '—', '‘Drop', '‘appliance’', '‘technically', '‘til', '“', '“Alexa', '“Alexa”', '“Echo”', '“Hey', '“Jaws”', '“OK', '“apps”', '“check', '“dot”', '“drop', '“dropin”', '“dropping', '“free”', '“inactivity”', '“learn”', '“live”version', '“name”', '“no”', '“oh', '“oops', '“plusprimeetc”', '“ready', '“smart', '“themes”', '“things', '“updated', '“wake', '“wake”', '“white”', '“your”', '⏰', '❤', '⭐⭐⭐⭐⭐', '?', '?', '??', '??', '????', '?', '??', '?', '?', '?', '?????❤', '?', '?', '?', '?', '?', '?']

2.1 – Tokenization:

print(reviews_countvectorizer.toarray())
    [[0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     ...
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]]
reviews_countvectorizer.shape
    (3150, 5211)
reviews_df
verified_reviewsfeedbackBlack DotBlack PlusBlack ShowBlack SpotCharcoal FabricConfiguration: Fire TV StickHeather Gray FabricOak FinishSandstone FabricWalnut FinishWhiteWhite DotWhite PlusWhite ShowWhite Spot
0Love my Echo!1000010000000000
1Loved it!1000010000000000
2Sometimes while playing a game, you can answer...1000000000100000
3I have had a lot of fun with this thing. My 4 ...1000010000000000
4Music1000010000000000
......................................................
3145Perfect for kids, adults and everyone in betwe...1100000000000000
3146Listening to music, searching locations, check...1100000000000000
3147I do love these things, i have them running my...1100000000000000
3148Only complaint I have is that the sound qualit...1000000000001000
3149Good1100000000000000

2.2 – Delete and paste the new tokenized column:

reviews_df.drop(['verified_reviews'], axis = 1, inplace = True)
reviews = pd.DataFrame(reviews_countvectorizer.toarray())
reviews_df = pd.concat([reviews_df, reviews], axis = 1)
reviews_df
feedbackBlack DotBlack PlusBlack ShowBlack SpotCharcoal FabricConfiguration: Fire TV StickHeather Gray FabricOak FinishSandstone Fabric...5201520252035204520552065207520852095210
01000010000...0000000000
11000010000...0000000000
21000000000...0000000000
31000010000...0000000000
41000010000...0000000000
..................................................................
31451100000000...0000000000
31461100000000...0000000000
31471100000000...0000000000
31481000000000...0000000000
31491100000000...0000000000

3 – Separate our input data from our predict variable (‘feedback’):

X = reviews_df.drop(['feedback'], axis = 1)
X
Black DotBlack PlusBlack ShowBlack SpotCharcoal FabricConfiguration: Fire TV StickHeather Gray FabricOak FinishSandstone FabricWalnut Finish...5201520252035204520552065207520852095210
00000100000...0000000000
10000100000...0000000000
20000000001...0000000000
30000100000...0000000000
40000100000...0000000000
..................................................................
31451000000000...0000000000
31461000000000...0000000000
31471000000000...0000000000
31480000000000...0000000000
31491000000000...0000000000
y = reviews_df['feedback']
y
    0       1
    1       1
    2       1
    3       1
    4       1
           ..
    3145    1
    3146    1
    3147    1
    3148    1
    3149    1
    Name: feedback, Length: 3150, dtype: int64

5 – Naive Bayes Classifier

X.shape
    (3150, 5226)
y.shape
    (3150,)

Creating the model:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
from sklearn.naive_bayes import MultinomialNB

nb_multinomial = MultinomialNB()
nb_multinomial.fit(X_train, y_train)
    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Model validation and Confusion Matrix:

from sklearn.metrics import classification_report, confusion_matrix

y_predict_train  = nb_multinomial.predict(X_train)
y_predict_train

cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)
y_predict_test = nb_multinomial.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)
print(classification_report(y_test, y_predict_test))
                  precision    recall  f1-score   support
    
               0       0.58      0.30      0.40        60
               1       0.93      0.98      0.95       570
    
        accuracy                           0.91       630
       macro avg       0.76      0.64      0.67       630
    weighted avg       0.90      0.91      0.90       630
  • Our model captures positive valuations very well (we have a recall of 0.98) however the negative ones are detect quite badly (beacuse of our dataset is not very well balanced).

6- Logistic regression classifier

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Creating the model:

model = LogisticRegression()
model.fit(X_train, y_train)
    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)

Model validation and Confusion Matrix:

y_pred = model.predict(X_test)
y_pred
    array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
           0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])
print("Accuracy {} %".format(100*accuracy_score(y_test, y_pred)))
    Accuracy 93.33333333333333 %
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True)
print(classification_report(y_test, y_pred))
                  precision    recall  f1-score   support
    
               0       0.78      0.42      0.54        60
               1       0.94      0.99      0.96       570
    
        accuracy                           0.93       630
       macro avg       0.86      0.70      0.75       630
    weighted avg       0.93      0.93      0.92       630
  • Our model captures positive valuations very well (we have a recall of 0.99), however the negative ones are detect quite badly (0.42), nevertheless is better than the Naive Bayer model.