[This work is based on this course: Data Science for Business | 6 Real-world Case Studies.]

The Public Relations Department has provided us a dataset with evaluations of their customers’ commercial products. The team seeks to predict if their customers are satisfied and they ask us to develop a predictive model.

1 – Import libraries and data visualization

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

reviews_df = pd.read_csv("amazon_alexa.tsv", sep = "\t")
reviews_df

	rating	date	variation	verified_reviews	feedback
0	5	31-Jul-18	Charcoal Fabric	Love my Echo!	1
1	5	31-Jul-18	Charcoal Fabric	Loved it!	1
2	4	31-Jul-18	Walnut Finish	Sometimes while playing a game, you can answer...	1
3	5	31-Jul-18	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1
4	5	31-Jul-18	Charcoal Fabric	Music	1
...	...	...	...	...	...
3145	5	30-Jul-18	Black Dot	Perfect for kids, adults and everyone in betwe...	1
3146	5	30-Jul-18	Black Dot	Listening to music, searching locations, check...	1
3147	5	30-Jul-18	Black Dot	I do love these things, i have them running my...	1
3148	5	30-Jul-18	White Dot	Only complaint I have is that the sound qualit...	1
3149	4	29-Jul-18	Black Dot	Good	1

reviews_df.info()

    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 3150 entries, 0 to 3149
    Data columns (total 5 columns):
     #   Column            Non-Null Count  Dtype 
    ---  ------            --------------  ----- 
     0   rating            3150 non-null   int64 
     1   date              3150 non-null   object
     2   variation         3150 non-null   object
     3   verified_reviews  3150 non-null   object
     4   feedback          3150 non-null   int64 
    dtypes: int64(2), object(3)
    memory usage: 123.2+ KB

reviews_df.describe()

	rating	feedback
count	3150.000000	3150.000000
mean	4.463175	0.918413
std	1.068506	0.273778
min	1.000000	0.000000
25%	4.000000	1.000000
50%	5.000000	1.000000
75%	5.000000	1.000000
max	5.000000	1.000000

We can observe that 91% of the people leave a very good grade in the rating.

reviews_df['verified_reviews']

    0                                           Love my Echo!
    1                                               Loved it!
    2       Sometimes while playing a game, you can answer...
    3       I have had a lot of fun with this thing. My 4 ...
    4                                                   Music
                                  ...                        
    3145    Perfect for kids, adults and everyone in betwe...
    3146    Listening to music, searching locations, check...
    3147    I do love these things, i have them running my...
    3148    Only complaint I have is that the sound qualit...
    3149                                                 Good
    Name: verified_reviews, Length: 3150, dtype: object

Missing values

sns.heatmap(reviews_df.isnull(), yticklabels=False, cbar = False, cmap = "Blues")

Visualization

reviews_df.hist(bins = 30, figsize = (13, 5), color = 'r')

We can see that it is a fairly unbalanced dataset, there are many more positive than negative evaluations.

– We add a column ‘ length’ with the number of characters of the rewiews:

reviews_df['length'] = reviews_df['verified_reviews'].apply(len)
reviews_df.head()

	rating	date	variation	verified_reviews	feedback	length
0	5	31-Jul-18	Charcoal Fabric	Love my Echo!	1	13
1	5	31-Jul-18	Charcoal Fabric	Loved it!	1	9
2	4	31-Jul-18	Walnut Finish	Sometimes while playing a game, you can answer...	1	195
3	5	31-Jul-18	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1	172
4	5	31-Jul-18	Charcoal Fabric	Music	1	5

reviews_df['length'].plot(bins = 100, kind = 'hist')

There are many customers with short reviews.

reviews_df[reviews_df['length'] == 2851]['verified_reviews'].iloc[0]

    "Incredible piece of technology.I have this right center of my living room on an island kitchen counter. The mic and speaker goes in every direction and the quality of the sound is quite good. I connected the Echo via Bluetooth to my Sony soundbar on my TV but find the Echo placement and 360 sound more appealing. It's no audiophile equipment but there is good range and decent bass. The sound is more than adequate for any indoor entertaining and loud enough to bother neighbors in my building. The knob on the top works great for adjusting volume. This is my first Echo device and I would imagine having to press volume buttons (on the Echo 2) a large inconvenience and not as precise. For that alone I would recommend this over the regular Echo (2nd generation).The piece looks quality and is quite sturdy with some weight on it. The rubber material on the bottom has a good grip on the granite counter-- my cat can even rub her scent on it without tipping it over.This order came with a free Philips Hue Bulb which I installed along with an extra one I bought. I put the 2 bulbs into my living room floor lamp, turned on the light, and all I had to do was say &#34;Alexa, connect my devices&#34;. The default names for each bulb was assigned as &#34;First light&#34; and &#34;Second light&#34;, so I can have a dimmer floor lamp if I just turned on/off one of the lights by saying &#34;Alexa, turn off the second light&#34;. In the Alexa app, I created a 'Group' with &#34;First light&#34; and &#34;Second light&#34; and named the group &#34;The light&#34;, so to turn on the lamp with both bulbs shining I just say &#34;Alexa, turn on The light&#34;.I was surprised how easily the bulbs connected to the Echo Plus with its built in hub. I thought I would have to buy a hub bridge to connect to my floor lamp power plug. Apparently there is some technology built directly inside the bulb! I was surprised by that. Awesome.You will feel like Tony Stark on this device. I added quite a few &#34;Skills&#34; like 'Thunderstorm sounds' and 'Quote of the day' . Alexa always loads them up quickly. Adding songs that you hear to specific playlists on Amazon Music is also a great feature.I can go on and on and this is only my second day of ownership.I was lucky to buy this for $100 on Prime Day, but I think for $150 is it pretty expensive considering the Echo 2 is only $100. In my opinion, you will be paying a premium for the Echo Plus and you have to decide if the value is there for you:1) Taller and 360 sound unit.2) Volume knob on top that you spin (I think this is a huge benefit over buttons)3) Built in hub for Hue bulbs. After researching more, there are some cons to this setup if you plan on having more advanced light setups. For me and my floor lamp, it's just perfect.I highly recommend it and will buy an Echo dot for my bedroom now."

reviews_df.length.describe()

    count    3150.000000
    mean      132.049524
    std       182.099952
    min         1.000000
    25%        30.000000
    50%        74.000000
    75%       165.000000
    max      2851.000000
    Name: length, dtype: float64

reviews_df[reviews_df['length'] == 1]['verified_reviews'].iloc[0]

'?'

reviews_df[reviews_df['length'] == 133]['verified_reviews'].iloc[0]

  'Fun item to play with and get used to using.  Sometimes has hard time answering the questions you ask, but I think it will be better.'

The longest review is 2851 characters and the one with less 1 (an emoticon).
The average of reviews is 132.

Let’s go to separate the positive(1) and negative(0) feedback

positive = reviews_df[reviews_df['feedback'] == 1]
negative = reviews_df[reviews_df['feedback'] == 0]
negative

	rating	date	variation	verified_reviews	feedback	length
46	2	30-Jul-18	Charcoal Fabric	It's like Siri, in fact, Siri answers more acc...	0	163
111	2	30-Jul-18	Charcoal Fabric	Sound is terrible if u want good music too get...	0	53
141	1	30-Jul-18	Charcoal Fabric	Not much features.	0	18
162	1	30-Jul-18	Sandstone Fabric	Stopped working after 2 weeks ,didn't follow c...	0	87
176	2	30-Jul-18	Heather Gray Fabric	Sad joke. Worthless.	0	20
...	...	...	...	...	...	...
3047	1	30-Jul-18	Black Dot	Echo Dot responds to us when we aren't even ta...	0	120
3048	1	30-Jul-18	White Dot	NOT CONNECTED TO MY PHONE PLAYLIST 🙁	0	37
3067	2	30-Jul-18	Black Dot	The only negative we have on this product is t...	0	240
3091	1	30-Jul-18	Black Dot	I didn’t order it	0	17
3096	1	30-Jul-18	White Dot	The product sounded the same as the emoji spea...	0	210

positive

	rating	date	variation	verified_reviews	feedback	length
0	5	31-Jul-18	Charcoal Fabric	Love my Echo!	1	13
1	5	31-Jul-18	Charcoal Fabric	Loved it!	1	9
2	4	31-Jul-18	Walnut Finish	Sometimes while playing a game, you can answer...	1	195
3	5	31-Jul-18	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1	172
4	5	31-Jul-18	Charcoal Fabric	Music	1	5
...	...	...	...	...	...	...
3145	5	30-Jul-18	Black Dot	Perfect for kids, adults and everyone in betwe...	1	50
3146	5	30-Jul-18	Black Dot	Listening to music, searching locations, check...	1	135
3147	5	30-Jul-18	Black Dot	I do love these things, i have them running my...	1	441
3148	5	30-Jul-18	White Dot	Only complaint I have is that the sound qualit...	1	380
3149	4	29-Jul-18	Black Dot	Good	1	4

sns.countplot(reviews_df['feedback'], label = "Count")

sns.countplot(x = 'rating', data = reviews_df)

reviews_df['rating'].hist(bins = 5)

Most consumers give a 5.

– Let’s see the variation field that indicates the type of product:

plt.figure(figsize=(40, 15))
sns.barplot(x = 'variation', y = 'rating', data = reviews_df, palette='deep')

2- Reviews

All Reviews

– We put the reviews in a single list and separate them by a blank space:

sentences = reviews_df['verified_reviews'].tolist()
len(sentences)

print(sentences)

['Love my Echo!', 'Loved it!', 'Sometimes while playing a game, you can answer a question correctly but Alexa says you got it wrong and answers the same as you. I like being able to turn lights on and off while away from home.', 'I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.', 'Music', 'I received the echo as a gift. I needed another Bluetooth or something to play music easily accessible, and found this smart speaker. Can’t wait to see what else it can do.', 'Without having a cellphone, I cannot use many of her features. I have an iPad but do not see that of any use. It IS a great alarm. If u r almost deaf, you can hear her alarm in the bedroom from out in the living room, so that is reason enough to keep her.It is fun to ask random questions to hear her response. She does not seem to be very smartbon politics yet.', "I think this is the 5th one I've purchased. I'm working on getting one in every room of my house. I really like what features they offer specifily playing music on all Echos and controlling the lights throughout my house.", 'looks great', 'Love it! I’ve listened to songs I haven’t heard since childhood! I get the news, weather, information! It’s great!', 'I sent it to my 85 year old Dad, and he talks to it constantly.', "I love it! Learning knew things with it eveyday! Still figuring out how everything works but so far it's been easy to use and understand. She does make me laugh at times", 'I purchased this for my mother who is having knee problems now, to give her something to do while trying to over come not getting around so fast like she did.She enjoys all the little and big things it can do...Alexa play this song, What time is it and where, and how to cook this and that!', 'Love, Love, Love!!', 'Just what I expected....', 'I love it, wife hates it.',
......................................................................................
......................................................................................
......................................................................................
"I was really happy with my original echo so i thought I'd get an echo dot to use in my bedroom. I was really disappointed in the audio quality so I connected an external speaker via bluetooth. The audio was much better but I started having problems with it loosing connection with the wifi, presumably due to interference from the bluetooth. Then I connected a speaker via the auxiliary jack. when i did that, the auxiliary jack picked up interference from the wifi and I was woken up in the middle of the night by a horrible buzzing sound. im hoping Amazon will take this thing back and give me a good deal on an echo spot which I hope will be a better nightstand device.",
"Weak sound. Compared to the Google Home Mini the sound is bad. Also you need Prime to have a small selection of music and need prime music to play everything. Also if you get two you need prime family music to play music on both. I think it's lame that you need a family plan to use multiple devices.",
"Echo Dot responds to us when we aren't even talking to it. I've unplugged it. It feels like it's &#34;spying&#34; on us.",
'NOT CONNECTED TO MY PHONE PLAYLIST :(',
'The only negative we have on this product is the terrible sound quality. A massive difference from the Alexa. Which to us was a big reason we wanted to purchase this.Won’t be buying another until the speaker and sound quality can improve.',
'I didn’t order it',
'The product sounded the same as the emoji speaker from five below my sister has ... and even that one has Bluetooth and doesn’t need to be plugged in. The only good thing about this is that you can speak to it.']

negative_sentences_as_one_string = " ".join(negative_list)

plt.figure(figsize = (20,20))
plt.imshow(WordCloud().generate(negative_sentences_as_one_string))

device and work and love stand out from negative words.

3 – Data Cleaning

reviews_df.head()

	rating	date	variation	verified_reviews	feedback	length
0	5	31-Jul-18	Charcoal Fabric	Love my Echo!	1	13
1	5	31-Jul-18	Charcoal Fabric	Loved it!	1	9
2	4	31-Jul-18	Walnut Finish	Sometimes while playing a game, you can answer...	1	195
3	5	31-Jul-18	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1	172
4	5	31-Jul-18	Charcoal Fabric	Music	1	5

– Let’s remove some columns:

reviews_df = reviews_df.drop(['date', 'rating', 'length'], axis = 1)
reviews_df

	variation	verified_reviews	feedback
0	Charcoal Fabric	Love my Echo!	1
1	Charcoal Fabric	Loved it!	1
2	Walnut Finish	Sometimes while playing a game, you can answer...	1
3	Charcoal Fabric	I have had a lot of fun with this thing. My 4 ...	1
4	Charcoal Fabric	Music	1
...	...	...	...
3145	Black Dot	Perfect for kids, adults and everyone in betwe...	1
3146	Black Dot	Listening to music, searching locations, check...	1
3147	Black Dot	I do love these things, i have them running my...	1
3148	White Dot	Only complaint I have is that the sound qualit...	1
3149	Black Dot	Good	1

– Pass ‘variation’ (kind of product) to dummie variable:

variation_dummies = pd.get_dummies(reviews_df['variation'], drop_first=True)
variation_dummies

	Black Dot	Black Plus	Black Show	Black Spot	Charcoal Fabric	Configuration: Fire TV Stick	Heather Gray Fabric	Oak Finish	Sandstone Fabric	Walnut Finish	White	White Dot	White Plus	White Show	White Spot
0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
3	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3145	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3146	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3147	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3148	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3149	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

– Remove ‘variation’ column an add ‘variation_dummies’ column from reviews_df:

reviews_df.drop(['variation'], axis = 1, inplace = True)
reviews_df = pd.concat([reviews_df, variation_dummies],axis = 1)
reviews_df

	verified_reviews	feedback	Black Dot	Black Plus	Black Show	Black Spot	Charcoal Fabric	Configuration: Fire TV Stick	Heather Gray Fabric	Oak Finish	Sandstone Fabric	Walnut Finish	White	White Dot	White Plus	White Show	White Spot
0	Love my Echo!	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
1	Loved it!	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
2	Sometimes while playing a game, you can answer...	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
3	I have had a lot of fun with this thing. My 4 ...	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
4	Music	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3145	Perfect for kids, adults and everyone in betwe...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3146	Listening to music, searching locations, check...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3147	I do love these things, i have them running my...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3148	Only complaint I have is that the sound qualit...	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3149	Good	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

4 – Punctuation marks, StopWords and ‘Tokenization’

import string

string.punctuation

    '!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

– Test Part –

Removing punctuation marks:

Test = "Hello Mr. Future, I am so happy to be learning AI now!!!"

Test_punc_removed = [char for char in Test if char not in string.punctuation]
Test_punc_removed

   ['H',
     'e',
     'l',
     'l',
     'o',
     ' ',
     'M',
     'r',
     ' ',
     'F',
     'u',
     't',
     'u',
     'r',
     'e',
     ' ',
     'I',
     ' ',
     'a',
     'm',
     ' ',
     's',
     'o',
     ' ',
     'h',
     'a',
     'p',
     'p',
     'y',
     ' ',
     't',
     'o',
     ' ',
     'b',
     'e',
     ' ',
     'l',
     'e',
     'a',
     'r',
     'n',
     'i',
     'n',
     'g',
     ' ',
     'A',
     'I',
     ' ',
     'n',
     'o',
     'w'

Test_punc_removed_join = ''.join(Test_punc_removed)
Test_punc_removed_join

    'Hello Mr Future I am so happy to be learning AI now'

StopWords:

– We eliminate the stopwords: There’re words like ‘I’, ‘you’, ‘and’, ‘but’…..They don’t give us information for our analysis:

import nltk 

nltk.download('stopwords')

    True

from nltk.corpus import stopwords

stopwords.words('english') #In english lenguage

['i',
     'me',
     'my',
     'myself',
     'we',
     'our',
     'ours',
     'ourselves',
     'you',
     "you're",
     "you've",
     "you'll",
     "you'd",
     'your',
     'yours',
     'yourself',
     'yourselves',
     'he',
     'him',
     'his',
     'himself',
     'she',
     "she's",
     'her',
     'hers',
     'herself',
     'it',
     "it's",
     'its',
     'itself',
     'they',
     'them',
     'their',
     'theirs',
     'themselves',
     'what',
     'which',
     'who',
     'whom',
     'this',
     'that',
     "that'll",
     'these',
     'those',
     'am',
     'is',
     'are',
     'was',
     'were',
     'be',
     'been',
     'being',
     'have',
     'has',
     'had',
     'having',
     'do',
     'does',
     'did',
     'doing',
     'a',
     'an',
     'the',
     'and',
     'but',
     'if',
     'or',
     'because',
     'as',
     'until',
     'while',
     'of',
     'at',
     'by',
     'for',
     'with',
     'about',
     'against',
     'between',
     'into',
     'through',
     'during',
     'before',
     'after',
     'above',
     'below',
     'to',
     'from',
     'up',
     'down',
     'in',
     'out',
     'on',
     'off',
     'over',
     'under',
     'again',
     'further',
     'then',
     'once',
     'here',
     'there',
     'when',
     'where',
     'why',
     'how',
     'all',
     'any',
     'both',
     'each',
     'few',
     'more',
     'most',
     'other',
     'some',
     'such',
     'no',
     'nor',
     'not',
     'only',
     'own',
     'same',
     'so',
     'than',
     'too',
     'very',
     's',
     't',
     'can',
     'will',
     'just',
     'don',
     "don't",
     'should',
     "should've",
     'now',
     'd',
     'll',
     'm',
     'o',
     're',
     've',
     'y',
     'ain',
     'aren',
     "aren't",
     'couldn',
     "couldn't",
     'didn',
     "didn't",
     'doesn',
     "doesn't",
     'hadn',
     "hadn't",
     'hasn',
     "hasn't",
     'haven',
     "haven't",
     'isn',
     "isn't",
     'ma',
     'mightn',
     "mightn't",
     'mustn',
     "mustn't",
     'needn',
     "needn't",
     'shan',
     "shan't",
     'shouldn',
     "shouldn't",
     'wasn',
     "wasn't",
     'weren',
     "weren't",
     'won',
     "won't",
     'wouldn',
     "wouldn't"]

Test_punc_removed_join

    'Hello Mr Future I am so happy to be learning AI now'

Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
Test_punc_removed_join_clean

    ['Hello', 'Mr', 'Future', 'happy', 'learning', 'AI']

mini_challenge = "Here is a mini challenge, that will teach you how to remove stopwords and punctuations!!"

challenge = [char for char in mini_challenge if char not in string.punctuation]
challenge = ''.join(challenge)
challenge = [word for word in challenge.split() if word.lower() not in stopwords.words('english')]
challenge

    ['mini', 'challenge', 'teach', 'remove', 'stopwords', 'punctuations']

‘Tokenization’:

from sklearn.feature_extraction.text import CountVectorizer

sample_data = ['This is the first document.','This document is the second document.','And this is the third one.','Is this the first document?']

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(sample_data)

print(vectorizer.get_feature_names())

    ['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

print(X.toarray())

    [[0 1 1 1 0 0 1 0 1]
     [0 2 0 1 0 1 1 0 1]
     [1 0 0 1 1 0 1 1 1]
     [0 1 1 1 0 0 1 0 1]]

mini_challenge = ["Hello World", "Hello Hello World", "Hello World world world"]

vectorizer_challenge = CountVectorizer()
X_challenge = vectorizer_challenge.fit_transform(mini_challenge)
print(vectorizer_challenge.get_feature_names())
print(X_challenge.toarray())

    ['hello', 'world']
    [[1 1]
     [2 1]
     [1 3]]

– With our data –

We define a pipeline to clean all the reviews and keep us the most relevant words.

1.1 – We’re going to create a function to remove punctuation marks and stopwords:

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

reviews_df_clean = reviews_df['verified_reviews'].apply(message_cleaning)
print(reviews_df_clean[3])

    ['lot', 'fun', 'thing', '4', 'yr', 'old', 'learns', 'dinosaurs', 'control', 'lights', 'play', 'games', 'like', 'categories', 'nice', 'sound', 'playing', 'music', 'well']

print(reviews_df['verified_reviews'][3])

    I have had a lot of fun with this thing. My 4 yr old learns about dinosaurs, i control the lights and play games like categories. Has nice sound when playing music as well.

reviews_df_clean

    0                                            [Love, Echo]
    1                                                 [Loved]
    2       [Sometimes, playing, game, answer, question, c...
    3       [lot, fun, thing, 4, yr, old, learns, dinosaur...
    4                                                 [Music]
                                  ...                        
    3145                    [Perfect, kids, adults, everyone]
    3146    [Listening, music, searching, locations, check...
    3147    [love, things, running, entire, home, TV, ligh...
    3148    [complaint, sound, quality, isnt, great, mostl...
    3149                                               [Good]
    Name: verified_reviews, Length: 3150, dtype: object

1.2 – Applying a pre-made function in a library:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer=message_cleaning)
reviews_countvectorizer = vectorizer.fit_transform(reviews_df['verified_reviews'])

print(vectorizer.get_feature_names())

['072318', '1', '10', '100', '1000', '100X', '1010', '1030pm', '11', '1100sf', '1220', '129', '12am', '15', '150', '19', '1964', '1990s', '1990’s', '1GB', '1rst', '1st', '2', '20', '200', '2000', '2017', '2030', '229', '23', '2448', '247', '24GHZ', '24ghz', '25', '29', '299', '2999', '2Original', '2nd', '2or', '2package', '3', '30', '300', '30so', '334', '34', '342nd', '3434', '34A34', '34Alexa', '34Alexa34', '34Certified', '34Computer34', '34Dot34', '34Drop', '34First', '34Hub', '34I', '34Im', '34NEVER', '34Philips', '34Play', '34Second', '34Skills34', '34Tell', '34The', '34Things', '34Thongs', '34Try', '34Whats', '34alexa34', '34card34', '34cycle', '34cycle34', '34fixes34', '34fun34', '34group34', '34hear34', '34hmm', '34hmmm', '34it34', '34late', '34learn', '34light34', '34lights34', '34listen34', '34minor', '34outlet34', '34personal34', '34she34', '34show', '34smart', '34smart34', '34sorry', '34spying34', '34the', '34thick34', '34things', '34this', '34trouble', '34try', '34turn', '34visual34', '34wake34me', '34warehouse34', '35', '360', '39', '399', '3999', '3Dots', '3rd', '3xs', '4', '40000', '45', '48', '4K', '4am', '4k', '4th', '5', '50', '54', '5GHZ', '5GHz', '5am”', '5ghz', '5th', '6', '600', '62', '672', '6th', '7', '7000', '70s', '75', '7900', '8', '80s', '81', '83', '85', '88', '888', '8GB', '9', '90', '91', '911', '99', 'A1', 'A19', 'ABC', 'ABSOLUTELY', 'AF', 'AI', 'AIs', 'ALARM', 'ALEXA', 'ALEXUS', 'ALLRecipes', 'AMAZING', 'AMAZON', 'ANNOYING', 'ANOTHER', 'ASAP', 'ASK', 'AV', 'AVAILABLE', 'Able', 'Absolutely', 'Absolutly', 'Ac', 'AccentThe', 'Access', 'Acoustical', 'Acting', 
.......................................................................................................................
.......................................................................................................................
.......................................................................................................................
,'zero', 'zigbee', 'zonkedout', 'zwave', 'zzzz', 'í', 'útil', '—', '‘Drop', '‘appliance’', '‘technically', '‘til', '“', '“Alexa', '“Alexa”', '“Echo”', '“Hey', '“Jaws”', '“OK', '“apps”', '“check', '“dot”', '“drop', '“dropin”', '“dropping', '“free”', '“inactivity”', '“learn”', '“live”version', '“name”', '“no”', '“oh', '“oops', '“plusprimeetc”', '“ready', '“smart', '“themes”', '“things', '“updated', '“wake', '“wake”', '“white”', '“your”', '⏰', '❤', '⭐⭐⭐⭐⭐', '?', '?', '??', '??', '????', '?', '??', '?', '?', '?', '?????❤', '?', '?', '?', '?', '?', '?']

2.1 – Tokenization:

print(reviews_countvectorizer.toarray())

    [[0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     ...
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]
     [0 0 0 ... 0 0 0]]

reviews_countvectorizer.shape

    (3150, 5211)

reviews_df

	verified_reviews	feedback	Black Dot	Black Plus	Black Show	Black Spot	Charcoal Fabric	Configuration: Fire TV Stick	Heather Gray Fabric	Oak Finish	Sandstone Fabric	Walnut Finish	White	White Dot	White Plus	White Show	White Spot
0	Love my Echo!	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
1	Loved it!	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
2	Sometimes while playing a game, you can answer...	1	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0
3	I have had a lot of fun with this thing. My 4 ...	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
4	Music	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3145	Perfect for kids, adults and everyone in betwe...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3146	Listening to music, searching locations, check...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3147	I do love these things, i have them running my...	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
3148	Only complaint I have is that the sound qualit...	1	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0
3149	Good	1	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0

2.2 – Delete and paste the new tokenized column:

reviews_df.drop(['verified_reviews'], axis = 1, inplace = True)

reviews = pd.DataFrame(reviews_countvectorizer.toarray())

reviews_df = pd.concat([reviews_df, reviews], axis = 1)
reviews_df

	feedback	Black Dot	Black Plus	Black Show	Black Spot	Charcoal Fabric	Configuration: Fire TV Stick	Heather Gray Fabric	Oak Finish	Sandstone Fabric	...	5201	5202	5203	5204	5205	5206	5207	5208	5209	5210
0	1	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	1	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	1	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	1	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3145	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3146	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3147	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3148	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3149	1	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

3 – Separate our input data from our predict variable (‘feedback’):

X = reviews_df.drop(['feedback'], axis = 1)
X

	Black Dot	Black Plus	Black Show	Black Spot	Charcoal Fabric	Configuration: Fire TV Stick	Heather Gray Fabric	Oak Finish	Sandstone Fabric	Walnut Finish	...	5201	5202	5203	5204	5205	5206	5207	5208	5209	5210
0	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	1	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	1	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
3145	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3146	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3147	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3148	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3149	1	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

y = reviews_df['feedback']
y

    0       1
    1       1
    2       1
    3       1
    4       1
           ..
    3145    1
    3146    1
    3147    1
    3148    1
    3149    1
    Name: feedback, Length: 3150, dtype: int64

5 – Naive Bayes Classifier

X.shape

    (3150, 5226)

y.shape

    (3150,)

Creating the model:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

from sklearn.naive_bayes import MultinomialNB

nb_multinomial = MultinomialNB()
nb_multinomial.fit(X_train, y_train)

    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Model validation and Confusion Matrix:

from sklearn.metrics import classification_report, confusion_matrix

y_predict_train  = nb_multinomial.predict(X_train)
y_predict_train

cm = confusion_matrix(y_train, y_predict_train)
sns.heatmap(cm, annot = True)

y_predict_test = nb_multinomial.predict(X_test)
cm = confusion_matrix(y_test, y_predict_test)
sns.heatmap(cm, annot = True)

print(classification_report(y_test, y_predict_test))

                  precision    recall  f1-score   support
    
               0       0.58      0.30      0.40        60
               1       0.93      0.98      0.95       570
    
        accuracy                           0.91       630
       macro avg       0.76      0.64      0.67       630
    weighted avg       0.90      0.91      0.90       630

Our model captures positive valuations very well (we have a recall of 0.98) however the negative ones are detect quite badly (beacuse of our dataset is not very well balanced).

6- Logistic regression classifier

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

Creating the model:

model = LogisticRegression()
model.fit(X_train, y_train)

    LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                       intercept_scaling=1, l1_ratio=None, max_iter=100,
                       multi_class='auto', n_jobs=None, penalty='l2',
                       random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                       warm_start=False)

Model validation and Confusion Matrix:

y_pred = model.predict(X_test)
y_pred

    array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
           0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
           1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1])

print("Accuracy {} %".format(100*accuracy_score(y_test, y_pred)))

    Accuracy 93.33333333333333 %

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot = True)

print(classification_report(y_test, y_pred))

                  precision    recall  f1-score   support
    
               0       0.78      0.42      0.54        60
               1       0.94      0.99      0.96       570
    
        accuracy                           0.93       630
       macro avg       0.86      0.70      0.75       630
    weighted avg       0.93      0.93      0.92       630

Our model captures positive valuations very well (we have a recall of 0.99), however the negative ones are detect quite badly (0.42), nevertheless is better than the Naive Bayer model.

Predective Model for Public Relations Department

1 – Import libraries and data visualization

Missing values

Visualization

Let’s go to separate the positive(1) and negative(0) feedback

2- Reviews

All Reviews

3 – Data Cleaning

4 – Punctuation marks, StopWords and ‘Tokenization’

– Test Part –

Removing punctuation marks:

StopWords:

‘Tokenization’:

– With our data –

We define a pipeline to clean all the reviews and keep us the most relevant words.

5 – Naive Bayes Classifier

Creating the model:

Model validation and Confusion Matrix:

6- Logistic regression classifier

Creating the model:

Model validation and Confusion Matrix:

Like this: