Lets load the data and have a look at it!!

import numpy as np
import pandas as pd
from IPython.display import display

# plotting modules
import seaborn as sns
import matplotlib.pyplot as plt
import self_made_visual as vs

%matplotlib inline

from time import time

# set seed for reproducibility
np.random.seed(45)

try:
    data = pd.read_csv("winequality_red.csv")
    print("Red Wine Quality dataset has {} samples with {} features each.".format(*data.shape))
    display(data.head())
except:
    print("Dataset could not be loaded. Is the dataset missing?")

Red Wine Quality dataset has 1599 samples with 12 features each.

Above features can be described as follows:

fixed acidity most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid found in small quantities, citric acid can add ‘freshness’ and flavor to wines
residual sugar the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides the amount of salt in the wine
free sulfur dioxide the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density the density of water is close to that of water depending on the percent alcohol and sugar content
pH describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
alcohol the percent alcohol content of the wine
quality output variable (based on sensory data, score between 0 and 10)

Assumption¶

We will be assuming that wines having value higher than 6 will be considered good and other wines are not good.

Data Cleaning¶

Check for Missing Values in the dataset — missing values will hinder us from making proper predictions as they will hamper correct calculation of Mean, Variance, etc.

# Checking for missing values
# Calculating total number of missing values in dataset

missingValues = data.isnull().sum()

display("Total number of missing values in our dataset are:", missingValues)

'Total number of missing values in our dataset are:'

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

From above result we can see that we don’t have any missing values.

Statisctical Distribution of data — now before moving forward for further analysis lets have a look at different statisctical aspects of data.

# Displaying description of dataset

percent = [0.04, 0.25, 0.5, 0.75, 0.96]
display(data.describe(percentiles=percent))

From above table we can clearly see the distribution of data — mean, standard deviation (std), min-max values and percentiles — all this will be helpful once we proceed further into analysis.

Handling Outliers — Here we won’t handle outliers because we are looking for accuracy to minute levels not just some approximation — best wine may have very different attributes from other wines so we can not remove or modify outlier values in this scenario. We can relate this case to the data related to a medicine in that scenario as well we need all our data for getting highly accurate data. Due to this we may see some skewness in data while finding out correlation in data.
Correlation between data — one thing must be noted, that while doing correlation we must scale all the features otherwise we will get biased correlation graphs. So before moving on to Correlation we will first Scale our data

# From sklear we are importing 'preprocessing'
# Further MinMaxScaler() will be used for scaling data between [0,1]

from sklearn import preprocessing

# Setting all the column names in a variable 'features'
features = data.keys()

min_max_scaler = preprocessing.MinMaxScaler()
data_array_scaled = min_max_scaler.fit_transform(data)

# Creating an empty data frame
data_scaled = pd.DataFrame()

for x in range(0,12):
    dataset = pd.DataFrame({features[x]:data_array_scaled[:,x]})
    data_scaled = pd.concat([data_scaled, dataset], axis=1, join_axes=[dataset.index])
    
display(data_scaled.head())

Now, our data is scaled in (0,1) range, now we will find correlation among different features

Visualizing Correlation using correlation graphs.

# Produce a scatter matrix for each pair of newly-transformed features
pd.plotting.scatter_matrix(data, alpha = 0.3, figsize = (50,30), diagonal = 'kde');

From above graphs we can see relation between different elements and their distribution like whether they are normally distributed or skewed. One thing can be noted that relation between each feature and output is represented in the form of Bar-Graphs.

Fixed Acidity has positive correlation with Citric Acid and Density which means if Quality of wine improves with increasing Fixed Acidity then same will be the effect of Citric Acid and Density; it has negative correlation with volatile acidity and pH (higher pH value corresponds to basic nature that is less acidic).
volatile acidity has negative correlation with Fixed acidity and Citric Acid.

Similarly we can look at different raltions among different features, it is very useful to have an insight of correlation of features as it helps us in determining the effect of different elements in starting stages only.

Let’s have a closer look between each element and our output using Bar-Graphs

For Fixed Acitdity

# Creating a empty figure then creating bar-graph using Seaborn
fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'fixed acidity', data = data)

From above graph for Fixed Acidity vs Quality we can note that once acidity increases the Quality of wine improves but still it is still not able to clearly justify the result as we can see when Fixed Acidity was above 8 we have two values for Quality — (3, 7) thus it is not of much use independently.

For Volatile Acidity

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'volatile acidity', data = data)

From this distribution we can say that once Volatile Acidity decreases the Wine Quality Improves and same is justified by negative correlation between Fixed Acidity and Volatile Acidity.

For Citric Acid

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'citric acid', data = data)

It clearly shows that once the value of Citric Acid increases to a value of 0.4 we are getting best quality of Wine.

For Residual Sugar

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'residual sugar', data = data)

Residual Sugar is not able to clearly justify the Wine Quality.

For Chlorides

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'chlorides', data = data)

From above graph we can clearly see that Decrease in Chlorides corresponds to better Wine quality.

For Free Sulfur Dioxide

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'free sulfur dioxide', data = data)

Free Sulfur Dioxide alone is not able to clearly justify Wine quality.

For Total Sulfur Dioxide

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'total sulfur dioxide', data = data)

Similarly Total Sulfur Dioxide is also not able to justify quality of Wine

For Density

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'density', data = data)

Density of all wines are nearly same so it can not be used for discriminating quality of Wine

For pH

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'pH', data = data)

From above graph we can see once the pH decreases and reaches around 3.3 the Wine Quality reaches at its best.

For Sulphates

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'sulphates', data = data)

From above graph it is clear that once the value of Sulphates increase quality of wine also improves.

For Alcohol

fig = plt.figure(figsize = (15,8))
sns.barplot(x = 'quality', y = 'alcohol', data = data)

With increase in Alcohol quantity Quality of Wine Improves.

Interpretation¶

From above graphs we concluded that Density has not relation with quality of wine and features which alone are independently able to improve wine quality are — Citric Acid, Chlorides, pH, Sulphates and rest are dependent variables.
So, further analysis we will drop density feature from our data as it will only create overhead on our machine learning predictive model processing.

data.drop(['density'], axis = 1, inplace = True)
display(data.head())

Predictive Modeling¶

Now we will create a machine learning model which will help us to predict that which wine is best depending upon all the given features.

Preparing dataset for Machine learning¶

Dividing our data into good and bad wines or we can say into two buckets on the basis of which we will provide our final result.

# Dividing Quality into two labels -- Bad & Good

bucket = (2, 6.1, 8)
bucket_label = ['bad', 'good']
data['quality'] = pd.cut(data['quality'], bins = bucket, labels = bucket_label)
display(data.head())

By executing above code we have defined that all Quality values which lie between above range including (6.1 and 8) will correspond to ‘Good’ Wine and rest will be ‘Bad’, we can change this value depending upon our requirement.

Now, we will replace labels ‘bad’ and ‘good’ with 0 and 1 respectively.

data['quality'] = data['quality'].map({'bad':0,'good':1})
display(data.head(10))

Counting total number of different Wine samples

# calculating count
display("Total number of different Wine samples: ",data['quality'].value_counts())

# representing counts using bar graph
sns.countplot(data['quality'])

'Total number of different Wine samples: '

0    1382
1     217
Name: quality, dtype: int64

Implementation – Creating a Training and Predicting Pipeline¶

To properly evaluate the performance of each model we have chosen, it’s important that we create a training and predicting pipeline that allows us to quickly and effectively train models using various sizes of training data and perform predictions on the testing data.

from sklearn.metrics import fbeta_score, accuracy_score 

def train_predict(learner, sample_size, X_train, y_train, X_test, y_test): 
    
    results = {}
    
    #Fit the learner to the training data using slicing with 'sample_size' using .fit(training_features[:], training_labels[:])
    start = time() # Get start time
    learner = learner.fit(X_train[:sample_size],y_train[:sample_size])
    end = time() # Get end time
    
    #Calculate the training time
    results['train_time'] = end - start
        
    # Get the predictions on the test set(X_test),
    # then get predictions on the first 300 training samples(X_train) using .predict()
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train[:300])
    end = time() # Get end time
    
    # Calculate the total prediction time
    results['pred_time'] = end - start
            
    # Compute accuracy on the first 300 training samples which is y_train[:300]
    results['acc_train'] = accuracy_score(y_train[:300], predictions_train[:300])
        
    # Compute accuracy on test set using accuracy_score()
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    # Compute F-score on the the first 300 training samples using fbeta_score()
    results['f_train'] = fbeta_score(y_train[:300], predictions_train, beta=0.5)
        
    # Compute F-score on the test set which is y_test
    results['f_test'] = fbeta_score(y_test, predictions_test, beta=0.5)
       
    # Success
    print("{} trained on {} samples.".format(learner.__class__.__name__, sample_size))
        
    # Return the results
    return results

Initial Model Evaluation¶

We will use Support Vector Machine (SVM) and AdaBoost for prediction, then based on this initial evaluation we will refine our chosen model.

# Import train_test_split
from sklearn.cross_validation import train_test_split

X = data.drop('quality', axis = 1)
y = data['quality']

# Split the 'features' and 'income' data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))

# Import two supervised learning models from sklearn
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

# Initialize the two models
clf_A = AdaBoostClassifier(random_state=40)
clf_B = SVC(random_state=35)

# Calculate the number of samples for 1%, 10%, and 100% of the training data
samples_100 = X_train.shape[0]
samples_10 = X_train.shape[0]//10
samples_1 = X_train.shape[0]//100

# Collect results on the learners
results = {}
for clf in [clf_A, clf_B]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    for i, samples in enumerate([samples_1, samples_10, samples_100]):
        results[clf_name][i] = \
        train_predict(clf, samples, X_train, y_train, X_test, y_test)

# Run metrics visualization for the three supervised learning models chosen
vs.visualize(results)

C:\Users\nitgoyal\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\cross_validation.py:41: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

Training set has 1279 samples.
Testing set has 320 samples.
AdaBoostClassifier trained on 12 samples.
AdaBoostClassifier trained on 127 samples.
AdaBoostClassifier trained on 1279 samples.
SVC trained on 12 samples.
SVC trained on 127 samples.
SVC trained on 1279 samples.

C:\Users\nitgoyal\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)
C:\Users\nitgoyal\AppData\Local\Continuum\anaconda3\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.
  'precision', 'predicted', average, warn_for)

From above Graphs we can clearly see that Support Vector Machine has higher accuracy in comparison to AdaBoost hence we will be using SVM for our further analysis.

Fine Tuning Model¶

We will now tune our model by finding the best values for its Hyper-Parameters, for doing that we will be using Grid Search from Sklearn library of python.

# Import 'GridSearchCV', 'make_scorer'
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

# TODO: Initialize the classifier
clf = SVC(random_state=35)

# Create the parameters list you wish to tune, using a dictionary if needed.
parameters = {
    'kernel':['rbf'],
    'gamma' :[0.5,0.8,0.9,1,1.1,1.2,1.3, 10, 100, 200]
}

# Make an fbeta_score scoring object using make_scorer()
scorer = make_scorer(fbeta_score,beta=0.5)

# Perform grid search on the classifier using 'scorer' as the scoring method using GridSearchCV()
grid_obj = GridSearchCV(clf,param_grid=parameters,scoring=scorer)

# Fit the grid search object to the training data and find the optimal parameters using fit()
grid_fit = grid_obj.fit(X_train,y_train)

# Get the estimator
best_clf = grid_fit.best_estimator_

# Make predictions using the unoptimized and model
predictions = (clf.fit(X_train, y_train)).predict(X_test)
best_predictions = best_clf.predict(X_test)

# Report the before-and-afterscores
print("Unoptimized model\n------")
print("Accuracy score on testing data: {:.4f}".format(accuracy_score(y_test, predictions)))
print("F-score on testing data: {:.4f}".format(fbeta_score(y_test, predictions, beta = 0.5)))
print("\nOptimized Model\n------")
print("Final accuracy score on the testing data: {:.4f}".format(accuracy_score(y_test, best_predictions)))
print("Final F-score on the testing data: {:.4f}".format(fbeta_score(y_test, best_predictions, beta = 0.5)))

Unoptimized model
------
Accuracy score on testing data: 0.9219
F-score on testing data: 0.5556

Optimized Model
------
Final accuracy score on the testing data: 0.9313
Final F-score on the testing data: 0.6452

#Parameter value

print("gamma = {}".format(best_clf.get_params()['gamma']))

gamma = 1.2

Conclusion¶

By using above model after tuning we are getting 93.13% accuracy, that is, if any data of some other Wine is being provided to us we will be able to predict whether it is Good or Bad by 93.13% Accuracy, which is a pretty good number.

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	0.9968	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	0.9970	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	0.9980	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	0.9978	3.51	0.56	9.4	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
count	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000	1599.000000
mean	8.319637	0.527821	0.270976	2.538806	0.087467	15.874922	46.467792	0.996747	3.311113	0.658149	10.422983	5.636023
std	1.741096	0.179060	0.194801	1.409928	0.047065	10.460157	32.895324	0.001887	0.154386	0.169507	1.065668	0.807569
min	4.600000	0.120000	0.000000	0.900000	0.012000	1.000000	6.000000	0.990070	2.740000	0.330000	8.400000	3.000000
4%	6.000000	0.260000	0.000000	1.500000	0.052000	4.000000	11.000000	0.993460	3.049200	0.460000	9.200000	5.000000
25%	7.100000	0.390000	0.090000	1.900000	0.070000	7.000000	22.000000	0.995600	3.210000	0.550000	9.500000	5.000000
50%	7.900000	0.520000	0.260000	2.200000	0.079000	14.000000	38.000000	0.996750	3.310000	0.620000	10.200000	6.000000
75%	9.200000	0.640000	0.420000	2.600000	0.090000	21.000000	62.000000	0.997835	3.400000	0.730000	11.100000	6.000000
96%	12.000000	0.870000	0.640000	5.600000	0.153320	37.000000	119.080000	1.000122	3.580000	0.970800	12.608000	7.000000
max	15.900000	1.580000	1.000000	15.500000	0.611000	72.000000	289.000000	1.003690	4.010000	2.000000	14.900000	8.000000

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	density	pH	sulphates	alcohol	quality
0	0.247788	0.397260	0.00	0.068493	0.106845	0.140845	0.098940	0.567548	0.606299	0.137725	0.153846	0.4
1	0.283186	0.520548	0.00	0.116438	0.143573	0.338028	0.215548	0.494126	0.362205	0.209581	0.215385	0.4
2	0.283186	0.438356	0.04	0.095890	0.133556	0.197183	0.169611	0.508811	0.409449	0.191617	0.215385	0.4
3	0.584071	0.109589	0.56	0.068493	0.105175	0.225352	0.190813	0.582232	0.330709	0.149701	0.215385	0.6
4	0.247788	0.397260	0.00	0.068493	0.106845	0.140845	0.098940	0.567548	0.606299	0.137725	0.153846	0.4

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	5
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	5
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	6
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	5

	fixed acidity	volatile acidity	citric acid	residual sugar	chlorides	free sulfur dioxide	total sulfur dioxide	pH	sulphates	alcohol	quality
0	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	bad
1	7.8	0.88	0.00	2.6	0.098	25.0	67.0	3.20	0.68	9.8	bad
2	7.8	0.76	0.04	2.3	0.092	15.0	54.0	3.26	0.65	9.8	bad
3	11.2	0.28	0.56	1.9	0.075	17.0	60.0	3.16	0.58	9.8	bad
4	7.4	0.70	0.00	1.9	0.076	11.0	34.0	3.51	0.56	9.4	bad

Prediction of Quality of Wine

Assumption¶

Data Cleaning¶

Interpretation¶

Predictive Modeling¶

Preparing dataset for Machine learning¶

Implementation – Creating a Training and Predicting Pipeline¶

Initial Model Evaluation¶

Fine Tuning Model¶

Conclusion¶

Like this:

Related

Leave a ReplyCancel reply

Help Stray Dogs! Donate Now!!

Assumption¶

Data Cleaning¶

Interpretation¶

Predictive Modeling¶

Preparing dataset for Machine learning¶

Implementation – Creating a Training and Predicting Pipeline¶

Initial Model Evaluation¶

Fine Tuning Model¶

Conclusion¶

Share this post:

Like this:

Related

Leave a ReplyCancel reply

Related Posts

Discover more from Machine Learning For Analytics