Data Mining to Model Wine Preferences

Data Mining to model Wine Preferences Final Project Abstract: To sustain its immense growth in the last decade, the wine industry has started investing in new technologies that assists superior wine making and efficient selling processes. This paper proposes how data mining techniques can be used to predict human wine preferences. The Wine dataset addressed in this paper was collected between May-2004 and Feb-2007 and is available at: http://archive. ics. ci. edu/ml/ machine-learning-databases/wine-quality/ It is to be noted that a considerable research has been done on the dataset and support vector machine (SVM) algorithm s known to yield best results [1]. This paper addresses the dataset using supervised MIL algorithms (multivariate regression, decision trees) to generate patterns of interests. Introduction: Wine industry is investing in technologies that can help build for winemaking and selling process.

Quality assessment, part of wine certification process can be used to improve wine making by identifying the most relevant factors in wine production. Wine certification is assessed by physicochemical and sensory tests [2]. Physicochemical laboratory tests may include determination of sugar, alcohol or pH alue, while sensory tests are carried out by human experts. This paper addresses the Wine dataset with a set of supervised MIL algorithms to detect patterns of interest. It can be used to support the oenologist wine evaluations, potentially improving the quality and speed of their decisions.

Dataset: The Wine dataset is the red variant of the popular “Vinho Verde” wine. The inputs include objective tests (e. g. PH values) and the output is based on sensory data (median of compulsory 3 evaluations made by an expert). Wine quality is graded between O (very bad) and 10 (very excellent). Only the physicochemical (inputs) and sensory (the output) variables have been made available keeping in mind the privacy and logistic concerns. For e. g. there is no data regarding the grape type, maturation-time, etc.

Number of red wine instances: 1599 Number of Attributes: 11 + output attribute Number of missing attributes: N/A Input variables (based on physicochemical tests): 1 – fixed acidity 2 – volatile acidity 3- citric acid 4- residual sugar 5 – chlorides 6 – free sulfur dioxide 7 – total sulfur dioxide 8 – density 9- pH IO- sulphates – alcohol 11 Output variable (based on sensory data): 2 – quality (score between O and 10) Using Regression Regression model estimates the value of a continuous target (y) as a function (F) of one or more predictors (xl, x2… n), a set of parameters (81, 82, .. , Bn), and a measure of error (e). y = F(x, B) + e The regression parameters are also referred as coefficients. The process of training a regression model involves finding the set of parameter values that minimizes a measure of the error, for example, the sum of ]squared errors. Creating the regression model with WEKA Preprocessing: All the given attributes are numeric which fits the regression model. There seems to be a few irrelevant or redundant attributes present but we continue without any pre-processing.

Looking at the dataset it can be inferred that it has an uneven class distribution (i. e. more number of medium quality wines than good or bad ones! ) Building the Regression Model: We select the Liner Regression leaf and choose Use training set as the Test Option. Quality is chosen as the output variable. Fig-I below shows the built linear regression model quality – -1. 0128 * volatile acidity + -2. 0178 * chlorides + 0. 0051 * free sulfur dioxide + -0. 0035 * total sulfur dioxide + -0. 827 * PH + 0. 8827 * sulphates + 0. 2893 * alcohol + 4. 301 Correlation coefficient Mean absolute error Root mean squared error Relative absolute error Root relative squared error Total Number of Instances 0. 5012 0. 6461 73. 3703 % 80. 0331 % 1 599 WEKA uses the columns that contribute (statistically) to the accuracy of the build model (usually measured in R-squared) and removes any other irrelevant columns. Using different test options (cross validations and percentage split) gave the same regression model with slight increments in RMS. Analysis: A random attribute is chosen from the instance set Input: 7, 0. 76, 11, 34, 3. 51, 0. 56, 9. 4 Output (quality): 5 By feeding the input values to the regression model we get Predicted Quality as 5. 2. The model predicts the class with a high precision for this instance. We feed a few other instances into the model. The results are summarized below in Fig-I : Quality Predicted Quality 7 5. 3 3 5. 11 5. 21 8 6. 23 6. 52 5 5. 24 Fig-I The unbalanced class distribution seems to have a impact on the performance. We intuitively categorize the numeric values as follows: Quality: 3, 4 Bad Quality: 5 0k Quality: 6 Good Quality: 7, 8 Superior.

The above model works reasonably well for the “0k” and “Good” quality wines. One of the possible reasons for this is because of heavy distribution of relation instances in these categories [1319/1599]. While the “Superior” quality wines [218/1 599] are predicted as “Good”, most of the “Bad” quality wines [63/1 599] are predicted as “Good” or “0k”. Besides uneven distribution of class, the dataset may contain redundant or irrelevant attributes (for example we soon find residual sugar to be an irrelevant attribute) The following observations can be made from the model:

Increase in Volatile Acidity (VA) degrades the Wine Quality: A balanced amount of VA, in fact, is necessary for aroma and flavor, Just as a fever indicates a problem in man, excess volatile acidi ty in wine signals trouble. VA can be caused by several acids, even though its primary source is acetic acid. At higher levels, however, VA can give wine a sharp, vinegary tactile sensation signifying a seriously flawed wine [3]. Increase in pH degrades the Wine Quality: In fact, pH’s higher than 4. 0 are generally avoided as spoilage is more likely to occur above this level.

A low pH value wine rovides a fresher taste, increases ageing potential and shifts color equilibrium to more red pigments [4]. Fixed acidity, citric acid content does not matter: On the contrary, the fixed acidity content has a significant on wine quality Using Decision Trees: Pre-processing: To build a decision trees model we discretize all the numeric attributes using the Filters options. We use 4 bins to discretize all numeric attributes. We leave the class attribute (Quality) as type numeric. By using the default parameters for 148, we get the model shown in Fig-2. Correctly Classified Instances 1049 65. 35 % Incorrectly Classified Instances 550 34. 3965 % Kappa statistic 0. 4291 0. 1589 0. 2819 74. 0911 % 86. 1237% Detailed Accuracy By Class TP Rate FP Rate Precision Recall F-Measure ROC Area Class O 0. 133 0. 682 0. 276 0. 8 4 0. 297 0. 604 0. 404 0. 885 0. 939 3 0. 075 0. 002 0. 571 0. 075 0. 812 0. 265 0. 695 0. 812 0. 682 0. 641 0. 743 6 0. 749 0. 111 0. 803 0. 831 5 0. 013 0. 19 0. 753 0. 939 0. 111 0. 001 0. 667 0. 233 0. 657 0. 656 0. 631 8 weighted Avg. 0. 656 Confusion Matrix a b c d e f Bad Quality: 6, 7, 8 Good By choosing seed size as 50 and num of objects as 10 we get the model shown below n fg-3.

The model has a better accuracy than the previous model as the class labels have been reduced to Just 3 (Good, Bad and 0k). The recall and precision values are quiet low for predicting the class label ‘Bad’. Correctly Classified Instances 1 160 72. 5453 % Incorrectly Classified Instances 439 27. 4547 % Kappa statistic 0. 4637 0. 264 0. 3633 74. 53 % 86. 3514% Total Number ot Instances TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0. 702 0. 317 0. 016 0. 725 0. 221 0. 702 0. 702 0. 702 0. 79 0K 0. 796 0. 016 0 0. 743 0. 031 0. 71 1 0. 796 0. 769 0. 722 0. 787 sad weighted Avg. 0. 725 0. 735

Categories

ˆ Back To Top