Categories
Data Science

Pairplots In Python

Pairplots in Python

In this notebook we will explore making pairplots in Python using the seaborn visualization library. We'll start with the default sns.pairplot and then look at customizing our plots using sns.PairGrids.

In [ ]:
# Pandas and numpy for data manipulation
import pandas as pd
import numpy as np
In [ ]:
# matplotlib for plotting
import matplotlib.pyplot as plt
import matplotlib

# Set text size
matplotlib.rcParams['font.size'] = 18

# Seaborn for pairplots
import seaborn as sns

sns.set_context('talk', font_scale=1.2);

Gapminder Socioeconomic Data

We will be using GapMinder socioeconomic data that is available in the R package gapminder. The data has been saved to a csv file which we will read into a dataframe. There are six columns in the data:

  1. Country
  2. Continent: useful for grouping data
  3. Year: data coveres 1952-2007
  4. life_exp: the life expectancy at birth
  5. pop: population
  6. gdp_per_cap: the per capita (per person) GDP in international dollars
In [ ]:
df = pd.read_csv('../input/gapminder-data/gapminder_data.csv')
df.columns = ['country', 'continent', 'year', 'life_exp', 'pop', 'gdp_per_cap']
df.head()

We can quickly find summary stats for the data using the describe method of a dataframe.

In [ ]:
df.describe()

Default Pair Plot with All Data

Let's use the entire dataset and sns.pairplot to create a simple, yet useful plot.

In [ ]:
sns.pairplot(df);

The default pairplot shows scatter plots between variables on the upper and lower triangle and histograms along the diagonal. Already, we can see some trends such as a positive correlation between gdp_per_cap and life_exp and year and life_exp which suggests that people in richer countries live longer and that in general, people have been living longer as time increases. We can't say what causes theses trends, only that there is a correlation.

We can also see that the distribution of pop and gdp_per_cap is heavily skewed to the right. To better represent the data, we can take the log transform of those columns.

In [ ]:
df['log_pop'] = np.log10(df['pop'])
df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap'])

df = df.drop(columns = ['pop', 'gdp_per_cap'])

Group and Color by a Variable

In order to better understand the data, we can color the pairplot using a categorical variable and the hue keyword. First, we will color the plots by the continent.

In [ ]:
matplotlib.rcParams['font.size'] = 40
sns.pairplot(df, hue = 'continent');

I don't find stacked histograms (on the diagonal) to be very useful, and there are some issues with overlapping data points (known as overplotting). We can fix these by adding in a few customizations to the pairplot call.

Customizing pairplot

First, let's change the diagonal from a histogram to a kde which can better show the differences between continents. We can also adjust the alpha (intensity) of the scatter plots to better show all the data and change the size of the markers on the scatter plot. Finally, I increase the size of all the plots to better show the data.

In [ ]:
sns.pairplot(df, hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

That makes some of the trends more clear. We can see that Oceania and Europe tend to have the highest life expectancy and highest GDP with Asian countries tending to have the greatest population. The density plots on the diagonal are better for when we have data in multiple categories to make comparisons. We can color the plot by any variable we like. For example, here is a plot colored by a decade categorical variable we create from the year column.

In [ ]:
df['decade'] = pd.cut(df['year'], bins = range(1950, 2010, 10))
df.head()
In [ ]:
sns.pairplot(df, hue = 'decade', diag_kind = 'kde', vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'],
             plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);

In this case, we can know see that life expectancy has increased over the decades as has population. Retaining the year variable might not make much sense when we are already coloring by the decade.

There is still quite a lot of noise on the scatter plots, mostly because we are plotting many years at once. Let's limit ourselves to the most recent year in the data. Notice how we must now use the vars keyword to specify the variables we want to plot. It does not make sense to plot the year variable since it no longer varies. We will limit the plot to the three remaining numerical variables.

In [ ]:
sns.pairplot(df[df['year'] >= 2000], vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], 
             hue = 'continent', diag_kind = 'kde', plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'k'}, size = 4);
plt.suptitle('Pair Plot of Socioeconomic Data for 2000-2007', size = 28);

More Customization with sns.PairGrid

When the options offered by pairplot are not enough, we can move on to more powerful PairGrid. This allows us to define our own functions to map to the lower and upper triangles and the diagonal. For example, we might want a plot that instead of showing two instaces of the scatter plots, shows the Pearson Correlation coefficient (a measure of a linear trend) on one of the triangles. To do this, we can just write a function to calculate the statistic and then map it to the appropriate part of the plot.

First, we will show the basic usage of sns.PairGrid. Here, we map a scatter plot to the upper triangle, a density plot to the diagonal, and a 2D density plot to the lower triangle. PairGrid is a class and not a function, which means that we need to create an instance and then use methods of that instance to build a plot. Then, after we have added all the methods to the instance, we can show the resulting plot.

In [ ]:
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Map different plots to different sections
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred', edgecolor = 'k');

Now that we see how to map different functions to the different elements, we can write out own function to put on the plot. We'll use a simple function to show the correlation coffiecients on the scatterplot. (Thanks to this Stack Overflow answer for help on how to write a custom function and map it onto the plot).

In [ ]:
# Function to calculate correlation coefficient between two arrays
def corr(x, y, **kwargs):
    
    # Calculate the value
    coef = np.corrcoef(x, y)[0][1]
    # Make the label
    label = r'$rho$ = ' + str(round(coef, 2))
    
    # Add the label to the plot
    ax = plt.gca()
    ax.annotate(label, xy = (0.2, 0.95), size = 20, xycoords = ax.transAxes)
    
# Create a pair grid instance
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Map the plots to the locations
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(plt.hist, bins = 10, edgecolor =  'k', color = 'darkred');

We can map any function we would like to any of the areas. For example, maybe we would like to show the summary stats on the diagonal.

In [ ]:
# Define a summary function
def summary(x, **kwargs):
    # Convert to a pandas series
    x = pd.Series(x)
    
    # Get stats for the series
    label = x.describe()[['mean', 'std', 'min', '50%', 'max']]
    
    # Convert from log to regular scale
    # Adjust the column names for presentation
    if label.name == 'log_pop':
        label = 10 ** label
        label.name = 'pop stats'
    elif label.name == 'log_gdp_per_cap':
        label = 10 ** label
        label.name = 'gdp_per_cap stats'
    else:
        label.name = 'life_exp stats'
       
    # Round the labels for presentation
    label = label.round()
    ax = plt.gca()
    ax.set_axis_off()
    print(label)
    # Add the labels to the plot
    #ax.annotate(pd.DataFrame(label),xy = (0.1, 0.2), size = 20, xycoords = ax.transAxes)    
    

# Create a pair grid instance
grid = sns.PairGrid(data= df[df['year'] == 2007],
                    vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'], size = 4)

# Fill in the mappings
grid = grid.map_upper(plt.scatter, color = 'darkred')
grid = grid.map_upper(corr)
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')
grid = grid.map_diag(summary);

We can extend this however we like in order to investigate the data. For most use cases, the sns.pairplot function will do everything we require, but if we need the extra options, we can always use the more powerful sns.PairGrid. Pair plots are a great method to get a first look at a dataset, and seaborn has extensive capabilities for producing these figures!

Categories
TensorFlow & Keras

Deep Learning With Keras

Coding the forward propagation algorithm

In [ ]:
import numpy as np





In [ ]:
weights = { 'node_0': np.array([2,4]),
            'node_1': np.array([4, -5]),
            'output': np.array([2,7])}

input_data = np.array([3,5])

# Calculate node 0 value: node_0_value
node_0_value = (weights['node_0'] * input_data).sum()

# Calculate node 1 value: node_1_value
node_1_value = (weights['node_1'] * input_data).sum()

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_value, node_1_value])

# Calculate output: output
output = (weights['output'] * hidden_layer_outputs).sum()

# Print output
print(output)

The Rectified Linear Activation Function

In [ ]:
def relu(input):
    '''Define your relu activation function here'''
    # Calculate the value for the output of the relu function: output
    output = max(input, 0)

    # Return the value just calculated
    return(output)

# Calculate node 0 value: node_0_output
node_0_input = (input_data * weights['node_0']).sum()
node_0_output = relu(node_0_input)

# Calculate node 1 value: node_1_output
node_1_input = (input_data * weights['node_1']).sum()
node_1_output = relu(node_1_input)

# Put node values into array: hidden_layer_outputs
hidden_layer_outputs = np.array([node_0_output, node_1_output])

# Calculate model output (do not apply relu)
model_output = (hidden_layer_outputs * weights['output']).sum()

# Print model output
print(model_output)
In [ ]:
relu(3)
In [ ]:
relu(-1)
In [ ]:
# Define predict_with_network()
def predict_with_network(input_data_row, weights):

    # Calculate node 0 value
    node_0_input = (input_data_row * weights['node_0']).sum()
    node_0_output = relu(node_0_input)

    # Calculate node 1 value
    node_1_input = (input_data_row * weights['node_1']).sum()
    node_1_output = relu(node_1_input)

    # Put node values into array: hidden_layer_outputs
    hidden_layer_outputs = np.array([node_0_output, node_1_output])
    
    # Calculate model output
    input_to_final_layer = (hidden_layer_outputs * weights['output']).sum()
    model_output = relu(input_to_final_layer)
    
    # Return model output
    return(model_output)
        
In [ ]:
weights = {'node_0': np.array([2, 4]), 'node_1': np.array([ 4, -5]), 'output': np.array([2, 7])}
input_data = [np.array([3, 5]), np.array([ 1, -1]), np.array([0, 0]), np.array([8, 4])]

# Create empty list to store prediction results
results = []
for input_data_row in input_data:
    # Append prediction to results
    results.append(predict_with_network(input_data_row, weights))

# Print results
print(results)

Multi-layer neural networks

In [ ]:
weights = {'node_0_0': np.array([2, 4]),
 'node_0_1': np.array([ 4, -5]),
 'node_1_0': np.array([-1,  2]),
 'node_1_1': np.array([1, 2]),
 'output': np.array([2, 7])}

input_data = np.array([3, 5])

def predict_with_network(input_data):
    # Calculate node 0 in the first hidden layer
    node_0_0_input = (input_data * weights['node_0_0']).sum()
    node_0_0_output = relu(node_0_0_input)

    # Calculate node 1 in the first hidden layer
    node_0_1_input = (input_data * weights['node_0_1']).sum()
    node_0_1_output = relu(node_0_1_input)

    # Put node values into array: hidden_0_outputs
    hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])
    
    # Calculate node 0 in the second hidden layer
    node_1_0_input = (hidden_0_outputs * weights['node_1_0']).sum()
    node_1_0_output = relu(node_1_0_input)

    # Calculate node 1 in the second hidden layer
    node_1_1_input = (hidden_0_outputs * weights['node_1_1']).sum()
    node_1_1_output = relu(node_1_1_input)

    # Put node values into array: hidden_1_outputs
    hidden_1_outputs = np.array([node_1_0_output, node_1_1_output])

    # Calculate model output: model_output
    model_output = (hidden_1_outputs * weights['output']).sum()
    
    # Return model_output
    return(model_output)
In [ ]:
output = predict_with_network(input_data)
output

Q: Which layers of a model capture more complex or "higher level" interactions?

The last layers capture the most complex interactions.

Coding how weight changes affect accuracy

In [ ]:
def predict_with_network(input_data, weights):
    # Calculate node 0 in the first hidden layer
    node_0_0_input = (input_data * weights['node_0']).sum()
    node_0_0_output = relu(node_0_0_input)

    # Calculate node 1 in the first hidden layer
    node_0_1_input = (input_data * weights['node_1']).sum()
    node_0_1_output = relu(node_0_1_input)

    # Put node values into array: hidden_0_outputs
    hidden_0_outputs = np.array([node_0_0_output, node_0_1_output])

    # Calculate model output: model_output
    model_output = (hidden_0_outputs * weights['output']).sum()
    
    # Return model_output
    return(model_output)





In [ ]:
# The data point you will make a prediction for
input_data = np.array([0, 3])

# Sample weights
weights_0 = {'node_0': [2, 1],
             'node_1': [1, 2],
             'output': [1, 1]
            }

# The actual target value, used to calculate the error
target_actual = 3

# Make prediction using original weights
model_output_0 = predict_with_network(input_data, weights_0)

# Calculate error: error_0
error_0 = model_output_0 - target_actual

# Create weights that cause the network to make perfect prediction (3): weights_1
weights_1 = {'node_0': [2, 1],
             # changed node_1 weight as [1,0]
             'node_1': [1, 0],
             'output': [1,1]
            }

# Make prediction using new weights: model_output_1
model_output_1 = predict_with_network(input_data, weights_1)

# Calculate error: error_1
error_1 = model_output_1 - target_actual

# Print error_0 and error_1
print(error_0)
print(error_1)

Scaling up to multiple data points

In [ ]:
input_data = [np.array([0, 3]), np.array([1, 2]), np.array([-1, -2]), np.array([4, 0])]
target_actuals = [1, 3, 5, 7]

weights_0 ={'node_0': np.array([2, 1]), 'node_1': np.array([1, 2]), 'output': np.array([1, 1])}
weights_1 = {'node_0': np.array([2, 1]), 'node_1': np.array([1. , 1.5]), 'output': np.array([1. , 1.5])}
In [ ]:
from sklearn.metrics import mean_squared_error

# Create model_output_0 
model_output_0 = []
# Create model_output_0
model_output_1 = []

# Loop over input_data
for row in input_data:
    # Append prediction to model_output_0
    model_output_0.append(predict_with_network(row, weights_0))
    
    # Append prediction to model_output_1
    model_output_1.append(predict_with_network(row, weights_1))

# Calculate the mean squared error for model_output_0: mse_0
mse_0 = mean_squared_error(target_actuals, model_output_0)

# Calculate the mean squared error for model_output_1: mse_1
mse_1 = mean_squared_error(target_actuals, model_output_1)

# Print mse_0 and mse_1
print("Mean squared error with weights_0: %f" %mse_0)
print("Mean squared error with weights_1: %f" %mse_1)
In [ ]:
model_output_0
In [ ]:
model_output_1

Calculating slopes

When plotting the mean-squared error loss function against predictions, the slope is 2 * x * (y-xb), or 2 * input_data * error.

In [ ]:
input_data = np.array([1, 2, 3])
weights = np.array([0, 2, 1])
target = 0
In [ ]:
# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Print the slope
print(slope)

You've just calculated the slopes you need. Now it's time to use those slopes to improve your model.

Improving model weights

In [ ]:
# Set the learning rate: learning_rate
learning_rate = 0.01

# Calculate the predictions: preds
preds = (weights * input_data).sum()

# Calculate the error: error
error = preds - target

# Calculate the slope: slope
slope = 2 * input_data * error

# Update the weights: weights_updated
weights_updated = weights - ( learning_rate * slope)

# Get updated predictions: preds_updated
preds_updated = (weights_updated * input_data).sum()

# Calculate updated error: error_updated
error_updated = preds_updated - target

# Print the original error
print(error)

# Print the updated error
print(error_updated)

Making multiple updates to weights

In [ ]:
import matplotlib.pyplot as plt
In [ ]:
input_data = np.array([1, 2, 3])
weights = np.array([-0.49929916,  1.00140168, -0.49789747])
target = 0
In [ ]:
def get_slope(input_data, target, weights):
    # Calculate the predictions: preds
    preds = (weights * input_data).sum()

    # Calculate the error: error
    error = preds - target

    # Calculate the slope: slope
    slope = 2 * input_data * error
    
    return slope
In [ ]:
def get_mse(input_data, target, weights_updated):

    # Get updated predictions: preds_updated
    preds_updated = (weights_updated * input_data).sum()

    # Calculate updated error: error_updated
    error_updated = preds_updated - target
    
    return error_updated
In [ ]:
n_updates = 20
mse_hist = []

# Iterate over the number of updates
for i in range(n_updates):
    
    # Calculate the slope: slope
    slope = get_slope(input_data, target, weights)
    
    # Update the weights: weights
    weights = weights - 0.01 * slope
    
    # Calculate mse with new weights: mse
    mse = get_mse(input_data, target, weights)
    
    # Append the mse to mse_hist
    mse_hist.append(mse)

# Plot the mse history
plt.plot(mse_hist)
plt.xlabel('Iterations')
plt.ylabel('Mean Squared Error')
plt.show()

Creating a keras model

In [ ]:
# Import necessary modules
import pandas as pd
import keras
from keras.layers import Dense
from keras.models import Sequential

# Save the number of columns in predictors: n_cols
df = pd.read_csv('../input/hourly-wages/hourly_wages.csv')
#print(df.head())
predictors = df.drop(columns=['wage_per_hour'], axis=1).values
target = df['wage_per_hour'].values
n_cols = predictors.shape[1]

# Set up the model: model
model = Sequential()

# Add the first layer
model.add(Dense(50, activation='relu', input_shape=(n_cols,)))

# Add the second layer
model.add(Dense(32, activation='relu'))

# Add the output layer
model.add(Dense(1))

Compiling and fitting a model

In [ ]:
# Compile the model
model.compile(optimizer='adam', loss='mean_squared_error')

# Verify that model contains information from compiling
print("Loss function: " + model.loss)
In [ ]:
# Fit the model
model.fit(predictors, target)
In [ ]:
# Import necessary modules
import pandas as pd
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical

df = pd.read_csv('../input/titanic/titanic_all_numeric.csv')
print(df.head())
predictors = df.drop(columns=['survived'], axis=1).values
#target = df['wage_per_hour'].values
n_cols = predictors.shape[1]

# Convert the target to categorical: target
target = to_categorical(df.survived)

# Set up the model
model = Sequential()

# Add the first layer
model.add(Dense(32, activation='relu', input_shape=(n_cols,)))

# Add the output layer
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(predictors, target)

Making Predictions

In [ ]:
# Specify, compile, and fit the model
# Convert the target to categorical: target
#target = df['survived']
pred_data = np.array([[2,34.0,0,0,13.0,1,False,0,0,1] , [2,31.0,1,1,26.25,0,False,0,0,1] , [1,11.0,1,2,120.0,1,False,0,0,1] , [3,0.42,0,1,8.5167,1,False,1,0,0] , [3,27.0,0,0,6.975,1,False,0,0,1] , [3,31.0,0,0,7.775,1,False,0,0,1] , [1,39.0,0,0,0.0,1,False,0,0,1] , [3,18.0,0,0,7.775,0,False,0,0,1] , [2,39.0,0,0,13.0,1,False,0,0,1] , [1,33.0,1,0,53.1,0,False,0,0,1] , [3,26.0,0,0,7.8875,1,False,0,0,1] , [3,39.0,0,0,24.15,1,False,0,0,1] , [2,35.0,0,0,10.5,1,False,0,0,1] , [3,6.0,4,2,31.275,0,False,0,0,1] , [3,30.5,0,0,8.05,1,False,0,0,1] , [1,29.69911764705882,0,0,0.0,1,True,0,0,1] , [3,23.0,0,0,7.925,0,False,0,0,1] , [2,31.0,1,1,37.0042,1,False,1,0,0] , [3,43.0,0,0,6.45,1,False,0,0,1] , [3,10.0,3,2,27.9,1,False,0,0,1] , [1,52.0,1,1,93.5,0,False,0,0,1] , [3,27.0,0,0,8.6625,1,False,0,0,1] , [1,38.0,0,0,0.0,1,False,0,0,1] , [3,27.0,0,1,12.475,0,False,0,0,1] , [3,2.0,4,1,39.6875,1,False,0,0,1] , [3,29.69911764705882,0,0,6.95,1,True,0,1,0] , [3,29.69911764705882,0,0,56.4958,1,True,0,0,1] , [2,1.0,0,2,37.0042,1,False,1,0,0] , [3,29.69911764705882,0,0,7.75,1,True,0,1,0] , [1,62.0,0,0,80.0,0,False,0,0,0] , [3,15.0,1,0,14.4542,0,False,1,0,0] , [2,0.83,1,1,18.75,1,False,0,0,1] , [3,29.69911764705882,0,0,7.2292,1,True,1,0,0] , [3,23.0,0,0,7.8542,1,False,0,0,1] , [3,18.0,0,0,8.3,1,False,0,0,1] , [1,39.0,1,1,83.1583,0,False,1,0,0] , [3,21.0,0,0,8.6625,1,False,0,0,1] , [3,29.69911764705882,0,0,8.05,1,True,0,0,1] , [3,32.0,0,0,56.4958,1,False,0,0,1] , [1,29.69911764705882,0,0,29.7,1,True,1,0,0] , [3,20.0,0,0,7.925,1,False,0,0,1] , [2,16.0,0,0,10.5,1,False,0,0,1] , [1,30.0,0,0,31.0,0,False,1,0,0] , [3,34.5,0,0,6.4375,1,False,1,0,0] , [3,17.0,0,0,8.6625,1,False,0,0,1] , [3,42.0,0,0,7.55,1,False,0,0,1] , [3,29.69911764705882,8,2,69.55,1,True,0,0,1] , [3,35.0,0,0,7.8958,1,False,1,0,0] , [2,28.0,0,1,33.0,1,False,0,0,1] , [1,29.69911764705882,1,0,89.1042,0,True,1,0,0] , [3,4.0,4,2,31.275,1,False,0,0,1] , [3,74.0,0,0,7.775,1,False,0,0,1] , [3,9.0,1,1,15.2458,0,False,1,0,0] , [1,16.0,0,1,39.4,0,False,0,0,1] , [2,44.0,1,0,26.0,0,False,0,0,1] , [3,18.0,0,1,9.35,0,False,0,0,1] , [1,45.0,1,1,164.8667,0,False,0,0,1] , [1,51.0,0,0,26.55,1,False,0,0,1] , [3,24.0,0,3,19.2583,0,False,1,0,0] , [3,29.69911764705882,0,0,7.2292,1,True,1,0,0] , [3,41.0,2,0,14.1083,1,False,0,0,1] , [2,21.0,1,0,11.5,1,False,0,0,1] , [1,48.0,0,0,25.9292,0,False,0,0,1] , [3,29.69911764705882,8,2,69.55,0,True,0,0,1] , [2,24.0,0,0,13.0,1,False,0,0,1] , [2,42.0,0,0,13.0,0,False,0,0,1] , [2,27.0,1,0,13.8583,0,False,1,0,0] , [1,31.0,0,0,50.4958,1,False,0,0,1] , [3,29.69911764705882,0,0,9.5,1,True,0,0,1] , [3,4.0,1,1,11.1333,1,False,0,0,1] , [3,26.0,0,0,7.8958,1,False,0,0,1] , [1,47.0,1,1,52.5542,0,False,0,0,1] , [1,33.0,0,0,5.0,1,False,0,0,1] , [3,47.0,0,0,9.0,1,False,0,0,1] , [2,28.0,1,0,24.0,0,False,1,0,0] , [3,15.0,0,0,7.225,0,False,1,0,0] , [3,20.0,0,0,9.8458,1,False,0,0,1] , [3,19.0,0,0,7.8958,1,False,0,0,1] , [3,29.69911764705882,0,0,7.8958,1,True,0,0,1] , [1,56.0,0,1,83.1583,0,False,1,0,0] , [2,25.0,0,1,26.0,0,False,0,0,1] , [3,33.0,0,0,7.8958,1,False,0,0,1] , [3,22.0,0,0,10.5167,0,False,0,0,1] , [2,28.0,0,0,10.5,1,False,0,0,1] , [3,25.0,0,0,7.05,1,False,0,0,1] , [3,39.0,0,5,29.125,0,False,0,1,0] , [2,27.0,0,0,13.0,1,False,0,0,1] , [1,19.0,0,0,30.0,0,False,0,0,1] , [3,29.69911764705882,1,2,23.45,0,True,0,0,1] , [1,26.0,0,0,30.0,1,False,1,0,0] , [3,32.0,0,0,7.75,1,False,0,1,0]])
#print(type(pred_data))
predictors = df.drop(columns=['survived'], axis=1)
#target = df['wage_per_hour'].values
n_cols = predictors.shape[1]
# Convert the target to categorical: target
target = to_categorical(df.survived)
#print(target)

model = Sequential()
model.add(Dense(32, activation='relu', input_shape = (n_cols,)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='sgd', 
              loss='categorical_crossentropy', 
              metrics=['accuracy'])
model.fit(predictors, target)

# Calculate predictions: predictions
predictions = model.predict(pred_data)

# Calculate predicted probability of survival: predicted_prob_true
predicted_prob_true = predictions[:,1]

# print predicted_prob_true
print(predicted_prob_true)

Changing optimization parameters

In [ ]:
def get_new_model():
    model = Sequential()
    model.add(Dense(100, activation='relu', input_shape = (n_cols,)))
    model.add(Dense(100, activation='relu'))
    model.add(Dense(2, activation='softmax'))
    return(model)
In [ ]:
# Import the SGD optimizer
from keras.optimizers import SGD

# Create list of learning rates: lr_to_test
lr_to_test = [.000001, 0.01, 1]

# Loop over learning rates
for lr in lr_to_test:
    print('nnTesting model with learning rate: %fn'%lr )
    
    # Build new model to test, unaffected by previous models
    model = get_new_model()
    
    # Create SGD optimizer with specified learning rate: my_optimizer
    my_optimizer = SGD(lr=lr)
    
    # Compile the model
    model.compile(optimizer=my_optimizer, loss='categorical_crossentropy')
    
    # Fit the model
    model.fit(predictors, target)
    

Model validation

In [ ]:
# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
hist = model.fit(predictors, target, validation_split=.3)

Early stopping: Optimizing the optimization

In [ ]:
# Import EarlyStopping
from keras.callbacks import EarlyStopping

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model = Sequential()
model.add(Dense(100, activation='relu', input_shape = input_shape))
model.add(Dense(100, activation='relu'))
model.add(Dense(2, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Fit the model
model.fit(predictors, target, epochs=30, validation_split=.3, callbacks=[early_stopping_monitor])

Experimenting with wider networks

In [ ]:
# Define early_stopping_monitor
early_stopping_monitor = EarlyStopping(patience=2)

# Save the number of columns in predictors: n_cols
n_cols = predictors.shape[1]
input_shape = (n_cols,)

# Specify the model
model_1 = Sequential()
model_1.add(Dense(100, activation='relu', input_shape = input_shape))
model_1.add(Dense(100, activation='relu'))
model_1.add(Dense(2, activation='softmax'))

# Compile the model
model_1.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model_1.fit(predictors, target, epochs=30, validation_split=.3, callbacks=[early_stopping_monitor])
# Create the new model: model_2
model_2 = Sequential()

# Add the first and second layers
model_2.add(Dense(100, activation='relu', input_shape=input_shape))
model_2.add(Dense(100, activation='relu', input_shape=input_shape))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model_1
model_1_training = model_1.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Fit model_2
model_2_training = model_2.fit(predictors, target, epochs=15, validation_split=0.2, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

Adding layers to a network

In [ ]:
# The input shape to use in the first hidden layer
input_shape = (n_cols,)

# Create the new model: model_2
model_2 = Sequential()

# Add the first, second, and third hidden layers
model_2.add(Dense(50, activation='relu', input_shape=input_shape))
model_2.add(Dense(50, activation='relu'))
model_2.add(Dense(50, activation='relu'))

# Add the output layer
model_2.add(Dense(2, activation='softmax'))

# Compile model_2
model_2.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit model 1
model_1_training = model_1.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Fit model 2
model_2_training = model_2.fit(predictors, target, epochs=20, validation_split=0.4, callbacks=[early_stopping_monitor], verbose=False)

# Create the plot
plt.plot(model_1_training.history['val_loss'], 'r', model_2_training.history['val_loss'], 'b')
plt.xlabel('Epochs')
plt.ylabel('Validation score')
plt.show()

The model with the lower loss value is the better model

Building your own digit recognition model

In [ ]:
# Import necessary modules
import pandas as pd
import keras
from keras.layers import Dense
from keras.models import Sequential
from keras.utils import to_categorical


df = pd.read_csv('../input/mnist-data/mnist.csv')
print(df.head())
X = df.drop(df.columns[0], axis=1).values
# Convert the target to categorical: target
y = to_categorical(df.iloc[:,0].values)
print(X.shape)
print(y.shape)

#n_cols = predictors.shape[1]


#target = to_categorical(df.survived)

# Create the model: model
model = Sequential()

# Add the first hidden layer
model.add(Dense(50, activation='relu', input_shape=(X[0].shape[0],)))

# Add the second hidden layer
model.add(Dense(50, activation='relu'))

# Add the output layer
model.add(Dense(10, activation='softmax'))

# Compile the model
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

# Fit the model
model.fit(X, y, validation_split=.3)

You should see better than 90% accuracy recognizing handwritten digits, even while using a small training set of only 1750 images!

Categories
TensorFlow & Keras

Advanced Deep Learning With Keras

Advanced Deep Learning with Keras in Python

The Keras Functional API

Input layers

In [ ]:
# Import Input from keras.layers
from keras.layers import Input

# Create an input layer of shape 1
input_tensor = Input(shape=(1,))

Dense layers

In [ ]:
# Load layers
from keras.layers import Input, Dense

# Input layer
input_tensor = Input(shape=(1,))

# Dense layer
output_layer = Dense(1)

# Connect the dense layer to the input_tensor
output_tensor = output_layer(input_tensor)

This network will take the input, apply a linear coefficient to it, and return the result.

Output layers

In [ ]:
# Load layers
from keras.layers import Input, Dense

# Input layer
input_tensor = Input(shape=(1,))

# Create a dense layer and connect the dense layer to the input_tensor in one step
# Note that we did this in 2 steps in the previous exercise, but are doing it in one step now
output_tensor = Dense(1)(input_tensor)

The output layer allows your model to make predictions.

Build a model

In [ ]:
# Input/dense/output layers
from keras.layers import Input, Dense
input_tensor = Input(shape=(1,))
output_tensor = Dense(1)(input_tensor)

# Build the model
from keras.models import Model
model = Model(input_tensor, output_tensor)

Compile a model

This finalizes your model, freezes all its settings, and prepares it to meet some data!

In [ ]:
# Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')

Visualize a model

In [ ]:
# Import the plotting function
from keras.utils import plot_model
import matplotlib.pyplot as plt

# Summarize the model
model.summary()

# # Plot the model
plot_model(model, to_file='model.png')

# # Display the image
data = plt.imread('model.png')
plt.imshow(data)
plt.show()

Fit the model to the tournament basketball data

In [ ]:
import pandas as pd

games_tourney_train = pd.read_csv('../input/games-tourney/games_tourney.csv')
games_tourney_train.head()
In [ ]:
games_tourney_train.shape
In [ ]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(games_tourney_train['seed_diff'], games_tourney_train['score_diff'])
In [ ]:
# Now fit the model
model.fit(X_train, y_train,
          epochs=1,
          batch_size=128,
          validation_split=.10,
          verbose=True)

Evaluate the model on a test set

In [ ]:
# Evaluate the model on the test data
model.evaluate(X_test, y_test)

Two Input Networks Using Categorical Embeddings, Shared Layers, and Merge Layers

Define team lookup

In [ ]:
games_season = pd.read_csv('../input/games-season/games_season.csv')
games_season.head()
In [ ]:
# Imports
from keras.layers import Embedding
from numpy import unique

# Count the unique number of teams
n_teams = unique(games_season['team_1']).shape[0]

# Create an embedding layer
team_lookup = Embedding(input_dim=n_teams,
                        output_dim=1,
                        input_length=1,
                        name='Team-Strength')

The embedding layer is a lot like a dictionary, but your model learns the values for each key.

Define team model

In [ ]:
# Imports
from keras.layers import Input, Embedding, Flatten
from keras.models import Model

# Create an input layer for the team ID
teamid_in = Input(shape=(1,))

# Lookup the input in the team strength embedding layer
strength_lookup = team_lookup(teamid_in)

# Flatten the output
strength_lookup_flat = Flatten()(strength_lookup)

# Combine the operations into a single, re-usable model
team_strength_model = Model(teamid_in, strength_lookup_flat, name='Team-Strength-Model')

Defining two inputs

In [ ]:
# Load the input layer from keras.layers
from keras.layers import Input

# Input layer for team 1
team_in_1 = Input((1,), name='Team-1-In')

# Separate input layer for team 2
team_in_2 = Input((1,), name='Team-2-In')

These two inputs will be used later for the shared layer.

Lookup both inputs in the same model

In [ ]:
# Lookup team 1 in the team strength model
team_1_strength = team_strength_model(team_in_1)

# Lookup team 2 in the team strength model
team_2_strength = team_strength_model(team_in_2)

Now your model knows how strong each team is.

Output layer using shared layer

In [ ]:
# Import the Subtract layer from keras
from keras.layers import Subtract

# Create a subtract layer using the inputs from the previous exercise
score_diff = Subtract()([team_1_strength, team_2_strength])

This setup subracts the team strength ratings to determine a winner.

Model using two inputs and one output

In [ ]:
# Imports
from keras.layers import Subtract
from keras.models import Model

# Subtraction layer from previous exercise
score_diff = Subtract()([team_1_strength, team_2_strength])

# Create the model
model = Model([team_in_1, team_in_2], score_diff)

# Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')
In [ ]:
model.summary()

Fit the model to the regular season training data

In [ ]:
# Get the team_1 column from the regular season data
input_1 = games_season['team_1']

# Get the team_2 column from the regular season data
input_2 = games_season['team_2']

# Fit the model to input 1 and 2, using score diff as a target
model.fit([input_1, input_2],
          games_season['score_diff'],
          epochs=1,
          batch_size=2048,
          validation_split=.1,
          verbose=True)

Now our model has learned a strength rating for every team.

Evaluate the model on the tournament test data

In [ ]:
games_tourney = pd.read_csv('../input/games-tourney/games_tourney.csv')
games_tourney.head()
In [ ]:
# Get team_1 from the tournament data
input_1 = games_tourney['team_1']

# Get team_2 from the tournament data
input_2 = games_tourney['team_2']

# Evaluate the model using these inputs
model.evaluate([input_1, input_2], games_tourney['score_diff'])

Multiple Inputs: 3 Inputs (and Beyond!)

Make an input layer for home vs. away

You know there is a well-documented home-team advantage in basketball, so you will add a new input to your model to capture this effect.

In [ ]:
from keras.layers.merge import Concatenate
In [ ]:
# Create an Input for each team
team_in_1 = Input(shape=(1,), name='Team-1-In')
team_in_2 = Input(shape=(1,), name='Team-2-In')

# Create an input for home vs away
home_in = Input(shape=(1,), name='Home-In')

# Lookup the team inputs in the team strength model
team_1_strength = team_strength_model(team_in_1)
team_2_strength = team_strength_model(team_in_2)

# Combine the team strengths with the home input using a Concatenate layer, then add a Dense layer
out = Concatenate()([team_1_strength, team_2_strength, home_in])
out = Dense(1)(out)

Now you have a model with 3 inputs!

Make a model and compile it

In [ ]:
# Import the model class
from keras.models import Model

# Make a Model
model = Model([team_in_1, team_in_2, home_in], out)

# Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')
In [ ]:
model.summary()

Fit the model and evaluate

In [ ]:
# Fit the model to the games_season dataset
model.fit([games_season['team_1'], games_season['team_2'], games_season['home']],
          games_season['score_diff'],
          epochs=1,
          verbose=True,
          validation_split=.1,
          batch_size=2048)

# Evaluate the model on the games_tourney dataset
model.evaluate([games_tourney['team_1'], games_tourney['team_2'], games_tourney['home']], games_tourney['score_diff'])

Plotting models

In [ ]:
# Imports
import matplotlib.pyplot as plt
from keras.utils import plot_model

# Plot the model
plot_model(model, to_file='model.png')

# Display the image
data = plt.imread('model.png')
plt.imshow(data)
plt.show()

Add the model predictions to the tournament data

In [ ]:
# Predict
games_tourney['pred'] = model.predict([games_tourney['team_1'],games_tourney['team_2'],games_tourney['home']])
In [ ]:
games_tourney.head()
In [ ]:
games_tourney.score_diff.unique()

Now you can try building a model for the tournament data based on your regular season predictions

Create an input layer with multiple columns

In [ ]:
# Create an input layer with 3 columns
input_tensor = Input((3,))

# Pass it to a Dense layer with 1 unit
output_tensor = Dense(1)(input_tensor)

# Create a model
model = Model(input_tensor, output_tensor)

# Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')

Fit the model

In [ ]:
games_tourney_train = games_tourney.query('season < 2010')
In [ ]:
games_tourney_train.shape
In [ ]:
# Fit the model
model.fit(games_tourney_train[['home', 'seed_diff', 'pred']],
          games_tourney_train['score_diff'],
          epochs=1,
          verbose=True)

Evaluate the model

In [ ]:
games_tourney_test = games_tourney.query('season >= 2010')
In [ ]:
games_tourney_test.shape

Evaluate the model

In [ ]:
# Evaluate the model on the games_tourney_test dataset
model.evaluate(games_tourney_test[['home', 'seed_diff', 'pred']], 
               games_tourney_test['score_diff'])

Multiple Outputs

Simple two-output model

"multiple target regression": one model making more than one prediction.

In [ ]:
# Define the input
input_tensor = Input((2,))

# Define the output
output_tensor = Dense(2)(input_tensor)

# Create a model
model = Model(input_tensor, output_tensor)

# Compile the model
model.compile(optimizer='adam', loss='mean_absolute_error')

Fit a model with two outputs

this model will predict the scores of both teams.

In [ ]:
games_tourney_train.shape
In [ ]:
# Fit the model
model.fit(games_tourney_train[['seed_diff', 'pred']],
  		  games_tourney_train[['score_1', 'score_2']],
  		  verbose=False,
  		  epochs=1000,
  		  batch_size=64)
In [ ]:
import numpy as np
In [ ]:
np.mean(model.history.history['loss'])

Inspect the model (I)

The input layer will have 4 weights: 2 for each input times 2 for each output.

The output layer will have 2 weights, one for each output.

In [ ]:
# Print the model's weights
print(model.get_weights())

# Print the column means of the training data
print(games_tourney_train.mean())

Did you notice that both output weights are about ~53? This is because, on average, a team will score about 53 points in the tournament.

Evaluate the model

In [ ]:
# Evaluate the model on the tournament test data
model.evaluate(games_tourney_test[['seed_diff', 'pred']],games_tourney_test[['score_1', 'score_2']])

Classification and regression in one model

In [ ]:
# Create an input layer with 2 columns
input_tensor = Input((2,))

# Create the first output
output_tensor_1 = Dense(1, activation='linear', use_bias=False)(input_tensor)

# Create the second output (use the first output as input here)
output_tensor_2 = Dense(1, activation='sigmoid', use_bias=False)(output_tensor_1)

# Create a model with 2 outputs
model = Model(input_tensor, [output_tensor_1, output_tensor_2])

Compile and fit the model

In [ ]:
# Import the Adam optimizer
from keras.optimizers import Adam

# Compile the model with 2 losses and the Adam optimzer with a higher learning rate
model.compile(loss=['mean_absolute_error', 'binary_crossentropy'], optimizer=Adam(lr=0.01))

# Fit the model to the tournament training data, with 2 inputs and 2 outputs
model.fit(games_tourney_train[['seed_diff', 'pred']],
          [games_tourney_train[['score_diff']], games_tourney_train[['won']]],
          epochs=10,
          verbose=True,
          batch_size=16384)

Inspect the model (II)

In [ ]:
# Print the model weights
print(model.get_weights())

# Print the training data means
print(games_tourney_train.mean())
In [ ]:
# Import the sigmoid function from scipy
from scipy.special import expit as sigmoid

# Weight from the model
weight = 0.14

# Print the approximate win probability predicted close game
print(sigmoid(1 * weight))

# Print the approximate win probability predicted blowout game
print(sigmoid(10 * weight))

So sigmoid(1 * 0.14) is 0.53, which represents a pretty close game and sigmoid(10 * 0.14) is 0.80, which represents a pretty likely win. In other words, if the model predicts a win of 1 point, it is less sure of the win than if it predicts 10 points. Who says neural networks are black boxes?

Evaluate on new data with two metrics

Keras will return 3 numbers:

the first number will be the sum of both the loss functions,

the next 2 numbers will be the loss functions you used when defining the model.

In [ ]:
# Evaluate the model on new data
model.evaluate(games_tourney_test[['seed_diff', 'pred']],
               [games_tourney_test[['score_diff']], games_tourney_test[['won']]])

model plays a role as a regressor and a good classifier

Categories
Data Science Scikit Learn

K Nearest Neighbours Iris Dataset

Data points without K Nearest Neighbour

In [ ]:
import matplotlib
#matplotlib.use('GTKAgg')

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

# import some data to play with
iris = datasets.load_iris()

# take the first two features
# choosing first 2 columns sepal length and sepal width
X = iris.data[:, :2]
y = iris.target
h = .02  # step size in the mesh

# Calculate min, max and limits
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Put the result into a color plot
plt.figure()
plt.scatter(X[:, 0], X[:, 1])
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("Data points")
plt.show()

Sepal length and sepal width analysis

In [ ]:
import matplotlib

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 6

# import some data to play with
iris = datasets.load_iris()

# prepare data
# choosing first 2 columns sepal length and sepal width
X = iris.data[:, :2]
y = iris.target
h = .02

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA','#00AAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00','#00AAFF'])

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf.fit(X, y)

# calculate min, max and limits
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

# predict class using data and kNN classifier
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)

plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())

plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)" % (n_neighbors))
plt.show()

Petal length and petal width analysis

In [ ]:
import matplotlib

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import neighbors, datasets

n_neighbors = 6

# import some data to play with
iris = datasets.load_iris()

# prepare data
# choosing 3rd and 4th columns petal length and petal width
X = iris.data[:, [2,3]]
y = iris.target
h = .02

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA','#00AAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00','#00AAFF'])

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf.fit(X, y)

# calculate min, max and limits
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))

# predict class using data and kNN classifier
print(np.c_[xx.ravel(), yy.ravel()])

Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
print(Z)

# Put the result into a color plot
Z = Z.reshape(xx.shape)
print(Z)

plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)

plt.figure()
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())


plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i)" % (n_neighbors))
plt.show()

Predicting the output

In [ ]:
import numpy as np
from sklearn import neighbors, datasets
from sklearn import preprocessing

n_neighbors = 6

# import some data to play with
iris = datasets.load_iris()

# prepare data
# choosing first 2 columns sepal length and sepal width
X = iris.data[:, :2]
y = iris.target
h = .02

# we create an instance of Neighbours Classifier and fit the data.
clf = neighbors.KNeighborsClassifier(n_neighbors, weights='distance')
clf.fit(X, y)

# make prediction
#sl = raw_input('Enter sepal length (cm): ')
#sw = raw_input('Enter sepal width (cm): ')

sl = 6.4
sw = 2.8
dataClass = clf.predict([[sl,sw]])
print('Prediction: '),

if dataClass == 0:
    print('Iris Setosa')
elif dataClass == 1:
    print('Iris Versicolour')
else:
    print('Iris Virginica')
dianabol pills for sale
andriol testocaps for sale
deca durabolin cycle
natural hgh
Categories
Data Science Scikit Learn

K Nearest Neighbour With Multiple Features

In [ ]:
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
wine = datasets.load_wine()
In [ ]:
# print the names of the features
print(wine.feature_names)
In [ ]:
# print the label species(class_0, class_1, class_2)
print(wine.target_names)
In [ ]:
# print the wine data (top 5 records)
print(wine.data[0:5])
In [ ]:
# print the wine labels (0:Class_0, 1:Class_1, 2:Class_3)
print(wine.target)
In [ ]:
# print data(feature)shape
print(wine.data.shape)
In [ ]:
# print target(or label)shape
print(wine.target.shape)
In [ ]:
# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split dataset into training set and test set
X_train, X_test, y_train, y_test = train_test_split(wine.data, wine.target, test_size=0.3) # 70% training and 30% test

Model with K=5

In [ ]:
#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)
In [ ]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Model with K=7

In [ ]:
#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier

#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=7)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)
In [ ]:
#Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
In [ ]: