The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be (or not) subscribed. The dataset is ordered by date (from May 2008 to November 2010).

In [1]:

```
# First we import the necessary packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from imblearn.over_sampling import SMOTE
%matplotlib inline
```

The .csv file which is separated by semicolons is loaded into a new dataframe, as required:

In [2]:

```
eurobank = pd.read_csv('bank_marketing.csv', sep=';')
```

In [3]:

```
eurobank.head()
```

Out[3]:

After examining the first five rows of data to add a visual point of reference, the shape of the dataset is inspected at precisely 45,211 rows and 17 columns.

In [4]:

```
print(eurobank.shape)
```

Further examination yields that most of the columns (variables) presented herein are objects, as stated explicitly by the datatypes call on the "eurobank" data frame (eurobank.dtypes). These objects are categorical variables. Only six out of seventeen columns are presented numerically as integers (int64). Only one other numerical variable is presented as a floating number with two decimal places (float64).

In [5]:

```
eurobank.dtypes
```

Out[5]:

Next, a determination is made on which columns have missing values by running the following code:

In [6]:

```
eurobank.isnull().sum()
```

Out[6]:

To handle missing values, we drop NA's for the 'default' and 'contact' variables and fill missing numerical values for age with median age values to smooth the data as shown below:

In [7]:

```
eurobank.dropna(subset=['default', 'contact'], inplace=True)
eurobank['age'].fillna(int(eurobank['age'].median()), inplace=True)
```

Taking a further look, we determine that several of the following variables are of no consequence to the marketing campaign's efficacy. For example, the housing and loan variables pertain to binary loan categories and are redundant because they are essentially reporting the same metrics. Whether a person has a loan at a bank or if it is tied to housing does not have any bearing or impact on this particular marketing outreach. There is enough information to move forward with establishing relationships between variables. Similarly, the contact variable is omitted because the contact method (cellular, telephone, and unknown) does not provide much relevant information. Day and month would present some interesting findings. However, since this dataset is explicitly between the range of years 2008 - 2010, these parameters are not well-defined to provide which year they belong to; hence, we omit them from the model. Lastly, 'pdays', and 'previous' refer to contact tracing via the number of days. This exercise aims to establish the outcome of the marketing campaign via demographic representation, outreach, and subscription to the product. Thus, the following relevant columns are included in a new subset to help achieve this goal.

In [8]:

```
cols_to_include = ['age', 'job','marital', 'education', 'default', 'balance', 'duration', 'campaign', 'poutcome','deposit']
eurobank2 = eurobank[cols_to_include]
eurobank2.head()
```

Out[8]:

In the accompanying table, deposit outcome is examined by age group. Whereas clients between the ages of 35-40 made 962 deposits, the subscription rate is only 11%. The highest percent of deposits in any age group are equal for 18-20 year olds and those 75+ years old, tying them both at a 42% subscription rate. Given larger numbers of customers in these age brackets, this can very well change. However, targeted marketing for those age groups may prove beneficial.

In [9]:

```
print("\033[1m"+'Deposit Outcome by Age (Maximum Values):'+"\033[1m")
def deposit_by_age():
pd.options.mode.chained_assignment = None # default='warn'
bins = [18, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75]
names = ['18-20', '20-25', '25-30', '30-35', '35-40', \
'40-45', '45-50', '50-55', '55-60', '60-65', '65-70', '70-75', '75+']
d = dict(enumerate(names, 1))
eurobank2['Age Range'] = np.vectorize(d.get)(np.digitize(eurobank2['age'],bins))
deposit_yes = eurobank2.loc[eurobank2.deposit == 'yes'].groupby(['Age Range'])[['deposit']].count()
deposit_yes.rename(columns={'deposit':'Deposit'}, inplace=True)
deposit_no = eurobank2.loc[eurobank2.deposit == 'no'].groupby(['Age Range'])[['deposit']].count()
deposit_no.rename(columns={'deposit':'No Deposit'}, inplace=True)
merged_df = pd.concat([deposit_yes, deposit_no], axis = 1)
merged_df['Deposit'] = merged_df['Deposit'].fillna(0)
merged_df['No Deposit'] = merged_df['No Deposit'].fillna(0)
merged_df
max = merged_df.max()
print(max)
merged_df.loc['Total'] = merged_df.sum(numeric_only=True, axis=0)
merged_df['% of Deposits'] = round((merged_df['Deposit'] / (merged_df['Deposit'] \
+ merged_df['No Deposit']))* 100, 2)
return merged_df.style.format("{:,.0f}")
deposit_by_age()
```

Out[9]:

The summary statistics table broken down by job type is presented below. While there are several ways to aggregate this data, looking at balances by job category sheds light on summary statistics like mean, median, and standard deviation. For example, clients that are retired exhibit the highest mean bank balance of 1,963.06. The next highest mean bank balance of 1,794.90 is for individuals who did not specify their career profiles (unknown). Whereas the third-highest recorded mean bank balance of 1,768.30 is for management professionals. Given that retirement income, visa vie savings generally produces larger bank balances, this make sense contextually. On average, management professionals are higher wage earners than hourly positions, larger bank balances make sense for this sample as well. The interesting finding here is that blue-collar workers recorded lower average bank balances (1,079.62) than those that are unemployed (1,499.55). The same job category rankings are shown for median recorded bank balances, but the balances themselves are all below 1,000.00.

The highest standard deviation is that of 4,420.12 for retired individuals, suggesting that it is dispersed over a wider range of values and further away from the mean. This makes sense because tracking retirement income, generally speaking, casts a wide net over an abstract group of individuals. Similar assertions can be made about entrepreneurs to capture a broad category; their standard deviation is recorded as 4,242.79.

In [10]:

```
print("\033[1m"+'Bank Balance Summary Statistics by Job:'+"\033[1m")
def summary_by_job():
pd.options.display.float_format = '{:,.2f}'.format
cols_to_include = ['age', 'job','marital', 'education', 'default', 'balance', 'duration',\
'campaign', 'poutcome','deposit']
new = eurobank[cols_to_include]
new = new.groupby('job')['balance'].agg(["mean", "median", "std", "min", "max"])
new.loc['Total'] = new.sum(numeric_only=True, axis=0)
column_rename = {'mean': 'Mean', 'median': 'Median','std': 'Standard Deviation',\
'min':'Minimum','max': 'Maximum'}
dfsummary = new.rename(columns = column_rename)
new
return dfsummary
summary_by_job()
```

Out[10]:

The summary statistics of bank balances by marital status show that married individuals have the highest average bank balance. In contrast, individuals who are single have the second-highest average bank balance. Divorced couples are shown as having the lowest average bank balance, perhaps in part due to income loss from legal filings, paperwork, and attorney fees. Ranking information (highest to lowest) for median bank balances is the same; however, the balances themselves are lower. Whereas married couples show the highest average bank balances, the mean (standard deviation) variance is also the highest in this category.

In [11]:

```
print("\033[1m"+'Bank Balance Summary Statistics by Marital Status:'+"\033[1m")
def summary_by_marital():
pd.options.display.float_format = '{:,.2f}'.format
new2 = eurobank2.groupby('marital')['balance'].agg(["mean", "median", "std", "min", "max"])
new2.loc['Total'] = new2.sum(numeric_only=True, axis=0)
column_rename = {'mean': 'Mean', 'median': 'Median','std': 'Standard Deviation',\
'min':'Minimum','max': 'Maximum'}
dfsummary2 = new2.rename(columns = column_rename)
return dfsummary2
summary_by_marital()
```

Out[11]:

The table below illustrates deposits by marital status. In examining this data, one must consider what all of this means from a marketing perspective. Each campaign has its unique premise for onboarding new clients to higher-tier services and offerings; hence this is why large conglomerate banks have promotional offers visa vie sales calls, pamphlets, and other promotional materials. In this case, the outcome of a successful campaign is that of a customer making a deposit.

Furthermore, it becomes apparent that the highest recorded deposits come from married clients (2,601), but even a larger (the largest at 23,050) share of non-deposits also come from married clients. Only 10% of married clients subscribed to this marketing campaign, the lowest subscription rate by marital status category. Divorced customers, the lowest count of all marital categories, came in second with an overall 12% subscription rate. Single clients subscribed at the highest rate of 15% by making 1,794 deposits. There were only 12,017 single clients, which is why this metric was higher than the rest. All in all, 12% (counted as 4,985) of all 42,569 clients made a deposit, thereby subscribing to this marketing campaign.

In [12]:

```
print("\033[1m"+'Deposit Outcome by Marital Status:'+"\033[1m")
def deposit_by_marital():
marital_status_yes = eurobank2.loc[eurobank2.deposit == 'yes'].groupby\
(['marital'])[['deposit']].count()
marital_status_yes.rename(columns={'deposit':'Deposit'}, inplace=True)
marital_status_no = eurobank2.loc[eurobank2.deposit == 'no'].groupby\
(['marital'])[['deposit']].count()
marital_status_no.rename(columns={'deposit':'No Deposit'}, inplace=True)
merged_df = pd.concat([marital_status_yes, marital_status_no], axis = 1)
merged_df.loc['Total'] = merged_df.sum(numeric_only=True, axis=0)
merged_df['# of Clients'] = merged_df.sum(axis=1)
merged_df['% Deposited'] = round((merged_df['Deposit'] / (merged_df['Deposit'] \
+ merged_df['No Deposit']))* 100, 2)
return merged_df.style.format("{:,.0f}")
deposit_by_marital()
```

Out[12]:

Upon further examination of the dataset once again in its entirety, it is apparent that each variable (column) has its unique characteristics in terms of type. For example, whereas age, balance, duration, campaign, pdays, and previous are numerical variables, some are not. To this point, job, marital, education, default, housing, loan, contact, poutcome, and deposit are categorical variables. Month, on the other hand, though categorical at first glance, is truly an ordinal variable, as it follows a strict order. To aide in the determination of each, 'eurobank.dtypes', and 'eurobank.head()' functions were already run in the prior section.

The following code block examines the summary statistics as measures of centrality for the selected variables. The average age of the bank's clients is approximately 41 years old, which is only about two years older than the median, so there is no remarkable difference. The minimum age is 18, which makes sense because that is close to the minimum age when banking clients can open new accounts. Not surprisingly, the maximum age of the clientele is 95 years old.

In [13]:

```
summary_stats = round(eurobank2.describe().T,2)
summary_stats.drop(['count'], axis=1, inplace=True)
summary_stats.rename (columns={'mean':'Mean', 'std': 'Standard Deviation', 'min': 'Minimum', \
'25%':'Q1', '50%': 'Median', '75%':'Q3', 'max':'Maximum'}, inplace=True)
summary_stats
```

Out[13]:

We examine the age distribution visually as a boxplot, and though it is normally distributed, it is skewed to the right.

In [14]:

```
# Boxplot of age as another way of showing distribution
fig = plt.figure(figsize = (8,4))
plt.title ('Age Distribution of Clients With Outliers')
plt.xlabel('Client Age')
plt.ylabel('Clients')
sns.boxplot(data=eurobank2['age'], palette="PuBu", orient='h')
plt.show()
# Computing IQR
Q1 = eurobank2['age'].quantile(0.25)
Q3 = eurobank2['age'].quantile(0.75)
IQR = Q3-Q1
# Computing Summary Statistics
mean_1 = round(eurobank2['age'].mean(),2)
std_1 = round(eurobank2['age'].std(),2)
median_1 = round(eurobank2['age'].median(),2)
print('The first quartile is %s. '%Q1)
print('The third quartile is %s. '%Q3)
print('The IQR is %s.'%IQR)
print('The mean is %s.'%mean_1)
print('The standard deviation is %s.'%std_1)
print('The median is %s.'%median_1)
```

Some outliers from past the age of 70 onward suggest that older clients past retirement age are few and far in between. These outliers are subsetted into a new data frame to examine the means of ages with and without them as bases of comparison.

In [15]:

```
lower_bound = Q1-(1.5*IQR)
upper_bound = Q3+(1.5*IQR)
outliers_eurobank2 = eurobank2.loc[(eurobank2['age'] < lower_bound) | (eurobank2['age'] > upper_bound)]
average_age = round(eurobank2.age.mean(), 2)
print('The average of age in the original data frame is %s.' % average_age)
average_age_outliers_eurobank2 = round(outliers_eurobank2.age.mean(),2)
print('The average of age of outliers is %s.' % average_age_outliers_eurobank2)
```

The average age of clients as outliers in this dataset is 76.8 years old. In this analysis, while we are generally interested in data sensitive to outliers, we are equally interested in examining the dataset that is resistant to these outliers as a basis of comparison.

Removing the outliers from the age variable yields little observable significance, as shown below. For example, the average age has only been reduced by 0.37, the standard deviation by 0.62, with no difference in the median age with or without outliers.

In [16]:

```
# Boxplot of age without outliers as another way of showing distribution
fig = plt.figure(figsize = (8,4))
plt.title ('Age Distribution of Clients Without Outliers')
plt.xlabel('Client Age')
plt.ylabel('Clients')
sns.boxplot(data=eurobank2['age'], palette="PuBu", orient='h', showfliers = False)
plt.show()
filter = (eurobank2['age'] >= Q1 - 1.5 * IQR) & (eurobank2['age'] <= Q3 + 1.5 *IQR)
summary_stats_2 = round(eurobank2.loc[filter].describe().T,2)
summary_stats_2.drop(['count'], axis=1, inplace=True)
summary_stats_2.rename (columns={'mean':'Mean', 'std': 'Standard Deviation', 'min': 'Minimum', \
'25%':'Q1', '50%': 'Median', '75%':'Q3', 'max':'Maximum'}, inplace=True)
summary_stats_2
```

Out[16]:

In [17]:

```
# Computing Means, Standard Deviations, and Observing Differences
mean_2 = round(eurobank2.loc[filter]['age'].mean(),2)
mean_difference = round((mean_1 - mean_2),2)
std_2 = round(eurobank2.loc[filter]['age'].std(),2)
std_difference = round((std_1 - std_2),2)
median_2 = round(eurobank2.loc[filter]['age'].median(),2)
median_difference = round((median_1 - median_2),2)
print('The mean age without outliers is %s.'%mean_2)
print('The standard deviation of the age variable without outliers is %s.'%std_2)
print('The median age without outliers is %s.'%median_2)
print("\n")
print('The difference between the mean with outliers and the \
mean without outliers is %s.'%mean_difference)
print('The difference between the standard deviation with outliers and the \
standard deviation without outliers is %s.'%std_difference)
print('The difference between the median with outliers and the \
median without outliers is %s.'%median_difference)
```

A histogram showing the distribution of the age variable is plotted below.

In [18]:

```
# Plotting a histogram to show distribution
fig = plt.figure(figsize = (8,8))
plt.hist(eurobank['age'], bins=10, color='steelblue', alpha=0.9, rwidth=.97)
plt.title ('Age Distribution of Clients')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()
# mode = eurobank2.mode('age', numeric_only=True)
# print('The mode is %s.'%mode)
print('The information below shows the mode for the age variable:')
eurobank.loc[:,'age'].mode()
```

Out[18]:

There are more clients who are 39 years old than any other age. While age is one relevant variable in this dataset, it is worthwile to examine the distribution of account balances as follows:

In [19]:

```
# Plotting a histogram to show distribution
fig = plt.figure(figsize = (8,8))
plt.hist(eurobank2['balance'], bins=10, color='steelblue', alpha=0.9, rwidth=0.7)
plt.title ('Balance Distribution of Clients')
plt.xlabel('Balance')
plt.ylabel('Count')
plt.show()
print('The following values represent the number of data points between each bin:')
counts, bin_edges = np.histogram(eurobank2['balance'], bins=10)
print(counts)
```

The majority of these clients have negative balances. This is not so alarming since the prior summary output table did show a minimum balance of -8,019.00. While there appear to be only four bins in this figure, the output clearly shows otherwise. For example, most of our dataset is below \$40,000, based on the four visible bins shown above. More importantly, most of the balance is either zero or below. This can potentially yield some additional insights in terms of the marketing campaign's direction and its predictability.

In [20]:

```
campaign_age = eurobank2.groupby('age', as_index=False)['campaign'].sum()
fig = plt.figure(figsize = (15,8))
plt.barh(campaign_age['age'], campaign_age['campaign'])
plt.xlabel("Campaign")
plt.ylabel("Age")
plt.title("Marketing Campaigns by Age")
plt.show()
```

Now, to prepare the dataset for a regression model, we assign numerical values to the categorical variables of deposit and prior outcome as follows:

In [21]:

```
eurobank2['deposit'] = eurobank2.deposit.map(dict(yes=1, no=0))
eurobank2['poutcome'] = eurobank2.poutcome.map(dict(unknown=0, other=1, failure=2, success=3))
```

Next, we examine correlations visually:

In [22]:

```
sns.set_context("paper", font_scale=1)
plot = sns.pairplot(eurobank2, diag_kind='hist', dropna = True, palette='Blues')
plot.fig.suptitle("Pairplot of European Bank Marketing Dataset", y = 1.0001)
plt.show()
```

At first glance it does not appear that any of these variables have any noticeable degrees of correlation. Therefore, a heatmap correlation with indices is produced below:

In [23]:

```
# Examining the correlations between variables using a heat map
heatmap = eurobank2[['age','balance','duration','campaign','poutcome','deposit']]
sns.set_context("paper", rc={"axes.labelsize":12}, font_scale = 1.3)
correlations = heatmap.corr()
plt.figure(figsize = (9,9))
ax = sns.heatmap(correlations[['age','balance','duration','campaign','poutcome', 'deposit']], \
annot = True, square = True, cbar_kws={'label': 'Correlation Index'})
ax.set_title('Heatmap of Correlations in European Bank Marketing Campaign')
ax.set_ylim(len(correlations), -0.5)
plt.show()
```

From here we can see that the duration of the marketing call is moderately correlated to the deposit (0.40). This is a moderate positive correlation. The second highest correlation is that of prior marketing campaign outcome (poutcome) to deposit (.22). This is a low positive correlation. Lastly, a low negative correlation (-.11) is seen between campaign and poutcome.

The following scatterplot modeling the moderate at best relationship between deposit and duration is produced below. An increase in the duration of the contact attempt produces a similar increase in the deposits (subscriptions) in the marketing campaign.

In [24]:

```
sns.regplot(data = eurobank2, x = 'duration', y = 'deposit')
plt.title('Deposit vs. Duration of Marketing Call')
plt.show()
print('The correlation coefficient r is equal to %s.' % \
round(np.corrcoef(eurobank2.duration,eurobank2.deposit)[0,1], 2))
```

In looking at linear regression as a potential model for this dataset, a determination is made that this model holds little to no merit for its weak variable associations (relationships). Therefore, the eurobank dataset requires a logistic regression model to make predictions based on the current variables. This is an example of supervised learning.

Since the marketing strategy focused on who opened an account and made an initial deposit, the analysis requires changing the outputs from yes and no to 1 and 0. Additionally, it will need dummy variables to help the program understand the categorical data. After the data preprocessing, the program will split the data into train and test sets to identify accuracy. Not all the variables will be selected for the final model, so the analysis will use the Recursive Feature Elimination (REF) method to reduce the number of variables.

The final portion of the analysis will show the final model, the model's accuracy, and a confusion matrix for the overall performance of the model.

In [25]:

```
# Change yes and no answers from deposit variable (column) to 0(no) and 1(yes) for linear regression
eurobank['deposit'] = eurobank.deposit.map(dict(yes=1, no=0))
```

In [26]:

```
# Created dummy variables for categorical variables
categorical_variables=['job','marital','education','default','housing','loan','contact','month','poutcome']
for var in categorical_variables:
categorical_list='var'+'_'+var
categorical_list = pd.get_dummies(eurobank[var], prefix=var)
eurobank10=eurobank.join(categorical_list)
eurobank=eurobank10
categorical_variables=['job','marital','education','default','housing','loan','contact','month','poutcome']
data_vars=eurobank.columns.values.tolist()
keep=[i for i in data_vars if i not in categorical_variables]
```

In [27]:

```
# Print all variables including dummy variables
data_final = eurobank[keep]
data_final.columns.values
```

Out[27]:

In [28]:

```
# Named dependent variable to y (deposit), while the rest are independent (X)
X = data_final.loc[:, data_final.columns != 'deposit']
y = data_final.loc[:, data_final.columns == 'deposit']
# Used 70% of train data and 30% of test data
os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['deposit'])
```

The Recursive Feature Elimination will identify which variables are significant for our model. It will help as a first filter, and a logistic regression analysis will help as a second filter to choose the most signficant variables.

In [29]:

```
import warnings
warnings.filterwarnings("ignore")
data_final_vars=data_final.columns.values.tolist()
y=['deposit']
X=[i for i in data_final_vars if i not in y]
#Used Recursive Feature Elimiation (RFE) to select significant feautures for our model
logreg = LogisticRegression()
rfe = RFE(logreg, 20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)
```

In [30]:

```
# Used variables that were given by the RFE output
cols = ['job_admin.', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid', 'job_retired', \
'job_self-employed', 'job_student', 'job_unemployed', 'job_unknown', 'marital_divorced', \
'marital_married', 'marital_single', 'education_primary', 'education_secondary', 'education_tertiary', \
'default_yes', 'housing_no', 'housing_yes', 'loan_no']
X = os_data_X[cols]
y = os_data_y['deposit']
```

Approaching the end result, we have a sound logistical model that maps the optimal job type, marital status, educational background, and housing strata to an effective targeted marketing campaign. Since all of the *p-*values associated with these variables are statistically significant (less than the alpha value of 0.05), they are all good candidates for measuring viability in marketing outreach.

In [31]:

```
# No features are removed since all are significant (<0.05)
logit_model = sm.Logit(y,X)
result = logit_model.fit()
print(result.summary2())
```

In [32]:

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
```

Out[32]:

The accuracy of this logistical model is determined to be 0.87 as shown below:

In [33]:

```
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))
```

The confusion matrix shows whether the predictions had false negatives (Type II) or false positives (Type I). The first row shows 5,782 true positives and 186 false positives, while the second row illustrates 1,333 falses negatives and 4715 true negatives. The prediction had a total of 10,497 accurate results and 1,519 not accurate results.

In [34]:

```
# The output shows 5782 and 4715 correct predictions
# It also shows 186 and 1333 incorrect predictions
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)
```

Ultimately, modeling an effective subscription model visa vie deposit outcome only hinges on one variable relationship between duration and deposits with a coefficient of determination of roughly 40%. Taking that value against itself produces a 16% explanation in terms of the variability between both of these variables. While increasing the duration of the contact attempt in the campaign can stand to benefit a higher subscription rate, it will only do so slightly. Therefore, a revision model based on logistical regression is carried out to solve the overarching dilemma of targeted marketing to a select demographic, with 87% accuracy.