InVehicle Marketing Engagement Optimization
Leonid Shpaner, Isabella Oakes, and Emanuel Lucban
Appendix [Python Code]  PreProcessing
Appendix [Python Code]  Modeling
Abstract
Enduser engagement is often too broadly confined into a “blackbox” of targeted marketing. Our paper takes a deep dive approach using selected machine learning algorithms suitable for classification and regression tasks to uncover enduser decision outcomes whilst in vehicle transit. We begin modeling the relationships between relevant user characteristics (predictors) like age, education, and marital status (to name a few) and our selected target of whether they accept the coupon recommended to them or not. Logistic regression is used as a “jumpingoff point”, from which we establish a baseline accuracy of 59%. Ensuing models like decision trees, neural networks, and support vector machines form the landscape for our algorithmic efforts, commencing with predictive analytics, and culminating with prescriptive conclusions that leave room for subsequent iterative efforts to take shape.
Keywords: classification, regression, machine learning, targeted marketing analytics
Background: InVehicle Marketing Engagement Optimization
Consumers appreciate the feeling from receiving a discounted offer. It is a “a positive feeling and reassuring (even if it is not as profitable for the consumer) to complete a transaction with some discount or incentive applied” (Ackner, n.d.). Moreover, it is estimated that digital coupon recommendations are to “surpass $90 billion by 2022” (Ackner, n.d.). Studying a consumer’s decisionmaking trajectory is the underlying mechanism for establishing a sound targeted marketing practice. Wang et al. (2017) summon a set of rules to establish this trajectory where “if a customer (goes to coffee houses ≥ once per month AND destination = no urgent place AND passenger ≠ kids OR (goes to coffee houses ≥ once per month AND the time until coupon expires = one day) then predict the customer will accept the coupon for a coffee house” (Wang et al., 2017). Our endeavor focuses on a baseline logistic regression model, from which we eliminate passengers with children since this bears no statistical significance for at a pvalue of 0.107, establishing a baseline accuracy of 59%. The preprocessed dataset is subjected to eleven algorithms, each in an attempt to surpass this accuracy score.
Exploratory Data Analysis (EDA)
The dataset consists of 12,684 entries with 25 features and a binary Y output. Most features have no missing values, with the car (missing 99.1% of entries) feature missing the most data. The other five features with missing data (bar, coffee house, carry away, restaurant less than 20, and restaurant 20 to 50) are missing between 0.84  1.7% of the data for each feature (see Table 1 in the supplemental materials for a list of features).
The distributions by target variable for destination are similar in home and work, with no urgent place having comparatively more people with a target result of 1 (yes). Examining the feature passenger, more people accept the coupon when with friends or a partner than if they are alone or have kids with them, as shown in Figure 1.
Figure 1
Normalized vs. Absolute Distributions (Passenger and Target)
There is more coupon acceptance in the feature ‘weather’ when it was sunny vs rainy or snowy, but it is also skewed toward sunny. Temperature does not significantly impact whether a coupon is accepted. Time has a small effect, with 10am and 2pm having higher acceptance than early morning or evening. Figure 2 shows that the type of coupon affects acceptance likelihood, being higher for take away and restaurants less than $20 than other establishments and/or coupons.
Figure 2
Normalized vs. Absolute Distributions (Coupon and Target)
In the ‘expiration’ feature, the oneday coupons are accepted more often than the twohour coupons. Gender does not seem to affect the likelihood of acceptance. Age is also not impacted heavily, but respondents over 50 are less likely to accept coupons and respondents under 21 are slightly more likely to accept coupons. Marital status breaks down coupon acceptance fairly evenly, with widowed respondents being slightly less likely to accept. Respondents with children are slightly less likely to accept than respondents without children. Education has a similarly normal distribution with respondents with some high school education more likely to accept the coupon.
Occupation has 25 options, with construction & extraction, healthcare, and architecture having more respondents accept the coupon, and legal, retired, and social services having less. Response is not affected by income, with slight variations and the range $75,00087,499 having the highest distributions of nonacceptance. For car type, although a Mazda 5 (a car that is too old to install OnStar) has more acceptance of the coupon, this is not a reliable statistic because of the response rate and categories. Acceptance of coupon does not seem affected by how often respondents visit bars. For coffee house, categories one to three and four to eight are slightly higher to accept coupons and is never lower than that in acceptance. The number of times people order carry away in a month does not change how often people accept the coupons by a significant amount. The number of times customers visit a restaurant and spend less than $20 also does not seem to change whether people accept the coupons or not. Restaurants in the range of $2050 have more acceptance for people who ate out four to eight times and over eight times. The feature “To coupon over 5 min” is entirely “yes,” with slightly more than half of the people accepting the coupon. If the coupon is over 15 minutes away, the coupon is slightly less likely to be accepted. Like the coupon over 15 minutes away, if the response is “no,” there are more respondents accepting the coupon. The direction that the respondent is traveling (toward or away from the coupon destination) does not seem to make a difference in whether someone responds “yes” or “no.”
PreProcessing
To prepare the data for modeling, the features ‘car’, ‘toCoupon_GEQ5min’, and ‘direction_opp’ are first dropped due to sparse data, only one feature option, and redundancy respectively. Time is converted to 24hour time and expiration is converted to time in hours for consistency. ‘Bar’, ‘Coffee house’, ‘carry away’, ‘restaurant less than 20’, ‘restaurant 2050’, ‘age’, ‘education’, and income are all changed to ordinal values. The features ‘destination’, ‘passenger’, ‘weather’, ‘coupon’, ‘maritalStatus’, and ‘occupation’ are transformed using one hot encoding. The data is then transformed using standard scaling for gaussian features and normalized for nongaussian features. The target is assigned to a labels data frame and data is split into a 75:25 traintest split.
Models
We look to KNearest Neighbors to determine the conditional probability \(Pr\) that a given target \(Y\) belongs to a class label \(j\) given that our feature space \(X\) is a matrix of observations \(x_0\). We sum the knearest observations contained in a set \(\mathcal{N}_0\) over an indicator variable \(I\), thereby giving us a result of \(0\) or \(1\), dependent on class \(j\). This is represented in the following form:
\[Pr(Y=jX=x_0)=\frac{1}{k}\large \sum_{i\in \mathcal{N}_0} \normalsize I(y_i=j)\]
The KNearest Neighbors model, using Euclidean distance was trained over odd values between 1 and 19, with the optimum number of neighbors resulting in 5. The area under the ROC curve was 65% with the test data, accuracy was 67%, recall was 78%, and the F1score was 72%. Subsequently, applying the Manhattan distance metric over a set of values between 1 and 31 (for broader scope) yielded a better accuracy score of 69%, with the optimum number of neighbors being 31. Moreover, the area under the curve was 67%, recall was 83%, and the F1score was 75%.
The Random Forest model had better metrics and was trained over a max depth of 120. The optimum max depth for the test data was determined to be 19, producing a model with an area under the ROC curve of 74%, test accuracy at 76%, recall at 84%, and an F1score of 79%.
Subsequently, the Naïve Bayes model was implemented, calculating the posterior probability \(P(cx)\) from \(P(c), P(x)\), and \(P(xc)\) giving us:
\[P(couponsX) = P(x_1coupons) \space \times \space P(x_2coupons) \space \times \space \dotsb \space \times \space P(x_ncoupons) \space \times \space P(coupons)\] \[\normalsize P[XY = coupons] = \Large \prod_{j=1}^P \normalsize P[X_JY = coupons]\]
The Naïve Bayes model had somewhat similar metrics to the KNearest Neighbors model with an area under the ROC curve at 71%, accuracy at 62%, recall at 65%, and F1score of 66%.
In order to model any linear and nonlinear relationships that may be inherent in the data, a fully connected Neural Network was given consideration. During the hyperparameter tuning process, high training accuracy with low test data accuracy was observed, indicative of model overfitting. Therefore, dropout was implemented to introduce regularization, effectively reducing the predictive variance on unseen data. The tuned Neural Network included 6 layers, 124 hidden units, Rectified Linear Units for activation (ReLU), 300 training epochs and a learning rate of 0.001, producing an overall classification accuracy of 73%, recall of 79% and F1score of 77%.
In addition to Naïve Bayes, Gaussian Discriminant Analysis was considered as another generative approach. Quadratic Discriminant Analysis (QDA) and Linear Discriminant Analysis (LDA) models were trained and tested. With no hyperparameters to tune, the trained QDA and LDA model predictions were optimized using the ROC curve, finding the optimal probability threshold of 0.46 and 0.50, respectively. For the QDA model this produced an overall test data accuracy of 67%, recall of 69% and F1score of 70%. The LDA model produced better metrics, with an overall accuracy of 69%, recall of 79% and F1score of 74%.
Furthermore, we used a decision tree classifier to trace the consumers’ path to accepting and/or rejecting a coupon recommendation. An untuned decision tree can take us down an “endless road” of decisionless consequences. Therefore, we tuned our max depth over a relatively broad (three to ten) range to give us a sense of the optimal hyperparameter given the highest test accuracy. An optimal maximum depth of ten produced an accuracy of 70%, recall of 76%, and F1score of 74%. Figure 3 illustrates the decision tree for a maximum depth of three.
Figure 3
Decision Tree Classifier with Max Depth of 3
Note. Customers are more likely to accept the coupon if the coupon location is within 20 minutes, or if the coupon is for a coffee house.
Gradient Boosting, as another ensemble method, was tuned on varying number of generated trees and the maximum depth of each tree to increase the performance. The tuned Gradient Boosting Model included 500 trees, each with a maximum depth of 15. The Gradient Boosting Model outperformed the Random Forest, producing and overall test data accuracy of 76%, recall of 83% and F1score of 80%.
Subsequently, we revisit our baseline logistic regression model and tune it in the following manner. Using a linear classifier, the model can create a linearly separable hyperplane bounded by the class of observations from our coupon dataset and the likelihood of occurrences within the class. The model is simplified down into an optimization function of the regularized negative loglikelihood, where w and b are estimated parameters:
\[\begin{equation*} (w^*,b^*) = \arg\min_{w,b}  \sum_{i=1}^N y_i \log\bigg[\sigma(w^Tx_i + b)\bigg] + (1y_i) \log\bigg[\sigma(w^Tx_i  b)\bigg] + \frac{1}{C} \Omega([w,b]) \end{equation*}\]
We further tune our cost hyperparameter \(C\) such that the model complexity is varied (regularized by \(\Omega\) from smallest to largest, producing a greater propensity for an increased classification accuracy at each iteration. Moreover, we rely on the default l2norm to pair with the ‘lbfgs’ solver and terminate our maximum iterations at 2,000 such that the model does not fail to converge. An optimal cost hyperparameter of 1 produced an accuracy of 69%, area under the curve of 67%, recall of 78%, and F1score of 74%.
Similarly, we applied support vector machines (in tuning the cost, gamma, and kernel hyperparameters) to supplement our modeling endeavors. A linear support vector machine model relies on estimating (\(\normalsize w^*\), \(\normalsize b^*\) ) visa vie constrained optimization of the following form:
\[\begin{eqnarray*} &&\min_{w^*,b^*,\{\xi_i\}} \frac{\w\^2}{2} + \frac{1}{C} \sum_i \xi_i \\ \textrm{s.t.} && \forall i: y_i\bigg[w^T \phi(x_i) + b\bigg] \ge 1  \xi_i, \ \ \xi_i \ge 0 \end{eqnarray*}\]
However, our endeavor relies on the radial basis function kernel:
\[\begin{eqnarray*}K(x,x^{'}) = \text{exp} \left(\frac{xx^{'}^2}{2\sigma^2} \right) \end{eqnarray*}\]
where \(xx^{'}^2\) is the squared Euclidean distance between the two feature vectors, and \(\gamma = \frac{1}{2\sigma^2}\). Simplifying the equation, we have:
\[\begin{eqnarray*} K(x,x^{'}) = \text{exp} (\gamma xx^{'}^2) \end{eqnarray*}\]
An optimal cost hyperparameter of 50 produced an accuracy of 76%, area under the curve of 75%, recall of 78%, and F1score of 74%.
Results – Model Summary Statistics and Performance Metrics
Figure 4 depicts the aggregation of the receiver operating characteristic of the 11 individual ROC curves.
Figure 4
ROC Comparison for All 11 Models
Note. Gradient Boosting captures the highest area under the curve at 79%. The Neural Network boasts a close second place at 78%, and Support Vector Machines report an AUC of 75% (see Table 2 in the supplemental materials for an itemized breakdown of performance metrics).
Conclusion
The nature of the project was to provide marketers with the means to leverage data in order to generate a measurable increase in engagement for invehicle, targeted promotions and advertising. Based on the Exploratory Data Analysis of the survey dataset, a nontargeted, naïve approach for distributing invehicle promotions may only yield an acceptance rate of approximately 57%. Modeling the complex relationships between the variables that contribute to the receptiveness of a target audience have several key benefits. First, employing an accurate predictive model allows for increased engagement through highly targeted distribution (i.e., only sending promotional offers to users that are receptive). Second, highly targeted promotions allow for the simultaneous distribution of different offers through audience segmentation. Sending specific offers that are predicted to be the most receptive to each segment. Third, an accurate predictive model reduces false negative rates and any associated opportunity costs.
All models that were tuned and tested outperformed the baseline model accuracy of 59%. However, the premise of the project is reliant on the maximization of true positive rates, therefore the recall metric became the deciding factor in model selection. The Gradient Boosting Model with 500 estimators and a maximum tree depth of 15 was selected as the best model for this project, outperforming all tuned models by every metric with an overall accuracy of 76% and a recall of 83%. Implementing the tuned Gradient Boosting Model will have a measurable increase of engagement rates due the lower false positive rates when compared to a nontargeted, naïve approach.
References
Ackner, R. (n.d.). Ecommerce Coupon Marketing Strategies: Give Discounts, Get a Lot More. Big Commerce. Dua, D., & Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml].
Irvine, CA: University of California, School of Information and Computer Science.
https://archive.ics.uci.edu/ml/datasets/invehicle+coupon+recommendation

A Bayesian Framework for Learning Rule Sets for Interpretable Classification.
Journal of Machine Learning Research, 18, 137. https://arxiv.org/abs/1504.07614