Supervised Learning Techniques for E-Commerce Orders¶
This phase applies supervised learning techniques to the cleaned e-commerce dataset to predict the payment type of each order
The primary objectives are:
- Preprocess the dataset by appropriately handling numerical and categorical features to prepare it for model training.
- Compare the performance of 3 supervised learning techniques (Logistic Regression, Decision Tree & Random Forest) based on standard evaluation metrics, class imbalance considerations, model complexity, and practical deployability.
- Using these comparisons, we will recommend the ideal model to be utilised by the E-Commerce firm to support its business objectives.
Setup¶
The environment set-up utilises pandas and numpy for data manipulation, matplotlib and seaborn for visualisation and sklearn for preprocessing, model training, and evaluation.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')
plt.style.use('ggplot')
# A random state is introduced to ensure reproducibility and stability in the machine learning workflows
RANDOM_STATE = 8
print(f"Random state set to: {RANDOM_STATE}")
Random state set to: 8
Part 1: Data Preparation¶
We will first prepare the transformed dataset so that it can be used for supervised classification. Proper data preparation is a critical step in the machine learning pipeline, as model performance and validity depend heavily on the quality, structure, and representation of the input data.
Data Loading and Splitting¶
In this section, we use the processed dataset ecommerce_orders_cleaned.csv from Assignment 1. To predict the payment type of an order, we set payment_type as the target variable, while all remaining columns are treated as input features. We then partition the dataset into training and test sets using train_test_split() with a test size of 0.2 and the pre-defined RANDOM_STATE to ensure reproducibility.
It is important to have a test set as it is used to evaluate how well the trained model performs against a completely new set of data which is previously unseen. This helps to detect overfitting and provides a fair comparison of the model performance.
df = pd.read_csv("ecommerce_orders_cleaned.csv")
print(f"Dataset shape: {df.shape}")
df.head()
Dataset shape: (46076, 17)
| order_id | order_status | order_purchase_hour | order_purchase_dayofweek | order_purchase_month | order_total_value | num_items | num_unique_products | num_unique_sellers | total_item_price | avg_item_price | total_freight_value | top_product_category | customer_state | payment_type | order_value_per_item | order_size_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | sdv-id-whzjUX | shipped | 10 | 4 | 4 | 744.312535 | 1 | 1 | 1 | 352.420029 | 369.966521 | 68.790159 | construction_tools_construction | Massachusetts | voucher | 352.420029 | Small |
| 1 | sdv-id-dbopoJ | delivered | 19 | 2 | 3 | 1556.667902 | 1 | 1 | 1 | 289.242639 | 1354.621410 | 15.394619 | health_beauty | Vermont | credit_card | 289.242639 | Small |
| 2 | sdv-id-FSEOvM | delivered | 15 | 4 | 8 | 62.060506 | 1 | 1 | 1 | 26.893468 | 48.485654 | 18.751282 | luggage_accessories | South Carolina | debit_card | 26.893468 | Small |
| 3 | sdv-id-bQcBUR | delivered | 21 | 0 | 8 | 73.873470 | 1 | 1 | 1 | 37.790896 | 75.704909 | 8.670875 | computers_accessories | Kentucky | credit_card | 37.790896 | Small |
| 4 | sdv-id-MPxIXB | delivered | 13 | 5 | 5 | 361.961537 | 3 | 3 | 3 | 169.528323 | 50.132979 | 34.731146 | pet_shop | Missouri | voucher | 56.509441 | Medium |
# Define target variable
y = df['payment_type']
# Define features (all columns except payment_type)
cols_to_select = [col for col in df.columns if col not in ('order_id', 'payment_type')]
X = df[cols_to_select]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
# Print shapes to verify
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
X_train shape: (36860, 15) X_test shape: (9216, 15) y_train shape: (36860,) y_test shape: (9216,)
Handling Categorical Variables¶
Categorical data has to be processed before it can be used for machine learning because the models work with numbers only and not string objects. We do so by identifying all categorical columns in the training data first and then apply label encoding to convert category labels into numerical representations so that the models can interpret the categories.
Encoders are fitted on the training data only and subsequently used to transform both training and test sets. This can help to prevent data leakage and ensure consistency throughout the work flow. Any unseen categories are handled by having a placeholder label unknown_category to prevent transformation errors and improve model robustness.
# Identify categorical columns
categorical_cols = X_train.select_dtypes(include=['object', 'category']).columns.tolist()
print(f"Categorical columns: {categorical_cols}")
Categorical columns: ['order_status', 'top_product_category', 'customer_state', 'order_size_category']
# Create a copy of X_train to store encoded values
X_train_encoded = X_train.copy()
# Initialise LabelEncoder
le = LabelEncoder()
# Fit on training data
le.fit(X_train_encoded['order_status'].astype(str))
le.classes_
array(['approved', 'canceled', 'created', 'delivered', 'invoiced',
'processing', 'shipped', 'unavailable'], dtype=object)
# Apply Label Encoding to categorical columns
# Create copies to avoid modifying original data
X_train_encoded = X_train.copy()
X_test_encoded = X_test.copy()
# Dictionary to store label encoders for each column
label_encoders = {}
for col in categorical_cols:
le = LabelEncoder()
# Fit on training data
le.fit(X_train_encoded[col].astype(str))
# Transform train and test
X_train_encoded[col] = le.transform(X_train_encoded[col].astype(str))
# Handle unseen categories in test set
le.classes_ = np.append(le.classes_, "unknown_category")
X_test_encoded[col] = X_test_encoded[col].astype(str).where(X_test_encoded[col].isin(le.classes_), "unknown_category")
X_test_encoded[col] = le.transform(X_test_encoded[col])
# Store encoder
label_encoders[col] = le
# Print the shape after encoding
print(f"X_train_encoded shape: {X_train_encoded.shape}")
print(f"X_test_encoded shape: {X_test_encoded.shape}")
X_train_encoded shape: (36860, 15) X_test_encoded shape: (9216, 15)
Handling Numerical Variables¶
We handle the numerical variables by performing feature scaling. StandardScaler from scikit-learn is used to perform this scaling which standardises the numerical features by converting the mean and standard deviation to 0 and 1 respectively. This allows the numerical features to be on a common scale.
This is important because it prevents feature dominance by minimising the effect of different units of measurement of each feature. For example from X.describe() below, we can see in the original dataset that order_total_value ranges from 6.12 to 2890.55 while num_items only ranges from 1 to 8. As such, when it comes to distance-based models, order_total_value will dominate num_items as a feature, resulting in the model being biased towards the former.
X.describe()
| order_purchase_hour | order_purchase_dayofweek | order_purchase_month | order_total_value | num_items | num_unique_products | num_unique_sellers | total_item_price | avg_item_price | total_freight_value | order_value_per_item | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 | 46076.000000 |
| mean | 14.968856 | 2.765583 | 6.235567 | 215.071101 | 1.589526 | 1.402965 | 1.163686 | 128.680045 | 104.606243 | 25.924525 | 87.919473 |
| std | 5.240562 | 2.036003 | 3.121741 | 268.827491 | 1.245700 | 1.071458 | 0.536977 | 183.724414 | 135.327151 | 23.721633 | 123.504255 |
| min | 0.000000 | 0.000000 | 1.000000 | 6.116141 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.850000 | 0.000000 | 0.000000 |
| 25% | 12.000000 | 1.000000 | 3.000000 | 77.835386 | 1.000000 | 1.000000 | 1.000000 | 39.647803 | 39.443877 | 13.721818 | 32.069555 |
| 50% | 15.000000 | 2.000000 | 7.000000 | 134.222752 | 1.000000 | 1.000000 | 1.000000 | 75.259779 | 73.906002 | 18.199461 | 56.677959 |
| 75% | 19.000000 | 5.000000 | 8.000000 | 215.567663 | 2.000000 | 1.000000 | 1.000000 | 139.397564 | 120.253773 | 30.271361 | 99.938720 |
| max | 23.000000 | 6.000000 | 12.000000 | 2890.551922 | 8.000000 | 8.000000 | 5.000000 | 2606.843945 | 1809.335277 | 277.552685 | 2606.843945 |
# Identify numerical columns
numerical_cols = X_train_encoded.columns.tolist()
print(f"Numerical columns: {numerical_cols}")
print(f"Number of numerical columns: {len(numerical_cols)}")
Numerical columns: ['order_status', 'order_purchase_hour', 'order_purchase_dayofweek', 'order_purchase_month', 'order_total_value', 'num_items', 'num_unique_products', 'num_unique_sellers', 'total_item_price', 'avg_item_price', 'total_freight_value', 'top_product_category', 'customer_state', 'order_value_per_item', 'order_size_category'] Number of numerical columns: 15
# Initialise StandardScaler
scaler = StandardScaler()
# Fit on training data only, then transform both train and test
X_train_scaled = scaler.fit_transform(X_train_encoded[numerical_cols])
X_test_scaled = scaler.fit_transform(X_test_encoded[numerical_cols])
# Print the shape after scaling
print(f"X_train_scaled shape: {X_train_scaled.shape}")
print(f"X_test_scaled shape: {X_test_scaled.shape}")
X_train_scaled shape: (36860, 15) X_test_scaled shape: (9216, 15)
Part 2: Applying Machine Learning Models¶
In this section, we train and evaluate 3 supervised classification models to predict the payment method of an order. We then conduct a systematic comparison of their predictive performance and practical suitability for deployment.
Logistic Regression¶
Logistic Regression is a linear classification model that models the log-odds of the target class as a linear combination of the input features.
# Train Logistic Regression model
lr_model = LogisticRegression(random_state=RANDOM_STATE, max_iter=1000)
lr_model.fit(X_train_scaled, y_train)
# Generate predictions
y_pred_lr = lr_model.predict(X_test_scaled)
print("Logistic Regression training complete.")
Logistic Regression training complete.
Decision Trees¶
Decision Trees is a tree-structured model that makes predictions by recursively splitting the feature space based on feature values.
# Train Decision Tree model
dt_model = DecisionTreeClassifier(random_state=RANDOM_STATE)
dt_model.fit(X_train_scaled, y_train)
# Generate predictions
y_pred_dt = dt_model.predict(X_test_scaled)
print("Decision Tree training complete.")
Decision Tree training complete.
Random Forest¶
Random Forest is an ensemble learning model that makes predictions by combining multiple decision trees built on different subsets of data and features. By using aggregated voting, it reduces overfitting and improves model performance.
# Train Random Forest
rf = RandomForestClassifier(
n_estimators=200,
max_depth=None,
random_state=RANDOM_STATE
)
rf.fit(X_train_scaled, y_train)
y_pred_rf = rf.predict(X_test_scaled)
print("Random Forest training complete.")
Random Forest training complete.
Part 3: Evaluation & Visualization¶
We evaluate and comapre the performance of all three models using standard evaluation metrics and confusion matrix visualisations for a comprehensive and robust analysis.
Evaluation metrics:
- Accuracy
- Measures the overall proportion of correct predictions which provides a general performance overview
- Accuracy = (TP + TN)/(TP + TN + FP + FN)
- Precision
- Measures how many predicted positives are actually correct
- A high precision implies low false positive errors.
- Precision = TP/(TP+ FP)
- Recall
- Measures how many actual positives are correctly identified
- A high recall implies low false negative erros
- Recall = TP/(TP + FN)
- F1-score
- Harmonic mean of precision and recall
- F1 = (2 x Precision x Recall)/(Precison + Recall)
- Accuracy
Confusion matrix:
- The confusion matrix are presented as heatmap visualisations to illustrate the distribution of correct and incorrect predictions across classes.
- This helps us to identify misclassification patterns and class-level performance differences.
# Compute evaluation metrics for all three models
def evaluate_model(y_true, y_pred, model_name):
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
f1 = f1_score(y_true, y_pred, average='weighted')
return {
'Model': model_name,
'Accuracy': accuracy,
'Precision': precision,
'Recall': recall,
'F1-Score': f1
}
# Evaluate all models
results = [
evaluate_model(y_test, y_pred_lr, 'Logistic Regression'),
evaluate_model(y_test, y_pred_dt, 'Decision Tree'),
evaluate_model(y_test, y_pred_rf, 'Random Forest') # or your chosen model
]
# Create results DataFrame
results_df = pd.DataFrame(results)
results_df
| Model | Accuracy | Precision | Recall | F1-Score | |
|---|---|---|---|---|---|
| 0 | Logistic Regression | 0.606988 | 0.553678 | 0.606988 | 0.483541 |
| 1 | Decision Tree | 0.503798 | 0.511951 | 0.503798 | 0.507755 |
| 2 | Random Forest | 0.630534 | 0.582534 | 0.630534 | 0.569791 |
# Plot confusion matrices for all three models
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
def plot_confusion_matrix(y_true, y_pred, ax, title):
cm = confusion_matrix(y_true, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax)
ax.set_title(title)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plot_confusion_matrix(y_test, y_pred_lr, axes[0], 'Logistic Regression')
plot_confusion_matrix(y_test, y_pred_dt, axes[1], 'Decision Tree')
plot_confusion_matrix(y_test, y_pred_rf, axes[2], 'Random Forest') # or your chosen model
plt.tight_layout()
plt.show()
Based on the results above, we discuss the strengths and weakness of each model, as well as how class imbalance is being handled.
Logistic Regression
- Strengths:
- Easy to implement
- Highly interpretable - The coefficient of each feature describe the odds ratio of each feature, providing an intuitive understanding of which feature drives the model prediction
- Weaknesses:
- Prone to poor performance when there are non-linear relationship between the features and the dependent variable, which is highly likely in real-life scenarios involving data with complex patterns
- Class imbalance handling:
- Can adjust threshold to treat lower probabilities as actionable with considerations given to business risk tolerance
- Can apply resampling first (downsample of majority class or oversample minority class)
Decision Tree
- Strengths:
- Able to capture non-linear relationships
- Easy to visualise (for smaller trees)
- Does not require scaling
- Weaknesses:
- Difficult to interpret large and complex trees with many leaves
- Highly prone to overfitting. A deep tree might "memorize" specific orders in the training set rather than learning general patterns, leading to poor performance on the test set.
- High variance whereby small changes in the training data can lead to very different tree structures and predictions, hence, it is not suitable for real-life production pipelines
- Class imbalance handling:
- Can apply the class_weight parameter to penalise minority class errors more heavily
- Can apply resampling first as well
Random Forest
- Strengths:
- Able to capture non-linear relationships
- Lower variance and overfitting compared to decision trees as the predictions are averaged across multiple trees
- Weaknesses:
- Difficult to interpret decision logic since there could be hundreds of trees
- High computational cost due to the training of multiple trees
- Class imbalance handling
- Similar to Decision Tree, a class_weight parameter can be applied to each bootstrap sample (subset of the data for each tree)
Part 4: Model Selection & Final Recommendation¶
Based on the evaluation and comparison discussed above, we provide our recommendation for the model to be selected
# Display the results table
print("Model Performance Comparison:")
print(results_df.to_string(index=False))
# Identify the best model based on your criteria
best_model = results_df.loc[results_df['F1-Score'].idxmax(), 'Model']
print(f"\nBest performing model: {best_model}")
Model Performance Comparison:
Model Accuracy Precision Recall F1-Score
Logistic Regression 0.606988 0.553678 0.606988 0.483541
Decision Tree 0.503798 0.511951 0.503798 0.507755
Random Forest 0.630534 0.582534 0.630534 0.569791
Best performing model: Random Forest
Justifications¶
The model selection is based on the following justifications.
Evaluation Metrics:
- Random Forest outperformed the other models in all the metrics as shown above, in particular, it has the best F1 score of 0.57.
- Due to the class imbalance (credit_card: 60%, voucher: 30%, points and debit card <= 5%), accuracy is not a suitable evaluation metric. Meanwhile, there is no strong preference towards false positive errors or false negative errors given that we are simply predicting payment types. Hence, F1 score is the most suitable evaluation metric as it combines both Precision and Recall.
Class Imbalance Considerations:
- Random forest handles class imbalance in a more robust manner due to bootstrap sampling and the ensemble diversity of multiple trees.
- Logistic regression only learns a single global deicsion boundary which fails to capture small local pockets of the minority class. A decision tree has high variance and may overfit for the majority class.
- On the other hand, the bootstrap sampling in random forest allows minority signals to be learned repeatedly and the ensemble helps to stabilise minority detection.
Model Complexity and Interpretability:
- Logistic regression is easy to implement and intuitive to understanding using the feature coefficients. Decision tree can still have interpretable decision logic if the tree is shallow.
- A random forest is the most complex and has the lowest interpretability among the 3 models. However, alternative methods such as SHAP can be used to understand how each feature drives the prediction.
Practical Deployment Considerations:
- Logistic regression is the most simple to implement and it runs quickly. Decision tree runs quickly too but its instability can cause maintenance issues in a production pipeline.
- Random forest requires a high computational cost due to its ensemble nature of handling multiple trees and high memory usage.
Why this model is preferred over others: Random Forest is preferred over Logistic Regression and Decision Trees because it provides stronger predictive performance, greater stability, and better robustness to class imbalance. By combining many decision trees trained on different bootstrap samples, the model captures complex nonlinear patterns while reducing the variance and overfitting risk of a single tree. Although more complex, interpretability can be maintained using SHAP values, which explain how each feature contributes to predictions. With modern production systems capable of handling large-scale computation, the additional training cost is generally outweighed by the model's improved accuracy and reliability, making Random Forest a strong overall modeling choice.
Model Explainability/Interpretability¶
Lastly, we compare the explainability and interpretability of the models
Logistic Regression:
- Logistic Regression has the highest interpretability because it is easiest to explain mathematically. This is because of the coefficients of each feature which provide a magnitude and direction that can be used to assess the feature's impact on the prediction.
Decision Tree:
- Decision Tree can have high interpretability if the tree is shallow. This is so because the decision logic is essentially a sequence of if-else logic which can be traced. However, the interpretability decreases if the tree becomes too deep or have too many leaves.
Random Forest:
- Being an ensemble of multiple decision trees, Random Forest is not directly interpretable. Instead, tools such as SHAP can be applied to help us understand feature importance.
Part 5: Conclusion¶
- This assignment provided provided us with a hands-on experience of processing a dataset and subsequently train the processed data using different models to evaluate which model provides the best performance.
- Among the models tested, we found that Random Forest achieved the strongest predictive performance and demonstrated greater model stability, largely due to its ensemble structure that reduces variance compared to a single decision tree. It also captures nonlinear relationships that logistic regression may not model effectively. Although Random Forest is more complex, its interpretability can still be supported through model explanation tools such as SHAP, which help quantify the contribution of individual features to predictions.
- Key takeaways:
- Firstly, model selection should not rely solely on predictive performance. Other factors such as interpretability, computational complexity and deployment feasibility must also be considered depending on the application context. This is particularly important in a business setting where resources and timeline are important considerations to factor into the development of machine learning pipelines.
- Another area to highlight would be how class imbalances are being handled by the different models. In many real-life scenarios (customer churn, fraud detection, medical diagnosis), class imbalance scenarios can be common. Appropriate techniques such as class weighting, resampling, or threshold adjustment should therefore be incorporated into the modeling process to ensure fair and meaningful evaluation.
- Lastly, we have to be mindful of how we pre-process the data before using them as features in a model. In this assignment, we touched upon Label Encoder to encode categorical columns, and Standard Scaler to normalise numerical columns. However, there are other techniques such as One-hot encoding or ordinal encoding, min-max scaling (for distance-based models) and robust scaling (to handle outliers), which may be more suitable depending on feature characteristics and the choice of model.
Overall, this assignment showcased to us the development of an end-to-end machine learning workflow, from preprocessing and model training to evaluation and practical deployment considerations. It prompted us to consider how an effective model workflow requires balancing predictive accuracy with interpretability, and operational feasibility.