E-Commerce Customer Behavior Report

This report documents the end-to-end data science workflow applied to an e-commerce orders dataset, starting from the raw data and extracting patterns and insights through both supervised and unsupervised learning. A quick summary below details the steps taken for this workflow.

Data Preprocessing (Phase 1) - Four raw relational tables (orders, order_items, order_shipping, payments) were consolidated into a single order-level analytical record. The pipeline addressed missing values via statistical imputation (median for numerical, mode for categorical), enforced logical business constraints (valid monetary and quantity fields), and engineered derived features including order_value_per_item and order_size_category. The result was a clean, deduplicated dataset persisted as ecommerce_orders_cleaned.csv.

Supervised Learning (Phase 2) - Classification models (Logistic Regression, Decision Tree, and Random Forest) were trained to predict the payment method (payment_type) from order attributes. After encoding and scaling, Random Forest delivered the strongest performance across all evaluation metrics, with its ensemble structure providing robustness against class imbalance and capturing the non-linear patterns that characterize real-world transaction data.

Unsupervised Learning (Phase 3) - Documented in full in this notebook. K-Means clustering (k = 4) segmented orders into four behaviorally distinct groups defined primarily by basket value and freight intensity. Market Basket Analysis then surfaced high-lift associations between product categories, payment types, and order sizes — most notably the strong link between large orders, furniture/decor categories, and credit card payment. Together, these findings translate raw transactional data into actionable customer segments and cross-sell intelligence.

Contributors:

Aaron Chou
IT Analyst

Javier Lee
Data Scientist

Jayden Liaw
ML Engineer

Lee Chuan
Software Engineer