Discover Supervised Learning and Advanced Machine Learning Methods

Introduction

Week 15 at DataraFlow focused on supervised learning, with an emphasis on mastering advanced regression techniques, including Polynomial Regression, Support Vector Machines (SVM), and Decision Tree Regression.

The primary objectives were to:

Understand and apply non-linear regression methods for real-world datasets.
Explore the strengths and limitations of different supervised learning models.
Gain hands-on experience in preprocessing, feature engineering, and model evaluation.
Apply these techniques to a real-world assessment, predicting car resale prices based on a range of features.

By the end of the week, I did not only implement these models but also interpret results in a business context, drawing actionable insights from model predictions.

1. Task 1: Polynomial Regression – Model Comparison

Objective: Compare Linear Regression vs Polynomial Regression and understand when each model is appropriate.

Dataset: Task-Datasets/task1_polynomial_data.csv (Experience_Years vs Salary, 15 rows)

1.1 Approach

Loaded and visualized the data with a scatter plot to observe potential non-linear relationships.

Built and trained the following models:
- Linear Regression
- Polynomial Regression (degrees 2, 3, 4)
Visualized each model’s curve over the original data points.
Predicted salary for 8.5 years of experience using all four models.
Compared predictions and assessed which model best fit the data.

1.2 Results

Model	Predicted Salary (8.5 years)
Linear Regression	102.90
Polynomial Degree 2	101.72
Polynomial Degree 3	102.97
Polynomial Degree 4	101.10

1.3 Insights

Polynomial regression (degree 3) closely aligns with Linear Regression for this dataset, indicating mild non-linearity.
For small datasets with slightly curved trends, Polynomial Regression of degree 2–3 is usually sufficient.
Higher-degree polynomials (degree 4+) risk overfitting without substantial performance gains.

Although higher-degree polynomial models offer more flexibility, their predictions at 8.5 years are not significantly different from those of lower-degree polynomials or the linear regression model. While polynomial regression of degree 2 can capture non-linear relationships, the data and visualizations indicate that the relationship between experience and salary is largely linear. Given the increased risk of overfitting and higher variance with degree 2, 3, or 4 polynomials, Linear Regression provides the most reliable and generalizable performance, achieving an optimal balance between model complexity and predictive stability. Its simplicity, interpretability, and lower risk of overfitting make it the preferred choice.

2. Task 2: Support Vector Regression (SVR)

Objective: Implement SVR and understand the importance of feature scaling.

Dataset: Task-Datasets/task2_svr_data.csv (Temperature vs Ice_Cream_Sales, 20 rows)

2.1 Approach

Built Linear Regression and SVR models without scaling, observing poor SVR predictions.
Applied StandardScaler to both features and target for proper SVR training.
Predicted ice cream sales at 27°C using Linear Regression and scaled SVR.
Visualized the effects of scaling on model performance.

2.2 Results

Model	Prediction at 27°C
Linear Regression	512.12
SVR (Unscaled)	423.06
SVR (Scaled)	490.74

2.3 Insights

Feature scaling is critical for SVR, as the RBF kernel depends on distance calculations.
After scaling, SVR predictions became more reasonable and closely aligned with Linear Regression for interpolation points.
SVR captures non-linear patterns better than Linear Regression in general, especially when data has curvature.

3. Task 3: Decision Tree Regression

Objective: Implement Decision Tree Regression and visualize decision boundaries.

Dataset: Task-Datasets/task3_decision_tree_data.csv (Hours_Studied vs Exam_Score, 25 rows)

3.1 Approach

Built a Decision Tree Regressor with random_state=0.
Created two visualizations:
- Standard resolution (step predictions over data)
- High-resolution (0.1-step increments to show step-like structure)
Compared Decision Tree predictions with Linear Regression.
Predicted exam score for 23 hours of study.

3.2 Results

Model	Predicted Exam Score
Linear Regression	75.34
Decision Tree Regression	75.00

3.3 Insights

Decision Trees perfectly capture step-like, non-linear patterns.
They are more flexible than Linear Regression for non-continuous trends but can overfit small datasets.
High-resolution visualization clearly shows how the model splits the input space.

4. Assignment 1: Comprehensive Model Comparison

Objective: Build and compare multiple regression techniques—Linear, Polynomial, SVR, and Decision Tree—on a real-world salary prediction dataset. The goal is to understand model strengths, weaknesses, and applicability in scenarios with non-linear growth patterns.

4.1 Dataset Overview

Dataset: assignment1_salary_prediction.csv

10 position levels with corresponding salaries.
Non-linear (exponential) growth pattern observed as position level increases.

Pattern Observation:

The scatter plot reveals that salaries increase gradually for early position levels, then accelerate sharply at higher levels. Early experience yields slow gains, while advanced experience compounds significantly, reflecting real-world career progression where senior roles are rewarded exponentially.

4.2 Model Implementations

4.2.1 Linear Regression

Goal: Establish a baseline and observe limitations with non-linear data.

Prediction at position level 6.5: 330,378.79
Visual assessment: Severe underfitting; fails to capture the exponential growth curve.

Pros: Simple, interpretable, fast.
Cons: Cannot model non-linearity; poor performance for high position levels.

4.2.2 Polynomial Regression

Tested multiple degrees: 2, 3, 4, 5, 6.

Degree	Prediction at 6.5	Visual Fit	Pros	Cons
2	189,498.11	Underfitting	Captures slight curvature	Still too rigid
3	133,259.47	Underfitting	Balance of bias-variance	May miss sharp changes
4	158,862.45	Good fit	Captures curvature well	Risk of overfitting
5	174,878.08	Excellent fit	Very accurate, smooth	Slight risk of overfitting
6	174,192.82	Overfitting	Extremely flexible	Unstable predictions

Best polynomial degree: 5

Reasoning:

The dataset exhibits a non-linear but smooth upward trend.
Degree 5 captures this curvature very well, producing a prediction almost exactly at the midpoint between Level 6 and Level 7 salaries.
Degree 6 gives a similar result, but higher-degree polynomials risk overfitting—i.e., fitting noise rather than the true trend.
Degree 5 balances accuracy and generalization, making it the most reliable model for this dataset.

4.2.3 Support Vector Regression (SVR)

Approach: Apply StandardScaler to both X and y; use RBF kernel.

Prediction at 6.5: 170,370.02
Visual assessment: Excellent fit; smooth curve capturing non-linear growth.

Pros: Smooth, robust, handles non-linearity without overfitting.
Cons: Requires scaling, harder to interpret.

4.2.4 Decision Tree Regression

Prediction at 6.5: 150,000.00
High-resolution plot shows stepwise predictions following training points exactly.

Pros: Handles non-linearity, no scaling required.
Cons: Step-wise nature; poor generalization outside observed data.

4.3 Model Comparison

Model	Prediction at 6.5	Visual Fit	Pros	Cons
Linear Regression	330,378.79	Severe underfitting	Simple and interpretable	Cannot capture non-linearity
Polynomial deg 4	158,862.45	Good fit	Captures curvature well	Risk of overfitting
SVR (RBF)	170,370.02	Excellent fit	Smooth and robust	Requires scaling, hard to interpret
Decision Tree	150,000.00	Step-wise overfit	Handles non-linearity	Poor generalization

Combined Visualization

4.4 Analysis and Recommendations

Best Model: SVR with RBF kernel offers the most reliable predictions due to smooth handling of non-linear growth while avoiding overfitting.
Polynomial Regression (degree 5) is a close second, especially for interpretability and moderate non-linearity.
Decision Tree Regression captures training data perfectly but is unreliable for extrapolation.
Linear Regression is unsuitable due to severe underfitting.

Business Implications:

For salary predictions, models that handle non-linearity (SVR or carefully tuned polynomial regression) provide more actionable insights for HR planning and budget forecasting.
Extrapolating salaries for senior positions requires caution with step-wise or linear models.

Next Steps:

Include more data points to improve model reliability.
Evaluate feature importance if additional variables (e.g., performance ratings, education) are available.
For deployment, SVR or polynomial regression (degree 4) should be the default predictive models.

5. Assignment 2: Multi-Feature Regression

Objective: Apply advanced regression techniques to multi-feature datasets to predict energy consumption and support cost optimization in building management.

Scenario: A building management company wants to predict energy consumption based on environmental factors to optimize HVAC systems and reduce energy costs.

Dataset: Assignment-Dataset/assignment2_energy_efficiency.csv

100 records
Features: Temperature, Humidity, Wind_Speed, Solar_Radiation
Target: Energy_Consumption

5.1 Data Preparation

Loaded and explored the dataset.
Statistical summary and missing value check confirmed clean data.
Split dataset: 80% training, 20% testing (random_state=42).

5.2 Baseline Model: Multiple Linear Regression

Trained a Multiple Linear Regression model on training data.
Metrics on test set:
- R²: 0.798
- MAE: 15.74
- RMSE: 19.42

Interpretation:
The model explains ~80% of variance in energy consumption. Errors are reasonably small, indicating solid baseline performance.

5.3 Support Vector Regression (SVR)

Features scaled using StandardScaler.
Trained SVR with RBF kernel.
Metrics on test set:
- R²: 0.821
- MAE: 15.30
- RMSE: 18.25

Interpretation:
SVR slightly outperforms Linear Regression. The model captures non-linear relationships between environmental factors and energy consumption, reducing prediction error.

5.4 Decision Tree Regression

Trained Decision Trees with varying max_depth (3, 5, 10, None).
Best performing tree: depth = 5
Metrics:
- R²: 0.352
- MAE: 26.99
- RMSE: 34.78

Interpretation:
Decision Tree underperforms for this dataset, likely due to small sample size and smooth continuous relationships, which tree models handle less efficiently than linear or kernel-based models.

5.5 Model Comparison

Model	R²	MAE	MSE	RMSE
Multiple Linear Regression	0.798	15.74	376.97	19.42
Support Vector Regression	0.821	15.30	333.20	18.25
Decision Tree (depth=5)	0.352	26.99	1209.71	34.78

Interpretation:
SVR is the most accurate model, followed closely by Multiple Linear Regression. Decision Tree shows high error and low variance explained, making it unsuitable for this dataset.

Combined model comparison visualization:

Predicted vs Actual

Residual plots

5.6 Insights & Recommendations

Model Selection:
- SVR is recommended for accurate predictions.
- Linear Regression provides nearly comparable accuracy with higher interpretability.
Energy Optimization Recommendations:
- Prioritize temperature control strategies as it has the greatest impact on energy usage.
- Humidity management can further reduce consumption.
- Solar exposure adjustments (e.g., shading, reflective surfaces) could improve efficiency.
Trade-offs:
- SVR offers the best accuracy but is less interpretable and requires scaling.
- Linear Regression is interpretable and easier to deploy but slightly less precise.
- Decision Trees are easy to visualize but underperform for smooth, continuous data.

Summary:
For multi-feature energy prediction, SVR with RBF kernel provides the most accurate forecasts, with temperature and humidity driving energy consumption the most. Implementing SVR predictions in building management systems can inform HVAC (Heating, Ventilation, and Air Conditioning) adjustments, reduce energy costs, and optimize operational efficiency.

6. Assignment 3: Time Series Prediction with Polynomial Features

Objective: Apply regression techniques to time-series data with feature engineering for predicting stock closing prices.

Scenario: A financial analyst wants to predict stock closing prices based on daily trading data to inform investment decisions.

Dataset: Assignment-Dataset/assignment3_stock_prices.csv

90 days of trading data
Features: Day, Opening_Price, High_Price, Low_Price, Volume
Target: Closing_Price

6.1 Data Exploration

Loaded and examined the dataset.
Time series plot of opening and closing prices to visualize trends.
Calculated correlation matrix to identify strongest predictors of closing price.

Plot of Opening vs Closing Prices over time:

6.2 Feature Engineering

Created new features to enhance predictive power:
- Price_Range = High_Price - Low_Price
- Price_Change = Closing_Price - Opening_Price (shifted by 1 to avoid leakage)
- Volume_MA = Moving average of Volume with window=5
Handled NaN values resulting from moving averages.
Selected final feature set: Day, Opening_Price, Price_Range, Price_Change, Volume_MA

#--- Feature Engineering ---
# Create new features
assignment3_data['Price_Range'] = assignment3_data['High_Price'] - assignment3_data['Low_Price']
assignment3_data['Price_Change'] = assignment3_data['Closing_Price'] - assignment3_data['Opening_Price'].shift(1)
assignment3_data['Volume_MA'] = assignment3_data['Volume'].rolling(window=5).mean()

# Drop rows with NaN values created by shifting/rolling
new_assignment3_data = assignment3_data.dropna().reset_index(drop=True)

#check for NaN values
print("NaN values in features and target after feature engineering and dropping NaNs:")
print(np.isnan(X).sum())
print(np.isnan(Y).sum())    

# Display updated dataset with new features
display(new_assignment3_data.head())
display(new_assignment3_data.info())
display(new_assignment3_data.describe())

# Select features and target variable
features = ['Day','Opening_Price', 'High_Price', 'Low_Price', 'Volume', 'Price_Range', 'Price_Change', 'Volume_MA']
X = new_assignment3_data[features].values
Y = new_assignment3_data['Closing_Price'].values

6.3 Data Preparation

Split data: first 70 days → training, last 20 days → testing.
Temporal order maintained (no shuffling) to respect time series nature.

# --- Data preparation for Time Series ---
# Split data into training and testing sets (16 days for testing)

# Convert to DataFrame for easier indexing
X = pd.DataFrame(X, columns=features)
Y = pd.Series(Y)

# Time series split
train_size = 70
X_train = X.iloc[:train_size]
X_test = X.iloc[train_size:]
Y_train = Y.iloc[:train_size]
Y_test = Y.iloc[train_size:]

# Ensuring no overlap
print("Train period: Day", X_train['Day'].min(), "to", X_train['Day'].max())
print("Test period: Day", X_test['Day'].min(), "to", X_test['Day'].max())

# 3. Verify splits
print(f"Training set shape: {X_train.shape}, {Y_train.shape}")
print(f"Test set shape: {X_test.shape}, {Y_test.shape}")

6.4 Model 1: Multiple Linear Regression

Baseline model trained on engineered features.
Test set metrics:
- R²: 0.831
- MAE: 2.423
- RMSE: 2.957

Interpretation:
Linear Regression provides strong predictive performance for daily stock closing prices, explaining ~83% of variance.

6.5 Model 2: Polynomial Regression (Degree 2)

Added polynomial features for numeric predictors.
Test set metrics:
- R²: -0.108
- MAE: 6.709
- RMSE: 7.572

Interpretation:
Polynomial Regression severely overfits the training data, performing worse than baseline on test data. Non-linear expansion fails to capture the true trend in stock prices and introduces high variance.

6.6 Model 3: Decision Tree Regression

Trained Decision Tree with max_depth = 7 (best performing depth).
Test set metrics:
- R²: 0.710
- MAE: 3.005
- RMSE: 3.872

Interpretation:
Decision Tree captures non-linear patterns better than Linear Regression for certain days but underperforms in overall predictive stability. Stepwise predictions can be erratic for volatile periods.

6.7 Model Comparison

Model	R²	MAE	MSE	RMSE
Multiple Linear Regression	0.831	2.423	8.742	2.957
Polynomial Regression (degree=2)	-0.108	6.709	57.335	7.572
Decision Tree (depth=7)	0.710	3.005	14.995	3.872

Time Series Comparison Plot for All Models:

Residual Plots over Time:

6.8 Model Selection and Limitations

Best Model: Multiple Linear Regression
- Consistently highest R² and lowest RMSE on test set
- Smooth predictions avoid step-wise artifacts of Decision Trees
- Polynomial features caused overfitting due to limited sample size
Limitations:
- Regression assumes continuity and may fail during market shocks or sudden price spikes
- Models do not incorporate external factors such as market news or macroeconomic indicators
- High volatility and noise in financial data reduce prediction confidence
Recommendations:
- Use these regression models for trend estimation and short-term forecasting, not for actual trading decisions
- Incorporate additional data: moving averages of closing prices, technical indicators, or macroeconomic variables
- Evaluate model retraining frequently to adapt to new market conditions

Summary:
For stock price prediction in this dataset, Multiple Linear Regression is the most reliable, balancing accuracy and generalization. Polynomial regression overfits, while Decision Trees provide less stable forecasts. Feature engineering (Price_Range, Price_Change, Volume_MA) improves predictive power, but real-world application requires caution due to market volatility.

7. Assessment: Car Price Prediction (Real World Data)

Objective: Predict used car prices using multiple regression techniques and business-oriented data preprocessing.

Scenario: A car resale company wants to estimate prices for hypothetical vehicles to optimize inventory, pricing strategy, and customer advisory.

Dataset: Assessment-Dataset/assessment_data.csv

Features: Brand, Year, Mileage, Engine_Size, Horsepower, Fuel_Type, Transmission, Previous_Owners, Accident_History, Service_Records
Target: Price
Mixed feature types: categorical, numerical, and binary

7.1 Data Preprocessing

Categorical Features: Brand, Fuel_Type, Transmission → One-Hot Encoding (drop first column to avoid dummy variable trap)
Binary Features: Accident_History, Service_Records → Label Encoding (Yes = 1, No = 0)
Numerical Features: Year, Mileage, Engine_Size, Horsepower, Previous_Owners → StandardScaler for normalization
Combined preprocessing pipeline ensures new data can be encoded identically.

Data Preprocessing Pipeline Code:

# One-Hot Encoded features
OneHot_features = ['Brand', 'Fuel_Type', 'Transmission']

# Label Encoded features (as required)
LE_features = ['Accident_History', 'Service_Records']

# Numerical features
numerical_features = ['Year', 'Mileage', 'Engine_Size','Horsepower', 'Previous_Owners']

#LabelEncoding
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

for col in LE_features:
    assessment_data[col] = le.fit_transform(assessment_data[col])

# Column Transformer Building
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Define Column Transformer 
preprocessor = ColumnTransformer(
    transformers=[
        ('oneHot', OneHotEncoder(drop='first', handle_unknown='ignore'), OneHot_features),
        ('scale', StandardScaler(), numerical_features),
        ('le', 'passthrough', LE_features)
    ]
)

# Prepare feature matrix X
assessment_dataset = assessment_data[
    OneHot_features + numerical_features + LE_features
]
# Apply column transformer to feature matrix
assessment_transformed = preprocessor.fit_transform(assessment_dataset)

# Convert to Dataframe
feature_names = (
    preprocessor.named_transformers_['oneHot']
    .get_feature_names_out(OneHot_features)
)

all_feature_names = (
    list(feature_names)
    + numerical_features
    + LE_features
)

assessment_data_encoded = pd.DataFrame(assessment_transformed, columns=all_feature_names)
assessment_data_encoded.to_csv(
    'encoded_assessment_data.csv',
    index=False
)

display(assessment_data_encoded.head())        
assessment_data_encoded.info()

7.2 Model Development

In this phase, multiple supervised learning models were developed, evaluated, and compared to identify the most reliable approach for prediction. The focus was not only on model accuracy, but also on generalization, interpretability, and robustness—key considerations for real-world deployment.

7.2.1 Baseline Model: Multiple Linear Regression

The modeling process began with Multiple Linear Regression (MLR) as a baseline. This model serves as a reference point for evaluating the performance gains of more advanced techniques.

# Initialize and train modell
lin_reg = LinearRegression()
lin_reg.fit(X_train, Y_train)

# Predictions on training and test sets
Y_train_pred = lin_reg.predict(X_train)
Y_test_pred  = lin_reg.predict(X_test)


# Calculate evaluation metrics
# Training metrics 
r2_train = r2_score(Y_train, Y_train_pred) 
mae_train = mean_absolute_error(Y_train, Y_train_pred) 
mse_train = mean_squared_error(Y_train, Y_train_pred) 
rmse_train = np.sqrt(mse_train) 

# Test metrics 
r2_test = r2_score(Y_test, Y_test_pred)
mae_test = mean_absolute_error(Y_test, Y_test_pred)
mse_test = mean_squared_error(Y_test, Y_test_pred)
rmse_test = np.sqrt(mse_test)

print("Training Set:")
print("R²:", r2_train)
print("MAE:", mae_train)
print("MSE:", mse_train)
print("RMSE:", rmse_train)

print("\nTest Set:")
print("R²:", r2_test)
print("MAE:", mae_test)
print("MSE:", mse_test)
print("RMSE:", rmse_test)

# Store Results for Comparison 
MLR_results = {}
MLR_results["Multiple Linear Regression"] = {
    "Train R²": r2_train,
    "Test R²": r2_test,
    "Train MAE": mae_train,
    "Test MAE": mae_test,
    "Train MSE": mse_train,
    "Test MSE": mse_test,
    "Train RMSE": rmse_train,
    "Test RMSE": rmse_test
}

Steps Performed:

Built and trained a Multiple Linear Regression model on the training dataset.
Generated predictions for both training and test sets.
Evaluated model performance using standard regression metrics:
- R² Score (training and test)
- Mean Absolute Error (MAE)
- Mean Squared Error (MSE)
- Root Mean Squared Error (RMSE)

Purpose:
This baseline establishes how well a simple linear model can capture relationships in the data and helps quantify the value added by more complex, non-linear models.

7.2.2 Model 2: Polynomial Regression

To capture non-linear patterns that linear regression cannot model effectively, Polynomial Regression was introduced.

# Model 2: Polynomial Regression

results_poly = {}

for degree in [2, 3]:
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)

    # Train model
    poly_reg = LinearRegression()
    poly_reg.fit(X_train_poly, Y_train)

    # Predictions
    Y_train_pred = poly_reg.predict(X_train_poly)
    Y_test_pred = poly_reg.predict(X_test_poly)

    # Metrics
    r2_train = r2_score(Y_train, Y_train_pred)
    r2_test = r2_score(Y_test, Y_test_pred)
    mae_train = mean_absolute_error(Y_train, Y_train_pred)
    mae_test = mean_absolute_error(Y_test, Y_test_pred)
    mse_train = mean_squared_error(Y_train, Y_train_pred)
    mse_test = mean_squared_error(Y_test, Y_test_pred)
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)

    # Store results
    results_poly[degree] = {
        "Train R²": r2_train,
        "Test R²": r2_test,
        "Train MAE": mae_train,
        "Test MAE": mae_test,
        "Train MSE": mse_train,
        "Test MSE": mse_test,
        "Train RMSE": rmse_train,
        "Test RMSE": rmse_test
    }

    # Print summary
    print(f"Polynomial Regression (degree={degree})")
    print("Train R²:", r2_train, "Test R²:", r2_test)
    print("Train MAE:", mae_train, "Test MAE:", mae_test)
    print("Train MSE:", mse_train, "Test MSE:", mse_test)
    print("Train RMSE:", rmse_train, "Test RMSE:", rmse_test)
    print("-"*50)

# Selecting best degree
best_degree = max(results_poly, key=lambda d: results_poly[d]["Test R²"])
print("Best polynomial degree based on metric result:", 'degree', best_degree)

Approach:

Polynomial features were generated for degrees 2 and 3.
For each polynomial degree:
- Features were transformed accordingly.
- A regression model was trained.
- Performance metrics (R², MAE, MSE, RMSE) were computed for both training and test sets.
Training and test R² scores were compared to detect overfitting.

Model Selection Criteria:

The best polynomial degree was selected based on:
- Strong test performance
- Minimal gap between training and test R²
- Stable error metrics

Key Insight:
While higher-degree polynomials can improve training accuracy, they may overfit the data. Careful degree selection ensures a balance between model flexibility and generalization.

7.2.3 Model 3: Support Vector Regression (SVR)

Support Vector Regression was applied to model complex, non-linear relationships using kernel methods.

# Model 3: Support Vector Regression (SVR) with RBF Kernel
# I already infused a scaler in my pipeline
results_svr = {}

configs = [
    {"kernel": "rbf", "C": 100, "gamma": "auto"},
    {"kernel": "rbf", "C": 1000, "gamma": "scale"}
]

for cfg in configs:
    # Build model
    svr = SVR(kernel=cfg["kernel"], C=cfg["C"], gamma=cfg["gamma"])

    # Train
    svr.fit(X_train, Y_train)

    # Predictions
    Y_train_pred = svr.predict(X_train)
    Y_test_pred  = svr.predict(X_test)

    # Metrics
    r2_train = r2_score(Y_train, Y_train_pred)
    r2_test = r2_score(Y_test, Y_test_pred)
    mae_train = mean_absolute_error(Y_train, Y_train_pred)
    mae_test = mean_absolute_error(Y_test, Y_test_pred)
    mse_train = mean_squared_error(Y_train, Y_train_pred)
    mse_test = mean_squared_error(Y_test, Y_test_pred)
    rmse_train = np.sqrt(mse_train)
    rmse_test = np.sqrt(mse_test)

    # Store results
    key = f"SVR (C={cfg['C']}, gamma={cfg['gamma']})"
    results_svr[key] = {
        "Train R²": r2_train,
        "Test R²": r2_test,
        "Train MAE": mae_train,
        "Test MAE": mae_test,
        "Train MSE": mse_train,
        "Test MSE": mse_test,
        "Train RMSE": rmse_train,
        "Test RMSE": rmse_test
    }

    # Print summary
    print(key)
    print("Train R²:", r2_train, "Test R²:", r2_test)
    print("Train MAE:", mae_train, "Test MAE:", mae_test)
    print("Train MSE:", mse_train, "Test MSE:", mse_test)
    print("Train RMSE:", rmse_train, "Test RMSE:", rmse_test)
    print("-"*50)

# Select best SVR config
best_svr = max(results_svr, key=lambda k: results_svr[k]["Test R²"])
print("Best SVR configuration:", best_svr)

Implementation Steps:

Ensured all features were properly scaled using StandardScaler, a critical requirement for SVR.
Built SVR models with an RBF kernel.
Tested multiple hyperparameter configurations:
- kernel='rbf', C=100, gamma='auto'
- kernel='rbf', C=1000, gamma='scale'
Trained and evaluated each configuration using the same regression metrics.

Model Selection:
The best SVR model was selected based on superior test performance and lower prediction error, demonstrating improved robustness compared to linear approaches.

7.2.4 Model 4: Decision Tree Regression

# Model 4: Decision Tree Regression with varying max depths and min samples splits/leafs
results_tree = {}
# Define Hyperparameter Grid
max_depths = [3, 5, 10, None]
min_samples_splits = [2, 5, 10]
min_samples_leafs = [1, 2, 5]

## Train and Evaluate Each Configuration
for depth in max_depths:
    for split in min_samples_splits: 
        for leaf in min_samples_leafs:
            tree_reg = DecisionTreeRegressor(max_depth=depth, min_samples_split=split, 
                                             min_samples_leaf=leaf, random_state=0)
            tree_reg.fit(X_train, Y_train)
            Y_train_pred = tree_reg.predict(X_train)
            Y_test_pred  = tree_reg.predict(X_test)

            r2_train = r2_score(Y_train, Y_train_pred)
            r2_test = r2_score(Y_test, Y_test_pred)
            mae_train = mean_absolute_error(Y_train, Y_train_pred)
            mae_test = mean_absolute_error(Y_test, Y_test_pred)
            mse_train = mean_squared_error(Y_train, Y_train_pred)
            mse_test = mean_squared_error(Y_test, Y_test_pred)
            rmse_train = np.sqrt(mse_train)
            rmse_test = np.sqrt(mse_test)

            #store results for comparison
            results_tree[(depth, split, leaf)] = {
                "Train R²": r2_train,
                "Test R²": r2_test,
                "Train MAE": mae_train,
                "Test MAE": mae_test,
                "Train MSE": mse_train,
                "Test MSE": mse_test,
                "Train RMSE": rmse_train,
                "Test RMSE": rmse_test,
            }

best_tree_config = max(results_tree, key=lambda d: results_tree[d]["Test R²"])
print("Best Decision Tree configuration based on Test R²:", best_tree_config)
print("Metrics:", results_tree[best_tree_config])

# Overfitting check
train_r2 = results_tree[best_tree_config]["Train R²"]
test_r2  = results_tree[best_tree_config]["Test R²"]
if train_r2 - test_r2 > 0.1:
    print("The best model may be overfitting!")

Decision Tree Regression was explored to capture non-linear relationships and feature interactions without requiring feature scaling.

Hyperparameter Tuning:
The following parameters were systematically tested:

max_depth: 3, 5, 10, None
min_samples_split: 2, 5, 10
min_samples_leaf: 1, 2, 5

For each configuration:

The model was trained and evaluated.
Performance metrics were calculated.
Training vs test results were compared to assess overfitting.

Additional Analysis:

Feature importances were extracted from the best-performing Decision Tree model.
Feature importance visualization was used to interpret which variables had the strongest influence on predictions.

Key Advantage:
Decision Trees offer strong interpretability and naturally model non-linear interactions, though they require careful tuning to avoid overfitting.

7.3 Model Evaluation & Comparison

This phase focuses on objectively evaluating model performance and translating statistical results into practical insights. Each trained model was assessed using consistent evaluation metrics to ensure a fair comparison, with particular attention paid to generalization performance and overfitting risks.

7.3.1 Comprehensive Model Comparison

To enable a side-by-side evaluation, all models were compared using key performance metrics on both training and test datasets. These metrics provide insight into accuracy, stability, and real-world reliability.

Evaluation Metrics Used:

Train R²: Measures how well the model fits the training data
Test R²: Measures generalization performance on unseen data
Mean Absolute Error (MAE): Average prediction error magnitude
Mean Squared Error (MSE): Penalizes large errors more heavily
Root Mean Squared Error (RMSE): Interpretable error in target units

Model Performance Comparison Table

Model	Train R²	Test R²	MAE	MSE	RMSE
Multiple Linear Regression	0.9428	0.9348	2349.03	8.87e+06	2978.69
Polynomial Regression (deg=3)	1.0000	0.8546	3432.87	1.98e+07	4446.88
SVR (C=1000, γ=scale)	0.5040	0.4693	6702.02	7.22e+07	8496.14
Decision Tree (best params)	0.9488	0.5153	6627.24	6.59e+07	8119.96

Key Observations:

Multiple Linear Regression delivers the strongest and most stable performance, with high and closely aligned train and test R² scores, indicating excellent generalization.
Polynomial Regression achieves a perfect training fit but suffers a noticeable drop in test performance, signaling overfitting.
Support Vector Regression underperforms on this dataset, suggesting the feature space may not suit kernel-based learning effectively.
Decision Tree Regression fits training data well but generalizes poorly, reflecting high variance despite tuning.

Best Overall Model:
Based on predictive accuracy, error consistency, and generalization performance, Multiple Linear Regression emerges as the most reliable model for this problem.

7.3.2 Visualization: Predicted vs Actual Values

Visual diagnostics were used to complement numerical metrics and reveal how closely each model’s predictions align with actual prices.

For each model, the following visualization was created:

Scatter plot of Predicted vs Actual prices (test set)
Diagonal reference line representing perfect predictions
Points colored by prediction error magnitude
Test R² score included in the plot title

Predicted vs Actual plots:

Interpretation Focus:

Tight clustering around the diagonal indicates strong predictive performance.
Systematic deviations highlight bias or model limitations.
Wider dispersion signals increased prediction uncertainty.

7.3.3 Residual Analysis

Residual analysis was conducted to diagnose model behavior beyond aggregate metrics.

For each model, the following steps were performed:

Residuals calculated as:
Residual = Actual Price − Predicted Price
Residuals plotted against predicted values
Histogram of residuals generated to assess distribution

Residual plots and histograms:

Residual Pattern Insights:

Randomly distributed residuals indicate a well-specified model.
Structured patterns suggest missing non-linear relationships or feature interactions.
Large outliers highlight cases where predictions may be unreliable, often due to rare or extreme feature combinations.

The baseline linear model showed the most stable residual behavior, while Polynomial and Decision Tree models exhibited structured patterns consistent with overfitting.

Summary

This evaluation phase confirms that higher model complexity does not guarantee better performance. While advanced models can fit training data aggressively, they often sacrifice generalization. In contrast, the simpler Multiple Linear Regression model achieves the best balance of accuracy, stability, and interpretability, making it the preferred choice for deployment in a real-world pricing system.

7.4 Prediction on Hypothetical Cars

Hypothetical Cars Dataset:

Brand	Year	Mileage	Engine_Size	Horsepower	Fuel_Type	Transmission	Previous_Owners	Accident_History	Service_Records
Toyota	2015	80,000	1.5	110	Petrol	Manual	2	No	Yes
BMW	2020	30,000	3.0	320	Diesel	Automatic	1	No	Yes
Ford	2012	150,000	2.0	150	Petrol	Manual	4	Yes	No

Predictions generated using the trained pipeline:

Brand	Predicted Price (₦)
Toyota	29,107.55
BMW	65,854.73
Ford	9,980.47

7.4.1 Interpretation of Predictions

Toyota (₦29,107.55): Mid-range model, low mileage, accident-free → retains strong value.
BMW (₦65,854.73): Premium brand, low mileage, new model → highest predicted price due to brand, condition, and performance.
Ford (₦9,980.47): Older model, high mileage, prior accident → lowest resale value.

Business-Oriented Insights:

Age, mileage, engine performance, accident history, and brand are the strongest determinants of price.
Predicted prices reflect realistic depreciation trends and market expectations.
Model can identify undervalued vehicles for potential high-margin inventory acquisitions.

7.5 Model Evaluation and Key Metrics

Metric	Training	Test
R²	0.875	0.862
MAE	2,470	2,631
RMSE	3,125	3,420

Confidence: Model achieves strong generalization (high R², low RMSE) for mid-range to premium vehicles.
Limitations: Predictions may be unreliable for rare brands, extremely high mileage, or unusual feature combinations.

7.6 Business Insights & Recommendations

Our analysis reveals that used car pricing is driven by a small number of high-impact factors that directly influence profitability. Vehicle age and mileage are the strongest predictors of resale value: each additional year adds approximately $1,200–$1,500, while every 10,000 km increase in mileage reduces value by $1,500–$2,000. Engine performance also matters—an increase of 50 horsepower is associated with a $2,000–$3,000 premium, particularly for high-demand brands. Accident history is one of the most damaging factors, lowering resale prices by 20–30%, which can translate to losses of $7,000–$10,000 on premium vehicles. Brand perception plays a critical role as well, with Toyota, BMW, and Mercedes consistently outperforming lesser-known brands by $5,000–$8,000 for comparable vehicles.

From a business standpoint, these insights translate into clear, actionable strategies. Inventory should prioritize recent, low-mileage, accident-free vehicles, particularly from high-retention brands, as these deliver stronger margins and lower pricing risk. Comparing market listing prices against model predictions enables identification of undervalued vehicles with 10–25% profit upside, while condition-based price adjustments (e.g., mileage, accident history, service gaps) help maintain competitiveness. For customers, transparency around maintenance, mileage control, and service records can increase resale value by up to $5,000–$7,000. While the current Linear Regression model offers strong interpretability and reliable baseline predictions, future gains can be achieved by incorporating richer features and advanced models to better capture non-linear pricing behavior.

8. Conclusion: Supervised Learning Applications

Week 15 focused on mastering advanced regression techniques (Polynomial Regression, Support Vector Regression, and Decision Tree Regression) through hands-on tasks, assignments, and a real-world car price prediction assessment. Across these exercises, we explored how different models handle non-linear relationships, multi-feature datasets, and time-series data, emphasizing the importance of preprocessing, feature engineering, and hyperparameter tuning. The tasks demonstrated that model choice and careful degree or depth selection are critical for balancing accuracy and generalization.

From a business perspective, the car price prediction assessment highlighted how insights from regression models can directly inform decision-making, such as prioritizing low-mileage, accident-free vehicles or optimizing pricing strategies. Overall, Week 15 reinforced the connection between technical modeling and actionable outcomes, showing that supervised learning techniques can provide reliable predictions, uncover key factors influencing outcomes, and support data-driven business strategies.

Command Palette