The Machine Does Learn: A Journey Through Regression, Feature Engineering, and Model Optimization.

Introduction

When we hear the phrase “the machine learns,” it often conjures images of futuristic robots or self-driving cars. In practice, machine learning is usually far more grounded. It begins with data, patterns, and the careful minimization of error.

In this project, I worked with different real-world datasets to examine how regression models behave under common challenges such as multicollinearity, feature scaling, and model complexity. I applied core regression techniques (Linear Regression, Ridge Regression, and Lasso Regression) to explore how regularization and feature engineering transform raw predictors into actionable insights. Using GridSearchCV, I further tuned hyperparameters to optimize model performance and generalization.

This article documents the technical decisions made, the measurable results achieved, and the insights uncovered along the way. It is both a narrative of exploration and a practical guide for practitioners interested in how regression models evolve from simple baselines to carefully optimized solutions.

Phase 1 : Foundational Techniques - Shaping the data for Learning

Task 1: Missing Data Management

In real‑world datasets, missing values are inevitable. They can arise from human error, incomplete surveys, or system glitches. Left untreated, missing data can distort statistical analysis and weaken model performance. The first foundational technique in our journey was detecting, analyzing, and imputing missing values.

Before imputation, I examined the numerical variables (Age, Income, Product Rating) using boxplots and skewness statistics.

Boxplots revealed no significant outliers.
Skewness values were close to zero, indicating approximately symmetric distributions.

This diagnostic step was crucial: it confirmed that mean imputation would not bias the data, since the distributions were not heavily skewed.Then i applied two imputation strategies; Mean Imputation (Numerical Columns) and Mode Imputation (Categorical Columns) which in this case was the City column.

# Working with missing data 
from sklearn.impute import SimpleImputer

#Handling mean imputation for numerical columns
mean_imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
task1_data[['Age', 'Income', 'Product_Rating']] = mean_imputer.fit_transform(dataset[['Age', 'Income', 'Product_Rating']])

#Handling mode imputation for categorical columns
mode_imputer = SimpleImputer(strategy='most_frequent')
task1_data[['City']] = mode_imputer.fit_transform(task1_data[['City']])
display(task1_data)

# Verify that all missing values have been handled
print(task1_data.isnull().sum())

Results After Imputation

Name	Age	City	Income	Product_Rating
John	25	New York	45000.00	4.50
Sarah	32	Los Angeles	62000.00	4.80
Emily	28	Houston	61416.67	4.70
David	45	Phoenix	78000.00	4.52
Jessica	33	Dallas	61416.67	4.80
Michelle	31	San Francisco	61416.67	4.70

Outcome: The dataset was transformed into a complete, robust structure, ready for downstream regression tasks.

Task 2: Encoding Categorical Variables

Machine learning models are mathematical at their core. They thrive on numbers, not words. While humans can easily understand categories like “Mumbai” or “Electronics”, algorithms cannot process them directly. Encoding bridges this gap by converting categorical variables into numerical representations without distorting their meaning.

Dataset Shape Before Encoding: (20, 6)

Index	CustomerID	City	Product_Type	Age	Purchase_Amount	Purchased
0	C001	Mumbai	Electronics	28	15000	Yes
1	C002	Delhi	Clothing	35	3500	No
2	C003	Bangalore	Electronics	42	22000	Yes
3	C004	Mumbai	Furniture	29	8500	No
4	C005	Chennai	Electronics	31	18000	Yes

#---ENCODING CATEGORICAL INDEPENDENT VARIABLE ---
X = dataset[['City', 'Product_Type', 'Age', 'Purchase_Amount']]
Y = dataset['Purchased']

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

column_transformer = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(sparse_output=False,), ['City', 'Product_Type'])
    ],
    remainder='passthrough'
)

X_encoded = column_transformer.fit_transform(X)
encoded_feature_names = column_transformer.get_feature_names_out()

X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_feature_names)

#--- ENCODING CATEGORICAL DEPENDENT VARIABLE (target)---

le = LabelEncoder()
Y_encoded = le.fit_transform(Y)
Y_encoded_df = pd.DataFrame(Y_encoded, columns=["Purchased"])

# Concatenate numerical features + encoded categorical + target
final_df = pd.concat([X_encoded_df, Y_encoded_df], axis=1)

# Display final shape and first 5 rows
print("Dataset Shape After Encoding:", final_df.shape)
display(final_df.head())

Feature matrix shape after encoding: (20, 10)

Index	onehot__City_Bangalore	onehot__City_Chennai	onehot__City_Delhi	onehot__City_Mumbai	onehot__Product_Type_Clothing	onehot__Product_Type_Electronics	onehot__Product_Type_Furniture	remainder__Age	remainder__Purchase_Amount	Purchased
0	0.0	0.0	0.0	1.0	0.0	1.0	0.0	28.0	15000.0	1
1	0.0	0.0	1.0	0.0	1.0	0.0	0.0	35.0	3500.0	0
2	1.0	0.0	0.0	0.0	0.0	1.0	0.0	42.0	22000.0	1
3	0.0	0.0	0.0	1.0	0.0	0.0	1.0	29.0	8500.0	0
4	0.0	1.0

MY ENCODING APPROACH

The dataset contained categorical independent variables (City and Product_Type), numerical variables (Age and Purchase_Amount), and a categorical dependent variable (Purchased). OneHotEncoder was applied to City and Product_Type to convert each category into binary features without introducing ordinal relationships. LabelEncoder was used to transform the binary target variable (Purchased) into numerical form. The shapes and sample rows of the dataset were examined before and after encoding to verify the correctness of the transformations.

Task 3: Feature Scaling Comparison

Not all numerical features speak the same language.

In this dataset, every feature is numeric, but their ranges are wildly different:

Age: 23–46
Annual_Salary: 32,000–108,000
Years_Experience: 1–23
Performance_Score: 71–95

If left unscaled, features like Annual_Salary dominate distance calculations and gradient updates simply because of their magnitude—not because they are more informative. This silently biases many machine learning models.

The question is not “Can the model run without scaling?”
The real question is “Can the model learn fairly without scaling?”

The Solution: Standardization (Z-score Scaling)

I applied StandardScaler, which transforms each feature by rescaling to:

Mean ≈ 0
Standard deviation ≈ 1

Critical safeguard: No data leakage

The scaler is fit only on the training set, then applied to both training and test data.

Implementation

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd

# Features
X = task3_data[['Age', 'Annual_Salary', 'Years_Experience', 'Performance_Score']]

# Train-test split (80/20)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)

# Initialize scaler
scaler = StandardScaler()

# Fit on training data only
X_train_scaled = scaler.fit_transform(X_train)

# Transform test data
X_test_scaled = scaler.transform(X_test)

Statistical Proof: Before vs After Scaling

Features	Mean (Before Scaling)	Mean (After Scaling)	Standard Deviation (Before Scaling)	Standard Deviation (After Scaling)
Age	34.28	3.44e-16	7.009	1.025978
Annual_Salary	66920.00	4.44e-17	23806.372	1.025978
Years_Experience	10.48	-1.22e-16	6.965	1.025978
Performance_Score	83.48	-3.55e-16	7.545	1.025978

All features are now:

Centered around zero
On the same scale
Directly comparable

(The slight deviation from exactly 1 is expected due to sample variance estimation.)

VISUALIZATION

Visualization of the distributions before and after scaling highlighted the effectiveness of the standardization. On the Box plot, before scaling, features like Annual_Salary dominated the plots due to their larger numerical range, while other features appeared compressed near the bottom of the scale. After scaling, all features were centered around zero and had a unit variance, making them directly comparable. Calculations of the mean and standard deviation for each feature confirmed that the transformed features had means very close to zero and standard deviations equal to one.

Overall, this process ensured that all features contributed proportionally during model training, improving model stability, convergence speed, and interpretability.

Phase 2: Assignments - Building and Evaluating Regression Models

In this phase. we worked on simple and multiple linear Regression models

Assignment 1: Simple Linear Regression Analysis

Dataset : assignment2_advertising_sales.csv

The goal of this analysis is to implement and evaluate a simple linear regression model to understand the relationship between advertising spend and sales revenue. This helps the company optimize its marketing budget by quantifying how changes in advertising investment affect revenue.

Deliverable

This section provides:

A complete regression analysis pipeline.
Visualizations of the relationship and regression line.
Model performance metrics (R², MSE, RMSE).
Regression equation and prediction example.
Actionable business recommendations for optimizing advertising spend.

Data Preparation

We begin by loading the dataset assignment2_advertising_sales.csv and performing basic exploration. The dataset contains two key variables:

Advertising Spend (X) — measured in thousands of dollars.
Sales Revenue (Y) — measured in thousands of dollars.

import pandas as pd
import matplotlib.pyplot as plt

# Load dataset
ad_sales = pd.read_csv('assignment2_advertising_sales.csv')

# Quick exploration
print(ad_sales.head())
print(ad_sales.info())
print(ad_sales.describe())

# Scatter plot to visualize relationship
plt.scatter(ad_sales ['Advertising_Spend'], ad_sales ['Sales_Revenue'], color='blue')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Scatter Plot: Advertising Spend vs Sales Revenue')
plt.show()

Explanation:

The scatter plot provides a first look at the relationship. A clear upward trend indicates that higher advertising spend is associated with higher sales revenue, suggesting linear regression is appropriate.

Model Building

We split the dataset into training (70%) and test (30%) sets to evaluate generalization. Then, we fit a simple linear regression model.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Define features and target
X = ad_sales [['Advertising_Spend']]
Y = ad_sales ['Sales_Revenue']

# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(
    X, Y, test_size=0.3, random_state=42
)

# Build and fit model
lr_model = LinearRegression()
lr_model.fit(X_train, Y_train)

# Display coefficients
print("Intercept:", lr_model.intercept_)
print("Slope:", lr_model.coef_[0])

Explanation:

The intercept (≈ 38.97) represents baseline sales revenue when advertising spend is zero.
The slope (≈ 4.86) indicates that for every additional $1,000 spent on advertising, sales revenue increases by about $4,860.

Predictions

We generate predictions for both training and test sets and compare them to actual values.

# Predictions
train_pred = lr_model.predict(X_train)
test_pred = lr_model.predict(X_test)

# Compare first 10 predictions vs actual on Test Set
comparison = pd.DataFrame({
    'Actual': Y_test[:10].values,
    'Predicted': test_pred[:10]
})
print(comparison)

Result:

First 10 Test Set Predictions:

Index	Actual (Test)	Predicted (Test)
0	256.8	255.729433
1	204.6	203.725528
2	142.5	140.057196
3	135.9	132.766929
4	165.3	167.760211
5	276.2	276.628198
6	233.2	235.802703
7	131.6	129.850822
8	174.3	175.536496
9	207.9	207.613671

Explanation:

The predicted values closely align with actual sales revenue, confirming the model captures the relationship well.

Visualization

We visualize the regression line overlaid on the scatter plots for both training and test sets.

# Training set visualization
plt.scatter(X_train, Y_train, color='blue', label='Actual')
plt.plot(X_train, train_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Training Set: Advertising Spend vs Sales Revenue')
plt.legend()
plt.show()

# Test set visualization
plt.scatter(X_test, Y_test, color='green', label='Actual')
plt.plot(X_test, test_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Test Set: Advertising Spend vs Sales Revenue')
plt.legend()
plt.show()

Explanation:

The regression line fits the scatter plots well, reinforcing the strong linear relationship between advertising spend and sales revenue.

Model Evaluation

We evaluate the model using R², MSE, and RMSE.

from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# Training metrics
r2_train = r2_score(y_train, train_pred)
mse_train = mean_squared_error(y_train, train_pred)
rmse_train = np.sqrt(mse_train)

# Test metrics
r2_test = r2_score(y_test, test_pred)
mse_test = mean_squared_error(y_test, test_pred)
rmse_test = np.sqrt(mse_test)

print("Training R²:", r2_train)
print("Training MSE:", mse_train)
print("Training RMSE:", rmse_train)
print("Test R²:", r2_test)
print("Test MSE:", mse_test)
print("Test RMSE:", rmse_test)

Result:

Index	Training Set Evaluation	Test Set Evaluation
R²	0.9970183835851744	0.99816296421656
MSE	7.2403875565144995	4.057518903568578
RMSE:	2.6907968255731425	2.01432840012957

Explanation:

R² close to 1 indicates the model explains most of the variance in sales revenue.
Low MSE and RMSE confirm accurate predictions with minimal error.
Similar train and test scores show the model generalizes well without overfitting.

Business Insights

Regression Equation The fitted regression line is:

Y=38.97+4.86X

where Y = Sales Revenue (in $1000s), X = Advertising Spend (in $1000s).

Prediction for $50,000 Advertising Spend Since $50,000 = 50 (in thousands),

Y=38.97+4.86×50=281.97

→ Expected sales revenue ≈ $281,970.

Recommendations
- Optimize Advertising Budget Allocation: Increasing advertising consistently boosts revenue. The company should scale investment gradually while monitoring ROI.
- Forecasting and Planning: Use the regression model to forecast revenue under different advertising budgets, aiding in realistic target setting.
- Monitor Diminishing Returns: While the current data shows a strong linear trend, returns may plateau. Continuous data collection and model re‑evaluation are essential to detect diminishing returns.

Assignment 2: Multiple Linear Regression Analysis

The aim was to build a multiple regression model to predict startup monthly profit using several business metrics, and then refine it through backward elimination to identify the most significant drivers of profitability.

Data Preprocessing (briefly noted)

We loaded the dataset containing 58 startups with variables such as R&D Spend, Marketing Spend, Administration Cost, Employee Count, Location, and Profit.

The categorical variable Location was encoded into dummy variables.
To avoid the dummy variable trap, one category was dropped.
The dataset was split into training (80%) and test (20%) sets.

With preprocessing complete, we moved straight into model building.

Initial Model

The first regression model included all features: R&D Spend, Marketing Spend, Administration Cost, Employee Count, and Location dummies(Urban and Suburban).

P-values (descending) Results:

Location_Urban	0.843570
Employee_Count	0.598936
Administration_Cost	0.398697
Location_Suburban	0.365000
Marketing_Spend	0.032131
RD_Spend	0.000579
const	0.000003

The model produced a decent R², indicating that these variables collectively explained a large portion of profit variance.

Issue: Several predictors had high p‑values (>0.05), meaning they were statistically insignificant. In other words, they weren’t truly contributing to explaining profit.

This is where backward elimination came in.

Backward Elimination

Using OLS regression with p‑values, we iteratively removed features that didn’t meet the significance threshold (p > 0.05).

Step 1: Administration Cost was dropped first — its p‑value was high, showing no meaningful impact on profit.
Step 2: Employee Count was removed next. Despite being intuitive, the data showed no significant correlation with profit.
Step 3: Location dummies (Urban, Suburban, Rural) were eliminated. None of them had significant coefficients, proving that office location didn’t matter in predicting profit.

After these eliminations, only R&D Spend and Marketing Spend remained. Both had strong statistical significance (R&D: p < 0.001, Marketing: p ≈ 0.03).

Optimized Model

The optimized regression model was leaner, using only R&D Spend and Marketing Spend.

Performance:

Metric	Initial Model	Optimized Model	Improvement (%)
R²	9.711073e-01	9.740541e-01	0.303448
Adjusted R²	9.646867e-01	9.682883e-01	0.373349
MSE	9.976647e+07	8.959117e+07	10.199117
RMSE	9.988317e+03	9.465261e+03	5.236672

Adjusted R² improved, showing the model explained profit more efficiently with fewer variables.
Error metrics (MSE, RMSE) dropped by over 10%, meaning predictions were closer to actual profits.
Interpretation: This is the ideal balance, a simpler model that performs better because it focuses only on the strongest predictors.

Visualization

This visualization compares actual, Initial predicted and optimized predicted profit values for the first ten test samples. The closeness of the bars indicates that the optimized model predicts profit accurately with minimal deviation.

Business Recommendations

From the analysis, five clear strategies emerge:

Double down on R&D Innovation is the lifeblood of profit. Startups should prioritize R&D budgets, even if it means trimming administrative expenses. The data proves R&D delivers the highest returns.
Strategic marketing, not scattergun Marketing spend matters, but it must be smart. Focus on channels that amplify innovative products. Tie campaigns to product launches and customer feedback loops rather than burning cash on generic ads.
Cut the noise Administration costs and employee headcount don’t directly translate to profit. Keep overhead lean. Adopt remote work or shared services where possible. Efficiency is key.
Location is overrated Whether urban or suburban, location didn’t significantly affect profit. In today’s digital economy, customers care more about product quality and visibility than office address. Invest in online presence rather than fancy headquarters.
Balanced growth strategy Think of R&D as the engine and marketing as the fuel. Without R&D, marketing is hype with no substance. Without marketing, R&D is a hidden gem nobody knows about. The two must work hand in hand.

Final Takeaway

The optimized model shows that simplicity beats complexity. By focusing on R&D and Marketing Spend, startups can predict profit more accurately and design smarter strategies. The lesson is clear: invest in innovation, amplify it with targeted marketing, and keep everything else lean.

REAL WORLD DATA SET - MEDICAL INSURANCE COST

Data source: Medical Insurance Cost Prediction

Predicting medical insurance charges is a classic supervised learning problem with real business impact—pricing fairness, risk management, and targeted wellness programs depend on it. In this publication, we build a robust, reproducible pipeline to model insurance charges using regularized linear models: Ridge (L2) and Lasso (L1). We emphasize:

Data preprocessing: log-transforming skewed targets, encoding categorical variables, scaling numericals.
Modeling rigor: train/test splits, cross-validation, hyperparameter tuning via GridSearchCV.
Interpretability: Lasso’s feature selection to identify drivers of cost.
Reproducibility: Colab + Google Drive mounting with explicit dataset paths.

Technically, the target variable (charges) is log-transformed to stabilize variance and improve linear model fit. Categorical features are one-hot encoded with a dropped baseline to avoid multicollinearity, and numerical features are standardized to ensure regularization behaves consistently across scales. We evaluate models using R², Adjusted R², MAE, MSE, and RMSE—balancing explanatory power and error magnitude.

Data preparation

Dataset loading and target transformation

We stored the datasets in Google Drive and access via Colab for reproducibility. The target (charges) is log-transformed to reduce right skew and improve linear assumptions.

import pandas as pd
import numpy as np

# Load dataset from Google Drive (adjust path to your folder)
data_path = '/content/drive/My Drive/Week14_Datasets/insurance.csv'
insurance_data = pd.read_csv(data_path)

# Log-transform the target to stabilize variance and reduce skew
y = np.log(insurance_data['charges'])

# Features: drop target
X = insurance_data.drop('charges', axis=1)

Implication for interpretation: Predictions are in log-space; if you need original units, exponentiate predictions (np.exp).

Feature Engineering and Preprocessing

We separate features by type: categorical vs numerical. We use OneHotEncoder with drop='first' to avoid the dummy variable trap and StandardScaler for numericals. The smoker column is binary—mapped manually to 0/1 and passed through as it is.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# Define feature groups
onehot_features = ['sex', 'region']
numeric_features = ['age', 'bmi', 'children']

# Map smoker to binary (yes=1, no=0)
X['smoker'] = X['smoker'].map({'yes': 1, 'no': 0})

# ColumnTransformer: one-hot for categoricals, scale numericals, pass smoker through
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first', sparse_output=False), onehot_features),
        ('num', StandardScaler(), numeric_features)
    ],
    remainder='passthrough'  # keeps 'smoker' as-is
)

Train/test split and preprocessing application

We split the data to evaluate generalization and fit the preprocessor on the training set only to avoid leakage.

from sklearn.model_selection import train_test_split

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Fit preprocessor on training data; transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Retrieve feature names post-transformation for interpretability
feature_names = preprocessor.get_feature_names_out()

Data leakage prevention: Fit transformations (scaling, encoding) on training data only; apply to test data using the fitted parameters.
Feature names: Useful for mapping coefficients back to human-readable features, especially for Lasso interpretation.

Modeling Approach

# 4 Models 

#--- ridge (with and without hypertuning with Gridsearch TV)  ---

# Fit Ridge Regression 
ridge_model = Ridge(alpha=1.0) 
ridge_model.fit(X_train_processed, Y_train) 

# Ridge Predictions without GridSearchCV
ridgeY1_train_pred = ridge_model.predict(X_train_processed) 
ridgeY1_test_pred = ridge_model.predict(X_test_processed) 

# Ridge Regression with GridSearchCV 
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='r2')
ridge_grid.fit(X_train_processed, Y_train)

best_ridge = ridge_grid.best_estimator_
ridgeY2_train_pred = best_ridge.predict(X_train_processed)
ridgeY2_test_pred = best_ridge.predict(X_test_processed)

#--- lasso (with and without hypertuning with Gridsearch TV)---

# Fit Lasso Regression
lasso_model = Lasso(alpha=0.1) 
lasso_model.fit(X_train_processed, Y_train) 

# Lasso Predictions without GridSearchCV
lassoY1_train_pred = lasso_model.predict(X_train_processed) 
lassoY1_test_pred = lasso_model.predict(X_test_processed)

# Lasso Regression with GridSearchCV
lasso_params = {'alpha': [0.001, 0.01, 0.1, 1, 10], 'max_iter': [10000, 50000]}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='r2')
lasso_grid.fit(X_train_processed, Y_train)

best_lasso = lasso_grid.best_estimator_
lassoY2_train_pred = best_lasso.predict(X_train_processed)
lassoY2_test_pred = best_lasso.predict(X_test_processed)

Two regularized linear regression techniques were implemented:

Ridge Regression (L2 penalty): Shrinks coefficients but keeps all features. It’s robust against multicollinearity and stabilizes the model.
Lasso Regression (L1 penalty): Shrinks some coefficients to zero, effectively performing feature selection. This improves interpretability by highlighting the most influential predictors.

Both models were evaluated in two configurations:

Baseline: Using a fixed alpha (regularization strength).
Optimized: Using GridSearchCV to tune hyperparameters with cross-validation.

Baseline and tuned models (context preview)

I trained four models:

Ridge (Baseline): alpha=1.0
Ridge (Optimized): tuned via GridSearchCV
Lasso (Baseline): alpha=0.1
Lasso (Optimized): tuned via GridSearchCV

EVALUATION

Evaluation will include both train and test metrics to check for under/overfitting and the impact of tuning. Lasso’s optimized coefficients will be used to identify retained vs dropped features, giving us interpretable business insights.

We evaluated all four models using R², Adjusted R², MAE, MSE, and RMSE. The evaluation function ensures consistency across metrics:

RESULTS

We evaluated all four models on both training and test sets using R², Adjusted R², MAE, MSE, and RMSE. The evaluation function ensures consistency across metrics:

Index	Model	R²	Adj R²	MAE	MSE	RMSE
0	Ridge Train(Baseline)	0.757211	0.749711	0.282444	0.201573	0.448969
1	Ridge Train(Optimized)	0.757227	0.749728	0.282130	0.201559	0.448954
2	Lasso Train (Baseline)	0.647795	0.636916	0.373393	0.292414	0.540753
3	Lasso Train (Optimized)	0.757148	0.749647	0.281922	0.201625	0.449027
4	Ridge Test(Baseline)	0.804598	0.798563	0.270409	0.175694	0.419158
5	Ridge Test(Optimized)	0.804719	0.798687	0.269763	0.175585	0.419029
6	Lasso Test(Baseline)	0.681453	0.671614	0.383556	0.286418	0.535181
7	Lasso Test(Optimized)	0.804075	0.798023	0.270082	0.176164	0.419719

VISUALIZATIONS

Interpretation of Results

Ridge Regression: Both baseline and optimized versions consistently achieved R² ≈ 0.80 on test data, with low error metrics. Ridge is stable and reliable even without tuning.
Lasso Regression: Baseline performance was weaker (R² ≈ 0.68), but after tuning, Lasso matched Ridge’s accuracy (R² ≈ 0.80) and error metrics.
Train vs Test: Ridge showed consistent train/test performance, indicating no overfitting. Lasso baseline underfit, but optimization corrected this.
Error Metrics: Ridge and optimized Lasso both achieved MAE ≈ 0.27 and RMSE ≈ 0.42, confirming strong predictive accuracy.

Feature Selection

Out of Curiousity, I wanted to know the features that greatly influenced the Lasso Regression model prediction, so i utilized the model's ability to perform feature selection by shrinking some coefficients exactly to zero.

# Coefficients from the optimized Lasso model
lasso_coefs = best_lasso.coef_

# Create a DataFrame of features and coefficients
lasso_coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": lasso_coefs
})

# Identify dropped features (coefficients = 0)
dropped_features = lasso_coef_df[lasso_coef_df["Coefficient"] == 0]
retained_features = lasso_coef_df[lasso_coef_df["Coefficient"] != 0]

print("Dropped Features (Coefficient = 0):")
display(dropped_features)

print("Retained Features (Non-zero Coefficients):")
display(retained_features)

Insights

1. Retained Features

Your optimized Lasso model kept 8 features with non‑zero coefficients:

Feature	Coefficient	Technical Interpretation
`remainder__smoker`	+1.545	Strongest positive driver. Smoking status massively increases predicted charges.
`num__age`	+0.481	Older age correlates with higher charges.
`num__children`	+0.111	More dependents slightly increase costs.
`num__bmi`	+0.080	Higher BMI adds moderate risk.
`onehot__sex_male`	−0.070	Males have slightly lower charges compared to females (baseline).
`onehot__region_northwest`	−0.040	Northwest residents pay less compared to Northeast (baseline).
`onehot__region_southeast`	−0.119	Southeast residents pay less compared to Northeast.
`onehot__region_southwest`	−0.106	Southwest residents pay less compared to Northeast.

2. Dropped Features

All other dummy variables were dropped during encoding (drop='first'). That means:

Female is the baseline for sex.
Northeast is the baseline for region. Their effects are absorbed into the model intercept, and all coefficients are interpreted relative to them.

3. Technical Takeaways

Smoker status dominates: The coefficient is an order of magnitude larger than others, confirming smoking is the single most important predictor.
Age is substantial: A steady positive effect, showing charges rise with age.
BMI and children are moderate: They contribute, but less dramatically.
Region and sex are comparative: Negative coefficients show relative reductions compared to the dropped baselines.

4. Business Implications

Pricing strategy: Smoking and age should be the primary drivers of premium differentiation.
Wellness programs: Target BMI and smoking cessation to reduce claims.
Family coverage: Incremental pricing for dependents ensures fairness.
Regional adjustments: Premiums should reflect geographic cost differences.
Gender differences: Too small to justify pricing changes; focus on lifestyle factors instead.

Together, these implications show insurers how to balance risk-based pricing with customer-centric wellness initiatives. Smoking and age should drive premium differentiation, while BMI, children, and regional differences offer opportunities for nuanced strategies. Gender differences are minor, so the focus should remain on lifestyle and geography.

Final Conclusion

Across the series of assignments and the real‑world project, a clear progression emerges: from foundational data preprocessing, through simple regression, into multiple regression with feature selection, and finally into applied business insights. Each stage built upon the last, sharpening both technical skills and strategic thinking.

Data Preprocessing I established a robust pipeline to clean, encode, and scale data. This ensured that every subsequent model was built on reliable inputs, highlighting the importance of preparation in data science.
Simple Linear Regression By modeling the relationship between advertising spend and sales revenue, I demonstrated how even a single predictor can yield actionable insights. The regression equation provided a straightforward forecasting tool, and the business recommendations showed how companies can optimize budgets with confidence.
Multiple Linear Regression with Feature Selection Expanding to multiple predictors, I applied backward elimination to strip away noise and reveal the true drivers of startup profit. The optimized model proved leaner yet more accurate, underscoring the principle that simplicity often outperforms complexity when guided by statistical rigor.
Real‑World Project (Insurance Charges) Using Ridge and Lasso regression, I tackled a practical problem of predicting medical insurance costs. Ridge offered stability, while Lasso added interpretability by selecting the most influential features. The business insights translated technical findings into strategies for pricing, wellness programs, and market expansion.

Big Picture

Together, these works illustrate the end‑to‑end journey of applied machine learning:

Preparation ensures data integrity.
Modeling captures relationships and patterns.
Evaluation validates accuracy and generalization.
Feature selection sharpens focus on what truly matters.
Business insights bridge the gap between numbers and strategy.

The overarching lesson is clear: data science is most powerful when technical precision meets business relevance. By combining rigorous modeling with actionable recommendations, we’ve shown how analytics can guide smarter decisions, optimize resources, and unlock growth.

Command Palette

Introduction

Phase 1 : Foundational Techniques - Shaping the data for Learning

Task 1: Missing Data Management

Task 2: Encoding Categorical Variables

Task 3: Feature Scaling Comparison

The Solution: Standardization (Z-score Scaling)

Critical safeguard: No data leakage

Statistical Proof: Before vs After Scaling

Phase 2: Assignments - Building and Evaluating Regression Models

Assignment 1: Simple Linear Regression Analysis

Data Preparation

Model Building

Predictions

Visualization

Model Evaluation

Business Insights

Assignment 2: Multiple Linear Regression Analysis

Data Preprocessing (briefly noted)

Initial Model

Backward Elimination

Optimized Model

Visualization

Business Recommendations

Final Takeaway

REAL WORLD DATA SET - MEDICAL INSURANCE COST

Data preparation

Dataset loading and target transformation

Feature Engineering and Preprocessing

Train/test split and preprocessing application

Modeling Approach

EVALUATION

RESULTS

VISUALIZATIONS

Feature Selection

Insights

2. Dropped Features

3. Technical Takeaways

4. Business Implications

Final Conclusion

Big Picture

Comments

More from this blog