The Machine Does Learn: A Journey Through Regression, Feature Engineering, and Model Optimization.

Introduction
When we hear the phrase “the machine learns,” it often conjures images of futuristic robots or self-driving cars. In practice, machine learning is usually far more grounded. It begins with data, patterns, and the careful minimization of error.
In this project, I worked with different real-world datasets to examine how regression models behave under common challenges such as multicollinearity, feature scaling, and model complexity. I applied core regression techniques (Linear Regression, Ridge Regression, and Lasso Regression) to explore how regularization and feature engineering transform raw predictors into actionable insights. Using GridSearchCV, I further tuned hyperparameters to optimize model performance and generalization.
This article documents the technical decisions made, the measurable results achieved, and the insights uncovered along the way. It is both a narrative of exploration and a practical guide for practitioners interested in how regression models evolve from simple baselines to carefully optimized solutions.
Phase 1 : Foundational Techniques - Shaping the data for Learning
Task 1: Missing Data Management
In real‑world datasets, missing values are inevitable. They can arise from human error, incomplete surveys, or system glitches. Left untreated, missing data can distort statistical analysis and weaken model performance. The first foundational technique in our journey was detecting, analyzing, and imputing missing values.
Before imputation, I examined the numerical variables (Age, Income, Product Rating) using boxplots and skewness statistics.
Boxplots revealed no significant outliers.
Skewness values were close to zero, indicating approximately symmetric distributions.
This diagnostic step was crucial: it confirmed that mean imputation would not bias the data, since the distributions were not heavily skewed.Then i applied two imputation strategies; Mean Imputation (Numerical Columns) and Mode Imputation (Categorical Columns) which in this case was the City column.
# Working with missing data
from sklearn.impute import SimpleImputer
#Handling mean imputation for numerical columns
mean_imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
task1_data[['Age', 'Income', 'Product_Rating']] = mean_imputer.fit_transform(dataset[['Age', 'Income', 'Product_Rating']])
#Handling mode imputation for categorical columns
mode_imputer = SimpleImputer(strategy='most_frequent')
task1_data[['City']] = mode_imputer.fit_transform(task1_data[['City']])
display(task1_data)
# Verify that all missing values have been handled
print(task1_data.isnull().sum())
Results After Imputation
| Name | Age | City | Income | Product_Rating |
| John | 25 | New York | 45000.00 | 4.50 |
| Sarah | 32 | Los Angeles | 62000.00 | 4.80 |
| Emily | 28 | Houston | 61416.67 | 4.70 |
| David | 45 | Phoenix | 78000.00 | 4.52 |
| Jessica | 33 | Dallas | 61416.67 | 4.80 |
| Michelle | 31 | San Francisco | 61416.67 | 4.70 |
Outcome: The dataset was transformed into a complete, robust structure, ready for downstream regression tasks.
Task 2: Encoding Categorical Variables
Machine learning models are mathematical at their core. They thrive on numbers, not words. While humans can easily understand categories like “Mumbai” or “Electronics”, algorithms cannot process them directly. Encoding bridges this gap by converting categorical variables into numerical representations without distorting their meaning.
Dataset Shape Before Encoding: (20, 6)
| Index | CustomerID | City | Product_Type | Age | Purchase_Amount | Purchased |
| 0 | C001 | Mumbai | Electronics | 28 | 15000 | Yes |
| 1 | C002 | Delhi | Clothing | 35 | 3500 | No |
| 2 | C003 | Bangalore | Electronics | 42 | 22000 | Yes |
| 3 | C004 | Mumbai | Furniture | 29 | 8500 | No |
| 4 | C005 | Chennai | Electronics | 31 | 18000 | Yes |
#---ENCODING CATEGORICAL INDEPENDENT VARIABLE ---
X = dataset[['City', 'Product_Type', 'Age', 'Purchase_Amount']]
Y = dataset['Purchased']
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
column_transformer = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(sparse_output=False,), ['City', 'Product_Type'])
],
remainder='passthrough'
)
X_encoded = column_transformer.fit_transform(X)
encoded_feature_names = column_transformer.get_feature_names_out()
X_encoded_df = pd.DataFrame(X_encoded, columns=encoded_feature_names)
#--- ENCODING CATEGORICAL DEPENDENT VARIABLE (target)---
le = LabelEncoder()
Y_encoded = le.fit_transform(Y)
Y_encoded_df = pd.DataFrame(Y_encoded, columns=["Purchased"])
# Concatenate numerical features + encoded categorical + target
final_df = pd.concat([X_encoded_df, Y_encoded_df], axis=1)
# Display final shape and first 5 rows
print("Dataset Shape After Encoding:", final_df.shape)
display(final_df.head())
Feature matrix shape after encoding: (20, 10)
| Index | onehot__City_Bangalore | onehot__City_Chennai | onehot__City_Delhi | onehot__City_Mumbai | onehot__Product_Type_Clothing | onehot__Product_Type_Electronics | onehot__Product_Type_Furniture | remainder__Age | remainder__Purchase_Amount | Purchased |
| 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 28.0 | 15000.0 | 1 |
| 1 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 35.0 | 3500.0 | 0 |
| 2 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 42.0 | 22000.0 | 1 |
| 3 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 29.0 | 8500.0 | 0 |
| 4 | 0.0 | 1.0 |
MY ENCODING APPROACH
The dataset contained categorical independent variables (City and Product_Type), numerical variables (Age and Purchase_Amount), and a categorical dependent variable (Purchased). OneHotEncoder was applied to City and Product_Type to convert each category into binary features without introducing ordinal relationships. LabelEncoder was used to transform the binary target variable (Purchased) into numerical form. The shapes and sample rows of the dataset were examined before and after encoding to verify the correctness of the transformations.
Task 3: Feature Scaling Comparison
Not all numerical features speak the same language.
In this dataset, every feature is numeric, but their ranges are wildly different:
Age: 23–46
Annual_Salary: 32,000–108,000
Years_Experience: 1–23
Performance_Score: 71–95
If left unscaled, features like Annual_Salary dominate distance calculations and gradient updates simply because of their magnitude—not because they are more informative. This silently biases many machine learning models.
The question is not “Can the model run without scaling?”
The real question is “Can the model learn fairly without scaling?”
The Solution: Standardization (Z-score Scaling)
I applied StandardScaler, which transforms each feature by rescaling to:
Mean ≈ 0
Standard deviation ≈ 1
Critical safeguard: No data leakage
The scaler is fit only on the training set, then applied to both training and test data.
Implementation
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import pandas as pd
# Features
X = task3_data[['Age', 'Annual_Salary', 'Years_Experience', 'Performance_Score']]
# Train-test split (80/20)
X_train, X_test = train_test_split(X, test_size=0.2, random_state=42)
# Initialize scaler
scaler = StandardScaler()
# Fit on training data only
X_train_scaled = scaler.fit_transform(X_train)
# Transform test data
X_test_scaled = scaler.transform(X_test)
Statistical Proof: Before vs After Scaling
| Features | Mean (Before Scaling) | Mean (After Scaling) | Standard Deviation (Before Scaling) | Standard Deviation (After Scaling) |
| Age | 34.28 | 3.44e-16 | 7.009 | 1.025978 |
| Annual_Salary | 66920.00 | 4.44e-17 | 23806.372 | 1.025978 |
| Years_Experience | 10.48 | -1.22e-16 | 6.965 | 1.025978 |
| Performance_Score | 83.48 | -3.55e-16 | 7.545 | 1.025978 |
All features are now:
Centered around zero
On the same scale
Directly comparable
(The slight deviation from exactly 1 is expected due to sample variance estimation.)
VISUALIZATION


Visualization of the distributions before and after scaling highlighted the effectiveness of the standardization. On the Box plot, before scaling, features like Annual_Salary dominated the plots due to their larger numerical range, while other features appeared compressed near the bottom of the scale. After scaling, all features were centered around zero and had a unit variance, making them directly comparable. Calculations of the mean and standard deviation for each feature confirmed that the transformed features had means very close to zero and standard deviations equal to one.
Overall, this process ensured that all features contributed proportionally during model training, improving model stability, convergence speed, and interpretability.
Phase 2: Assignments - Building and Evaluating Regression Models
In this phase. we worked on simple and multiple linear Regression models
Assignment 1: Simple Linear Regression Analysis
Dataset : assignment2_advertising_sales.csv
The goal of this analysis is to implement and evaluate a simple linear regression model to understand the relationship between advertising spend and sales revenue. This helps the company optimize its marketing budget by quantifying how changes in advertising investment affect revenue.
Deliverable
This section provides:
A complete regression analysis pipeline.
Visualizations of the relationship and regression line.
Model performance metrics (R², MSE, RMSE).
Regression equation and prediction example.
Actionable business recommendations for optimizing advertising spend.
Data Preparation
We begin by loading the dataset assignment2_advertising_sales.csv and performing basic exploration. The dataset contains two key variables:
Advertising Spend (X) — measured in thousands of dollars.
Sales Revenue (Y) — measured in thousands of dollars.
import pandas as pd
import matplotlib.pyplot as plt
# Load dataset
ad_sales = pd.read_csv('assignment2_advertising_sales.csv')
# Quick exploration
print(ad_sales.head())
print(ad_sales.info())
print(ad_sales.describe())
# Scatter plot to visualize relationship
plt.scatter(ad_sales ['Advertising_Spend'], ad_sales ['Sales_Revenue'], color='blue')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Scatter Plot: Advertising Spend vs Sales Revenue')
plt.show()

Explanation:
- The scatter plot provides a first look at the relationship. A clear upward trend indicates that higher advertising spend is associated with higher sales revenue, suggesting linear regression is appropriate.
Model Building
We split the dataset into training (70%) and test (30%) sets to evaluate generalization. Then, we fit a simple linear regression model.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Define features and target
X = ad_sales [['Advertising_Spend']]
Y = ad_sales ['Sales_Revenue']
# Train-test split
X_train, X_test, Y_train, Y_test = train_test_split(
X, Y, test_size=0.3, random_state=42
)
# Build and fit model
lr_model = LinearRegression()
lr_model.fit(X_train, Y_train)
# Display coefficients
print("Intercept:", lr_model.intercept_)
print("Slope:", lr_model.coef_[0])
Explanation:
The intercept (≈ 38.97) represents baseline sales revenue when advertising spend is zero.
The slope (≈ 4.86) indicates that for every additional $1,000 spent on advertising, sales revenue increases by about $4,860.
Predictions
We generate predictions for both training and test sets and compare them to actual values.
# Predictions
train_pred = lr_model.predict(X_train)
test_pred = lr_model.predict(X_test)
# Compare first 10 predictions vs actual on Test Set
comparison = pd.DataFrame({
'Actual': Y_test[:10].values,
'Predicted': test_pred[:10]
})
print(comparison)
Result:
First 10 Test Set Predictions:
| Index | Actual (Test) | Predicted (Test) |
| 0 | 256.8 | 255.729433 |
| 1 | 204.6 | 203.725528 |
| 2 | 142.5 | 140.057196 |
| 3 | 135.9 | 132.766929 |
| 4 | 165.3 | 167.760211 |
| 5 | 276.2 | 276.628198 |
| 6 | 233.2 | 235.802703 |
| 7 | 131.6 | 129.850822 |
| 8 | 174.3 | 175.536496 |
| 9 | 207.9 | 207.613671 |
Explanation:
- The predicted values closely align with actual sales revenue, confirming the model captures the relationship well.
Visualization
We visualize the regression line overlaid on the scatter plots for both training and test sets.
# Training set visualization
plt.scatter(X_train, Y_train, color='blue', label='Actual')
plt.plot(X_train, train_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Training Set: Advertising Spend vs Sales Revenue')
plt.legend()
plt.show()
# Test set visualization
plt.scatter(X_test, Y_test, color='green', label='Actual')
plt.plot(X_test, test_pred, color='red', linewidth=2, label='Regression Line')
plt.xlabel('Advertising Spend (in $1000s)')
plt.ylabel('Sales Revenue (in $1000s)')
plt.title('Test Set: Advertising Spend vs Sales Revenue')
plt.legend()
plt.show()


Explanation:
- The regression line fits the scatter plots well, reinforcing the strong linear relationship between advertising spend and sales revenue.
Model Evaluation
We evaluate the model using R², MSE, and RMSE.
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np
# Training metrics
r2_train = r2_score(y_train, train_pred)
mse_train = mean_squared_error(y_train, train_pred)
rmse_train = np.sqrt(mse_train)
# Test metrics
r2_test = r2_score(y_test, test_pred)
mse_test = mean_squared_error(y_test, test_pred)
rmse_test = np.sqrt(mse_test)
print("Training R²:", r2_train)
print("Training MSE:", mse_train)
print("Training RMSE:", rmse_train)
print("Test R²:", r2_test)
print("Test MSE:", mse_test)
print("Test RMSE:", rmse_test)
Result:
| Index | Training Set Evaluation | Test Set Evaluation |
| R² | 0.9970183835851744 | 0.99816296421656 |
| MSE | 7.2403875565144995 | 4.057518903568578 |
| RMSE: | 2.6907968255731425 | 2.01432840012957 |
Explanation:
R² close to 1 indicates the model explains most of the variance in sales revenue.
Low MSE and RMSE confirm accurate predictions with minimal error.
Similar train and test scores show the model generalizes well without overfitting.
Business Insights
- Regression Equation The fitted regression line is:
Y=38.97+4.86X
where Y = Sales Revenue (in $1000s), X = Advertising Spend (in $1000s).
- Prediction for $50,000 Advertising Spend Since $50,000 = 50 (in thousands),
Y=38.97+4.86×50=281.97
→ Expected sales revenue ≈ $281,970.
Recommendations
Optimize Advertising Budget Allocation: Increasing advertising consistently boosts revenue. The company should scale investment gradually while monitoring ROI.
Forecasting and Planning: Use the regression model to forecast revenue under different advertising budgets, aiding in realistic target setting.
Monitor Diminishing Returns: While the current data shows a strong linear trend, returns may plateau. Continuous data collection and model re‑evaluation are essential to detect diminishing returns.
Assignment 2: Multiple Linear Regression Analysis
The aim was to build a multiple regression model to predict startup monthly profit using several business metrics, and then refine it through backward elimination to identify the most significant drivers of profitability.
Data Preprocessing (briefly noted)
We loaded the dataset containing 58 startups with variables such as R&D Spend, Marketing Spend, Administration Cost, Employee Count, Location, and Profit.
The categorical variable Location was encoded into dummy variables.
To avoid the dummy variable trap, one category was dropped.
The dataset was split into training (80%) and test (20%) sets.
With preprocessing complete, we moved straight into model building.
Initial Model
The first regression model included all features: R&D Spend, Marketing Spend, Administration Cost, Employee Count, and Location dummies(Urban and Suburban).
P-values (descending) Results:
| Location_Urban | 0.843570 |
| Employee_Count | 0.598936 |
| Administration_Cost | 0.398697 |
| Location_Suburban | 0.365000 |
| Marketing_Spend | 0.032131 |
| RD_Spend | 0.000579 |
| const | 0.000003 |
The model produced a decent R², indicating that these variables collectively explained a large portion of profit variance.
- Issue: Several predictors had high p‑values (>0.05), meaning they were statistically insignificant. In other words, they weren’t truly contributing to explaining profit.
This is where backward elimination came in.
Backward Elimination
Using OLS regression with p‑values, we iteratively removed features that didn’t meet the significance threshold (p > 0.05).
Step 1: Administration Cost was dropped first — its p‑value was high, showing no meaningful impact on profit.
Step 2: Employee Count was removed next. Despite being intuitive, the data showed no significant correlation with profit.
Step 3: Location dummies (Urban, Suburban, Rural) were eliminated. None of them had significant coefficients, proving that office location didn’t matter in predicting profit.
After these eliminations, only R&D Spend and Marketing Spend remained. Both had strong statistical significance (R&D: p < 0.001, Marketing: p ≈ 0.03).
Optimized Model
The optimized regression model was leaner, using only R&D Spend and Marketing Spend.
Performance:
| Metric | Initial Model | Optimized Model | Improvement (%) |
| R² | 9.711073e-01 | 9.740541e-01 | 0.303448 |
| Adjusted R² | 9.646867e-01 | 9.682883e-01 | 0.373349 |
| MSE | 9.976647e+07 | 8.959117e+07 | 10.199117 |
| RMSE | 9.988317e+03 | 9.465261e+03 | 5.236672 |
Adjusted R² improved, showing the model explained profit more efficiently with fewer variables.
Error metrics (MSE, RMSE) dropped by over 10%, meaning predictions were closer to actual profits.
Interpretation: This is the ideal balance, a simpler model that performs better because it focuses only on the strongest predictors.
Visualization

This visualization compares actual, Initial predicted and optimized predicted profit values for the first ten test samples. The closeness of the bars indicates that the optimized model predicts profit accurately with minimal deviation.
Business Recommendations
From the analysis, five clear strategies emerge:
Double down on R&D Innovation is the lifeblood of profit. Startups should prioritize R&D budgets, even if it means trimming administrative expenses. The data proves R&D delivers the highest returns.
Strategic marketing, not scattergun Marketing spend matters, but it must be smart. Focus on channels that amplify innovative products. Tie campaigns to product launches and customer feedback loops rather than burning cash on generic ads.
Cut the noise Administration costs and employee headcount don’t directly translate to profit. Keep overhead lean. Adopt remote work or shared services where possible. Efficiency is key.
Location is overrated Whether urban or suburban, location didn’t significantly affect profit. In today’s digital economy, customers care more about product quality and visibility than office address. Invest in online presence rather than fancy headquarters.
Balanced growth strategy Think of R&D as the engine and marketing as the fuel. Without R&D, marketing is hype with no substance. Without marketing, R&D is a hidden gem nobody knows about. The two must work hand in hand.
Final Takeaway
The optimized model shows that simplicity beats complexity. By focusing on R&D and Marketing Spend, startups can predict profit more accurately and design smarter strategies. The lesson is clear: invest in innovation, amplify it with targeted marketing, and keep everything else lean.
REAL WORLD DATA SET - MEDICAL INSURANCE COST
Data source: Medical Insurance Cost Prediction
Predicting medical insurance charges is a classic supervised learning problem with real business impact—pricing fairness, risk management, and targeted wellness programs depend on it. In this publication, we build a robust, reproducible pipeline to model insurance charges using regularized linear models: Ridge (L2) and Lasso (L1). We emphasize:
Data preprocessing: log-transforming skewed targets, encoding categorical variables, scaling numericals.
Modeling rigor: train/test splits, cross-validation, hyperparameter tuning via GridSearchCV.
Interpretability: Lasso’s feature selection to identify drivers of cost.
Reproducibility: Colab + Google Drive mounting with explicit dataset paths.
Technically, the target variable (charges) is log-transformed to stabilize variance and improve linear model fit. Categorical features are one-hot encoded with a dropped baseline to avoid multicollinearity, and numerical features are standardized to ensure regularization behaves consistently across scales. We evaluate models using R², Adjusted R², MAE, MSE, and RMSE—balancing explanatory power and error magnitude.
Data preparation
Dataset loading and target transformation
We stored the datasets in Google Drive and access via Colab for reproducibility. The target (charges) is log-transformed to reduce right skew and improve linear assumptions.
import pandas as pd
import numpy as np
# Load dataset from Google Drive (adjust path to your folder)
data_path = '/content/drive/My Drive/Week14_Datasets/insurance.csv'
insurance_data = pd.read_csv(data_path)
# Log-transform the target to stabilize variance and reduce skew
y = np.log(insurance_data['charges'])
# Features: drop target
X = insurance_data.drop('charges', axis=1)
Implication for interpretation: Predictions are in log-space; if you need original units, exponentiate predictions (np.exp).
Feature Engineering and Preprocessing
We separate features by type: categorical vs numerical. We use OneHotEncoder with drop='first' to avoid the dummy variable trap and StandardScaler for numericals. The smoker column is binary—mapped manually to 0/1 and passed through as it is.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Define feature groups
onehot_features = ['sex', 'region']
numeric_features = ['age', 'bmi', 'children']
# Map smoker to binary (yes=1, no=0)
X['smoker'] = X['smoker'].map({'yes': 1, 'no': 0})
# ColumnTransformer: one-hot for categoricals, scale numericals, pass smoker through
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(drop='first', sparse_output=False), onehot_features),
('num', StandardScaler(), numeric_features)
],
remainder='passthrough' # keeps 'smoker' as-is
)
Train/test split and preprocessing application
We split the data to evaluate generalization and fit the preprocessor on the training set only to avoid leakage.
from sklearn.model_selection import train_test_split
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Fit preprocessor on training data; transform both train and test
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# Retrieve feature names post-transformation for interpretability
feature_names = preprocessor.get_feature_names_out()
Data leakage prevention: Fit transformations (scaling, encoding) on training data only; apply to test data using the fitted parameters.
Feature names: Useful for mapping coefficients back to human-readable features, especially for Lasso interpretation.
Modeling Approach
# 4 Models
#--- ridge (with and without hypertuning with Gridsearch TV) ---
# Fit Ridge Regression
ridge_model = Ridge(alpha=1.0)
ridge_model.fit(X_train_processed, Y_train)
# Ridge Predictions without GridSearchCV
ridgeY1_train_pred = ridge_model.predict(X_train_processed)
ridgeY1_test_pred = ridge_model.predict(X_test_processed)
# Ridge Regression with GridSearchCV
ridge_params = {'alpha': [0.01, 0.1, 1, 10, 100]}
ridge_grid = GridSearchCV(Ridge(), ridge_params, cv=5, scoring='r2')
ridge_grid.fit(X_train_processed, Y_train)
best_ridge = ridge_grid.best_estimator_
ridgeY2_train_pred = best_ridge.predict(X_train_processed)
ridgeY2_test_pred = best_ridge.predict(X_test_processed)
#--- lasso (with and without hypertuning with Gridsearch TV)---
# Fit Lasso Regression
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_train_processed, Y_train)
# Lasso Predictions without GridSearchCV
lassoY1_train_pred = lasso_model.predict(X_train_processed)
lassoY1_test_pred = lasso_model.predict(X_test_processed)
# Lasso Regression with GridSearchCV
lasso_params = {'alpha': [0.001, 0.01, 0.1, 1, 10], 'max_iter': [10000, 50000]}
lasso_grid = GridSearchCV(Lasso(), lasso_params, cv=5, scoring='r2')
lasso_grid.fit(X_train_processed, Y_train)
best_lasso = lasso_grid.best_estimator_
lassoY2_train_pred = best_lasso.predict(X_train_processed)
lassoY2_test_pred = best_lasso.predict(X_test_processed)
Two regularized linear regression techniques were implemented:
Ridge Regression (L2 penalty): Shrinks coefficients but keeps all features. It’s robust against multicollinearity and stabilizes the model.
Lasso Regression (L1 penalty): Shrinks some coefficients to zero, effectively performing feature selection. This improves interpretability by highlighting the most influential predictors.
Both models were evaluated in two configurations:
Baseline: Using a fixed alpha (regularization strength).
Optimized: Using GridSearchCV to tune hyperparameters with cross-validation.
Baseline and tuned models (context preview)
I trained four models:
Ridge (Baseline):
alpha=1.0Ridge (Optimized): tuned via GridSearchCV
Lasso (Baseline):
alpha=0.1Lasso (Optimized): tuned via GridSearchCV
EVALUATION
Evaluation will include both train and test metrics to check for under/overfitting and the impact of tuning. Lasso’s optimized coefficients will be used to identify retained vs dropped features, giving us interpretable business insights.
We evaluated all four models using R², Adjusted R², MAE, MSE, and RMSE. The evaluation function ensures consistency across metrics:
RESULTS
We evaluated all four models on both training and test sets using R², Adjusted R², MAE, MSE, and RMSE. The evaluation function ensures consistency across metrics:
| Index | Model | R² | Adj R² | MAE | MSE | RMSE | |
| 0 | Ridge Train(Baseline) | 0.757211 | 0.749711 | 0.282444 | 0.201573 | 0.448969 | |
| 1 | Ridge Train(Optimized) | 0.757227 | 0.749728 | 0.282130 | 0.201559 | 0.448954 | |
| 2 | Lasso Train (Baseline) | 0.647795 | 0.636916 | 0.373393 | 0.292414 | 0.540753 | |
| 3 | Lasso Train (Optimized) | 0.757148 | 0.749647 | 0.281922 | 0.201625 | 0.449027 | |
| 4 | Ridge Test(Baseline) | 0.804598 | 0.798563 | 0.270409 | 0.175694 | 0.419158 | |
| 5 | Ridge Test(Optimized) | 0.804719 | 0.798687 | 0.269763 | 0.175585 | 0.419029 | |
| 6 | Lasso Test(Baseline) | 0.681453 | 0.671614 | 0.383556 | 0.286418 | 0.535181 | |
| 7 | Lasso Test(Optimized) | 0.804075 | 0.798023 | 0.270082 | 0.176164 | 0.419719 |
VISUALIZATIONS



Interpretation of Results
Ridge Regression: Both baseline and optimized versions consistently achieved R² ≈ 0.80 on test data, with low error metrics. Ridge is stable and reliable even without tuning.
Lasso Regression: Baseline performance was weaker (R² ≈ 0.68), but after tuning, Lasso matched Ridge’s accuracy (R² ≈ 0.80) and error metrics.
Train vs Test: Ridge showed consistent train/test performance, indicating no overfitting. Lasso baseline underfit, but optimization corrected this.
Error Metrics: Ridge and optimized Lasso both achieved MAE ≈ 0.27 and RMSE ≈ 0.42, confirming strong predictive accuracy.
Feature Selection
Out of Curiousity, I wanted to know the features that greatly influenced the Lasso Regression model prediction, so i utilized the model's ability to perform feature selection by shrinking some coefficients exactly to zero.
# Coefficients from the optimized Lasso model
lasso_coefs = best_lasso.coef_
# Create a DataFrame of features and coefficients
lasso_coef_df = pd.DataFrame({
"Feature": feature_names,
"Coefficient": lasso_coefs
})
# Identify dropped features (coefficients = 0)
dropped_features = lasso_coef_df[lasso_coef_df["Coefficient"] == 0]
retained_features = lasso_coef_df[lasso_coef_df["Coefficient"] != 0]
print("Dropped Features (Coefficient = 0):")
display(dropped_features)
print("Retained Features (Non-zero Coefficients):")
display(retained_features)
Insights
1. Retained Features
Your optimized Lasso model kept 8 features with non‑zero coefficients:
| Feature | Coefficient | Technical Interpretation |
remainder__smoker | +1.545 | Strongest positive driver. Smoking status massively increases predicted charges. |
num__age | +0.481 | Older age correlates with higher charges. |
num__children | +0.111 | More dependents slightly increase costs. |
num__bmi | +0.080 | Higher BMI adds moderate risk. |
onehot__sex_male | −0.070 | Males have slightly lower charges compared to females (baseline). |
onehot__region_northwest | −0.040 | Northwest residents pay less compared to Northeast (baseline). |
onehot__region_southeast | −0.119 | Southeast residents pay less compared to Northeast. |
onehot__region_southwest | −0.106 | Southwest residents pay less compared to Northeast. |
2. Dropped Features
All other dummy variables were dropped during encoding (drop='first'). That means:
Female is the baseline for sex.
Northeast is the baseline for region. Their effects are absorbed into the model intercept, and all coefficients are interpreted relative to them.
3. Technical Takeaways
Smoker status dominates: The coefficient is an order of magnitude larger than others, confirming smoking is the single most important predictor.
Age is substantial: A steady positive effect, showing charges rise with age.
BMI and children are moderate: They contribute, but less dramatically.
Region and sex are comparative: Negative coefficients show relative reductions compared to the dropped baselines.
4. Business Implications
Pricing strategy: Smoking and age should be the primary drivers of premium differentiation.
Wellness programs: Target BMI and smoking cessation to reduce claims.
Family coverage: Incremental pricing for dependents ensures fairness.
Regional adjustments: Premiums should reflect geographic cost differences.
Gender differences: Too small to justify pricing changes; focus on lifestyle factors instead.
Together, these implications show insurers how to balance risk-based pricing with customer-centric wellness initiatives. Smoking and age should drive premium differentiation, while BMI, children, and regional differences offer opportunities for nuanced strategies. Gender differences are minor, so the focus should remain on lifestyle and geography.
Final Conclusion
Across the series of assignments and the real‑world project, a clear progression emerges: from foundational data preprocessing, through simple regression, into multiple regression with feature selection, and finally into applied business insights. Each stage built upon the last, sharpening both technical skills and strategic thinking.
Data Preprocessing I established a robust pipeline to clean, encode, and scale data. This ensured that every subsequent model was built on reliable inputs, highlighting the importance of preparation in data science.
Simple Linear Regression By modeling the relationship between advertising spend and sales revenue, I demonstrated how even a single predictor can yield actionable insights. The regression equation provided a straightforward forecasting tool, and the business recommendations showed how companies can optimize budgets with confidence.
Multiple Linear Regression with Feature Selection Expanding to multiple predictors, I applied backward elimination to strip away noise and reveal the true drivers of startup profit. The optimized model proved leaner yet more accurate, underscoring the principle that simplicity often outperforms complexity when guided by statistical rigor.
Real‑World Project (Insurance Charges) Using Ridge and Lasso regression, I tackled a practical problem of predicting medical insurance costs. Ridge offered stability, while Lasso added interpretability by selecting the most influential features. The business insights translated technical findings into strategies for pricing, wellness programs, and market expansion.
Big Picture
Together, these works illustrate the end‑to‑end journey of applied machine learning:
Preparation ensures data integrity.
Modeling captures relationships and patterns.
Evaluation validates accuracy and generalization.
Feature selection sharpens focus on what truly matters.
Business insights bridge the gap between numbers and strategy.
The overarching lesson is clear: data science is most powerful when technical precision meets business relevance. By combining rigorous modeling with actionable recommendations, we’ve shown how analytics can guide smarter decisions, optimize resources, and unlock growth.




