Web Scraping and Data Visualization: Global Population Analysis (2025)

Introduction

By Week 13 of my Data Science Internship at DataraFlow, I had moved past simply running notebooks and plotting charts. This phase of the internship challenged me to think and work like a data professional — someone who not only extracts data, but understands its real-world implications and communicates insights clearly to stakeholders.

For this assessment, I chose to explore a topic that affects everyone on the planet: global population distribution.

Population is more than just a number. It shapes markets, influences investment decisions, drives urban planning, and determines humanitarian priorities. While the world’s population in 2025 is projected to reach 8.23 billion, an important question remains: how is this population distributed across countries, and what does that distribution mean for business, policy, and development?

This project allowed me to combine the web scraping and data visualization skills I developed during Weeks 10 and 13 of the internship.

In this article, I will walk through the task requirements, my technical approach, the challenges encountered during scraping and data preparation, and the insights derived from real-world population data, with the ultimate goal of turning raw numbers into actionable intelligence for stakeholders.

Task Overview

The Week 13 assessment was structured to evaluate both guided learning retention and independent problem-solving using web scraping techniques.

It was divided into two distinct sections.

Section 1: Replicating a Guided Web Scraping Exercise

In the first part of the assessment, I was required to replicate a web scraping workflow demonstrated in a learning video. The objective was not simply to reproduce results, but to demonstrate an understanding of how structured data can be programmatically extracted from a live webpage and transformed into an analytical dataset.

Using Python libraries, I extracted tabular population data from an online source, performed basic cleaning and restructuring, and visualized trends using simple plots. This section reinforced foundational scraping concepts, including working with HTML tables, handling structured web data, and preparing datasets for analysis.

Section 2: Independent Web Scraping Assessment

The second section formed the core of the assessment and required a more autonomous approach.

I was tasked with selecting a webpage of my choice, scraping a table containing at least 100 rows of data, and carrying out a complete analytical workflow. This included data extraction, cleaning, exploratory analysis, visualization, and the presentation of findings in a format suitable for non-technical stakeholders.

This phase tested my ability to:

identify a reliable and relevant data source,
handle inconsistencies and formatting issues common in real-world web data,
and translate numerical findings into meaningful insights beyond code and charts.

Tools & Libraries Used

To complete the assessment, I relied on a standard Python-based data scraping and analysis stack.

Requests was used to handle HTTP requests and retrieve webpage content, while BeautifulSoup enabled the parsing and extraction of relevant HTML elements. Pandas played a central role in cleaning, transforming, and organizing the scraped data into structured DataFrames. For analysis and visualization, Matplotlib and Seaborn were used to explore trends, distributions, and cumulative patterns within the data.

Together, these tools reflect a typical workflow used in real-world data collection and exploratory analysis tasks.

Section 1: Scraping Nigeria’s Population Data

I started by extracting Nigeria’s historical and projected population data from Worldometer.

Instead of manually copying data, I used:

pd.read_html(url)

This allowed me to:

automatically extract all HTML tables from the page,
separate historical and projected population data,
clean and sort the dataset by year.

Visualization

I plotted:

historical population growth (blue solid line)
projected growth (orange dashed line)

This made it easy to compare past trends against future expectations.

Output

Finally, I combined both datasets and exported them as an Excel file, demonstrating how scraped data can be packaged for downstream business use.

Section 2: Scraping Global Population Data (Main Assessment)

For my independent task, I chose Wikipedia’s ‘List of countries and dependencies by population’.

This table was ideal because it:

contained over 200 rows,
was semi-structured,
required real data cleaning before analysis.

Data Acquisition & Cleaning

Using requests with a custom User-Agent, I fetched the page content and parsed it with BeautifulSoup.

Key steps included:

locating the table using its wikitable class,
extracting headers and rows manually,
converting the scraped HTML into a Pandas DataFrame.

#--- Scraping Wikipedia Table
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0 Safari/537.36"
}
# Make request with headers 
response = requests.get(url, headers=headers)
response.raise_for_status()

url = "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

# Find the table with class 'wikitable'
table = soup.find("table", {"class": "wikitable"})

# Extract headers
headers = [th.get_text(strip=True) for th in table.find_all("th")]

# Extract rows
rows = []
for tr in table.find_all("tr")[1:]:
    cells = tr.find_all(["td", "th"])
    if len(cells) > 1:
        rows.append([cell.get_text(strip=True) for cell in cells])

# Create DataFrame
pop_data = pd.DataFrame(rows, columns=headers)
pop_data.head()

At this stage, the data was raw — numbers contained commas, percentages had symbols, and totals were mixed with country-level data.

The dataset I scraped contained 6 columns:

Column	Meaning
Location	Country or territory name
Population	Total population (raw numbers)
% of world	Share of world population
Date	Most recent population estimate
Source	Official or UN estimate
Notes	Additional information

Data Cleaning & Preparation

#--- Cleaning my population data
pop_data = pop_data[['Location', 'Population', '% ofworld']]  # Keep relevant columns

# Clean Population column
pop_data['Population'] = (
    pop_data['Population']
    .astype(str)
    .str.replace(",", "")
    .astype(int)
)

# Clean percentage column
pop_data['% ofworld'] = (
    pop_data['% ofworld']
    .astype(str)
    .str.replace("%", "")
    .astype(float)
)

# Rename location header to countries
pop_data.rename(columns={"Location": "Country"}, inplace=True)

# Remove the "World" total row
pop_data_countries = pop_data[pop_data['Country'] != "World"].reset_index(drop=True)
pop_data_countries.head()

The above lines of code enabled me to;

Narrow the focus to three relevant columns: Country, Population, and % of world.
Remove commas and annotations from the population column
Convert the percentage column to float
Exclude the “World” aggregate row to focus on individual countries
Rename the “Location” column to “Country” for clarity

This preprocessing reinforced a key lesson: Scraping is only 30% of the work. Cleaning is the real task. I ensured the data was ready for visualization and analysis, allowing accurate computation of shares and cumulative population trends.

Exploring Population Distribution (Visualizations)

To better understand these patterns, I created several visualization charts/graphs using Matplotlib and Seaborn:

Histogram of Country Populations

The histogram showed a heavily right-skewed distribution, with a few mega-population countries and a long tail of smaller nations. Most countries have populations under 50 million, while India and China stand out as extreme outliers.

This visual highlights that population is highly concentrated in a few nations, which has implications for markets, policy, and global priorities.

Cumulative Population Curve (Top 15):

The cumulative curve confirmed the concentration of population. The top 15 countries cover over 70% of the world’s people, visually demonstrating that global strategies must consider these nations first.

Top 10 Countries Pie Chart
Insights:
- India and China together make up 34% of the global population.
- Other top 10 countries like the US, Indonesia, and Pakistan still have significant shares.
- The remaining 180+ countries collectively form less than half of the world’s people.

This chart clearly demonstrates population concentration and inequality across nations.

Top 10 Bar Chart

A bar chart highlighted each of the top 10 countries’ share of the world population, making it easy to see the gradual decrease from India and China down to countries like Russia and Mexico.

Analysis & Key Findings

The world population in 2025 is projected at 8.23 billion.
India and China each hold roughly 17% of the global population, together accounting for 34%.
The top 10 countries account for 57.1% of all people on Earth.
The largest 15 countries together make up more than 70% of the global population.

This data highlights an extreme top-heaviness: a handful of countries dominate global population, while hundreds of smaller nations make up the rest.

Strategic Implications for Stakeholders

1. Businesses & Investors

Countries with large populations and emerging growth, such as India, Indonesia, Nigeria, Pakistan, and Bangladesh, represent prime markets for consumer goods, digital adoption, and infrastructure investment.

Meanwhile, mature markets like the US and China remain attractive for high-tech industries, healthcare, and high-value manufacturing due to their population scale and economic maturity.

2. Policy & Government Planning

High-population nations must focus on:

Urban development and housing
Energy scaling and water resource planning
Food system modernization

Smaller countries (<10 million population) face scale limitations and may need regional cooperation or international support.

3. NGO & Global Development Implications

Countries such as Pakistan, Ethiopia, DR Congo, and Bangladesh require targeted humanitarian aid, healthcare capacity building, and education support. Population size directly impacts the effectiveness of policies and aid programs.

Insights & Takeaways

The global population is extremely concentrated; top 10–15 countries dominate.
Emerging Asian and African nations offer the largest growth opportunities.
Smaller countries need differentiated policy frameworks due to limited population scale.
Sustainability efforts must prioritize India, China, and the United States, as their decisions have global ripple effects.

Conclusion

Scraping and analyzing global population data turned out to be far more than a technical exercise. What began as a simple Wikipedia table evolved into a clear story about population concentration, opportunity, and strategic importance across the world.

This assessment helped me connect technical execution with real-world impact. Beyond scraping tables and plotting charts, it reinforced the importance of thinking critically about data sources, cleaning imperfect real-world datasets, and communicating insights in a way that decision-makers can understand and act upon.

Ultimately, this project highlighted a crucial lesson: data, even when freely available, holds powerful stories. With curiosity, careful preparation, and thoughtful visualization, those stories can shape decisions, inform strategy, and demonstrate how data literacy extends far beyond code.

Web Scraping and Data Visualization: Global Population Analysis (2025)

Introduction

Task Overview

Section 1: Replicating a Guided Web Scraping Exercise

Section 2: Independent Web Scraping Assessment

Tools & Libraries Used

Section 1: Scraping Nigeria’s Population Data

Visualization

Output

Section 2: Scraping Global Population Data (Main Assessment)

Data Acquisition & Cleaning

Data Cleaning & Preparation

Exploring Population Distribution (Visualizations)

Histogram of Country Populations

Cumulative Population Curve (Top 15):

Top 10 Countries Pie Chart

Top 10 Bar Chart

Analysis & Key Findings

Strategic Implications for Stakeholders

1. Businesses & Investors

2. Policy & Government Planning

3. NGO & Global Development Implications

Insights & Takeaways

Conclusion

Comments

More from this blog

Building a Multi-Agent AI System that Analyses Student Performance

Mastering Clustering, NLP & Dimensionality Reduction in Data Science

How I Built a Predictive "Forest" to Solve Churn, Detect Fraud, and Prevent Attrition

Precision in Predestination: Architecting Automated Loan Classification

Telecom Customer Churn Prediction with Machine Learning: How Data Science Saves Millions in Revenue

Command Palette

Introduction

Task Overview

Section 1: Replicating a Guided Web Scraping Exercise

Section 2: Independent Web Scraping Assessment

Tools & Libraries Used

Section 1: Scraping Nigeria’s Population Data

Visualization

Output

Section 2: Scraping Global Population Data (Main Assessment)

Data Acquisition & Cleaning

Data Cleaning & Preparation

Exploring Population Distribution (Visualizations)

Histogram of Country Populations

Cumulative Population Curve (Top 15):

Top 10 Countries Pie Chart

Top 10 Bar Chart

Analysis & Key Findings

Strategic Implications for Stakeholders

1. Businesses & Investors

2. Policy & Government Planning

3. NGO & Global Development Implications

Insights & Takeaways

Conclusion

Comments

More from this blog