Day 2 : Pandas And Data Analysis
Self Intro¶
Hi Everyone,
Welcome back to my blog! Here we are at Day 2 of learning Machine Learning.
I’m Rohan Sai, aka AiKnight.
A Joke to Lighten the Mood :¶
Why did the machine learning model break up with its dataset?
It just didn’t find the relationship significant anymore! π
Pretty Lame Ikr...
Today, we’re diving into one of the most powerful tools in the data science arsenal: the Pandas Library.
We'll cover everything that Pandas library has and everything that can be done with it indetail.
Before starting , check out my Advanced Data Processor to clean, preprocess, and visualize datasets effortlessly.
Funny is that it also handles the data cleaning as well that too effortlessly , making it simple.
Try it now at Streamlit App
AN INTERESTING MILESTONE :
In 2024, Google DeepMind's AlphaFold 3 made a groundbreaking leap by accurately predicting how proteins interact with DNA and other molecules, significantly accelerating research and drug discovery.
Enough Talk. Let’s dive in and unlock the power of Pandas together! π
Pandas Library¶
The Pandas library is an open-source Python library that provides high-level data structures and methods designed to make data analysis and manipulation, fast and easy. It is built on top of NumPy and is the most widely used library for data analysis tasks in Python.
Pandas has two primary data structures:
- Series – A one-dimensional labeled array.
- DataFrame – A two-dimensional labeled data structure, similar to a table or spreadsheet.
Pandas provides functionality to load, clean, manipulate, analyze, and visualize data, making it the go-to tool for data analysis in Python.
1. Pandas: Key Concepts and Data Structures¶
a. Pandas Series¶
A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.).
Key Features of Series:
- It has an index, which is used to access elements in the Series.
- It can be created from a list, a NumPy array, or a dictionary.
Example:
import pandas as pd
# Creating a Series
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)
Output:
0 10
1 20
2 30
3 40
4 50
dtype: int64
Key Operations on Series:
- Accessing Elements: You can access elements using the index.
print(s[0])
# Access first element
- Indexing and Slicing: Like arrays, Series can be indexed or sliced.
print(s[:3])
# First three elements
- Element-wise Operations: Mathematical operations on Series elements.
print(s * 2)
# Multiply each element by 2
b. Pandas DataFrame¶
A DataFrame is a two-dimensional data structure that consists of rows and columns. It can be thought of as a table, where each column can hold different data types.
Key Features of DataFrame:
- It can be created from various data sources like CSV files, Excel files, dictionaries, or lists.
- A DataFrame has both row and column indices.
- Each column can have a different data type.
Example:
import pandas as pd
# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)
Output:
Name Age
0 Alice 25
1 Bob 30
2 Charlie 35
Key Operations on DataFrame:
- Accessing Columns: You can access columns as attributes or using
df['column_name'].
print(df['Name']) # Access the 'Name' column
- Selecting Rows: You can select rows using
.iloc[]or.loc[].
print(df.iloc[0]) # Access first row using integer index
print(df.loc[0]) # Access first row using label
- Slicing Data: DataFrames support slicing by rows or columns.
print(df[['Name']]) # Slicing 'Name' column
- Descriptive Statistics: Compute mean, median, min, max, etc., for each column.
print(df.describe()) # Get summary statistics of numerical columns
2. Commonly Used Pandas Functions¶
Here are some commonly used functions in Pandas for data manipulation, analysis, and cleaning.
a. pd.read_csv()¶
This function is used to read CSV files into a Pandas DataFrame.
df = pd.read_csv('data.csv')
Example:
import pandas as pd
df = pd.read_csv('data.csv')
print(df.head()) # Display first few rows of the dataset
b. pd.DataFrame()¶
This is used to create a DataFrame from various data sources, such as dictionaries, lists, and arrays.
df = pd.DataFrame(data)
Example:
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)
c. df.head() and df.tail()¶
head()returns the first 5 rows of the DataFrame by default.tail()returns the last 5 rows of the DataFrame by default.
print(df.head())
# Displays first 5 rows
print(df.tail())
# Displays last 5 rows
d. df.info()¶
This function provides a concise summary of the DataFrame, including the number of non-null values, data types, and memory usage.
df.info()
e. df.describe()¶
This function generates descriptive statistics such as count, mean, standard deviation, minimum, and maximum values for numerical columns.
df.describe()
f. df.isnull() and df.dropna()¶
isnull()returns a DataFrame of boolean values indicating if values areNaN(null).dropna()removes rows with missing data.
df.isnull()
# Check for missing values
df.dropna()
# Remove rows with missing values
g. df.groupby()¶
This function is used to group the DataFrame by one or more columns and then apply a function to each group.
grouped = df.groupby('column_name')
print(grouped.mean()) # Compute mean for each group
h. df.merge()¶
This function is used to merge two DataFrames based on a common column.
merged_df = pd.merge(df1, df2, on='column_name')
3. Advanced Pandas Operations¶
a. Data Cleaning with Pandas¶
- Handling Missing Data: You can use
dropna(),fillna(), or forward/backward fill to handle missing data.
df.fillna(0)
# Replace missing values with 0
df.dropna()
# Drop rows with missing data
- Replacing Values: You can replace specific values using
replace().
df.replace('old_value', 'new_value')
b. Data Transformation¶
- Normalization: Scale data to a specific range (usually 0 to 1).
df['Age'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())
- One-Hot Encoding: Convert categorical variables into a numerical format using one-hot encoding.
df_encoded = pd.get_dummies(df['Category'])
c. Pivoting and Reshaping¶
- Pivot Table: Used to summarize and aggregate data, similar to Excel pivot tables.
pivot_table = df.pivot_table(values='value_column', index='index_column', columns='columns')
- Melt: Reshape DataFrame to a long format.
df_melted = df.melt(id_vars='id_column')
d. Time Series Analysis¶
Pandas provides several functions to work with time series data, such as pd.to_datetime() to convert dates and times into a datetime format, and df.resample() to resample data.
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df_resampled = df.resample('M').mean() # Resample by month and compute mean
4. How Pandas is Used for Data Analysis¶
Pandas plays a crucial role in data analysis, offering a wide range of functionality to manipulate, analyze, and visualize data.
a. Data Cleaning and Preprocessing¶
Before performing analysis, it is crucial to clean the data. Pandas allows you to:
- Remove or fill missing values using
dropna()orfillna(). - Convert data types with
astype(). - Filter data based on conditions using boolean indexing.
Example:
df = df[df['Age'] > 30] # Filter rows where age > 30
b. Exploratory Data Analysis (EDA)¶
Pandas provides a set of methods to perform EDA, which is the process of analyzing and visualizing data to identify patterns, trends, and relationships. Functions such as groupby(), describe(), and pivot_table() help summarize and understand the data.
Example:
df.groupby('Category')['Sales'].mean() # Group data by 'Category' and compute mean sales
c. Statistical Analysis¶
Pandas allows for basic statistical analysis of data. You can calculate mean, median, variance, and perform more advanced techniques such as regression using external libraries (like statsmodels or scikit-learn).
Example:
df['Sales'].mean() # Calculate the mean of the 'Sales' column
d. Data Visualization¶
Though Pandas is not a visualization library, it integrates seamlessly with libraries like Matplotlib and Seaborn to plot graphs and visualize the data.
Example:
df['Sales'].plot(kind='hist') # Plot histogram of 'Sales'
5. Benefits and Demerits of Pandas for Data Analysis¶
Benefits:¶
- Fast and Efficient: Pandas is optimized for performance and can handle large datasets efficiently.
- Versatile: Works well with data from various formats like CSV, Excel, SQL, and others.
- Powerful Data Manipulation: Supports a wide variety of operations like filtering, sorting, merging, and reshaping data.
- Integrated with Visualization: Easily integrates with Matplotlib and Seaborn for data visualization.
- Easy-to-Use: The API is intuitive and Pythonic, making it user-friendly.
Demerits:¶
- Memory Consumption: Pandas can be memory-intensive, especially with large datasets.
- Limited Performance for Extremely Large Datasets: While it handles large datasets well, it may struggle with datasets too large to fit in memory.
- Learning Curve: For beginners, there can be a learning curve due to the extensive functionality and methods.
Now let's take a dataset as example and see how pandas can be used for Data Analysis :
Pandas in Data Analysis¶
To provide a complete and practical overview of how Pandas is used in Data Analysis, we’ll work through a real dataset and explore how to perform various analysis tasks step by step using Pandas. This will involve:
- Data Loading and Preprocessing
- Exploratory Data Analysis (EDA)
- Data Cleaning
- Statistical Analysis
- Data Visualization
For this, let's use a commonly available dataset from the Pandas DataFrame library called "Titanic", which contains information about passengers on the Titanic. This dataset is commonly used for data analysis and machine learning exercises.
We’ll go through the tasks and show the Pandas functions used, along with the code and outputs at each step.
1. Data Loading and Preprocessing¶
Loading the Dataset¶
We will load the Titanic dataset directly from the seaborn library, which comes pre-packaged with some datasets, including Titanic.
import seaborn as sns
import pandas as pd
# Load Titanic dataset
df = sns.load_dataset('titanic')
# Show the first few rows of the DataFrame
print(df.head())
Output:
survived pclass sex age sibsp parch fare embarked class who ...
0 0 3 male 22.0 1 0 7.250 S Third man ...
1 1 1 female 38.0 1 0 71.283 C First woman ...
2 1 3 female 26.0 0 0 7.925 S Third woman ...
3 1 1 female 35.0 1 0 53.100 S First woman ...
4 0 3 male 35.0 0 0 8.050 S Third man ...
This dataset contains information about passengers on the Titanic, such as their survival status, pclass (passenger class), sex, age, fare, and embarked (port of embarkation).
2. Exploratory Data Analysis (EDA)¶
Summarizing the Data¶
To understand the dataset, we can start by getting basic descriptive statistics, such as mean, standard deviation, and percentiles.
# Get a summary of the numerical columns
print(df.describe())
Output:
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 3.000000 20.125000 0.000000 0.000000 7.910000
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 1.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
Missing Data Analysis¶
To understand the completeness of the data, we check for missing values.
# Check for missing data
print(df.isnull().sum())
Output:
survived 0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
...
This indicates that the Age column has 177 missing values, and the Embarked column has 2 missing values.
3. Data Cleaning¶
Handling Missing Values¶
We can either drop the rows with missing values or fill them with a reasonable value. Let’s fill the missing values in Age with the median and the missing values in Embarked with the mode (most frequent value).
# Fill missing 'Age' with the median
df['age'].fillna(df['age'].median(), inplace=True)
# Fill missing 'Embarked' with the most frequent value
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)
# Check again for missing values
print(df.isnull().sum())
Output:
survived 0
pclass 0
sex 0
age 0
sibsp 0
parch 0
fare 0
embarked 0
class 0
who 0
...
Converting Data Types¶
We can convert the Survived column to a boolean type (True or False), which will be more informative.
# Convert 'survived' to a boolean type
df['survived'] = df['survived'].astype(bool)
# Check the data types of the columns
print(df.dtypes)
Output:
survived bool
pclass int64
sex object
age float64
sibsp int64
parch int64
fare float64
embarked object
class object
who object
...
4. Statistical Analysis¶
We can perform some statistical analysis to understand relationships in the data.
Correlation Analysis¶
For numerical columns like age, fare, and survived, we can check the correlation to see how strongly variables are related.
# Calculate correlation between numerical features
print(df.corr())
Output:
survived pclass age sibsp parch fare
survived 1.000000 -0.338481 -0.077221 -0.035523 -0.017901 0.257308
pclass -0.338481 1.000000 -0.369226 0.105868 0.072712 -0.549500
age -0.077221 -0.369226 1.000000 0.035634 0.004070 -0.096067
sibsp -0.035523 0.105868 0.035634 1.000000 0.410798 0.010388
parch -0.017901 0.072712 0.004070 0.410798 1.000000 0.081629
fare 0.257308 -0.549500 -0.096067 0.010388 0.081629 1.000000
This tells us that pclass is negatively correlated with survived (passenger class has a lower survival rate), and fare has a moderate positive correlation with survival (higher fare might indicate better chances of survival).
5. Data Visualization¶
Visualizing the Data¶
Data visualization is key to understanding patterns and trends in the dataset. We can use libraries like Matplotlib and Seaborn to visualize the data.
import seaborn as sns
import matplotlib.pyplot as plt
# Distribution of Age
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()
# Survival rate by Class
sns.barplot(x='pclass', y='survived', data=df)
plt.title('Survival Rate by Class')
plt.show()
# Survival rate by Gender
sns.barplot(x='sex', y='survived', data=df)
plt.title('Survival Rate by Gender')
plt.show()
Visualization Output:¶
- Age Distribution: A histogram that shows the distribution of passengers' ages.
- Survival Rate by Class: A bar plot showing the survival rate for each passenger class (1st, 2nd, 3rd).
- Survival Rate by Gender: A bar plot showing the survival rate for male and female passengers.
6. Conclusion: Insights from Data Analysis¶
- Age: Passengers' ages range from infants to elderly, with most passengers being between 20 and 40 years old.
- Survival Rate: Women had a higher survival rate than men. The first-class passengers had the highest survival rate.
- Class: Passengers in first class had a higher survival rate than those in third class.
- Fare: There seems to be a positive correlation between the fare paid and the survival rate.
Beyond this, Pandas has some wild tricks up its sleeves—brace yourself, it's like discovering your dataset's secret superpowers! π
Advanced and Unique Ways to Use Pandas¶
While Pandas is a highly versatile library for data manipulation and analysis, there are many advanced and less commonly used functionalities that can help with different types of data analysis, especially when working with large datasets or performing complex analyses. These include methods that streamline workflows, integrate well with other libraries, and provide powerful capabilities beyond typical data cleaning, EDA, and basic analysis.
Let's explore some of these advanced techniques and unique functionalities offered by Pandas, with practical examples:
1. Multi-Indexing (Hierarchical Indexing)¶
Pandas allows the creation of multi-level indices or hierarchical indexing, which is useful for handling high-dimensional data in a more readable and efficient manner. Multi-indexing is particularly helpful when working with time-series data, grouped data, or when you need to organize data in a nested structure.
Example: Multi-Indexing with DataFrames¶
import pandas as pd
# Sample data with multi-level index
data = {
'Product': ['A', 'A', 'B', 'B'],
'Region': ['East', 'West', 'East', 'West'],
'Sales': [200, 300, 250, 350]
}
df = pd.DataFrame(data)
# Set multi-index
df.set_index(['Product', 'Region'], inplace=True)
print(df)
Output:
Sales
Product Region
A East 200
West 300
B East 250
West 350
Benefits:
- Allows for efficient querying and aggregation across multiple levels.
- Makes hierarchical data more intuitive to analyze.
- Useful for pivot tables, group-by operations, and more.
2. Window Functions¶
Pandas provides window functions (also called rolling windows) for performing calculations over a moving window of data. These are commonly used for time-series analysis, calculating moving averages, or smoothing out fluctuations.
Example: Rolling Window Calculations¶
# Create a sample time series
data = {'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
'Sales': [120, 140, 130, 160, 180, 200, 210, 190, 230, 220]}
df = pd.DataFrame(data)
df.set_index('Date', inplace=True)
# Calculate 3-day moving average for 'Sales'
df['3_day_avg'] = df['Sales'].rolling(window=3).mean()
print(df)
Output:
Sales 3_day_avg
Date
2023-01-01 120 NaN
2023-01-02 140 NaN
2023-01-03 130 130.000000
2023-01-04 160 143.333333
2023-01-05 180 156.666667
2023-01-06 200 180.000000
2023-01-07 210 196.666667
2023-01-08 190 200.000000
2023-01-09 230 210.000000
2023-01-10 220 213.333333
Benefits:
- Smooths out short-term fluctuations.
- Provides a more realistic analysis for time-series forecasting.
- Allows for flexible window size and aggregation functions.
3. Custom Aggregations with GroupBy¶
Beyond simple aggregation functions (sum, mean, etc.), Pandas GroupBy allows custom aggregation functions to be applied to each group in a dataset. This is useful for advanced statistical analysis and personalized reporting.
Example: Custom Aggregation¶
# Sample data
data = {
'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
'Sales': [500, 600, 450, 700, 550, 650, 400, 750],
'Profit': [50, 70, 45, 80, 60, 75, 40, 85]
}
df = pd.DataFrame(data)
# Group by Region and calculate custom aggregation
agg_funcs = {
'Sales': 'sum',
'Profit': lambda x: (x.mean() + x.std())
}
result = df.groupby('Region').agg(agg_funcs)
print(result)
Output:
Sales Profit
Region
East 850 81.859460
North 1050 65.773024
South 1250 73.639747
West 1450 81.561472
Benefits:
- Enables complex, domain-specific aggregations.
- Ideal for summarizing and analyzing grouped data in specialized ways.
- Flexible use of lambda functions for custom operations.
4. Pivot Tables¶
Pandas pivot_table function creates pivot tables to summarize and reorganize data, which is a powerful way to aggregate data based on multiple categorical variables.
Example: Pivot Table for Data Aggregation¶
# Create sample sales data
data = {
'Date': pd.date_range('2023-01-01', periods=12, freq='M'),
'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [200, 300, 150, 250, 180, 230, 220, 270, 210, 320, 190, 210]
}
df = pd.DataFrame(data)
# Create a pivot table
pivot_df = df.pivot_table(values='Sales', index='Category', aggfunc='sum')
print(pivot_df)
Output:
Sales
Category
A 1450
B 1870
Benefits:
- Useful for summarizing data with multiple variables.
- Can be customized with various aggregation functions like sum, mean, count, etc.
- Simplifies the interpretation of large datasets by summarizing key insights.
5. Melt and Pivot for Data Transformation¶
The melt function in Pandas is used to transform a dataset from wide format to long format, making it easier to analyze in many cases. Similarly, pivot allows you to go from long format to wide format.
Example: Using Melt and Pivot¶
# Sample data
data = {
'Year': [2020, 2020, 2021, 2021],
'Region': ['North', 'South', 'North', 'South'],
'Sales_Q1': [150, 180, 170, 210],
'Sales_Q2': [160, 190, 180, 220]
}
df = pd.DataFrame(data)
# Melt the dataset
melted_df = pd.melt(df, id_vars=['Year', 'Region'], value_vars=['Sales_Q1', 'Sales_Q2'], var_name='Quarter', value_name='Sales')
print(melted_df)
Output:
Year Region Quarter Sales
0 2020 North Sales_Q1 150
1 2020 South Sales_Q1 180
2 2021 North Sales_Q1 170
3 2021 South Sales_Q1 210
4 2020 North Sales_Q2 160
5 2020 South Sales_Q2 190
6 2021 North Sales_Q2 180
7 2021 South Sales_Q2 220
Benefits:
- Easily converts between long and wide formats.
- Enables data transformation that aligns with analytical needs.
- Facilitates reshaping and restructuring data for more efficient analysis.
6. Apply Function for Complex Data Operations¶
The apply function allows you to apply a custom function across a DataFrame or Series, making it very powerful for data manipulation and transformations.
Example: Using Apply for Custom Operations¶
# Sample data
data = {'Age': [25, 30, 35, 40, 45], 'Income': [40000, 50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
# Define a custom function to categorize income
def income_category(income):
if income < 50000:
return 'Low'
elif 50000 <= income < 70000:
return 'Medium'
else:
return 'High'
# Apply the function to the Income column
df['Income Category'] = df['Income'].apply(income_category)
print(df)
Output:
Age Income Income Category
0 25 40000 Low
1 30 50000 Medium
2 35 60000 Medium
3 40 70000 High
4 45 80000 High
Benefits:
- Powerful for custom transformations.
- Helps with applying complex logic across DataFrames or Series.
- Allows for flexible use cases, like conditional operations, transformations, and more.
7. Efficient Memory Management with Categorical Data Type¶
Pandas allows for efficient memory usage by converting text or string columns to the Categorical data type. This can significantly reduce memory usage, especially with large datasets containing repetitive strings.
Example: Using Categorical Data Type¶
# Sample data
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B']}
df = pd.DataFrame(data)
# Convert 'Category' column to categorical
df['Category'] = df['Category'].astype('category')
print(df.memory_usage())
Output:
Index 128
Category 32
dtype: int64
Benefits:
- Reduces memory consumption for categorical data.
- Useful for datasets with repetitive textual data.
- Speeds up operations like group-by and pivoting due to faster comparisons.
8. Advanced Data Profiling with pandas-profiling¶
pandas-profiling is a Python library that enhances data exploration by automating the process of generating comprehensive reports for a dataset. It is a powerful addition to the Pandas ecosystem, simplifying the identification of trends, patterns, and anomalies.
Example: Generating a Profiling Report¶
# Import required packages
import pandas as pd
from pandas_profiling import ProfileReport
# Load a dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
'Age': [24, 27, 22, 32, 29],
'Score': [85.5, 62.3, 88.8, None, 72.5]}
df = pd.DataFrame(data)
# Generate a profiling report
profile = ProfileReport(df, title='Dataset Profiling Report', explorative=True)
# Save the report as an HTML file
profile.to_file(output_file="profiling_report.html")
Output:
An interactive HTML report containing:
- Descriptive statistics: Mean, median, standard deviation, etc.
- Distributions: Histograms for numerical columns.
- Correlations: Pearson, Spearman, and Kendall coefficients.
- Warnings: Alerts for missing values, duplicates, and high cardinality.
Benefits:¶
- Automates exploratory data analysis.
- Saves time for data scientists and analysts.
- Generates ready-to-share, visually appealing reports.
- Facilitates quick identification of potential data quality issues.
Demerits:¶
- Computationally expensive for very large datasets.
- Limited customization for specific visualization styles.
Don't forget to checkout the project Advance Data Processor... Try it now at Streamlit App
Stay tuned for Day 3, where we’ll unlock even more machine learning...
Follow me on LinkedIn and X for updates.
Happy Analyzing!
Comments
Post a Comment