Self Intro¶

Hi Everyone,

Welcome back to my blog! Here we are at Day 2 of learning Machine Learning.

I’m Rohan Sai, aka AiKnight.

A Joke to Lighten the Mood :¶

Why did the machine learning model break up with its dataset?
It just didn’t find the relationship significant anymore! 😄

Pretty Lame Ikr...

Today, we’re diving into one of the most powerful tools in the data science arsenal: the Pandas Library.

We'll cover everything that Pandas library has and everything that can be done with it indetail.

Before starting , check out my Advanced Data Processor to clean, preprocess, and visualize datasets effortlessly.

Funny is that it also handles the data cleaning as well that too effortlessly , making it simple.

Try it now at Streamlit App

AN INTERESTING MILESTONE :

In 2024, Google DeepMind's AlphaFold 3 made a groundbreaking leap by accurately predicting how proteins interact with DNA and other molecules, significantly accelerating research and drug discovery.

Enough Talk. Let’s dive in and unlock the power of Pandas together! 🚀

Pandas Library¶

The Pandas library is an open-source Python library that provides high-level data structures and methods designed to make data analysis and manipulation, fast and easy. It is built on top of NumPy and is the most widely used library for data analysis tasks in Python.

Pandas has two primary data structures:

Series – A one-dimensional labeled array.
DataFrame – A two-dimensional labeled data structure, similar to a table or spreadsheet.

Pandas provides functionality to load, clean, manipulate, analyze, and visualize data, making it the go-to tool for data analysis in Python.

1. Pandas: Key Concepts and Data Structures¶

a. Pandas Series¶

A Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.).

Key Features of Series:

It has an index, which is used to access elements in the Series.
It can be created from a list, a NumPy array, or a dictionary.

Example:

import pandas as pd

# Creating a Series
data = [10, 20, 30, 40, 50]
s = pd.Series(data)
print(s)

Output:

0    10
1    20
2    30
3    40
4    50
dtype: int64

Key Operations on Series:

Accessing Elements: You can access elements using the index.

print(s[0])  
# Access first element

Indexing and Slicing: Like arrays, Series can be indexed or sliced.

print(s[:3])  
# First three elements

Element-wise Operations: Mathematical operations on Series elements.

print(s * 2)  
# Multiply each element by 2

b. Pandas DataFrame¶

A DataFrame is a two-dimensional data structure that consists of rows and columns. It can be thought of as a table, where each column can hold different data types.

Key Features of DataFrame:

It can be created from various data sources like CSV files, Excel files, dictionaries, or lists.
A DataFrame has both row and column indices.
Each column can have a different data type.

Example:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
  'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

Output:

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35

Key Operations on DataFrame:

Accessing Columns: You can access columns as attributes or using df['column_name'].

print(df['Name'])  # Access the 'Name' column

Selecting Rows: You can select rows using .iloc[] or .loc[].

print(df.iloc[0])  # Access first row using integer index
print(df.loc[0])   # Access first row using label

Slicing Data: DataFrames support slicing by rows or columns.

print(df[['Name']])  # Slicing 'Name' column

Descriptive Statistics: Compute mean, median, min, max, etc., for each column.

print(df.describe())  # Get summary statistics of numerical columns

2. Commonly Used Pandas Functions¶

Here are some commonly used functions in Pandas for data manipulation, analysis, and cleaning.

a. `pd.read_csv()`¶

This function is used to read CSV files into a Pandas DataFrame.

df = pd.read_csv('data.csv')

Example:

import pandas as pd
df = pd.read_csv('data.csv')
print(df.head())  # Display first few rows of the dataset

b. `pd.DataFrame()`¶

This is used to create a DataFrame from various data sources, such as dictionaries, lists, and arrays.

df = pd.DataFrame(data)

Example:

data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
print(df)

c. `df.head()` and `df.tail()`¶

head() returns the first 5 rows of the DataFrame by default.
tail() returns the last 5 rows of the DataFrame by default.

print(df.head())  
# Displays first 5 rows
print(df.tail())  
# Displays last 5 rows

d. `df.info()`¶

This function provides a concise summary of the DataFrame, including the number of non-null values, data types, and memory usage.

df.info()

e. `df.describe()`¶

This function generates descriptive statistics such as count, mean, standard deviation, minimum, and maximum values for numerical columns.

df.describe()

f. `df.isnull()` and `df.dropna()`¶

isnull() returns a DataFrame of boolean values indicating if values are NaN (null).
dropna() removes rows with missing data.

df.isnull()  
# Check for missing values
df.dropna()  
# Remove rows with missing values

g. `df.groupby()`¶

This function is used to group the DataFrame by one or more columns and then apply a function to each group.

grouped = df.groupby('column_name')
print(grouped.mean())  # Compute mean for each group

h. `df.merge()`¶

This function is used to merge two DataFrames based on a common column.

merged_df = pd.merge(df1, df2, on='column_name')

3. Advanced Pandas Operations¶

a. Data Cleaning with Pandas¶

Handling Missing Data: You can use dropna(), fillna(), or forward/backward fill to handle missing data.

df.fillna(0)  
# Replace missing values with 0
df.dropna()  
# Drop rows with missing data

Replacing Values: You can replace specific values using replace().

df.replace('old_value', 'new_value')

b. Data Transformation¶

Normalization: Scale data to a specific range (usually 0 to 1).

df['Age'] = (df['Age'] - df['Age'].min()) / (df['Age'].max() - df['Age'].min())

One-Hot Encoding: Convert categorical variables into a numerical format using one-hot encoding.

df_encoded = pd.get_dummies(df['Category'])

c. Pivoting and Reshaping¶

Pivot Table: Used to summarize and aggregate data, similar to Excel pivot tables.

pivot_table = df.pivot_table(values='value_column', index='index_column', columns='columns')

Melt: Reshape DataFrame to a long format.

df_melted = df.melt(id_vars='id_column')

d. Time Series Analysis¶

Pandas provides several functions to work with time series data, such as pd.to_datetime() to convert dates and times into a datetime format, and df.resample() to resample data.

df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
df_resampled = df.resample('M').mean()  # Resample by month and compute mean

4. How Pandas is Used for Data Analysis¶

Pandas plays a crucial role in data analysis, offering a wide range of functionality to manipulate, analyze, and visualize data.

a. Data Cleaning and Preprocessing¶

Before performing analysis, it is crucial to clean the data. Pandas allows you to:

Remove or fill missing values using dropna() or fillna().
Convert data types with astype().
Filter data based on conditions using boolean indexing.

Example:

df = df[df['Age'] > 30]  # Filter rows where age > 30

b. Exploratory Data Analysis (EDA)¶

Pandas provides a set of methods to perform EDA, which is the process of analyzing and visualizing data to identify patterns, trends, and relationships. Functions such as groupby(), describe(), and pivot_table() help summarize and understand the data.

Example:

df.groupby('Category')['Sales'].mean()  # Group data by 'Category' and compute mean sales

c. Statistical Analysis¶

Pandas allows for basic statistical analysis of data. You can calculate mean, median, variance, and perform more advanced techniques such as regression using external libraries (like statsmodels or scikit-learn).

Example:

df['Sales'].mean()  # Calculate the mean of the 'Sales' column

d. Data Visualization¶

Though Pandas is not a visualization library, it integrates seamlessly with libraries like Matplotlib and Seaborn to plot graphs and visualize the data.

Example:

df['Sales'].plot(kind='hist')  # Plot histogram of 'Sales'

5. Benefits and Demerits of Pandas for Data Analysis¶

Benefits:¶

Fast and Efficient: Pandas is optimized for performance and can handle large datasets efficiently.
Versatile: Works well with data from various formats like CSV, Excel, SQL, and others.
Powerful Data Manipulation: Supports a wide variety of operations like filtering, sorting, merging, and reshaping data.
Integrated with Visualization: Easily integrates with Matplotlib and Seaborn for data visualization.
Easy-to-Use: The API is intuitive and Pythonic, making it user-friendly.

Demerits:¶

Memory Consumption: Pandas can be memory-intensive, especially with large datasets.
Limited Performance for Extremely Large Datasets: While it handles large datasets well, it may struggle with datasets too large to fit in memory.
Learning Curve: For beginners, there can be a learning curve due to the extensive functionality and methods.

Now let's take a dataset as example and see how pandas can be used for Data Analysis :

Pandas in Data Analysis¶

To provide a complete and practical overview of how Pandas is used in Data Analysis, we’ll work through a real dataset and explore how to perform various analysis tasks step by step using Pandas. This will involve:

Data Loading and Preprocessing
Exploratory Data Analysis (EDA)
Data Cleaning
Statistical Analysis
Data Visualization

For this, let's use a commonly available dataset from the Pandas DataFrame library called "Titanic", which contains information about passengers on the Titanic. This dataset is commonly used for data analysis and machine learning exercises.

We’ll go through the tasks and show the Pandas functions used, along with the code and outputs at each step.

1. Data Loading and Preprocessing¶

Loading the Dataset¶

We will load the Titanic dataset directly from the seaborn library, which comes pre-packaged with some datasets, including Titanic.

import seaborn as sns
import pandas as pd

# Load Titanic dataset
df = sns.load_dataset('titanic')

# Show the first few rows of the DataFrame
print(df.head())

Output:

   survived     pclass     sex   age  sibsp  parch     fare embarked  class  who  ...
0         0        3    male  22.0      1      0   7.250      S   Third  man  ...
1         1        1  female  38.0      1      0  71.283      C   First  woman ...
2         1        3  female  26.0      0      0   7.925      S   Third  woman ...
3         1        1  female  35.0      1      0  53.100      S   First  woman ...
4         0        3    male  35.0      0      0   8.050      S   Third  man  ...

This dataset contains information about passengers on the Titanic, such as their survival status, pclass (passenger class), sex, age, fare, and embarked (port of embarkation).

2. Exploratory Data Analysis (EDA)¶

Summarizing the Data¶

To understand the dataset, we can start by getting basic descriptive statistics, such as mean, standard deviation, and percentiles.

# Get a summary of the numerical columns
print(df.describe())

Output:

         survived      pclass        age       sibsp      parch        fare
count   891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean      0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std       0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min       0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%       0.000000    3.000000   20.125000    0.000000    0.000000    7.910000
50%       0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%       1.000000    1.000000   38.000000    1.000000    0.000000   31.000000
max       1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

Missing Data Analysis¶

To understand the completeness of the data, we check for missing values.

# Check for missing data
print(df.isnull().sum())

Output:

survived      0
pclass        0
sex           0
age         177
sibsp         0
parch         0
fare          0
embarked      2
class         0
who           0
...

This indicates that the Age column has 177 missing values, and the Embarked column has 2 missing values.

3. Data Cleaning¶

Handling Missing Values¶

We can either drop the rows with missing values or fill them with a reasonable value. Let’s fill the missing values in Age with the median and the missing values in Embarked with the mode (most frequent value).

# Fill missing 'Age' with the median
df['age'].fillna(df['age'].median(), inplace=True)

# Fill missing 'Embarked' with the most frequent value
df['embarked'].fillna(df['embarked'].mode()[0], inplace=True)

# Check again for missing values
print(df.isnull().sum())

Output:

survived    0
pclass      0
sex         0
age         0
sibsp       0
parch       0
fare        0
embarked    0
class       0
who         0
...

Converting Data Types¶

We can convert the Survived column to a boolean type (True or False), which will be more informative.

# Convert 'survived' to a boolean type
df['survived'] = df['survived'].astype(bool)

# Check the data types of the columns
print(df.dtypes)

Output:

survived      bool
pclass        int64
sex          object
age         float64
sibsp         int64
parch         int64
fare        float64
embarked     object
class        object
who          object
...

4. Statistical Analysis¶

We can perform some statistical analysis to understand relationships in the data.

Correlation Analysis¶

For numerical columns like age, fare, and survived, we can check the correlation to see how strongly variables are related.

# Calculate correlation between numerical features
print(df.corr())

Output:

                survived    pclass       age     sibsp     parch      fare
survived        1.000000 -0.338481 -0.077221 -0.035523 -0.017901  0.257308
pclass         -0.338481  1.000000 -0.369226  0.105868  0.072712 -0.549500
age            -0.077221 -0.369226  1.000000  0.035634  0.004070 -0.096067
sibsp          -0.035523  0.105868  0.035634  1.000000  0.410798  0.010388
parch          -0.017901  0.072712  0.004070  0.410798  1.000000  0.081629
fare            0.257308 -0.549500 -0.096067  0.010388  0.081629  1.000000

This tells us that pclass is negatively correlated with survived (passenger class has a lower survival rate), and fare has a moderate positive correlation with survival (higher fare might indicate better chances of survival).

5. Data Visualization¶

Visualizing the Data¶

Data visualization is key to understanding patterns and trends in the dataset. We can use libraries like Matplotlib and Seaborn to visualize the data.

import seaborn as sns
import matplotlib.pyplot as plt

# Distribution of Age
sns.histplot(df['age'], kde=True)
plt.title('Age Distribution')
plt.show()

# Survival rate by Class
sns.barplot(x='pclass', y='survived', data=df)
plt.title('Survival Rate by Class')
plt.show()

# Survival rate by Gender
sns.barplot(x='sex', y='survived', data=df)
plt.title('Survival Rate by Gender')
plt.show()

Visualization Output:¶

Age Distribution: A histogram that shows the distribution of passengers' ages.
Survival Rate by Class: A bar plot showing the survival rate for each passenger class (1st, 2nd, 3rd).
Survival Rate by Gender: A bar plot showing the survival rate for male and female passengers.

6. Conclusion: Insights from Data Analysis¶

Age: Passengers' ages range from infants to elderly, with most passengers being between 20 and 40 years old.
Survival Rate: Women had a higher survival rate than men. The first-class passengers had the highest survival rate.
Class: Passengers in first class had a higher survival rate than those in third class.
Fare: There seems to be a positive correlation between the fare paid and the survival rate.

Beyond this, Pandas has some wild tricks up its sleeves—brace yourself, it's like discovering your dataset's secret superpowers! 😄

Advanced and Unique Ways to Use Pandas¶

While Pandas is a highly versatile library for data manipulation and analysis, there are many advanced and less commonly used functionalities that can help with different types of data analysis, especially when working with large datasets or performing complex analyses. These include methods that streamline workflows, integrate well with other libraries, and provide powerful capabilities beyond typical data cleaning, EDA, and basic analysis.

Let's explore some of these advanced techniques and unique functionalities offered by Pandas, with practical examples:

1. Multi-Indexing (Hierarchical Indexing)¶

Pandas allows the creation of multi-level indices or hierarchical indexing, which is useful for handling high-dimensional data in a more readable and efficient manner. Multi-indexing is particularly helpful when working with time-series data, grouped data, or when you need to organize data in a nested structure.

Example: Multi-Indexing with DataFrames¶

import pandas as pd

# Sample data with multi-level index
data = {
    'Product': ['A', 'A', 'B', 'B'],
    'Region': ['East', 'West', 'East', 'West'],
    'Sales': [200, 300, 250, 350]
}

df = pd.DataFrame(data)

# Set multi-index
df.set_index(['Product', 'Region'], inplace=True)

print(df)

Output:

               Sales
Product Region       
A       East     200
        West     300
B       East     250
        West     350

Benefits:

Allows for efficient querying and aggregation across multiple levels.
Makes hierarchical data more intuitive to analyze.
Useful for pivot tables, group-by operations, and more.

2. Window Functions¶

Pandas provides window functions (also called rolling windows) for performing calculations over a moving window of data. These are commonly used for time-series analysis, calculating moving averages, or smoothing out fluctuations.

Example: Rolling Window Calculations¶

# Create a sample time series
data = {'Date': pd.date_range('2023-01-01', periods=10, freq='D'),
        'Sales': [120, 140, 130, 160, 180, 200, 210, 190, 230, 220]}

df = pd.DataFrame(data)
df.set_index('Date', inplace=True)

# Calculate 3-day moving average for 'Sales'
df['3_day_avg'] = df['Sales'].rolling(window=3).mean()

print(df)

Output:

            Sales  3_day_avg
Date                         
2023-01-01    120        NaN
2023-01-02    140        NaN
2023-01-03    130  130.000000
2023-01-04    160  143.333333
2023-01-05    180  156.666667
2023-01-06    200  180.000000
2023-01-07    210  196.666667
2023-01-08    190  200.000000
2023-01-09    230  210.000000
2023-01-10    220  213.333333

Benefits:

Smooths out short-term fluctuations.
Provides a more realistic analysis for time-series forecasting.
Allows for flexible window size and aggregation functions.

3. Custom Aggregations with GroupBy¶

Beyond simple aggregation functions (sum, mean, etc.), Pandas GroupBy allows custom aggregation functions to be applied to each group in a dataset. This is useful for advanced statistical analysis and personalized reporting.

Example: Custom Aggregation¶

# Sample data
data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Sales': [500, 600, 450, 700, 550, 650, 400, 750],
    'Profit': [50, 70, 45, 80, 60, 75, 40, 85]
}

df = pd.DataFrame(data)

# Group by Region and calculate custom aggregation
agg_funcs = {
    'Sales': 'sum',
    'Profit': lambda x: (x.mean() + x.std())
}

result = df.groupby('Region').agg(agg_funcs)

print(result)

Output:

       Sales     Profit
Region                  
East     850  81.859460
North    1050  65.773024
South    1250  73.639747
West     1450  81.561472

Benefits:

Enables complex, domain-specific aggregations.
Ideal for summarizing and analyzing grouped data in specialized ways.
Flexible use of lambda functions for custom operations.

4. Pivot Tables¶

Pandas pivot_table function creates pivot tables to summarize and reorganize data, which is a powerful way to aggregate data based on multiple categorical variables.

Example: Pivot Table for Data Aggregation¶

# Create sample sales data
data = {
    'Date': pd.date_range('2023-01-01', periods=12, freq='M'),
    'Category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [200, 300, 150, 250, 180, 230, 220, 270, 210, 320, 190, 210]
}

df = pd.DataFrame(data)

# Create a pivot table
pivot_df = df.pivot_table(values='Sales', index='Category', aggfunc='sum')

print(pivot_df)

Output:

          Sales
Category        
A            1450
B            1870

Benefits:

Useful for summarizing data with multiple variables.
Can be customized with various aggregation functions like sum, mean, count, etc.
Simplifies the interpretation of large datasets by summarizing key insights.

5. Melt and Pivot for Data Transformation¶

The melt function in Pandas is used to transform a dataset from wide format to long format, making it easier to analyze in many cases. Similarly, pivot allows you to go from long format to wide format.

Example: Using Melt and Pivot¶

# Sample data
data = {
    'Year': [2020, 2020, 2021, 2021],
    'Region': ['North', 'South', 'North', 'South'],
    'Sales_Q1': [150, 180, 170, 210],
    'Sales_Q2': [160, 190, 180, 220]
}

df = pd.DataFrame(data)

# Melt the dataset
melted_df = pd.melt(df, id_vars=['Year', 'Region'], value_vars=['Sales_Q1', 'Sales_Q2'], var_name='Quarter', value_name='Sales')

print(melted_df)

Output:

   Year Region   Quarter  Sales
0  2020  North  Sales_Q1    150
1  2020  South  Sales_Q1    180
2  2021  North  Sales_Q1    170
3  2021  South  Sales_Q1    210
4  2020  North  Sales_Q2    160
5  2020  South  Sales_Q2    190
6  2021  North  Sales_Q2    180
7  2021  South  Sales_Q2    220

Benefits:

Easily converts between long and wide formats.
Enables data transformation that aligns with analytical needs.
Facilitates reshaping and restructuring data for more efficient analysis.

6. Apply Function for Complex Data Operations¶

The apply function allows you to apply a custom function across a DataFrame or Series, making it very powerful for data manipulation and transformations.

Example: Using Apply for Custom Operations¶

# Sample data
data = {'Age': [25, 30, 35, 40, 45], 'Income': [40000, 50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)

# Define a custom function to categorize income
def income_category(income):
    if income < 50000:
        return 'Low'
    elif 50000 <= income < 70000:
        return 'Medium'
    else:
        return 'High'

# Apply the function to the Income column
df['Income Category'] = df['Income'].apply(income_category)

print(df)

Output:

   Age  Income Income Category
0   25   40000            Low
1   30   50000         Medium
2   35   60000         Medium
3   40   70000           High
4   45   80000           High

Benefits:

Powerful for custom transformations.
Helps with applying complex logic across DataFrames or Series.
Allows for flexible use cases, like conditional operations, transformations, and more.

7. Efficient Memory Management with `Categorical` Data Type¶

Pandas allows for efficient memory usage by converting text or string columns to the Categorical data type. This can significantly reduce memory usage, especially with large datasets containing repetitive strings.

Example: Using Categorical Data Type¶

# Sample data
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B']}

df = pd.DataFrame(data)

# Convert 'Category' column to categorical
df['Category'] = df['Category'].astype('category')

print(df.memory_usage())

Output:

Index         128
Category       32
dtype: int64

Benefits:

Reduces memory consumption for categorical data.
Useful for datasets with repetitive textual data.
Speeds up operations like group-by and pivoting due to faster comparisons.

8. Advanced Data Profiling with `pandas-profiling`¶

pandas-profiling is a Python library that enhances data exploration by automating the process of generating comprehensive reports for a dataset. It is a powerful addition to the Pandas ecosystem, simplifying the identification of trends, patterns, and anomalies.

Example: Generating a Profiling Report¶

# Import required packages
import pandas as pd
from pandas_profiling import ProfileReport

# Load a dataset
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
        'Age': [24, 27, 22, 32, 29],
        'Score': [85.5, 62.3, 88.8, None, 72.5]}

df = pd.DataFrame(data)

# Generate a profiling report
profile = ProfileReport(df, title='Dataset Profiling Report', explorative=True)

# Save the report as an HTML file
profile.to_file(output_file="profiling_report.html")

Output:

An interactive HTML report containing:

Descriptive statistics: Mean, median, standard deviation, etc.
Distributions: Histograms for numerical columns.
Correlations: Pearson, Spearman, and Kendall coefficients.
Warnings: Alerts for missing values, duplicates, and high cardinality.

Benefits:¶

Automates exploratory data analysis.
Saves time for data scientists and analysts.
Generates ready-to-share, visually appealing reports.
Facilitates quick identification of potential data quality issues.

Demerits:¶

Computationally expensive for very large datasets.
Limited customization for specific visualization styles.

Don't forget to checkout the project Advance Data Processor... Try it now at Streamlit App

Stay tuned for Day 3, where we’ll unlock even more machine learning...

Follow me on LinkedIn and X for updates.

Happy Analyzing!

Day 2 : Pandas And Data Analysis

Self Intro¶

A Joke to Lighten the Mood :¶

Pandas Library¶

1. Pandas: Key Concepts and Data Structures¶

a. Pandas Series¶

b. Pandas DataFrame¶

2. Commonly Used Pandas Functions¶

a. pd.read_csv()¶

b. pd.DataFrame()¶

c. df.head() and df.tail()¶

d. df.info()¶

e. df.describe()¶

f. df.isnull() and df.dropna()¶

g. df.groupby()¶

h. df.merge()¶

3. Advanced Pandas Operations¶

a. Data Cleaning with Pandas¶

b. Data Transformation¶

c. Pivoting and Reshaping¶

d. Time Series Analysis¶

4. How Pandas is Used for Data Analysis¶

a. Data Cleaning and Preprocessing¶

b. Exploratory Data Analysis (EDA)¶

c. Statistical Analysis¶

d. Data Visualization¶

5. Benefits and Demerits of Pandas for Data Analysis¶

Benefits:¶

Demerits:¶

Pandas in Data Analysis¶

1. Data Loading and Preprocessing¶

Loading the Dataset¶

2. Exploratory Data Analysis (EDA)¶

Summarizing the Data¶

Missing Data Analysis¶

3. Data Cleaning¶

Handling Missing Values¶

Converting Data Types¶

4. Statistical Analysis¶

Correlation Analysis¶

5. Data Visualization¶

Visualizing the Data¶

Visualization Output:¶

6. Conclusion: Insights from Data Analysis¶

Advanced and Unique Ways to Use Pandas¶

1. Multi-Indexing (Hierarchical Indexing)¶

Example: Multi-Indexing with DataFrames¶

2. Window Functions¶

Example: Rolling Window Calculations¶

3. Custom Aggregations with GroupBy¶

Example: Custom Aggregation¶

4. Pivot Tables¶

Example: Pivot Table for Data Aggregation¶

5. Melt and Pivot for Data Transformation¶

Example: Using Melt and Pivot¶

6. Apply Function for Complex Data Operations¶

Example: Using Apply for Custom Operations¶

7. Efficient Memory Management with Categorical Data Type¶

Example: Using Categorical Data Type¶

8. Advanced Data Profiling with pandas-profiling¶

Example: Generating a Profiling Report¶

Benefits:¶

Demerits:¶

Comments

Post a Comment

Popular posts from this blog

Day 5 Git & Environment Setup in Machine Learning along with Advanced Data Processor

Day 1. Python & NumPy + Project: Data Processing Pipeline

Day 3 : Machine Learning Essential Statistics & Math + Project : Advanced Statistical Analyzer

a. `pd.read_csv()`¶

b. `pd.DataFrame()`¶

c. `df.head()` and `df.tail()`¶

d. `df.info()`¶

e. `df.describe()`¶

f. `df.isnull()` and `df.dropna()`¶

g. `df.groupby()`¶

h. `df.merge()`¶

7. Efficient Memory Management with `Categorical` Data Type¶

8. Advanced Data Profiling with `pandas-profiling`¶