Day 1. Python & NumPy + Project: Data Processing Pipeline
Introduction¶
Hello, everyone! My name is Rohan Sai, and I go by the pen name Aiknight. I’m thrilled to invite you on a structured journey through the world of machine learning (ML) with my series, "120 Days of Machine Learning." This series is designed to break down complex ML concepts into simple, digestible lessons, making it accessible to learners at all levels.
Over the next 120 days, we’ll delve deep into the core ideas, practical tools, and state-of-the-art advancements in machine learning. From understanding how machines learn patterns in data to implementing sophisticated algorithms, this series will cover:
Foundational topics like linear regression and decision trees. Core areas such as supervised and unsupervised learning. Advanced techniques like neural networks, transformers, and GANs. We’ll use popular ML libraries such as scikit-learn, TensorFlow, and PyTorch to turn theory into practice. Each post will build on the previous one, taking you step-by-step toward mastery in machine learning.
Kickstarting the series, Day 1 covers foundational concepts, and the code for the data processing pipeline is in kaggle notebook
Transform your data like a pro! π Explore the Advanced Data Processor to clean, preprocess, and visualize datasets effortlessly. Try it now at Streamlit App and take your data analysis to the next level!
Whether you’re a beginner exploring ML for the first time or an experienced enthusiast looking to sharpen your skills, this series promises a blend of learning, experimentation, and innovation. Let’s begin this exciting journey together! π
This session is dedicated to building a strong foundation in Python and NumPy, essential for working in Machine Learning (ML). Understanding these tools deeply will not only enhance your ability to write efficient code but also prepare you for solving real-world ML problems effectively.
Python: The Foundation of ML¶
Python is a versatile programming language with extensive libraries that make it a top choice for ML. Here's a deep dive into Python essentials:
1. Python Basics¶
Variables and Data Types¶
- Concept: Variables are used to store data, and Python dynamically assigns types to variables.
- Example:
# Variables integer_var = 10 # Integer float_var = 3.14 # Float string_var = "Hello" # String boolean_var = True # Boolean print(type(integer_var)) # Output: <class 'int'>
Type Conversion¶
- Concept: Converting one data type to another when necessary.
- Example:
num = "100" converted_num = int(num) # String to Integer print(type(converted_num)) # Output: <class 'int'>
2. Python Data Structures¶
Lists¶
- Concept: Mutable, ordered collections.
- Example:
my_list = [1, 2, 3, 4] my_list.append(5) # Add an element print(my_list) # Output: [1, 2, 3, 4, 5]
Dictionaries¶
- Concept: Key-value pairs for fast lookups.
- Example:
my_dict = {'name': 'Alice', 'age': 25} print(my_dict['name']) # Output: Alice
Sets¶
- Concept: Unordered collections with unique elements.
- Example:
my_set = {1, 2, 3, 2} # Duplicate 2 is ignored print(my_set) # Output: {1, 2, 3}
Tuples¶
- Concept: Immutable ordered collections.
- Example:
my_tuple = (1, 2, 3) print(my_tuple[1]) # Output: 2
3. Control Flow¶
Conditionals¶
- Concept: Decisions based on conditions.
- Example:
x = 10 if x > 5: print("x is greater than 5") else: print("x is less than or equal to 5")
Loops¶
For Loop:
for i in range(5): print(i) # Output: 0 1 2 3 4
While Loop:
count = 0 while count < 3: print(count) count += 1
4. Functions¶
Defining Functions¶
- Concept: Reusable blocks of code.
- Example:
def greet(name): return f"Hello, {name}" print(greet("Alice")) # Output: Hello, Alice
Lambda Functions¶
- Concept: Anonymous, one-line functions.
- Example:
square = lambda x: x ** 2 print(square(4)) # Output: 16
5. File Handling¶
Reading and Writing Files¶
- Concept: Manage data storage and retrieval.
- Example:
with open('sample.txt', 'w') as file: file.write("Hello, ML!") with open('sample.txt', 'r') as file: print(file.read()) # Output: Hello, ML!
NumPy: Numerical Python for ML¶
NumPy is the backbone of numerical computing in Python, offering powerful array manipulation and mathematical functions.
1. Introduction to NumPy¶
Concept: NumPy arrays provide efficient storage and computation compared to Python lists.¶
- Installation:
pip install numpy
2. NumPy Arrays¶
Creating Arrays¶
1D Array:
import numpy as np arr = np.array([1, 2, 3]) print(arr) # Output: [1 2 3]
2D Array:
mat = np.array([[1, 2], [3, 4]]) print(mat)
Special Arrays:
zeros = np.zeros((2, 2)) # 2x2 zero matrix ones = np.ones((3, 3)) # 3x3 matrix of ones identity = np.eye(3) # 3x3 Identity matrix
3. Array Indexing and Slicing¶
Accessing Elements:
print(arr[0]) # Output: 1
Slicing:
print(mat[1, :]) # Output: [3 4]
4. Array Operations¶
Arithmetic Operations¶
- Concept: Element-wise operations.
- Example:
arr1 = np.array([1, 2, 3]) arr2 = np.array([4, 5, 6]) print(arr1 + arr2) # Output: [5 7 9]
Broadcasting¶
- Concept: Operations on arrays with different shapes.
- Example:
print(arr1 * 2) # Output: [2 4 6]
5. Mathematical Operations¶
Mean, Median, and Standard Deviation:
print(np.mean(arr), np.median(arr), np.std(arr))
Dot Product:
print(np.dot(arr1, arr2)) # Output: 32
6. Reshaping and Flattening¶
Reshape:
reshaped = arr.reshape(3, 1) print(reshaped)
Flatten:
print(mat.flatten())
7. Random Numbers¶
- Random Array:
rand_arr = np.random.rand(3, 3) print(rand_arr)
Data Processing Pipeline¶
1. Imputation Methods :¶
Imputation is the process of filling in missing values in a dataset. Missing data can negatively impact the performance of machine learning models, and imputing these values ensures that the dataset is usable.
Here's a comprehensive explanation of each method with formulas, examples, why we use them, and what they are used for in the context of data preprocessing.
1. Imputation Methods¶
1.1 Mean Imputation¶
Formula:
Replace missing values $ x_{\text{missing}} $ with the mean of the column: $ x_{\text{imputed}} = \frac{1}{N} \sum_{i=1}^{N} x_i $ where $ N $ is the number of non-missing values.Example:
Column: [5, 7, NaN, 9]
Mean = $ \frac{5 + 7 + 9}{3} = 7 $ Imputed column: [5, 7, 7, 9].Why Use It:
Maintains central tendency and avoids dropping data.Used For:
Numerical features with a normal distribution and minimal outliers.
1.2 Median Imputation¶
Formula:
Replace missing values with the median of the column (middle value when sorted).
For even-length arrays: $ \text{Median} = \frac{x_{(N/2)} + x_{(N/2+1)}}{2} $Example:
Column: [4, NaN, 8, 6]
Sorted: [4, 6, 8]
Median = 6.
Imputed column: [4, 6, 8, 6].Why Use It:
Handles skewed data better than the mean.Used For:
Numerical data with outliers.
1.3 Mode Imputation¶
Formula:
Replace missing values with the most frequent value: $ x_{\text{imputed}} = \text{Mode}(x) $Example:
Column: [1, 2, 2, NaN, 3]
Mode = 2.
Imputed column: [1, 2, 2, 2, 3].Why Use It:
Suitable for categorical data, preserving its most frequent category.Used For:
Handling missing values in categorical features.
1.4 K-Nearest Neighbors (KNN) Imputation¶
Formula:
Find the $k$ - nearest neighbors and impute with their weighted average:$ x_{\text{imputed}} = \frac{\sum_{i=1}^{k} w_i \cdot x_i}{\sum_{i=1}^{k} w_i} $
where $w_i$ is the weight based on distance.
Example:
Column: [5, 7, NaN, 9]
If $ k = 2 $, nearest neighbors for the missing value are 7 and 9.
Imputed value: $ \frac{7 + 9}{2} = 8 $.Why Use It:
Leverages relationships between features.Used For:
Complex datasets where missing values correlate with other features.
1.5 Constant Imputation¶
Formula:
Replace missing values with a fixed constant, e.g., 0 or "Unknown."Example:
Column: [NaN, 3, NaN, 5]
Imputed column: [0, 3, 0, 5].Why Use It:
Highlights missing data explicitly without introducing assumptions.Used For:
Features where missing values have specific meanings.
1.6 Drop¶
Formula:
Remove rows or columns with missing values.Example:
Original dataset: $ \begin{bmatrix} 1 & 2 & NaN \\ 3 & 4 & 5 \\ NaN & 6 & 7 \end{bmatrix} $ Dropped rows dataset: $ \begin{bmatrix} 3 & 4 & 5 \end{bmatrix} $Why Use It:
Simplifies datasets with excessive missing values.Used For:
When missing data is minimal or irrecoverable.
2. Outlier Detection Methods¶
Outliers are extreme values in the dataset that can skew analyses and reduce model accuracy. These methods help detect and handle them.
2.1 Z-Score¶
Formula:
$ z = \frac{x - \mu}{\sigma} $where $ \mu $ is the mean, and $ \sigma$ is the standard deviation.
Example:
Column: [10, 12, 13, 100]
Mean = 33.75, $ \sigma = 40.2 $.
Z-scores: [-0.59, -0.54, -0.51, 1.64].Why Use It:
Identifies outliers beyond a threshold (e.g., $ |z| > 3 $).Used For:
Normally distributed numerical data.
2.2 IQR (Interquartile Range)¶
Formula:
$ \text{IQR} = Q3 - Q1 $Outliers lie outside:
$ [Q1 - 1.5 \cdot \text{IQR}, Q3 + 1.5 \cdot \text{IQR}] $
Example:
Column: [4, 5, 6, 20]
Q1 = 4.5, Q3 = 6.
IQR = 1.5.
Outlier threshold: [2.25, 8.25].Why Use It:
Robust against skewed data.Used For:
Numerical features with non-normal distributions.
2.3 Isolation Forest¶
Concept:
Randomly partitions data and isolates anomalies.Example:
For 2D data points, anomalies take fewer splits to isolate.Why Use It:
Efficient for large, high-dimensional datasets.Used For:
Unsupervised anomaly detection.
2.4 Local Outlier Factor (LOF)¶
Formula:
Computes local density deviation:$ \text{LOF}(p) = \frac{\text{Average Local Density of Neighbors}}{\text{Local Density of } p} $
Why Use It:
Identifies local anomalies based on density.Used For:
Datasets with variable densities.
3. Scaling Methods¶
Feature scaling ensures that numerical features contribute equally to model performance by normalizing or standardizing their ranges.
3.1 Standard Scaling¶
Formula: $ x_{\text{scaled}} = \frac{x - \mu}{\sigma} $ where $ \mu $ is the mean and $ \sigma $ is the standard deviation.
Example:
Column: [50, 60, 70]Scaled values: $[-1, 0, 1]$.
Why Use It:
Centers data around 0 with a standard deviation of 1. Reduces the impact of units.Used For:
Algorithms sensitive to feature scales (e.g., SVM, Logistic Regression).
3.2 Min-Max Scaling¶
Formula:
$ x_{\text{scaled}} = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} $
Scaled values are in the range [0, 1].
Example:
Column: [50, 60, 70]
Min = 50, Max = 70.
Scaled values: ([0, 0.5, 1]).Why Use It:
Scales all features to the same range. Preserves data shape.Used For:
Algorithms requiring bounded inputs (e.g., Neural Networks).
3.3 Robust Scaling¶
Formula: $ x_{\text{scaled}} = \frac{x - \text{Median}}{\text{IQR}} $
Example:
Column: [1, 2, 100]
Median = 2, IQR = 1.
Scaled values: ([-1, 0, 98]).Why Use It:
Reduces the influence of outliers.Used For:
Features with extreme outliers.
3.4 Power Scaling (Box-Cox, Yeo-Johnson)¶
Formula:
For Box-Cox: $ x' = \frac{x^\lambda - 1}{\lambda}, \, \lambda \neq 0 $For Yeo-Johnson: Handles zero and negative values with a similar formula.
Example:
Column: [1, 10, 100].
Box-Cox (Ξ»=0.5): ([0, 3, 9.5]).Why Use It:
Makes data more Gaussian-like.Used For:
Transforming skewed data.
4. Feature Selection Methods¶
Feature selection identifies and retains the most relevant features for the model.
4.1 Variance Threshold¶
Concept:
Remove features with variance below a threshold:$ \text{Variance} = \frac{\sum_{i=1}^{N}(x_i - \bar{x})^2}{N} $
Example:
Column A (constant values): [3, 3, 3]. Variance = 0.
Remove Column A.Why Use It:
Low-variance features contribute little to the model.Used For:
Initial dimensionality reduction.
4.2 Correlation¶
Formula:
Pearson Correlation Coefficient:$ r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i - \bar{x})^2 \sum(y_i - \bar{y})^2}} $
Example:
Features A and B: [1, 2, 3], [2, 4, 6].
( r = 1 ) (perfect correlation). Drop one.Why Use It:
Prevent multicollinearity in models.Used For:
Highly correlated features.
4.3 Mutual Information¶
Concept:
Measures dependency between variables:$ I(X; Y) = \sum_{x \in X, y \in Y} P(x, y) \log\left(\frac{P(x, y)}{P(x)P(y)}\right) $
Example:
Feature A strongly predicts target B. High mutual information score.Why Use It:
Select features that contribute most to the target.Used For:
Classification or regression problems.
4.4 Chi-Square Test¶
Formula:
$ \chi^2 = \sum \frac{(O - E)^2}{E} $ where $ O $ is observed, and $ E $ is expected frequency.
Example:
Categorical data (Observed): [30, 20], Expected: [25, 25].
$ \chi^2 = \frac{(30-25)^2}{25} + \frac{(20-25)^2}{25} = 2 $.Why Use It:
Measures independence between categorical features and the target.Used For:
Feature selection in categorical data.
5. Dimensionality Reduction Methods¶
Dimensionality reduction simplifies datasets by reducing the number of features while retaining essential information.
5.1 PCA (Principal Component Analysis)¶
Formula:
Transform data using eigenvectors: $ Z = XW $ where $ W $ is the matrix of eigenvectors.Example:
3D data → reduce to 2D while retaining variance.Why Use It:
Removes redundancy and reduces computation time.Used For:
Large datasets with correlated features.
5.2 ICA (Independent Component Analysis)¶
Concept:
Decomposes signals into statistically independent components.Example:
Separate mixed audio signals.Why Use It:
Uncovers hidden factors.Used For:
Signal processing, image analysis.
5.3 NMF (Non-Negative Matrix Factorization)¶
Formula:
Decompose matrix $X$ into non-negative matrices $W $ and $ H $:$ X \approx WH $
Why Use It:
Enforces non-negativity for interpretability.Used For:
Text mining, image recognition.
6. Encoding Methods¶
Encoding converts categorical data into numerical representations for machine learning models.
6.1 Label Encoding¶
Concept:
Convert categories to integers: $ \text{Category } \to \text{Integer} $Example:
Colors: [Red, Blue, Green] → [0, 1, 2].Why Use It:
For models requiring numeric input.Used For:
Ordinal categorical features.
6.2 One-Hot Encoding¶
Concept:
Convert categories into binary vectors.Example:
Colors: [Red, Blue] → [1, 0], [0, 1].Why Use It:
Avoids implicit ordering.Used For:
Nominal categorical features.
6.3 Target Encoding¶
Concept:
Replace categories with mean target value.Example:
Category: A, B → Target Mean: 0.7, 0.3.Why Use It:
Captures target relationships.Used For:
High-cardinality features.
7. Transformation Methods¶
Transformations modify data to improve model performance or meet assumptions.
7.1 Log Transformation¶
Concept:
Apply the natural logarithm to reduce skewness: $ x' = \log(x + c) $$c$ is added to avoid $\log(0)$.
Example:
Column: [1, 10, 100] → [0, 2.3, 4.6].Why Use It:
Reduces skewness and stabilizes variance.When to Use:
Positively skewed data.Limitations:
Only works for positive data.
7.2 Box-Cox Transformation¶
Concept:
A more generalized transformation: $ y = \begin{cases} \frac{(x^\lambda - 1)}{\lambda} & \lambda \neq 0 \\ \log(x) & \lambda = 0 \end{cases} $- $\lambda$ is a parameter to optimize.
Why Use It:
Makes data normal-like.When to Use:
Continuous, positive data.Limitations:
Requires non-negative values.
7.3 Yeo-Johnson Transformation¶
Concept:
Similar to Box-Cox but allows zero and negative values: $ y = \begin{cases} \frac{((x + 1)^\lambda - 1)}{\lambda} & x \geq 0, \lambda \neq 0 \\ -\frac{((-x + 1)^{2-\lambda} - 1)}{2-\lambda} & x < 0, \lambda \neq 2 \end{cases} $Why Use It:
Normalizes both positive and negative values.When to Use:
Non-normal data with mixed values.
7.4 Quantile Transformation¶
Concept:
Maps data to a uniform or normal distribution based on quantiles.Why Use It:
Handles non-linear relationships.When to Use:
Non-Gaussian data needing normalization.
Transform your data like a pro! π Explore our Advanced Data Processor to clean, preprocess, and visualize datasets effortlessly. Try it now at https://dataproces.streamlit.app/ and take your data analysis to the next level!
Comments
Post a Comment