Data Analysis, Preparation & EDA for Machine Learning
A comprehensive and modern guide to data analysis, preparation, and exploratory data analysis (EDA) techniques for machine learning. Covers practical workflows using Python libraries like pandas, polars, scikit-learn, and imbalanced-learn—designed to ensure clean, balanced, and model-ready datasets.
Data Analysis, Preparation & EDA for Machine Learning (2025 Edition)
1. Data Loading & Inspection
Understanding Data Sources & Formats
Why Format Matters: Different file formats have distinct advantages for ML workflows:
- CSV/TSV: Universal compatibility but slower I/O and larger file sizes
- Parquet: Columnar storage with compression, 10x faster than CSV for large datasets
- Feather: Lightning-fast serialization, perfect for temporary storage between analysis steps
- JSON: Nested data structures, common in APIs but requires flattening for ML
Database Considerations:
- SQL databases provide ACID compliance and complex queries
- NoSQL handles unstructured data but requires more preprocessing
- Cloud storage enables scalable data pipelines but needs proper authentication
Essential Libraries & Their Use Cases
import pandas as pd # Standard tabular data manipulation
import polars as pl # 2-10x faster than pandas, Rust-based
import pyarrow as pa # Arrow format, used by Parquet/Feather
import duckdb # In-memory SQL, excellent for analytics
from smart_open import open # Direct cloud storage access
# Loading examples
df = pd.read_csv('data.csv')
df_fast = pl.read_parquet('data.parquet') # Much faster for large files
df_sql = duckdb.query("SELECT * FROM 'data.parquet' WHERE price > 100").to_df()
Pro Tip: Use pd.read_csv()
with dtype
parameter to prevent automatic type inference issues that can cause memory bloat.
2. Exploratory Data Analysis (EDA)
️ Data Structure Assessment
Why Structure Matters: Understanding your data's shape, types, and memory usage prevents downstream issues and informs preprocessing decisions.
# Essential first steps
df.info() # Data types, non-null counts, memory usage
df.shape # Dimensions
df.columns.tolist() # Column names as list
df.dtypes # Data types only
df.memory_usage(deep=True) # Actual memory consumption
Memory Optimization: Large datasets can consume excessive RAM. Use category
dtype for strings with < 50% unique values, and int32
instead of int64
when possible.
Missing Value Analysis
Missing Data Patterns: Understanding why data is missing is crucial for choosing the right imputation strategy:
- MCAR (Missing Completely at Random): Safe to delete or impute
- MAR (Missing at Random): Missing depends on observed variables
- MNAR (Missing Not at Random): Missing depends on unobserved factors
import missingno as msno
# Missing value assessment
missing_percent = df.isnull().mean() * 100
print(missing_percent[missing_percent > 0])
# Visualize missing patterns
msno.matrix(df) # Shows missing data patterns
msno.heatmap(df) # Correlation of missingness between variables
Descriptive Statistics Deep Dive
Beyond Basic Stats: describe()
gives you central tendency and spread, but domain context is crucial for interpretation.
# Comprehensive statistical summary
df.describe(include='all') # All columns
df.describe(percentiles=[.1, .25, .5, .75, .9]) # Custom percentiles
# Grouped analysis
df.groupby('category')['price'].describe() # Stats by group
df.groupby('category').agg({
'price': ['mean', 'std', 'count'],
'rating': ['mean', 'median']
})
3. Univariate & Multivariate Analysis
Single Variable Analysis (Univariate)
Distribution Understanding: Each variable's distribution tells you about data quality, potential outliers, and appropriate preprocessing steps.
import seaborn as sns
import matplotlib.pyplot as plt
# Numerical variables
sns.histplot(df['price'], kde=True) # Distribution shape
sns.boxplot(x=df['price']) # Outlier detection
df['price'].describe() # Statistical summary
# Categorical variables
df['category'].value_counts() # Frequency counts
sns.countplot(data=df, x='category') # Visual frequency
Outlier Detection Methods:
- IQR Method: Q3 + 1.5×IQR (conservative, good for skewed data)
- Z-Score: |z| > 3 (assumes normal distribution)
- Domain Knowledge: Often the most reliable approach
Relationship Analysis (Bivariate/Multivariate)
Correlation vs. Causation: Correlation measures linear relationships but doesn't imply causation. Always combine statistical analysis with domain expertise.
# Correlation analysis
corr_matrix = df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
# Categorical vs numerical relationships
sns.boxplot(data=df, x='category', y='price')
df.groupby('category')['price'].mean()
# Multiple variable relationships
sns.pairplot(df, hue='target_variable') # All pairwise relationships
Automated EDA Tools
When to Use Automated EDA: Perfect for initial exploration of large datasets, but always follow up with targeted analysis.
from ydata_profiling import ProfileReport
import sweetviz as sv
# Comprehensive automated report
profile = ProfileReport(df, title="Dataset Analysis")
profile.to_notebook_iframe()
# Comparative analysis
report = sv.analyze(df)
report.show_html()
4. Data Cleaning
Missing Data Strategies
Imputation Strategy Selection:
- Numerical: Mean (normal distribution), Median (skewed), Mode (discrete)
- Categorical: Mode, "Unknown" category, or domain-specific defaults
- Advanced: KNN imputation, iterative imputation for complex patterns
from sklearn.impute import SimpleImputer, KNNImputer
# Basic imputation
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
df['age'] = num_imputer.fit_transform(df[['age']])
df['category'] = cat_imputer.fit_transform(df[['category']])
# Advanced imputation for complex missing patterns
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = knn_imputer.fit_transform(df.select_dtypes(include='number'))
Outlier Treatment
Treatment Options:
- Remove: Only if you're certain they're errors
- Cap/Winsorize: Clip to percentile values (e.g., 1st and 99th)
- Transform: Log transformation can reduce outlier impact
- Keep: Often outliers contain valuable information
# IQR-based outlier detection
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Winsorization (capping)
df['price_capped'] = df['price'].clip(lower=lower_bound, upper=upper_bound)
Duplicate Handling
Duplicate Types:
- Exact duplicates: Same values across all columns
- Partial duplicates: Same key columns but different values elsewhere
- Near duplicates: Similar but not identical (requires fuzzy matching)
# Duplicate detection and removal
print(f"Duplicates: {df.duplicated().sum()}")
df_clean = df.drop_duplicates(subset=['id'], keep='first')
# Check for near-duplicates in text columns
from fuzzywuzzy import fuzz
# Custom logic needed based on domain
5. Feature Engineering
Feature engineering is the process of transforming raw data into a set of features that can be used to improve the performance of machine learning models. It involves selecting, creating, and transforming variables from the raw data to make them more suitable for the chosen machine learning algorithm.
Categorical Encoding Strategies
Encoding Method Selection:
- Ordinal: Natural order exists (low/medium/high)
- One-Hot: Nominal categories, low cardinality (< 10-15 categories)
- Target Encoding: High cardinality, but risk of overfitting
- Frequency Encoding: When frequency correlates with target
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
# Ordinal encoding for ordered categories
ordinal_map = {'low': 1, 'medium': 2, 'high': 3}
df['priority_encoded'] = df['priority'].map(ordinal_map)
# Target encoding (use with cross-validation to prevent overfitting)
target_encoder = ce.TargetEncoder()
df['category_target_encoded'] = target_encoder.fit_transform(df['category'], df['target'])
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')
Numerical Transformations
Scaling Necessity: Different algorithms have different scaling requirements:
- Tree-based models: Don't require scaling
- Linear models, SVM, Neural Networks: Require scaling
- Distance-based algorithms: Very sensitive to scale
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy.stats import boxcox
import numpy as np
# Scaling options
standard_scaler = StandardScaler() # Mean=0, Std=1
minmax_scaler = MinMaxScaler() # Range [0,1]
robust_scaler = RobustScaler() # Uses median and IQR
# Handle skewed distributions
df['log_price'] = np.log1p(df['price']) # log(1+x) handles zeros
df['price_boxcox'], _ = boxcox(df['price'] + 1) # Box-Cox transformation
Text Feature Engineering
Text Preprocessing Pipeline:
- Clean: Remove HTML, special characters, normalize case
- Tokenize: Split into words/subwords
- Remove stopwords: Filter common words
- Vectorize: Convert to numerical representation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
def clean_text(text):
text = text.lower()
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove non-alphabetic
return text
# Vectorization
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
text_features = tfidf.fit_transform(df['description'].apply(clean_text))
DateTime Feature Engineering
Time-Based Features: Extract meaningful temporal patterns that algorithms can use.
# Extract datetime components
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6])
# Cyclical encoding (preserves circular nature of time)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)
6. Data Balancing & Resampling
When Balancing is Necessary
Class Imbalance Impact: Severely imbalanced datasets (>90:10 ratio) can cause models to ignore minority classes. However, balancing isn't always beneficial—consider the business cost of false positives vs. false negatives.
Key Principles:
- Only balance training data, never test data
- Consider class weights as an alternative to resampling
- Evaluate using appropriate metrics (F1, AUC-ROC, not just accuracy)
Resampling Techniques
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTEENN
# Check current balance
print(y.value_counts(normalize=True))
# Oversampling techniques
smote = SMOTE(random_state=42) # Synthetic minority samples
adasyn = ADASYN(random_state=42) # Adaptive synthetic sampling
ros = RandomOverSampler(random_state=42) # Simple duplication
# Apply resampling
X_resampled, y_resampled = smote.fit_resample(X, y)
# Combined approach (often works best)
smote_enn = SMOTEENN(random_state=42)
X_combined, y_combined = smote_enn.fit_resample(X, y)
SMOTE Explanation: Creates synthetic examples by interpolating between existing minority class samples and their k-nearest neighbors, rather than simply duplicating existing samples.
7. Data Splitting
Splitting Strategies
Split Ratios: Common splits depend on dataset size:
- Large datasets (>100k): 80/10/10 or 80/20
- Medium datasets (10k-100k): 70/15/15 or 80/20
- Small datasets (< 10k): Use cross-validation instead
Stratification: Ensures each split maintains the same class distribution as the original dataset.
from sklearn.model_selection import train_test_split, StratifiedKFold
# Basic stratified split
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2,
stratify=y, # Maintains class distribution
random_state=42 # Reproducibility
)
# Three-way split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)
# Cross-validation for small datasets
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]
8. Pipelines & Automation
Pipeline Benefits
Why Pipelines Matter:
- Prevent data leakage: Ensures transformations are fit only on training data
- Reproducibility: Same preprocessing steps applied consistently
- Cleaner code: Encapsulates complex preprocessing logic
- Easy deployment: Entire pipeline can be serialized and deployed
️ Advanced Pipeline Construction
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Define column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
# Create preprocessing pipelines
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# Full pipeline with model
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(random_state=42))
])
# Fit and predict
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)
9. Modern & Scalable Tools
High-Performance Libraries
When to Use Each:
- Polars: 2-10x faster than pandas, especially for large datasets
- Dask: Parallel computing, handles datasets larger than RAM
- Modin: Drop-in pandas replacement with automatic parallelization
- DuckDB: SQL analytics on DataFrames, excellent for complex queries
import polars as pl
import dask.dataframe as dd
# Polars for speed
df_pl = pl.read_csv("large_file.csv")
result = df_pl.filter(pl.col("price") > 100).group_by("category").agg(pl.col("price").mean())
# Dask for datasets larger than RAM
df_dask = dd.read_csv("huge_file.csv")
result = df_dask.groupby("category").price.mean().compute()
AutoML & Experiment Tracking
AutoML Tools: Automate feature engineering, model selection, and hyperparameter tuning:
- PyCaret: Low-code ML library
- AutoGluon: Amazon's AutoML toolkit
- FLAML: Microsoft's fast AutoML library
import pycaret.classification as pc
# AutoML setup and comparison
clf = pc.setup(df, target='target_column', session_id=123)
best_models = pc.compare_models(sort='AUC', n_select=3)
10. Visualization Best Practices
️ Choosing the Right Visualization
Visualization Purpose:
- Distribution: Histograms, box plots, violin plots
- Relationships: Scatter plots, correlation heatmaps
- Comparisons: Bar charts, grouped box plots
- Time series: Line plots, seasonal decomposition
import plotly.express as px
import plotly.graph_objects as go
# Interactive visualizations
fig = px.scatter(df, x='feature1', y='feature2', color='target',
hover_data=['id'], title='Feature Relationship')
fig.show()
# Statistical visualizations
sns.pairplot(df, hue='target', diag_kind='kde')
plt.show()
Visualization Guidelines
Best Practices:
- Use color meaningfully (categorical distinctions, not decoration)
- Always label axes, include units
- Choose appropriate scales (linear vs. log)
- Avoid 3D unless the third dimension adds meaningful information
- Consider colorblind-friendly palettes
11. Quick Project Template
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
def complete_data_prep_pipeline(df, target_column):
"""
Complete data preparation pipeline for ML
"""
# 1. Initial EDA
print("Dataset Shape:", df.shape)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nTarget Distribution:")
print(df[target_column].value_counts(normalize=True))
# 2. Separate features and target
X = df.drop(target_column, axis=1)
y = df[target_column]
# 3. Identify column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['object']).columns.tolist()
print(f"\nNumeric features: {len(numeric_features)}")
print(f"Categorical features: {len(categorical_features)}")
# 4. Create preprocessing pipelines
numeric_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
# 5. Combine transformers
preprocessor = ColumnTransformer(
transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
]
)
# 6. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
# 7. Apply preprocessing
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
# 8. Handle class imbalance if needed
if len(y.unique()) == 2: # Binary classification
class_ratio = y.value_counts(normalize=True).min()
if class_ratio < 0.3: # If minority class < 30%
print(f"\nApplying SMOTE (minority class: {class_ratio:.2%})")
smote = SMOTE(random_state=42)
X_train_processed, y_train = smote.fit_resample(X_train_processed, y_train)
print(f"\nFinal training set shape: {X_train_processed.shape}")
print(f"Final test set shape: {X_test_processed.shape}")
return X_train_processed, X_test_processed, y_train, y_test, preprocessor
# Usage example:
# X_train, X_test, y_train, y_test, preprocessor = complete_data_prep_pipeline(df, 'target')
Key Takeaways & Best Practices
Critical Success Factors
- Data Quality > Model Complexity: Spend 80% of your time on data understanding and preparation
- Domain Knowledge: Statistical techniques are tools; domain expertise guides their application
- Reproducibility: Always set random seeds and document preprocessing steps
- Validation Strategy: Never touch test data during preprocessing—use cross-validation for model selection
- Iterative Process: EDA → Preprocess → Model → Analyze Results → Refine (repeat)
️ Common Pitfalls to Avoid
- Data Leakage: Fitting transformers on entire dataset instead of training only
- Target Leakage: Including features that wouldn't be available at prediction time
- Overfitting to Validation: Using validation performance to make too many preprocessing decisions
- Ignoring Business Context: Optimizing metrics that don't align with business objectives
- Premature Optimization: Jumping to complex techniques before understanding the data
Workflow Checklist
- Load and inspect data structure
- Perform comprehensive EDA
- Handle missing values appropriately
- Detect and treat outliers
- Engineer relevant features
- Encode categorical variables
- Scale numerical features if needed
- Split data properly (stratified)
- Handle class imbalance if necessary
- Create preprocessing pipeline
- Validate preprocessing steps
- Document all transformations