Skip to content
Back to all posts

Table of Contents

Data Analysis, Preparation & EDA for Machine Learning

A comprehensive and modern guide to data analysis, preparation, and exploratory data analysis (EDA) techniques for machine learning. Covers practical workflows using Python libraries like pandas, polars, scikit-learn, and imbalanced-learn—designed to ensure clean, balanced, and model-ready datasets.

DEDennis Kibet Rono
11 min read

Data Analysis, Preparation & EDA for Machine Learning (2025 Edition)

1. Data Loading & Inspection

Understanding Data Sources & Formats

Why Format Matters: Different file formats have distinct advantages for ML workflows:

  • CSV/TSV: Universal compatibility but slower I/O and larger file sizes
  • Parquet: Columnar storage with compression, 10x faster than CSV for large datasets
  • Feather: Lightning-fast serialization, perfect for temporary storage between analysis steps
  • JSON: Nested data structures, common in APIs but requires flattening for ML

Database Considerations:

  • SQL databases provide ACID compliance and complex queries
  • NoSQL handles unstructured data but requires more preprocessing
  • Cloud storage enables scalable data pipelines but needs proper authentication

Essential Libraries & Their Use Cases

import pandas as pd          # Standard tabular data manipulation
import polars as pl         # 2-10x faster than pandas, Rust-based
import pyarrow as pa        # Arrow format, used by Parquet/Feather
import duckdb              # In-memory SQL, excellent for analytics
from smart_open import open # Direct cloud storage access
 
# Loading examples
df = pd.read_csv('data.csv')
df_fast = pl.read_parquet('data.parquet')  # Much faster for large files
df_sql = duckdb.query("SELECT * FROM 'data.parquet' WHERE price > 100").to_df()

Pro Tip: Use pd.read_csv() with dtype parameter to prevent automatic type inference issues that can cause memory bloat.

2. Exploratory Data Analysis (EDA)

️ Data Structure Assessment

Why Structure Matters: Understanding your data's shape, types, and memory usage prevents downstream issues and informs preprocessing decisions.

# Essential first steps
df.info()                    # Data types, non-null counts, memory usage
df.shape                     # Dimensions
df.columns.tolist()          # Column names as list
df.dtypes                    # Data types only
df.memory_usage(deep=True)   # Actual memory consumption

Memory Optimization: Large datasets can consume excessive RAM. Use category dtype for strings with < 50% unique values, and int32 instead of int64 when possible.

Missing Value Analysis

Missing Data Patterns: Understanding why data is missing is crucial for choosing the right imputation strategy:

  • MCAR (Missing Completely at Random): Safe to delete or impute
  • MAR (Missing at Random): Missing depends on observed variables
  • MNAR (Missing Not at Random): Missing depends on unobserved factors
import missingno as msno
 
# Missing value assessment
missing_percent = df.isnull().mean() * 100
print(missing_percent[missing_percent > 0])
 
# Visualize missing patterns
msno.matrix(df)      # Shows missing data patterns
msno.heatmap(df)     # Correlation of missingness between variables

Descriptive Statistics Deep Dive

Beyond Basic Stats: describe() gives you central tendency and spread, but domain context is crucial for interpretation.

# Comprehensive statistical summary
df.describe(include='all')                    # All columns
df.describe(percentiles=[.1, .25, .5, .75, .9])  # Custom percentiles
 
# Grouped analysis
df.groupby('category')['price'].describe()   # Stats by group
df.groupby('category').agg({
    'price': ['mean', 'std', 'count'],
    'rating': ['mean', 'median']
})

3. Univariate & Multivariate Analysis

Single Variable Analysis (Univariate)

Distribution Understanding: Each variable's distribution tells you about data quality, potential outliers, and appropriate preprocessing steps.

import seaborn as sns
import matplotlib.pyplot as plt
 
# Numerical variables
sns.histplot(df['price'], kde=True)          # Distribution shape
sns.boxplot(x=df['price'])                   # Outlier detection
df['price'].describe()                       # Statistical summary
 
# Categorical variables
df['category'].value_counts()                # Frequency counts
sns.countplot(data=df, x='category')         # Visual frequency

Outlier Detection Methods:

  • IQR Method: Q3 + 1.5×IQR (conservative, good for skewed data)
  • Z-Score: |z| > 3 (assumes normal distribution)
  • Domain Knowledge: Often the most reliable approach

Relationship Analysis (Bivariate/Multivariate)

Correlation vs. Causation: Correlation measures linear relationships but doesn't imply causation. Always combine statistical analysis with domain expertise.

# Correlation analysis
corr_matrix = df.corr(numeric_only=True)
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', center=0)
 
# Categorical vs numerical relationships
sns.boxplot(data=df, x='category', y='price')
df.groupby('category')['price'].mean()
 
# Multiple variable relationships
sns.pairplot(df, hue='target_variable')      # All pairwise relationships

Automated EDA Tools

When to Use Automated EDA: Perfect for initial exploration of large datasets, but always follow up with targeted analysis.

from ydata_profiling import ProfileReport
import sweetviz as sv
 
# Comprehensive automated report
profile = ProfileReport(df, title="Dataset Analysis")
profile.to_notebook_iframe()
 
# Comparative analysis
report = sv.analyze(df)
report.show_html()

4. Data Cleaning

Missing Data Strategies

Imputation Strategy Selection:

  • Numerical: Mean (normal distribution), Median (skewed), Mode (discrete)
  • Categorical: Mode, "Unknown" category, or domain-specific defaults
  • Advanced: KNN imputation, iterative imputation for complex patterns
from sklearn.impute import SimpleImputer, KNNImputer
 
# Basic imputation
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='most_frequent')
 
df['age'] = num_imputer.fit_transform(df[['age']])
df['category'] = cat_imputer.fit_transform(df[['category']])
 
# Advanced imputation for complex missing patterns
knn_imputer = KNNImputer(n_neighbors=5)
df_imputed = knn_imputer.fit_transform(df.select_dtypes(include='number'))

Outlier Treatment

Treatment Options:

  • Remove: Only if you're certain they're errors
  • Cap/Winsorize: Clip to percentile values (e.g., 1st and 99th)
  • Transform: Log transformation can reduce outlier impact
  • Keep: Often outliers contain valuable information
# IQR-based outlier detection
Q1 = df['price'].quantile(0.25)
Q3 = df['price'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
 
# Winsorization (capping)
df['price_capped'] = df['price'].clip(lower=lower_bound, upper=upper_bound)

Duplicate Handling

Duplicate Types:

  • Exact duplicates: Same values across all columns
  • Partial duplicates: Same key columns but different values elsewhere
  • Near duplicates: Similar but not identical (requires fuzzy matching)
# Duplicate detection and removal
print(f"Duplicates: {df.duplicated().sum()}")
df_clean = df.drop_duplicates(subset=['id'], keep='first')
 
# Check for near-duplicates in text columns
from fuzzywuzzy import fuzz
# Custom logic needed based on domain

5. Feature Engineering

Feature engineering is the process of transforming raw data into a set of features that can be used to improve the performance of machine learning models. It involves selecting, creating, and transforming variables from the raw data to make them more suitable for the chosen machine learning algorithm.

Categorical Encoding Strategies

Encoding Method Selection:

  • Ordinal: Natural order exists (low/medium/high)
  • One-Hot: Nominal categories, low cardinality (< 10-15 categories)
  • Target Encoding: High cardinality, but risk of overfitting
  • Frequency Encoding: When frequency correlates with target
import category_encoders as ce
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
 
# Ordinal encoding for ordered categories
ordinal_map = {'low': 1, 'medium': 2, 'high': 3}
df['priority_encoded'] = df['priority'].map(ordinal_map)
 
# Target encoding (use with cross-validation to prevent overfitting)
target_encoder = ce.TargetEncoder()
df['category_target_encoded'] = target_encoder.fit_transform(df['category'], df['target'])
 
# One-hot encoding
df_encoded = pd.get_dummies(df, columns=['category'], prefix='cat')

Numerical Transformations

Scaling Necessity: Different algorithms have different scaling requirements:

  • Tree-based models: Don't require scaling
  • Linear models, SVM, Neural Networks: Require scaling
  • Distance-based algorithms: Very sensitive to scale
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from scipy.stats import boxcox
import numpy as np
 
# Scaling options
standard_scaler = StandardScaler()      # Mean=0, Std=1
minmax_scaler = MinMaxScaler()         # Range [0,1]
robust_scaler = RobustScaler()         # Uses median and IQR
 
# Handle skewed distributions
df['log_price'] = np.log1p(df['price'])  # log(1+x) handles zeros
df['price_boxcox'], _ = boxcox(df['price'] + 1)  # Box-Cox transformation

Text Feature Engineering

Text Preprocessing Pipeline:

  1. Clean: Remove HTML, special characters, normalize case
  2. Tokenize: Split into words/subwords
  3. Remove stopwords: Filter common words
  4. Vectorize: Convert to numerical representation
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
 
def clean_text(text):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)  # Remove non-alphabetic
    return text
 
# Vectorization
tfidf = TfidfVectorizer(max_features=1000, stop_words='english')
text_features = tfidf.fit_transform(df['description'].apply(clean_text))

DateTime Feature Engineering

Time-Based Features: Extract meaningful temporal patterns that algorithms can use.

# Extract datetime components
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6])
 
# Cyclical encoding (preserves circular nature of time)
df['month_sin'] = np.sin(2 * np.pi * df['month'] / 12)
df['month_cos'] = np.cos(2 * np.pi * df['month'] / 12)

6. Data Balancing & Resampling

When Balancing is Necessary

Class Imbalance Impact: Severely imbalanced datasets (>90:10 ratio) can cause models to ignore minority classes. However, balancing isn't always beneficial—consider the business cost of false positives vs. false negatives.

Key Principles:

  • Only balance training data, never test data
  • Consider class weights as an alternative to resampling
  • Evaluate using appropriate metrics (F1, AUC-ROC, not just accuracy)

Resampling Techniques

from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.under_sampling import RandomUnderSampler, TomekLinks
from imblearn.combine import SMOTEENN
 
# Check current balance
print(y.value_counts(normalize=True))
 
# Oversampling techniques
smote = SMOTE(random_state=42)                    # Synthetic minority samples
adasyn = ADASYN(random_state=42)                  # Adaptive synthetic sampling
ros = RandomOverSampler(random_state=42)          # Simple duplication
 
# Apply resampling
X_resampled, y_resampled = smote.fit_resample(X, y)
 
# Combined approach (often works best)
smote_enn = SMOTEENN(random_state=42)
X_combined, y_combined = smote_enn.fit_resample(X, y)

SMOTE Explanation: Creates synthetic examples by interpolating between existing minority class samples and their k-nearest neighbors, rather than simply duplicating existing samples.

7. Data Splitting

Splitting Strategies

Split Ratios: Common splits depend on dataset size:

  • Large datasets (>100k): 80/10/10 or 80/20
  • Medium datasets (10k-100k): 70/15/15 or 80/20
  • Small datasets (< 10k): Use cross-validation instead

Stratification: Ensures each split maintains the same class distribution as the original dataset.

from sklearn.model_selection import train_test_split, StratifiedKFold
 
# Basic stratified split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,           # Maintains class distribution
    random_state=42       # Reproducibility
)
 
# Three-way split
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, stratify=y, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, stratify=y_temp, random_state=42)
 
# Cross-validation for small datasets
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(X, y):
    X_train_fold, X_val_fold = X.iloc[train_idx], X.iloc[val_idx]
    y_train_fold, y_val_fold = y.iloc[train_idx], y.iloc[val_idx]

8. Pipelines & Automation

Pipeline Benefits

Why Pipelines Matter:

  • Prevent data leakage: Ensures transformations are fit only on training data
  • Reproducibility: Same preprocessing steps applied consistently
  • Cleaner code: Encapsulates complex preprocessing logic
  • Easy deployment: Entire pipeline can be serialized and deployed

️ Advanced Pipeline Construction

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
 
# Define column types
numeric_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns
 
# Create preprocessing pipelines
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
 
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])
 
# Combine transformers
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)
 
# Full pipeline with model
from sklearn.ensemble import RandomForestClassifier
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(random_state=42))
])
 
# Fit and predict
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)

9. Modern & Scalable Tools

High-Performance Libraries

When to Use Each:

  • Polars: 2-10x faster than pandas, especially for large datasets
  • Dask: Parallel computing, handles datasets larger than RAM
  • Modin: Drop-in pandas replacement with automatic parallelization
  • DuckDB: SQL analytics on DataFrames, excellent for complex queries
import polars as pl
import dask.dataframe as dd
 
# Polars for speed
df_pl = pl.read_csv("large_file.csv")
result = df_pl.filter(pl.col("price") > 100).group_by("category").agg(pl.col("price").mean())
 
# Dask for datasets larger than RAM
df_dask = dd.read_csv("huge_file.csv")
result = df_dask.groupby("category").price.mean().compute()

AutoML & Experiment Tracking

AutoML Tools: Automate feature engineering, model selection, and hyperparameter tuning:

  • PyCaret: Low-code ML library
  • AutoGluon: Amazon's AutoML toolkit
  • FLAML: Microsoft's fast AutoML library
import pycaret.classification as pc
 
# AutoML setup and comparison
clf = pc.setup(df, target='target_column', session_id=123)
best_models = pc.compare_models(sort='AUC', n_select=3)

10. Visualization Best Practices

️ Choosing the Right Visualization

Visualization Purpose:

  • Distribution: Histograms, box plots, violin plots
  • Relationships: Scatter plots, correlation heatmaps
  • Comparisons: Bar charts, grouped box plots
  • Time series: Line plots, seasonal decomposition
import plotly.express as px
import plotly.graph_objects as go
 
# Interactive visualizations
fig = px.scatter(df, x='feature1', y='feature2', color='target',
                 hover_data=['id'], title='Feature Relationship')
fig.show()
 
# Statistical visualizations
sns.pairplot(df, hue='target', diag_kind='kde')
plt.show()

Visualization Guidelines

Best Practices:

  • Use color meaningfully (categorical distinctions, not decoration)
  • Always label axes, include units
  • Choose appropriate scales (linear vs. log)
  • Avoid 3D unless the third dimension adds meaningful information
  • Consider colorblind-friendly palettes

11. Quick Project Template

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from imblearn.over_sampling import SMOTE
import seaborn as sns
import matplotlib.pyplot as plt
 
def complete_data_prep_pipeline(df, target_column):
    """
    Complete data preparation pipeline for ML
    """
    # 1. Initial EDA
    print("Dataset Shape:", df.shape)
    print("\nMissing Values:")
    print(df.isnull().sum())
    print("\nTarget Distribution:")
    print(df[target_column].value_counts(normalize=True))
 
    # 2. Separate features and target
    X = df.drop(target_column, axis=1)
    y = df[target_column]
 
    # 3. Identify column types
    numeric_features = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
    categorical_features = X.select_dtypes(include=['object']).columns.tolist()
 
    print(f"\nNumeric features: {len(numeric_features)}")
    print(f"Categorical features: {len(categorical_features)}")
 
    # 4. Create preprocessing pipelines
    numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ])
 
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
    ])
 
    # 5. Combine transformers
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ]
    )
 
    # 6. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, stratify=y, random_state=42
    )
 
    # 7. Apply preprocessing
    X_train_processed = preprocessor.fit_transform(X_train)
    X_test_processed = preprocessor.transform(X_test)
 
    # 8. Handle class imbalance if needed
    if len(y.unique()) == 2:  # Binary classification
        class_ratio = y.value_counts(normalize=True).min()
        if class_ratio < 0.3:  # If minority class < 30%
            print(f"\nApplying SMOTE (minority class: {class_ratio:.2%})")
            smote = SMOTE(random_state=42)
            X_train_processed, y_train = smote.fit_resample(X_train_processed, y_train)
 
    print(f"\nFinal training set shape: {X_train_processed.shape}")
    print(f"Final test set shape: {X_test_processed.shape}")
 
    return X_train_processed, X_test_processed, y_train, y_test, preprocessor
 
# Usage example:
# X_train, X_test, y_train, y_test, preprocessor = complete_data_prep_pipeline(df, 'target')

Key Takeaways & Best Practices

Critical Success Factors

  1. Data Quality > Model Complexity: Spend 80% of your time on data understanding and preparation
  2. Domain Knowledge: Statistical techniques are tools; domain expertise guides their application
  3. Reproducibility: Always set random seeds and document preprocessing steps
  4. Validation Strategy: Never touch test data during preprocessing—use cross-validation for model selection
  5. Iterative Process: EDA → Preprocess → Model → Analyze Results → Refine (repeat)

️ Common Pitfalls to Avoid

  • Data Leakage: Fitting transformers on entire dataset instead of training only
  • Target Leakage: Including features that wouldn't be available at prediction time
  • Overfitting to Validation: Using validation performance to make too many preprocessing decisions
  • Ignoring Business Context: Optimizing metrics that don't align with business objectives
  • Premature Optimization: Jumping to complex techniques before understanding the data

Workflow Checklist

  • Load and inspect data structure
  • Perform comprehensive EDA
  • Handle missing values appropriately
  • Detect and treat outliers
  • Engineer relevant features
  • Encode categorical variables
  • Scale numerical features if needed
  • Split data properly (stratified)
  • Handle class imbalance if necessary
  • Create preprocessing pipeline
  • Validate preprocessing steps
  • Document all transformations

Share this article

© copyright 2025

    Accessibility StatementPrivacy PolicyTerms of Service