Word Embeddings: The Mathematical Revolution That Taught Machines to Understand Language

Neural Networks Deep Learning Machine Learning

Word Embeddings: The Mathematical Revolution That Taught Machines to Understand Language explores how mathematical models transformed raw text into meaningful vectors, enabling machines to grasp the nuances of human language. This breakthrough laid the foundation for modern NLP, powering applications from translation to conversational AI.

DEDennis Kibet Rono

August 3, 2025

12 min read

How mathematical models transformed raw text into meaningful vectors, enabling machines to grasp the nuances of human language and laying the foundation for modern NLP.

Introduction: The Great Language Divide

In the early days of computing, machines faced an insurmountable challenge: how to understand human language. Computers excel at processing numbers and executing logical operations, but language—with its ambiguities, cultural nuances, and contextual dependencies—seemed to exist in a completely different realm.

The breakthrough came from an unexpected realization: meaning itself could be mathematically encoded through statistical patterns in how words are used together.

Chapter 1: From Symbols to Vectors

The One-Hot Encoding Era

Early NLP used one-hot encoding, representing each word as a binary vector with a single 1 and all other elements as 0. This approach treated "king" and "queen" as completely unrelated, despite their obvious semantic connection.

# One-hot encoding example
import numpy as np
 
vocabulary = ["king", "queen", "man", "woman", "royal"]
vocab_size = len(vocabulary)
 
def one_hot_encode(word, vocabulary):
    vector = np.zeros(len(vocabulary))
    if word in vocabulary:
        vector[vocabulary.index(word)] = 1
    return vector
 
# Example encodings
king_vector = one_hot_encode("king", vocabulary)
queen_vector = one_hot_encode("queen", vocabulary)
 
print(f"King: {king_vector}")
print(f"Queen: {queen_vector}")
print(f"Similarity: {np.dot(king_vector, queen_vector)}")  # Always 0!

The Distributional Hypothesis

The breakthrough came from linguist Zellig Harris's insight: "You shall know a word by the company it keeps." Words with similar meanings tend to appear in similar contexts, providing a mathematical foundation for understanding meaning through statistical patterns.

Chapter 2: The Neural Revolution

Word2Vec: The Game Changer

In 2013, Google's Word2Vec revolutionized the field by learning dense vector representations that captured semantic relationships. The famous example "king - man + woman ≈ queen" demonstrated that abstract concepts could be encoded as directions in vector space.

# Word2Vec implementation using Gensim
from gensim.models import Word2Vec
import numpy as np
 
# Sample training data
sentences = [
    ["the", "king", "rules", "the", "kingdom"],
    ["the", "queen", "is", "royal"],
    ["a", "man", "walks", "in", "the", "park"],
    ["a", "woman", "reads", "a", "book"],
    ["royal", "family", "lives", "in", "palace"],
    ["the", "prince", "and", "princess", "are", "royal"],
    ["men", "and", "women", "are", "equal"]
]
 
# Train Word2Vec model
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4, epochs=100)
 
# The famous analogy: king - man + woman ≈ queen
try:
    result = model.wv.most_similar(positive=['king', 'woman'], negative=['man'], topn=3)
    print("king - man + woman =", [word for word, score in result])
except:
    print("Need more training data for reliable analogies")
 
# Check similarity between related words
if 'king' in model.wv and 'queen' in model.wv:
    similarity = model.wv.similarity('king', 'queen')
    print(f"Similarity between 'king' and 'queen': {similarity:.3f}")

GloVe: Global Statistics Meet Local Context

Stanford's GloVe combined global matrix factorization with local context methods, using word co-occurrence statistics to learn representations.

# GloVe concept: Co-occurrence matrix
from collections import defaultdict, Counter
 
def build_cooccurrence_matrix(sentences, window_size=2):
    """Build word co-occurrence matrix"""
    cooccurrence = defaultdict(Counter)
 
    for sentence in sentences:
        for i, word in enumerate(sentence):
            for j in range(max(0, i - window_size),
                          min(len(sentence), i + window_size + 1)):
                if i != j:
                    cooccurrence[word][sentence[j]] += 1
 
    return cooccurrence
 
# Example usage
sentences = [
    ["king", "royal", "palace", "crown"],
    ["queen", "royal", "crown", "throne"],
    ["man", "person", "human"],
    ["woman", "person", "human"]
]
 
cooc_matrix = build_cooccurrence_matrix(sentences)
print("Co-occurrence for 'king':", dict(cooc_matrix['king']))
print("Co-occurrence for 'royal':", dict(cooc_matrix['royal']))

FastText: Embracing Subword Information

Facebook's FastText addressed out-of-vocabulary words by incorporating character n-grams, allowing meaningful representations for unseen words.

# FastText concept: Subword representation
def get_character_ngrams(word, n=3):
    """Extract character n-grams from a word"""
    word = f"<{word}>"  # Add boundary markers
    ngrams = []
    for i in range(len(word) - n + 1):
        ngrams.append(word[i:i+n])
    return ngrams
 
# Example: "unhappiness" can be understood through its parts
word = "unhappiness"
trigrams = get_character_ngrams(word, 3)
print(f"Character trigrams for '{word}': {trigrams}")
 
# Even unseen words can be represented
unseen_word = "unhappy"
unseen_trigrams = get_character_ngrams(unseen_word, 3)
print(f"Character trigrams for '{unseen_word}': {unseen_trigrams}")
 
# Common substrings indicate semantic relationships
overlap = set(trigrams) & set(unseen_trigrams)
print(f"Shared trigrams: {overlap}")

Chapter 3: The Context Revolution

The Polysemy Problem

Static embeddings gave the same representation to "bank" whether it meant a financial institution or a river's edge. This limitation sparked the development of contextual embeddings.

BERT: Bidirectional Understanding

BERT revolutionized NLP by providing context-dependent representations using bidirectional transformers and masked language modeling.

# Using BERT embeddings with transformers library
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
 
# Initialize BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
 
def get_contextual_embedding(sentence, target_word):
    """Get contextual embedding for a word in a sentence"""
    # Tokenize and encode
    inputs = tokenizer(sentence, return_tensors='pt', padding=True, truncation=True)
 
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        embeddings = outputs.last_hidden_state
 
    # Find target word position
    tokens = tokenizer.tokenize(sentence)
    try:
        word_idx = tokens.index(target_word) + 1  # +1 for [CLS] token
        word_embedding = embeddings[0][word_idx]
        return word_embedding
    except ValueError:
        return None
 
# Example: "bank" has different meanings in different contexts
sentence1 = "I went to the bank to deposit money"
sentence2 = "We sat by the river bank"
 
bank_embedding1 = get_contextual_embedding(sentence1, "bank")
bank_embedding2 = get_contextual_embedding(sentence2, "bank")
 
if bank_embedding1 is not None and bank_embedding2 is not None:
    # Calculate cosine similarity
    similarity = torch.cosine_similarity(bank_embedding1, bank_embedding2, dim=0)
    print(f"Contextual similarity of 'bank' in different contexts: {similarity:.3f}")
    print("Lower similarity indicates different meanings!")

Chapter 4: Practical Applications

Semantic Search

# Semantic search using embeddings
from sentence_transformers import SentenceTransformer
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
# Load pre-trained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')
 
# Document corpus
documents = [
    "The cat sits on the mat",
    "A feline rests on the carpet",
    "Dogs are loyal animals",
    "Python is a programming language",
    "Machine learning uses algorithms",
    "Artificial intelligence transforms technology"
]
 
# Query
query = "A cat on a rug"
 
# Generate embeddings
doc_embeddings = model.encode(documents)
query_embedding = model.encode([query])
 
# Calculate similarities
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
 
# Rank documents by similarity
ranked_docs = sorted(zip(documents, similarities), key=lambda x: x[1], reverse=True)
 
print("Semantic search results:")
for i, (doc, score) in enumerate(ranked_docs, 1):
    print(f"{i}. Score: {score:.3f} - {doc}")

Sentiment Analysis

# Sentiment analysis using embeddings
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
 
# Sample data (in practice, use larger datasets)
texts = [
    "I love this movie, it's amazing!",
    "This film is terrible and boring",
    "Great acting and wonderful story",
    "Worst movie I've ever seen",
    "Pretty good, I enjoyed it",
    "Fantastic cinematography and plot",
    "Disappointing and poorly written",
    "Excellent performance by the actors"
]
labels = [1, 0, 1, 0, 1, 1, 0, 1]  # 1 = positive, 0 = negative
 
# Generate embeddings
embeddings = model.encode(texts)
 
# Train classifier
X_train, X_test, y_train, y_test = train_test_split(
    embeddings, labels, test_size=0.3, random_state=42
)
classifier = LogisticRegression()
classifier.fit(X_train, y_train)
 
# Predict sentiment for new text
new_texts = [
    "This movie is fantastic!",
    "I didn't like it at all",
    "It was okay, nothing special"
]
 
for text in new_texts:
    new_embedding = model.encode([text])
    prediction = classifier.predict(new_embedding)[0]
    probability = classifier.predict_proba(new_embedding)[0]
 
    print(f"Text: '{text}'")
    print(f"Sentiment: {'Positive' if prediction == 1 else 'Negative'}")
    print(f"Confidence: {max(probability):.3f}\n")

Chapter 5: The Bias Challenge

Detecting Bias in Embeddings

Word embeddings can inherit and amplify societal biases present in training data. Here's how to detect gender bias:

# Bias detection in word embeddings
import numpy as np
from scipy import spatial
 
def detect_gender_bias(model, male_words, female_words, career_words, family_words):
    """Detect gender bias using Word Embedding Association Test (WEAT)"""
 
    def get_embeddings(words):
        embeddings = []
        for word in words:
            if word in model.wv:
                embeddings.append(model.wv[word])
        return np.array(embeddings) if embeddings else np.array([])
 
    # Get embeddings for each category
    male_emb = get_embeddings(male_words)
    female_emb = get_embeddings(female_words)
    career_emb = get_embeddings(career_words)
    family_emb = get_embeddings(family_words)
 
    if len(male_emb) == 0 or len(female_emb) == 0:
        return "Insufficient vocabulary"
 
    # Calculate mean embeddings
    male_mean = np.mean(male_emb, axis=0)
    female_mean = np.mean(female_emb, axis=0)
 
    # Calculate bias direction (male - female)
    bias_direction = male_mean - female_mean
    bias_direction = bias_direction / np.linalg.norm(bias_direction)
 
    # Test career vs family associations
    career_scores = []
    family_scores = []
 
    for word in career_words:
        if word in model.wv:
            word_vec = model.wv[word]
            # Project onto bias direction
            bias_score = np.dot(word_vec, bias_direction)
            career_scores.append(bias_score)
 
    for word in family_words:
        if word in model.wv:
            word_vec = model.wv[word]
            bias_score = np.dot(word_vec, bias_direction)
            family_scores.append(bias_score)
 
    career_mean = np.mean(career_scores) if career_scores else 0
    family_mean = np.mean(family_scores) if family_scores else 0
 
    return {
        'career_bias_score': career_mean,
        'family_bias_score': family_mean,
        'bias_difference': career_mean - family_mean,
        'interpretation': 'Positive values indicate male association, negative indicate female association'
    }
 
# Example usage (requires trained Word2Vec model)
male_words = ['man', 'male', 'he', 'his', 'him', 'father', 'son']
female_words = ['woman', 'female', 'she', 'her', 'hers', 'mother', 'daughter']
career_words = ['executive', 'management', 'professional', 'corporation', 'salary', 'business']
family_words = ['home', 'parents', 'children', 'family', 'marriage', 'wedding']
 
# Uncomment to test with your trained model:
# bias_results = detect_gender_bias(model, male_words, female_words, career_words, family_words)
# print("Gender bias analysis:", bias_results)

Bias Mitigation

# Simple bias mitigation through vector projection
def debias_embeddings(model, bias_direction, words_to_debias):
    """Remove bias component from word embeddings"""
 
    # Normalize bias direction
    bias_direction = bias_direction / np.linalg.norm(bias_direction)
 
    debiased_embeddings = {}
 
    for word in words_to_debias:
        if word in model.wv:
            original_vec = model.wv[word].copy()
 
            # Project out the bias component
            bias_component = np.dot(original_vec, bias_direction) * bias_direction
            debiased_vec = original_vec - bias_component
 
            debiased_embeddings[word] = debiased_vec
 
    return debiased_embeddings
 
# Example: Remove gender bias from profession words
profession_words = ['doctor', 'nurse', 'engineer', 'teacher', 'programmer']
# debiased = debias_embeddings(model, bias_direction, profession_words)

Chapter 6: Modern Developments

Transformer-Based Embeddings

# Using modern transformer embeddings
from transformers import AutoTokenizer, AutoModel
import torch
 
def get_sentence_embedding(text, model_name='sentence-transformers/all-MiniLM-L6-v2'):
    """Get sentence-level embeddings using transformers"""
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
 
    # Tokenize and encode
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
 
    # Get embeddings
    with torch.no_grad():
        outputs = model(**inputs)
        # Use mean pooling for sentence representation
        embeddings = outputs.last_hidden_state.mean(dim=1)
 
    return embeddings
 
# Example: Compare sentence meanings
sentences = [
    "The quick brown fox jumps over the lazy dog",
    "A fast brown fox leaps over a sleepy dog",
    "I love eating pizza for dinner"
]
 
embeddings = [get_sentence_embedding(sent) for sent in sentences]
 
# Calculate similarities
sim_1_2 = torch.cosine_similarity(embeddings[0], embeddings[1])
sim_1_3 = torch.cosine_similarity(embeddings[0], embeddings[2])
 
print(f"Similarity between sentences 1 and 2: {sim_1_2.item():.3f}")
print(f"Similarity between sentences 1 and 3: {sim_1_3.item():.3f}")

Multilingual Embeddings

# Multilingual embeddings example
from sentence_transformers import SentenceTransformer
 
# Load multilingual model
multilingual_model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
 
# Sentences in different languages with similar meanings
sentences = [
    "Hello, how are you?",  # English
    "Hola, ¿cómo estás?",   # Spanish
    "Bonjour, comment allez-vous?",  # French
    "Guten Tag, wie geht es Ihnen?",  # German
    "I love programming",  # English (different meaning)
    "Me encanta programar"  # Spanish (same as above)
]
 
# Generate embeddings
embeddings = multilingual_model.encode(sentences)
 
# Calculate cross-lingual similarities
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(embeddings)
 
print("Cross-lingual similarity examples:")
print(f"English 'Hello' vs Spanish 'Hola': {similarities[0][1]:.3f}")
print(f"English 'Hello' vs French 'Bonjour': {similarities[0][2]:.3f}")
print(f"English 'I love programming' vs Spanish 'Me encanta programar': {similarities[4][5]:.3f}")
print(f"English 'Hello' vs English 'I love programming': {similarities[0][4]:.3f}")

Best Practices and Implementation Tips

Choosing the Right Embedding

# Embedding selection guide
def recommend_embedding(use_case, resources, domain):
    """Guide for choosing appropriate embedding method"""
 
    recommendations = {
        ('real_time', 'low', 'general'): {
            'method': 'Word2Vec/GloVe',
            'reason': 'Fast inference, low memory usage',
            'trade_offs': 'Less contextual understanding'
        },
        ('accuracy', 'high', 'general'): {
            'method': 'BERT/RoBERTa',
            'reason': 'Superior contextual understanding',
            'trade_offs': 'Higher computational cost'
        },
        ('multilingual', 'medium', 'general'): {
            'method': 'mBERT/XLM-R',
            'reason': 'Cross-lingual capabilities',
            'trade_offs': 'May sacrifice monolingual performance'
        },
        ('specialized', 'medium', 'medical'): {
            'method': 'BioBERT/ClinicalBERT',
            'reason': 'Domain-specific knowledge',
            'trade_offs': 'Limited to specific domain'
        }
    }
 
    key = (use_case, resources, domain)
    return recommendations.get(key, {
        'method': 'Consider hybrid approach',
        'reason': 'Complex requirements may need multiple methods',
        'trade_offs': 'Increased complexity'
    })
 
# Example usage
print("Real-time app recommendation:")
print(recommend_embedding('real_time', 'low', 'general'))
 
print("\nHigh-accuracy NLP recommendation:")
print(recommend_embedding('accuracy', 'high', 'general'))

Evaluation Framework

# Comprehensive embedding evaluation
def evaluate_embeddings(model, word_pairs, similarity_scores):
    """Evaluate embedding quality using word similarity tasks"""
    from scipy.stats import spearmanr
 
    predicted_similarities = []
    valid_pairs = 0
 
    for (word1, word2), human_score in zip(word_pairs, similarity_scores):
        if hasattr(model, 'wv'):  # Word2Vec/FastText
            if word1 in model.wv and word2 in model.wv:
                similarity = model.wv.similarity(word1, word2)
                predicted_similarities.append(similarity)
                valid_pairs += 1
            else:
                predicted_similarities.append(0)
        else:  # Sentence transformers
            try:
                embeddings = model.encode([word1, word2])
                similarity = cosine_similarity([embeddings[0]], [embeddings[1]])[0][0]
                predicted_similarities.append(similarity)
                valid_pairs += 1
            except:
                predicted_similarities.append(0)
 
    # Calculate correlation with human judgments
    if len(predicted_similarities) > 1:
        correlation, p_value = spearmanr(predicted_similarities, similarity_scores)
    else:
        correlation, p_value = 0, 1
 
    return {
        'correlation': correlation,
        'p_value': p_value,
        'coverage': valid_pairs / len(word_pairs),
        'valid_pairs': valid_pairs
    }
 
# Example evaluation
word_pairs = [
    ('king', 'queen'),
    ('man', 'woman'),
    ('car', 'automobile'),
    ('big', 'large'),
    ('cat', 'dog')
]
human_scores = [0.8, 0.7, 0.9, 0.85, 0.6]
 
# Uncomment to evaluate your model:
# results = evaluate_embeddings(model, word_pairs, human_scores)
# print("Evaluation results:", results)

Conclusion: The Continuing Journey

Word embeddings have fundamentally transformed how machines understand language, bridging the gap between human linguistic intuition and computational processing. This journey from simple one-hot vectors to sophisticated contextual representations represents one of the most significant advances in artificial intelligence.

Key Takeaways

Statistical Learning Works: Meaning can be learned from patterns in data rather than explicitly programmed
Context Matters: Modern embeddings adapt to context, solving the polysemy problem
Bias is Real: Embeddings inherit societal biases and require careful evaluation and mitigation
Applications are Vast: From search to translation to sentiment analysis, embeddings power countless applications
The Future is Multimodal: Next-generation systems integrate text with vision, audio, and other modalities

Looking Forward

# The future of embeddings: Conceptual framework
class FutureEmbeddingSystem:
    """Conceptual future embedding system"""
 
    def __init__(self):
        self.capabilities = [
            'contextual_awareness',
            'cultural_sensitivity',
            'bias_mitigation',
            'multimodal_integration',
            'continuous_learning',
            'interpretability'
        ]
 
    def generate_embedding(self, input_data, context=None):
        """Generate contextual, unbiased, multimodal embeddings"""
        # Conceptual implementation
        features = {
            'semantic_meaning': self.extract_semantics(input_data),
            'contextual_info': self.analyze_context(input_data, context),
            'cultural_markers': self.detect_cultural_context(input_data),
            'bias_score': self.measure_bias(input_data),
            'confidence': self.calculate_confidence(input_data)
        }
 
        return {
            'embedding': self.compute_representation(features),
            'metadata': features,
            'explanations': self.generate_explanations(features)
        }
 
    def extract_semantics(self, data):
        return "Deep semantic understanding"
 
    def analyze_context(self, data, context):
        return "Contextual adaptation"
 
    def detect_cultural_context(self, data):
        return "Cultural awareness"
 
    def measure_bias(self, data):
        return "Bias detection and mitigation"
 
    def calculate_confidence(self, data):
        return "Uncertainty quantification"
 
    def compute_representation(self, features):
        return "Multimodal embedding vector"
 
    def generate_explanations(self, features):
        return "Interpretable explanations"
 
# The future is bright
future_system = FutureEmbeddingSystem()
print("Future embedding systems will be:")
for capability in future_system.capabilities:
    print(f"- {capability.replace('_', ' ').title()}")

The story of word embeddings is ultimately about humanity's quest to understand the nature of meaning and communication. While current technologies will eventually become obsolete, the insights gained from this journey—the importance of representation quality, the need for careful evaluation, and the responsibility to address bias and fairness—will continue to guide our progress toward more intelligent and equitable AI systems.

The mathematical revolution that taught machines to understand language continues, with each advancement bringing us closer to truly intelligent systems that can comprehend and communicate with the nuance and sophistication of human language.

Table of Contents