Here is a concise learning recap summarizing your journey from “Memory Crashes” to a “Production-Ready Clustering Pipeline.” From OOM to Optimized: Scaling Unstructured Clustering Goal: Run Leiden community detection on 30k–100k text embeddings with high density. Challenge: The process was crashing (Out of Memory) even on moderate datasets.

  1. The “Graph Explosion” Problem We discovered that a standard Threshold Graph (e.g., “connect everyone > 70% similar”) is mathematically dangerous for large datasets.
  • The Math: 100k nodes with 50% density = 5 Billion edges. Storing this requires ~60GB RAM.
  • The Fix: Switch to k-Nearest Neighbors (k-NN).
    • Instead of “Connect to everyone,” we say “Connect to the top 30.”
    • This caps memory usage linearly (N \times 30) instead of quadratically (N^2).
    • Result: Graph size dropped from ~60GB to ~50MB.
  1. The “Hybrid” Approach Pure k-NN was too aggressive—it bridged distinct clusters together, dropping the community count from ~100 to ~20.
  • The Fix: Hybrid k-NN + Thresholding.
    • Step 1: Use PyNNDescent to find the top 30 candidates (Fast).
    • Step 2: Apply a strict mask (Similarity > 0.7) to prune weak links (Precise).
  • Outcome: We got the speed/safety of k-NN with the high-quality separation of a Radius graph.
  1. The “Embedding Spike” Problem Even before clustering, generating embeddings for 30k texts caused crashes.
  • The Cause:
    • Tensor Expansion: Sending all 30k texts to the model at once created massive temporary matrices (~20GB RAM).
    • The “VStack” Trap: Storing results in a list and running np.vstack(list) momentarily doubles memory usage (List + New Array coexist).
  • The Fix:
    • Batching: Process 256 texts at a time to keep “expansion” memory low (~200MB).
    • Pre-allocation: Create an empty np.zeros((N, Dim)) array first, then fill it slice-by-slice. This keeps memory usage flat with zero spikes.
  1. Zero-Copy Engineering We learned that Python objects are heavy.
  • Bad: edges = list(zip(rows, cols)) creates millions of tuple objects (Slow, Heavy).
  • Good: edges = np.column_stack((rows, cols)) keeps data in pure C-contiguous memory (Fast, Tiny).
  • Result: We can pass millions of edges to igraph in milliseconds using almost no extra RAM. Final Verdict By combining Batched Inference, Pre-allocated Memory, and Hybrid k-NN, we turned a script that crashed on 30k rows into a pipeline that can easily scale to 100k+ rows on a standard laptop, while maintaining the exact same clustering quality as the brute-force method.