Talk
Embedding GPU-accelerated vector search into Pandas via the Extension API and an ANSI C module with a
zero-copy data path.
Problem
Pandas is architecturally limited to single-threaded CPU execution - as datasets grow beyond 1–5 GB,
joins and aggregations take minutes
A typical embeddings pipeline extracts data from DataFrame into NumPy, passes it to FAISS or
HNSWlib, joins results back - each step involves buffer copying and type conversion
On a 1M×768 dims dataset this means ~3 GB of redundant allocations per search
Neither cuDF nor FAISS provides a unified API where vector search is a first-class citizen inside a
DataFrame
Approach: Extension API + ANSI C Module
Pandas Extension API (stable since v1.5) allows creating a custom VectorDtype and VectorArray storing
dense vectors in a contiguous float32 buffer
register_series_accessor adds a vec accessor, enabling df['embedding'].vec.search(query, k=10)
natively in pandas
An ANSI C extension module obtains a direct data pointer via CPython Buffer Protocol (Py_buffer) -
zero-copy, no intermediate allocations
GPU kernels invoked from C: CUDA for NVIDIA or OpenCL for cross-vendor portability
GIL released during GPU execution via Py_BEGIN_ALLOW_THREADS to avoid blocking Python
ANSI C (not pure CUDA) is deliberate: Python C Extension API is a C ABI, GPU kernels compile
separately, separating the portable bridge from hardware-specific code
GPU Vector Search: Numbers
Faiss v1.10 + cuVS: GPU IVF-PQ build 4.7× faster vs classic GPU, search latency reduced by 8.1×
CAGRA - a GPU-optimized graph index - builds 12.3× faster than CPU HNSW, searches 4.7× faster on
Deep100M
RAPIDS cuDF shows up to 150× acceleration on 5 GB datasets
Practical Edge Cases
Arrow interop (arrow_array) for Parquet serialization
Lazy device transfer with GPU-side caching
CPU fallback when no GPU is present - "GPU-first, CPU-fallback" strategy
Buffer lifetime management during pandas block reorganization
Audience Takeaways
Integration architecture: ExtensionArray → Buffer Protocol → C module → GPU kernels
Working code examples for custom dtype, series accessor, and C extension
Understanding where redundant copies occur and how to eliminate them
About the Speaker
Aleksandr Borgardt is a high-performance computing engineer specializing in databases and data
processing, with over a decade of experience in R&D and solutions engineering. He has worked
across multiple industries including AdTech, FinTech, MedTech, and machine learning infrastructure,
building systems that handle large-scale data workloads efficiently.
Aleksandr is the creator of two open-source projects: otterbrix, a framework for semi-structured
data processing that supports columnar storage and GPU acceleration, and Otterstax, a Data Fabric
and Data Mesh platform designed for federated query execution without centralized data storage. Both
projects reflect his focus on pushing performance boundaries while keeping tools accessible to the
broader engineering community.
His technical interests span database internals, columnar storage design, GPU-accelerated query
execution via CUDA and OpenCL, and integrating machine learning directly into SQL pipelines. He has
hands-on experience with vector index architectures, parallel execution models, and building C/C++
extensions for Python-based data tools.
He is a firm believer that open source makes the world a better place.