Aleksandr

Aleksandr

Open Source Enthusiast

OtterStax

Talk

Pandas + Vector Search on GPU: How to Ensure Compatibility
Track: Software Engineering Duration: 50 minutes View on Schedule
Graph Analytics GPU Computing Vector Search Pandas Python

Embedding GPU-accelerated vector search into Pandas via the Extension API and an ANSI C module with a zero-copy data path.
Problem

Pandas is architecturally limited to single-threaded CPU execution - as datasets grow beyond 1–5 GB, joins and aggregations take minutes
A typical embeddings pipeline extracts data from DataFrame into NumPy, passes it to FAISS or HNSWlib, joins results back - each step involves buffer copying and type conversion
On a 1M×768 dims dataset this means ~3 GB of redundant allocations per search
Neither cuDF nor FAISS provides a unified API where vector search is a first-class citizen inside a DataFrame

Approach: Extension API + ANSI C Module

Pandas Extension API (stable since v1.5) allows creating a custom VectorDtype and VectorArray storing dense vectors in a contiguous float32 buffer
register_series_accessor adds a vec accessor, enabling df['embedding'].vec.search(query, k=10) natively in pandas
An ANSI C extension module obtains a direct data pointer via CPython Buffer Protocol (Py_buffer) - zero-copy, no intermediate allocations
GPU kernels invoked from C: CUDA for NVIDIA or OpenCL for cross-vendor portability
GIL released during GPU execution via Py_BEGIN_ALLOW_THREADS to avoid blocking Python
ANSI C (not pure CUDA) is deliberate: Python C Extension API is a C ABI, GPU kernels compile separately, separating the portable bridge from hardware-specific code

GPU Vector Search: Numbers

Faiss v1.10 + cuVS: GPU IVF-PQ build 4.7× faster vs classic GPU, search latency reduced by 8.1×
CAGRA - a GPU-optimized graph index - builds 12.3× faster than CPU HNSW, searches 4.7× faster on Deep100M
RAPIDS cuDF shows up to 150× acceleration on 5 GB datasets

Practical Edge Cases

Arrow interop (arrow_array) for Parquet serialization
Lazy device transfer with GPU-side caching
CPU fallback when no GPU is present - "GPU-first, CPU-fallback" strategy
Buffer lifetime management during pandas block reorganization

Audience Takeaways

Integration architecture: ExtensionArray → Buffer Protocol → C module → GPU kernels
Working code examples for custom dtype, series accessor, and C extension
Understanding where redundant copies occur and how to eliminate them

About the Speaker

Aleksandr Borgardt is a high-performance computing engineer specializing in databases and data processing, with over a decade of experience in R&D and solutions engineering. He has worked across multiple industries including AdTech, FinTech, MedTech, and machine learning infrastructure, building systems that handle large-scale data workloads efficiently.
Aleksandr is the creator of two open-source projects: otterbrix, a framework for semi-structured data processing that supports columnar storage and GPU acceleration, and Otterstax, a Data Fabric and Data Mesh platform designed for federated query execution without centralized data storage. Both projects reflect his focus on pushing performance boundaries while keeping tools accessible to the broader engineering community.
His technical interests span database internals, columnar storage design, GPU-accelerated query execution via CUDA and OpenCL, and integrating machine learning directly into SQL pipelines. He has hands-on experience with vector index architectures, parallel execution models, and building C/C++ extensions for Python-based data tools.
He is a firm believer that open source makes the world a better place.

Recording

Video will be available after the conference.