Data Engineering
A data company needed a platform that would enable their teams to dynamically create and execute data pipelines for feature engineering in machine learning workflows. The platform needed to handle hundreds of gigabytes of data cost-effectively, provide a simple user interface for non-technical users, and remain cloud-agnostic to avoid vendor lock-in.
Traditional data pipeline solutions were either too expensive at scale, too complex for general use, or tightly coupled to specific cloud providers. We needed to build something different—a bespoke solution using open-source components that would deliver enterprise-grade performance at a fraction of the typical cost.
Core Requirement: Users should be able to define complex data transformations using simple SQL-like statements, without writing code or understanding the underlying distributed computing infrastructure.
We designed the platform around three principles: simplicity for end users, cost efficiency through intelligent resource usage, and flexibility through cloud-agnostic architecture. The solution leverages open-source components orchestrated to work seamlessly together, providing enterprise capabilities without enterprise licensing costs.
Our team selected DuckDB for its ability to query large remote datasets efficiently, Ray for distributed processing across compute clusters, and FastAPI for a modern, performant web interface. Kubernetes provides the orchestration layer, enabling horizontal scaling and efficient resource management.
FastAPI Web Application
SQL-like Query Interface for Filtering
Ray Framework
Parallel Processing Across Cluster Nodes
DuckDB
Query Large Remote Datasets Efficiently
AWS S3 / Cloud-Agnostic Storage
Filtered & Processed Datasets
Users interact with a FastAPI web application that provides a familiar SQL-like interface for defining data transformations. Behind the scenes, the application translates these high-level specifications into distributed processing tasks. This abstraction allows data scientists and analysts to focus on their work rather than infrastructure complexity.
Ray handles the distribution of workload across the cluster. When a pipeline executes, Ray automatically parallelizes the work, distributing tasks to available compute nodes. This approach enables the platform to scale horizontally—adding more nodes increases processing capacity linearly without code changes.
DuckDB serves as the query engine, with a critical capability: it can query data directly from cloud storage without downloading entire datasets first. This "query-in-place" approach dramatically reduces data transfer costs and processing time. DuckDB's columnar storage and vectorized execution provide excellent performance for analytical workloads.
The platform supports complex, multi-stage filtering operations. Data can be filtered, transformed, aggregated, and filtered again—all before persisting the final results. This staged approach reduces intermediate data storage and allows for sophisticated data preparation workflows.
These metrics represent a significant cost advantage over traditional data processing platforms. The combination of efficient open-source tools, intelligent architecture, and Kubernetes orchestration delivers enterprise-scale performance at startup-friendly costs.
This project demonstrated that open-source tools, when properly integrated, can deliver enterprise capabilities at a fraction of traditional costs. The key is understanding each component's strengths and designing an architecture that leverages them effectively.
The decision to prioritize simplicity in the user interface paid significant dividends. By hiding the complexity of distributed computing behind a SQL-like interface, we enabled a broader range of team members to create and manage data pipelines. This democratization of data processing capabilities accelerated the client's machine learning initiatives.
Cloud agnosticity proved valuable not just for avoiding vendor lock-in, but also for enabling hybrid and multi-cloud deployments. The client can now run workloads where it makes the most economic and technical sense, rather than being constrained by platform dependencies.