Data Engineering

The modern data stack for Indian enterprises: what we've seen work (and what's still hype)

Rhea Rastogi

Co-Founder, SYSTEMBENDER · February 2025

Data Engineering 6 min read

The term "modern data stack" has been used to describe so many different things that it's almost stopped meaning anything. So let's be specific. Here's what we've actually implemented for real clients in India, and the decisions that genuinely matter at each layer.

What "modern data stack" actually means

At its core, the modern data stack is a set of architectural choices that prioritise: cloud-native storage (not on-prem), ELT over ETL (load raw, transform in-place), SQL-first transformation (not Spark for everything), and self-service analytics (business users can query, not just data teams).

The stack most of our clients land on: S3 or GCS for raw storage → Redshift, BigQuery, or Snowflake for the warehouse → dbt for transformation → Airflow for orchestration → Tableau, QuickSight, or Looker for BI.

"The best data stack is the one your team can operate without a specialised platform engineer on call 24/7."

The storage layer: cloud object storage first, always

S3 (AWS) or GCS (GCP) as your raw data lake is non-negotiable at this point. The cost per GB is negligible, the durability is excellent, and having a raw immutable copy of everything means you can replay transformations when your logic changes. The one decision worth making carefully: partitioning strategy. Partition by date and by source system from day one. Retrofitting partition strategy later is painful.

Transformation: where dbt has genuinely won

dbt has won the transformation layer argument, and it deserved to. The core insight (that transformation is a software engineering problem and should be treated as one, with version control, testing, and documentation) is correct. The things we consistently see clients underestimate:

dbt tests are not optional. Unique, not-null, accepted values, referential integrity: these aren't nice-to-haves. A data pipeline without tests is a liability disguised as infrastructure.
Model organisation matters more than you think. Staging → intermediate → mart is the right pattern. Teams that don't follow it end up with spaghetti SQL that nobody can maintain six months later.
dbt is not a replacement for Spark. If you're processing terabytes of event data or doing complex joins at scale, you still need Spark or Flink at the ingestion layer. dbt lives in the transformation layer, not the ingestion layer.

Orchestration: Airflow is still the answer, but it's not free

Airflow remains the standard for orchestration in most enterprise contexts, not because it's simple (it isn't) but because it's powerful, extensible, and everyone knows how to operate it. Managed Airflow (MWAA on AWS, Cloud Composer on GCP) is worth the cost for teams that don't want to manage the infrastructure.

Prefect and Dagster are worth evaluating if you're starting fresh and your team is Python-first. They have better developer experience. They're not yet the default in enterprise contexts, but that's changing.

BI: the self-service problem is still unsolved for most

QuickSight with Amazon Q is the closest thing we've seen to genuine self-service analytics: business users typing questions and getting answers without SQL. Tableau and Power BI still require someone who knows the tool to build the dashboards. Looker is powerful but expensive and opinionated. For most Indian mid-market clients, QuickSight is the pragmatic choice if you're already in AWS, Power BI if you're in Microsoft, Tableau if your business users are already trained on it.

The honest truth: no BI tool fully solves the self-service problem yet. You still need someone maintaining the semantic layer, the data model, and the access controls. Budget for that person: they're often the difference between a data platform that gets used and one that doesn't.