Skip to main content

ETL is Inferior

ยท 8 min read
Malcolm Navarro
Director of Engineering @ Messari

ELT > ETL

Pipelines are necessary and transformation is where data becomes useful. The debate is not whether those steps matter, but when they should happen.

Extract. Transform. Load.

That middle step is where the crime happens. You pull data from some source system, immediately start "cleaning" it, drop fields you think nobody needs, coerce types, flatten structures, rename things, dedupe records, and then finally write the polished version somewhere durable.

Congrats, you built a pipeline that can only answer the questions you predicted ahead of time.

The better default is ELT:

Extract. Load. Transform.

Get the data. Store the raw thing. Then transform it after you have preserved reality.

ETL Sounds Responsibleโ€‹

This is why ETL has stuck around for so long. It sounds clean.

Nobody wants messy data. Nobody wants random JSON blobs, CSVs with cursed headers, partner payloads with nullable everything, or timestamps that look like they were generated by three teams that hate each other. The instinct is understandable.. clean it up before it touches the warehouse.

It feels like taking your shoes off before walking into the house.

Except data is not mud. Data is evidence.

When you transform before loading, you are not just cleaning. You are making assumptions and decisions. Some are obvious and harmless. Some are subtle. Some are wrong. The dangerous part is you are making those decisions at the exact moment where you know the least about the data.

You don't yet know which fields matter. You don't know which weird values are source bugs and which are meaningful business states. You don't know what finance, product, growth, compliance, support, or future-you will ask six months from now.

And yet classic ETL says: make the call now, write the shaped output, and move on.

That is not engineering discipline. That is throwing away optionality in a blazer.

Raw Data Is The Source Of Truthโ€‹

The first job of a data pipeline is not to make pretty tables.

The first job is to preserve what happened.

The raw payload is the closest thing you have to the source of truth. It is what the API returned. It is what the file contained. It is what the event stream emitted. It is what the database snapshot said at that point in time.

Everything after that is interpretation.

This is the same reason application logs matter. You don't only log the clean happy-path summary and delete the context. You keep enough detail so that when production starts acting sus, you can reconstruct what happened. Data pipelines deserve the same respect.

If you load raw first, every downstream model is just a view of history. Maybe a table, maybe a materialized view, maybe a dbt model, maybe whatever flavor of warehouse wizardry your org uses. But it is derived.

Derived data can be rebuilt.

Lost raw data cannot.

ETL Turns Bugs Into Archaeologyโ€‹

Every data team eventually has the same conversation:

"Why did this metric change?"

Sometimes the answer is obvious. A customer churned. A product launch worked. A source table changed.

But very often the answer is buried in transformation logic. A join duplicated rows. A null filter excluded too much. A timezone conversion moved revenue across days. A partner started sending a new enum value and your code quietly mapped it to unknown. Someone "cleaned" a field that later became important.

With ELT, this is annoying but fixable.

You go back to the raw layer, update the transformation, rebuild the model, compare the before and after, and move on with your life.

With ETL, the question becomes much worse:

"Do we still have the original data?"

If the answer is no, welcome to the dig site. Hope the source system still has history. Hope the API still works the same way. Hope pagination didn't change. Hope the vendor didn't mutate old records. Hope nobody is asking for an audit-grade answer.

Hope is not a data strategy.

ELT Makes Iteration The Defaultโ€‹

Data work is iteration. Anyone who says otherwise either hasn't done enough of it or is selling you a platform.

The first model is usually wrong. Not wildly wrong necessarily, but incomplete. Then you learn something. A status field has more states than the docs said. A timestamp means "created" for one integration and "received" for another. Deleted records are soft-deleted except when they are not. A duplicate is not always a duplicate. A null is sometimes the most important value in the row.

This is normal.

ELT accepts this reality. It says: load first, learn later.

That sounds almost too simple, but it changes the whole operating model:

  1. Extract from the source.
  2. Load the raw data with minimal mutation.
  3. Transform into clean models.
  4. Find the edge case.
  5. Fix the transformation.
  6. Re-transform from the original raw data.

That loop is the point. The faster and safer you can run it, the more trustworthy your data becomes.

ETL makes that loop fragile because the raw material may already be gone.

Storage Is Cheap. Being Wrong Is Expensive.โ€‹

The historical defense of ETL made more sense when storage was painful and warehouses were weaker. If keeping raw data was prohibitively expensive, transforming early was a practical compromise.

But in 2026, defaulting to early lossy transformation because of old storage assumptions is wild.

Object storage is cheap. Warehouses are powerful. Compute can be scaled. Retention can be tiered. Partitions exist. Compression exists. Lifecycle policies exist. We have tools.

What remains expensive is being unable to explain your own numbers.

What remains expensive is rebuilding a backfill from a source system that barely supports it.

What remains expensive is telling leadership, "The metric is different now, but we can't fully prove why because the pipeline overwrote the useful context."

That is how data teams lose trust.

This Does Not Mean Dump Garbage Everywhereโ€‹

To be clear, ELT is not a license to create a swamp and call it a lake.

You still need discipline:

  • Partition raw data sanely.
  • Capture extraction metadata.
  • Track source, cursor, batch, file, and ingestion time.
  • Protect PII and secrets.
  • Set retention rules intentionally.
  • Make transformations versioned and reviewable.
  • Build clean tables that normal humans can query.

The raw layer should not be the only layer. Nobody is saying analysts should have to spelunk through untyped vendor JSON every morning like Indiana Jones with a Looker license.

The point is that the raw layer must exist.

Clean, modeled, business-friendly tables are great. They are the product. But they should be downstream of preserved inputs, not the only artifact that survived ingestion.

The Right Pipeline Shapeโ€‹

A sane ELT pipeline is boring in the best way.

Extract

Pull from the source. Do not get cute. Capture enough metadata to know where the record came from and when: source name, extraction time, request parameters, cursor, file name, batch ID, response status, whatever applies.

Load

Write the raw record somewhere durable immediately. If you need to normalize a little to store it, keep that step mechanical and lossless. The goal is preservation, not interpretation.

Transform

Now apply meaning. Deduplicate. Type fields. Join dimensions. Apply business rules. Build facts, dimensions, aggregates, marts, semantic layers, whatever your stack calls them.

This is where the business logic belongs. In the open. Versioned. Tested. Re-runnable.

Not hidden inside ingestion code that nobody wants to touch because it has been "working" for three years.

When ETL Is Actually Fineโ€‹

ETL is not literally always wrong. There are real cases where transforming before load makes sense:

  • You are legally not allowed to retain the raw payload.
  • The source contains secrets that must be dropped immediately.
  • The destination cannot support the source shape.
  • The transformation is truly lossless and mechanical.
  • The volume is so large that raw retention is impossible under actual constraints.

Cool. Make that tradeoff.

But make it consciously. Write down what you dropped and why. Be honest about what future debugging or analysis you just made impossible.

What annoys me is not a carefully chosen ETL step under a real constraint. What annoys me is teams saying "ETL" because that is the default term, then casually deleting their best evidence before they know what questions the business will ask.

Plz don't.

Moralโ€‹

ETL won the naming battle, but ELT is the better mental model.

Load first. Transform second. Keep the raw data so you can re-transform when the model changes, when the business changes its mind, when a bug shows up, or when someone asks a question you did not predict.

The goal is not to make the first version of the data look perfect. The goal is to preserve the truth long enough that every future version can get better.

ETL optimizes for looking clean early.

ELT optimizes for being correct eventually.

I know which one I want when the numbers matter.