A real-time data/ML platform builder builds a tool to help teams find out what's wrong with an attribute. The tool is serverless, low-maintenance, and queries terabytesA real-time data/ML platform builder builds a tool to help teams find out what's wrong with an attribute. The tool is serverless, low-maintenance, and queries terabytes

Inside a Low-Cost, Serverless Data Lineage System Built on AWS

Problem Statement

I build and operate real-time data/ML platforms, and one recurring pain I see inside any data org is this: “Why does this attribute have this value?” When a company name, industry, or revenue looks wrong, investigations often stall without engineering help. I wanted a way for analysts, support, and product folks to self-serve the “why,” with history and evidence, without waking up an engineer.

This is the blueprint I shipped: A serverless, low‑maintenance traceability service that queries terabytes in seconds and costs peanuts.

What this tool needed to do (for non‑engineers)

  • Explain a value: Why does attribute X for company Y equal Z right now?
  • Show the history: When did it change? What were the past versions?
  • Show evidence: Which sources and rules produced that value?
  • Be self‑serve: An API + simple UI so teams can investigate without engineers

The architecture (serverless on purpose)

  • Amazon API Gateway: a secure front door for the API

  • AWS Lambda: stateless request handlers (no idle compute to pay for)

  • Amazon S3 + Apache Hudi: cheap storage with time travel and upserts

  • AWS Glue Data Catalog: schema and partition metadata

  • Amazon Athena: SQL over S3/Hudi, pay-per-data‑scanned, zero infra

\

Why these choices?

  • Cost: storage on S3 is cheap; Athena charges only for bytes scanned; Lambda is pay‑per‑invocation
  • Scale: S3/Hudi trivially supports TB→PB, and Athena scales horizontally
  • Maintenance: no fleet to patch; infra footprint stays tiny as usage grows

:::info Data layout: performance is a data problem (not a compute problem) Athena is fast when it reads almost nothing, and slow when it plans or scans too much. The entire project hinged on getting partitions and projection right.

:::

Partitioning strategy (based on query patterns)

  • created_date (date): most queries are time‑bounded
  • attributename (enum): employees, revenue, linkedinurl, founded_year, industry, etc.
  • entityidmod (integer bucket): mod(entity_id, N) to spread hot keys evenly

This limits data scanned and, more importantly, narrows what partition metadata Athena needs to consider.

The three things that made it fast:

  1. Partitioning:

  2. Put only frequently filtered columns in the partition spec.

  3. Use integer bucketing (mod) for high‑cardinality keys like entity_id.

    \

  4. Partition Indexing (first win, partial):

  5. We enabled partition indexing so Athena could prune partition metadata faster during planning.

  6. This helped until the partition count grew large; planning was still the dominant cost.

    \

  7. Partition Projection (the actual game‑changer):

  8. Instead of asking Glue to store millions of partitions, we taught Athena how partitions are shaped.

  9. Result: planning time close to zero; queries jumped from “slow-ish with growth” to consistently 1–2 seconds for typical workloads.

\ Athena TBLPROPERTIES (minimal example)

TBLPROPERTIES ( 'projection.enabled'='true', 'projection.attribute_name.type'='enum', 'projection.attribute_name.values'='employees,revenue,linkedin_url,founded_year,industry', 'projection.entity_id_mod.type'='integer', 'projection.entity_id_mod.interval'='1', 'projection.entity_id_mod.range'='0,9', 'projection.created_date.type'='date', 'projection.created_date.format'='yyyy-MM-dd', 'projection.created_date.interval'='1', 'projection.created_date.interval.unit'='days', 'projection.created_date.range'='2022-01-01,NOW' )

\

Why this works

  • Athena no longer fetches a huge partition list from Glue; it calculates partitions on the fly from the rules above
  • Scanning drops to “only the files that match the constraints”
  • Planning time becomes negligible, even as data and partitions grow

\

What surprised me (and what was hard)

  • The “gotcha” was query planning, not compute. We often optimize engines, but here the slowest part was enumerating partitions. Partition projection solved the right problem.
  • Picking partition keys is half art, half science. Over-partition and you drown in metadata; under-partition and you pay per scan. Start from your top 3 query predicates and work backwards.
  • Enum partitions are underrated. For low‑cardinality domains (attribute_name), enum projection is both simple and fast.
  • Bucketing (via mod) is pragmatic. True bucketing support is limited in Athena, but a mod-based integer partition gets you most of the benefits.

\

Cost & latency (real numbers)

  • Typical queries: 1–2 seconds end‑to‑end (Lambda cold starts excluded)
  • Data size: multiple TB in S3/Hudi
  • Cost: pennies per 100s of requests (Athena scan + Lambda invocations)
  • Ops: near‑zero—no servers, no manual compaction beyond Hudi maintenance cadence

\

Common pitfalls (so you can skip them)

  • Don’t partition by high‑cardinality fields directly (e.g., raw entity_id); you’ll explode the partition count
  • Don’t skip projection if you expect partitions to grow; indexing alone won’t save you
  • Don’t save partition metadata for every key if a rule can generate it (projection exists exactly for that reason)
  • Don’t leave Glue schemas to drift; version them and validate in CI

\

Try this at home (a minimal checklist)

  • Model your top 3 queries; pick partitions that match those predicates

  • Use enum projection for low‑cardinality fields; date projection for time; integer ranges for buckets

  • Store data in columnar formats (Parquet/ORC) via Hudi to keep scans small and enable time travel

  • Add a thin API (API Gateway + Lambda) to turn traceability SQL into JSON for your UI

  • Measure planning vs. scan time; optimize the former first

    \

What this unlocked for my users

  • Analysts and support can answer “why” without engineers
  • Product can audit attribute changes by time and cause
  • Engineering spends more time on fixes and less time on forensics
  • The org trusts the data more because the evidence is one click away

\

Closing thought

Great performance is usually a data layout story. Before you scale compute, fix how you store and find bytes. In serverless analytics, the fastest query is the one that plans instantly and reads almost nothing, and partition projection is the lever that gets you there.

Piyasa Fırsatı
RealLink Logosu
RealLink Fiyatı(REAL)
$0.07212
$0.07212$0.07212
-2.28%
USD
RealLink (REAL) Canlı Fiyat Grafiği
Sorumluluk Reddi: Bu sitede yeniden yayınlanan makaleler, halka açık platformlardan alınmıştır ve yalnızca bilgilendirme amaçlıdır. MEXC'nin görüşlerini yansıtmayabilir. Tüm hakları telif sahiplerine aittir. Herhangi bir içeriğin üçüncü taraf haklarını ihlal ettiğini düşünüyorsanız, kaldırılması için lütfen [email protected] ile iletişime geçin. MEXC, içeriğin doğruluğu, eksiksizliği veya güncelliği konusunda hiçbir garanti vermez ve sağlanan bilgilere dayalı olarak alınan herhangi bir eylemden sorumlu değildir. İçerik, finansal, yasal veya diğer profesyonel tavsiye niteliğinde değildir ve MEXC tarafından bir tavsiye veya onay olarak değerlendirilmemelidir.

Ayrıca Şunları da Beğenebilirsiniz

BitMine Expands Treasury Holdings with $140 Million Ethereum Acquisition

BitMine Expands Treasury Holdings with $140 Million Ethereum Acquisition

BitMine has significantly bolstered its cryptocurrency treasury with the acquisition of 48,049 ETH, valued at approximately $140 million at current market prices. The substantial purchase positions the company among a growing cohort of corporations holding Ethereum as a strategic reserve asset, extending a trend previously dominated by Bitcoin treasury strategies.
Paylaş
MEXC NEWS2025/12/17 17:19
Hyper Foundation Proposes Validator Vote to Burn Assistance Fund Tokens

Hyper Foundation Proposes Validator Vote to Burn Assistance Fund Tokens

The Hyper Foundation has put forward a proposal for validators to vote on burning the $HYPE tokens currently held in the project's Assistance Fund. If approved, the burn would permanently remove these tokens from circulating supply, representing a significant shift in the protocol's token economics and treasury management philosophy.
Paylaş
MEXC NEWS2025/12/17 17:21
This Altcoin Could 1000x By 2026

This Altcoin Could 1000x By 2026

The post This Altcoin Could 1000x By 2026 appeared on BitcoinEthereumNews.com. The SEC has approved a framework for the streamlined adoption of digital asset products in the United States on Wednesday, allowing exchanges to list and trade commodity-based trust shares without requiring a rule change to be filed first. This marks a significant milestone, opening the door for a surge in spot altcoin ETFs in the coming months. As a result, anticipation is building around institutional liquidity flows to the altcoin market – but which projects could perform the best?  Many analysts are betting on Bitcoin Hyper (HYPER) as a potential 1000x opportunity. It has not yet launched on exchanges, so it’s not immediately eligible for a spot ETF like some of the larger altcoins. That said, its use case positions it at the forefront of blockchain innovation, which signals huge potential for price gains as institutional capital rotates through the altcoin market. The project is developing the world’s first ZK-rollup-powered Bitcoin Layer 2 blockchain, addressing Bitcoin’s key issues of slow speeds and limited functionality while maintaining its renowned characteristics of security and immutability. SEC Approves Generic ETF Listing Standards The SEC has approved a proposed 19b-4 rule change from Cboe’s BZX exchange, Nasdaq, and NYSE Arca to standardize listing requirements for crypto exchange-traded products (ETPs) and streamline the process for public trading. According to Bloomberg ETF expert James Seyffart, this move paves the way for a “wave of spot crypto ETP launches in the coming weeks and months.” WOW. The SEC has approved Generic Listing Standards for “Commodity Based Trust Shares” aka includes crypto ETPs. This is the crypto ETP framework we’ve been waiting for. Get ready for a wave of spot crypto ETP launches in coming weeks and months. pic.twitter.com/xDKCuj41mc — James Seyffart (@JSeyff) September 17, 2025 Under the new listing standards, commodities must meet one of three conditions…
Paylaş
BitcoinEthereumNews2025/09/19 07:09