<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Alejandro Parra's Blog]]></title><description><![CDATA[Alejandro Parra's Blog]]></description><link>https://parraletz.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1767443809829/091229ff-2ce2-4cac-bab4-94ffc4c255c0.png</url><title>Alejandro Parra&apos;s Blog</title><link>https://parraletz.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Fri, 24 Apr 2026 22:09:36 GMT</lastBuildDate><atom:link href="https://parraletz.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Kubernetes Isn't Outdated. It's Evolving. And So Should You.]]></title><description><![CDATA[There's a narrative I keep hearing in tech communities across LATAM and beyond: "Kubernetes is too complex now. It had its moment."
I think that narrative confuses two very different things: the accid]]></description><link>https://parraletz.dev/kubernetes-isn-t-outdated-it-s-evolving-and-so-should-you</link><guid isPermaLink="true">https://parraletz.dev/kubernetes-isn-t-outdated-it-s-evolving-and-so-should-you</guid><category><![CDATA[CNCF]]></category><category><![CDATA[cloudnative]]></category><category><![CDATA[kcdguadalara]]></category><category><![CDATA[Kubecon]]></category><category><![CDATA[AI]]></category><category><![CDATA[mcp]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Thu, 26 Mar 2026 03:08:25 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/63a3eaa3f7fa1fc0ce9ebb2a/c854b823-664b-4e20-8fe5-f0a4c901a03c.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There's a narrative I keep hearing in tech communities across LATAM and beyond: <strong>"Kubernetes is too complex now. It had its moment."</strong></p>
<p>I think that narrative confuses two very different things: the accidental complexity <em>we</em> engineers have built on top of it, and the obsolescence of the platform itself. Kubernetes isn't trending or out of trend. It's infrastructure. And infrastructure doesn't go anywhere.</p>
<p>What <em>is</em> changing — fast — is the <em>role we need to play as engineers</em> around it.</p>
<blockquote>
<p>"We won the scalability war. But we are losing the simplicity battle."</p>
</blockquote>
<hr />
<h2>The Problem Nobody Wants to Admit</h2>
<p>The original DevOps promise was simple: <strong>"You build it, you run it."</strong> It was a philosophy of ownership. But as the CNCF ecosystem grew from 10 to over 200 projects, that philosophy became a cognitive load trap.</p>
<p>Today, in many teams, a backend developer is expected to be an expert in:</p>
<ol>
<li><strong>Business logic</strong> — their actual job</li>
<li><strong>Containers and Kubernetes</strong> — manifests, Helm charts, CRDs</li>
<li><strong>Network security</strong> — NetworkPolicies, mTLS, Cilium</li>
<li><strong>Observability</strong> — OpenTelemetry, Prometheus, Grafana Loki</li>
<li><strong>Compliance &amp; Governance</strong> — OPA, Kyverno, RBAC policies</li>
<li><strong>Continuous Delivery</strong> — ArgoCD, Flux, GitOps workflows</li>
</ol>
<p>That is an absurd stack of responsibilities for a single person. The result: developers become bottlenecks, and delivery speed collapses.</p>
<p>The problem isn't Kubernetes. The problem is that <strong>nobody built the floor between Kubernetes and the developer.</strong></p>
<hr />
<h2>Platform Engineering and the Golden Path</h2>
<p>Platform Engineering didn't emerge to "run servers" on behalf of developers. It emerged to build an <strong>Internal Developer Platform (IDP)</strong> — a product that absorbs infrastructure complexity and delivers a self-service interface.</p>
<p>The goal is the <em>Golden Path</em>: a set of automated, pre-approved, secure, and observable-by-default workflows. If you follow the Golden Path, everything just works. If you deviate, you need to justify it.</p>
<blockquote>
<p>💡 <strong>What does a 2026 self-service workflow actually look like?</strong></p>
<p>A developer opens their internal portal and requests a new microservice. Within 5 minutes they have: a dev environment deployed on K8s, an isolated database provisioned via Crossplane, traces and logs enabled with OpenTelemetry, and GitOps sync active against their repo. Zero infra tickets. Zero manual YAML.</p>
</blockquote>
<hr />
<h2>Your CNCF Toolbox for 2026</h2>
<p>If you're a Platform Engineer, the CNCF is your toolbox. Here's how you connect the dots to build your software factory:</p>
<table>
<thead>
<tr>
<th>Component</th>
<th>Key Technologies</th>
<th>Why It Matters</th>
</tr>
</thead>
<tbody><tr>
<td>The Portal (Frontend)</td>
<td>Backstage</td>
<td>The "storefront" of your platform. A central hub for software catalogs and self-service templates.</td>
</tr>
<tr>
<td>Infra as Code / API</td>
<td>Crossplane</td>
<td>Manage cloud resources (RDS, S3, GCS) using the Kubernetes API. Turns your infra into K8s objects.</td>
</tr>
<tr>
<td>Networking &amp; Security</td>
<td>Cilium / Istio</td>
<td>Fast, visible, and secure networking (mTLS) from day one.</td>
</tr>
<tr>
<td>Unified Observability</td>
<td>OpenTelemetry</td>
<td>A common language for metrics, logs, and traces. Essential for explainability.</td>
</tr>
<tr>
<td>Automated Policy</td>
<td>Kyverno / OPA</td>
<td>Governance as code. Blocks insecure deployments before they ever hit production.</td>
</tr>
<tr>
<td>GitOps Delivery</td>
<td>ArgoCD / Flux</td>
<td>The heartbeat of the platform. Git as the single source of truth.</td>
</tr>
</tbody></table>
<hr />
<h2>The Next Challenge: The Agentic Platform</h2>
<p>This is where the conversation gets interesting — and where most Platform Engineers aren't looking yet.</p>
<p>This week, during KubeCon EU Amsterdam, the CNCF published a standards paper for agentic AI in cloud native environments. The document — produced by the AI TCG — doesn't talk about models or data science. It talks about <strong>how Kubernetes must orchestrate autonomous agents</strong> securely, observably, and with governance from day one.</p>
<blockquote>
<p>📄 <strong>Reference · CNCF AI TCG — Cloud Native Agentic Standards (March 2026)</strong></p>
<p>The paper defines four pillars for agentic systems on K8s: <strong>General (containers)</strong>, <strong>Control &amp; Communication</strong> (MCP, A2A, SPIFFE), <strong>Observability</strong> (OTel + token metrics + agent traces), and <strong>Governance + Security</strong> (agent identity, JIT access, tenancy isolation).</p>
<p>Read the full document at <a href="https://www.cncf.io/blog/2026/03/23/cloud-native-agentic-standards/">cncf.io</a>.</p>
</blockquote>
<p>The protocols emerging as standards — <strong>MCP</strong> (from Anthropic, now donated to the Agentic AI Foundation) and <strong>A2A</strong> (from Google for peer-to-peer agent communication) — are already vocabulary that a 2026 Platform Engineer needs to understand.</p>
<p>Why does this matter to Platform Engineering? Because AI Agents are not a separate layer living "in some vendor's cloud." They are workloads. They run in containers. They need ephemeral identities (SPIFFE/SVID), namespace isolation, NetworkPolicies, and observability of LLM inference tokens.</p>
<blockquote>
<p>⚠️ <strong>What nobody is telling you</strong></p>
<p>If your IDP today has no identity model for agents, if your observability pipelines don't track LLM tokens, if your Kyverno policies don't account for MCP tool access — you already have technical debt in the 2026 stack. This isn't science fiction. KubeRay and Volcano are already optimizing how K8s handles these workloads.</p>
</blockquote>
<hr />
<h2>Kubernetes Isn't Going Anywhere. You Need to Grow.</h2>
<p>The message here isn't that DevOps is dead. It's that it <strong>matured</strong>. Platform Engineering is the next logical chapter of the same story: less manual heroism, more self-governing systems.</p>
<p>Kubernetes is the operating system of modern infrastructure. Nobody says Linux "went out of style." The same should be true of Kubernetes. What changes is the level of abstraction from which we operate. And for those of us building platforms, that is exactly the challenge that makes this work fascinating.</p>
<p>Next year, the most mature platforms won't just be the ones with the best DX for human developers. They'll be the ones capable of orchestrating autonomous developers — agents — with the same rigor, security, and observability we apply today to any microservice.</p>
<p>Are you building that platform, or are you still debugging YAML?</p>
<hr />
<p><em>Want to continue this conversation in person? On April 18th we're hosting <strong>KCD Guadalajara 2026</strong> at the Holiday Inn Guadalajara Expo. Platform Engineering, agentic AI, cloud native security, and the entire CNCF community of Mexico — all in one place.</em></p>
<p><em>👉 <a href="https://community.cncf.io/events/details/cncf-kcd-guadalajara-presents-kcd-guadalajara-2026/">community.cncf.io/kcd-guadalajara-2026</a></em></p>
<hr />
<p><em>#kubernetes #platformengineering #cncf #cloudnative #devops #mcp #agenticai #idp #kcdguadalajara</em></p>
]]></content:encoded></item><item><title><![CDATA[The Platform Is the Architecture]]></title><description><![CDATA[The Platform Is the Architecture
There's a distinction that comes up a lot in platform engineering conversations, and it's worth making explicit.
There's a difference between a team that operates infrastructure and a team that designs a platform.
Bot...]]></description><link>https://parraletz.dev/the-platform-is-the-architecture</link><guid isPermaLink="true">https://parraletz.dev/the-platform-is-the-architecture</guid><category><![CDATA[design patterns]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[software architecture]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Tue, 24 Mar 2026 16:10:40 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-the-platform-is-the-architecture">The Platform Is the Architecture</h1>
<p>There's a distinction that comes up a lot in platform engineering conversations, and it's worth making explicit.</p>
<p>There's a difference between a team that <em>operates</em> infrastructure and a team that <em>designs</em> a platform.</p>
<p>Both teams run Kubernetes. Both teams write Terraform. Both teams deal with incidents, manage clusters, and keep things running. From the outside, they look similar. From the inside, they make decisions in fundamentally different ways.</p>
<p>The operating team asks: <em>is this working?</em>
The platform team asks: <em>why is this designed this way, and what does that design decision cost us over time?</em></p>
<p>This series has been about developing the second kind of thinking — and about the vocabulary that makes it possible.</p>
<hr />
<h2 id="heading-what-we-covered">What we covered</h2>
<p>Six posts. Six patterns. Each one a different lens on a problem that platform engineers face every day.</p>
<p><strong>Factory and Builder</strong> — environment provisioning without a creation model is slow, inconsistent, and doesn't scale. When you centralize creation logic, you get consistency, self-service, and the ability to evolve the implementation without changing the interface.</p>
<p><strong>Facade</strong> — complexity leaks when there's no abstraction layer. The platform is a facade. The question isn't whether to build one — it's how opinionated to make it, and who it's designed for.</p>
<p><strong>Observer</strong> — manual reconciliation doesn't scale. The Kubernetes controller loop is the Observer pattern at cluster scale. Toil is a symptom of missing observers. Name the toil, find the missing observer, build or adopt it.</p>
<p><strong>Command</strong> — pipelines that apply changes directly are fast and invisible. GitOps decouples the command from its execution, and that decoupling gives you auditability, reversibility, and drift detection almost for free.</p>
<p><strong>Adapter and Glue</strong> — every platform connects systems that weren't designed to talk to each other. That connective tissue is glue. In most platforms, glue is 60-80% of what the team actually builds and maintains. The question isn't whether you'll write it — it's whether you'll design it intentionally.</p>
<p><strong>Strategy</strong> — deployment strategy hardcoded into pipelines is a hidden coupling that makes platforms less reliable and harder to evolve. When strategy is a first-class platform concern — declarative, observable, interchangeable — the whole delivery system gets more consistent.</p>
<hr />
<h2 id="heading-the-thread-that-connects-them">The thread that connects them</h2>
<p>Looking back at these six patterns, there's a common thread.</p>
<p>Every one of them is about the same thing: <strong>separating concerns that are accidentally coupled, and making that separation explicit.</strong></p>
<p>Factory separates <em>what</em> gets created from <em>how</em> it's created. Facade separates the consumer's interface from the system's complexity. Observer separates the thing that changes from the things that react to it. Command separates the definition of a change from its execution. Adapter separates incompatible interfaces without modifying either. Strategy separates the delivery mechanism from the deployment approach.</p>
<p>This isn't a coincidence. The Gang of Four didn't invent these patterns — they documented solutions that experienced engineers kept independently arriving at, because these couplings cause the same problems everywhere. In codebases. In distributed systems. In platforms.</p>
<p>The platform is a system. It has the same structural problems that software has. And it responds to the same structural solutions.</p>
<hr />
<h2 id="heading-from-glue-spaghetti-to-platform-control-plane">From glue spaghetti to platform control plane</h2>
<p>Earlier in this series, we talked about glue — the code that connects systems that weren't designed to talk to each other. And we mentioned an evolution that a lot of platform teams go through:</p>
<pre><code>scripts/
  deploy.sh
  create-service.sh
  add-monitor.sh
  update-ingress.sh
</code></pre><p>This is where most platforms start. It's not wrong — it's the natural first step when you're moving fast and the priority is getting things working.</p>
<p>But at some point, the scripts become a liability. They accumulate. They drift. They're owned by whoever wrote them last. They break in ways that are hard to debug because they have no reconciliation semantics, no observability, no clear interfaces.</p>
<p>The teams that scale past this point aren't necessarily the ones with the best tools. They're the ones that start treating the platform as a product with an architecture — not just a collection of scripts that happen to run in the right order.</p>
<p>That shift looks like:</p>
<ul>
<li>Glue scripts become operators with proper reconciliation semantics</li>
<li>Ad-hoc pipelines become platform APIs with clear contracts</li>
<li>Scattered strategy implementations become a first-class delivery layer</li>
<li>Manual processes become observers that react automatically</li>
<li>Implicit conventions become explicit facades that abstract complexity</li>
</ul>
<p>This is what a platform control plane looks like. Not a specific tool. A specific way of thinking about what the platform is and how it should behave.</p>
<p>And it maps almost exactly to the patterns in this series.</p>
<hr />
<h2 id="heading-the-vocabulary-is-the-point">The vocabulary is the point</h2>
<p>I want to be honest about what this series was and wasn't trying to do.</p>
<p>It wasn't trying to teach you Kubernetes. You already know Kubernetes. It wasn't trying to convince you to use Argo Rollouts or Crossplane or External Secrets Operator. Those are good tools, but the patterns exist independently of the tools.</p>
<p>What it was trying to do is give you vocabulary.</p>
<p>Vocabulary matters for a few reasons.</p>
<p>It makes architectural decisions explicit. When your team debates whether to expose Kubernetes directly to developers, "how much facade do we build" is a more productive framing than "should we use Backstage or not." The pattern gives the conversation a shape.</p>
<p>It helps you diagnose problems. When your pipelines are slow and tightly coupled, that's not just a pipeline problem — it's a symptom of missing separation of concerns. When you name the pattern that's missing or being violated, the solution space becomes clearer.</p>
<p>It connects you to a larger body of knowledge. The Gang of Four documented these patterns in 1994. Thirty years of software engineers have written about them, extended them, and argued about their limits. That literature is available to platform engineers too — it just needs translation.</p>
<p>And it levels the conversation. When a platform engineer and a software architect talk about the same problem using the same vocabulary, they can build on each other's thinking instead of talking past each other. The platform stops being "the infra people's problem" and starts being a shared architectural concern.</p>
<hr />
<h2 id="heading-what-comes-next">What comes next</h2>
<p>This series ends here, but the topic doesn't.</p>
<p>We covered six patterns from the structural and behavioral categories — the ones most directly applicable to platform engineering. There's more to explore: architectural styles like event-driven systems and how they map to platform data planes, distributed systems patterns and how they show up in multi-cluster architectures, and the evolution from platform-as-infrastructure to platform-as-product.</p>
<p>If any of these posts sparked a question, a disagreement, or a "this is exactly what we're dealing with right now" — I'd genuinely like to hear about it.</p>
<p>Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a>. That's where the real conversation happens.</p>
]]></content:encoded></item><item><title><![CDATA[Your Deployment Strategy Shouldn't Live in Your Pipeline]]></title><description><![CDATA[Your Deployment Strategy Shouldn't Live in Your Pipeline
Here's a situation most platform teams have been in.
A team wants to do a canary deployment. So they modify the pipeline. They add a step that deploys to 10% of traffic, waits, checks some metr...]]></description><link>https://parraletz.dev/your-deployment-strategy-shouldnt-live-in-your-pipeline</link><guid isPermaLink="true">https://parraletz.dev/your-deployment-strategy-shouldnt-live-in-your-pipeline</guid><category><![CDATA[argo-rollouts]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[Devops]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[progressive delivery]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Fri, 20 Mar 2026 15:10:17 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-your-deployment-strategy-shouldnt-live-in-your-pipeline">Your Deployment Strategy Shouldn't Live in Your Pipeline</h1>
<p>Here's a situation most platform teams have been in.</p>
<p>A team wants to do a canary deployment. So they modify the pipeline. They add a step that deploys to 10% of traffic, waits, checks some metric, then either proceeds or rolls back. It works. Six months later, a different team wants the same thing — but their pipeline is structured differently, so they write their own version. Then someone wants blue/green. Another pipeline modification. Then progressive delivery with automated analysis. Another version, in another pipeline, maintained by another team.</p>
<p>You now have five different implementations of deployment strategies scattered across your CI/CD system. None of them are consistent. All of them are someone's responsibility to maintain. And when something goes wrong mid-deployment, the person on call has to figure out which version of the strategy they're dealing with before they can even start debugging.</p>
<p>This is what happens when deployment strategy is hardcoded into the pipeline instead of being a first-class concern of the platform.</p>
<hr />
<h2 id="heading-the-problem">The problem</h2>
<p>A deployment pipeline has two distinct jobs that are often collapsed into one.</p>
<p>The first job is delivery: take the artifact, get it to the cluster. The second job is strategy: decide <em>how</em> it gets there — all at once, gradually, with traffic splitting, with automated rollback triggers.</p>
<p>When these two jobs live in the same pipeline, the strategy becomes tightly coupled to the delivery mechanism. Changing the strategy means changing the pipeline. Testing a new approach means modifying infrastructure that other things depend on. And because pipelines are usually owned by individual teams rather than the platform, you end up with strategy logic that's duplicated, inconsistent, and invisible to anyone outside that team.</p>
<p>The deeper problem is that strategy decisions are cross-cutting. They affect reliability, blast radius, rollback time, and developer confidence. They deserve to be made once, implemented well, and available to every team — not reinvented in every pipeline.</p>
<hr />
<h2 id="heading-why-it-hurts">Why it hurts</h2>
<p>Imagine a deployment goes wrong mid-canary. Traffic is split 20/80 between the new version and the old one. The new version has a bug. You need to roll back.</p>
<p>If the canary logic lives in the pipeline, rollback means re-running the pipeline in reverse — or manually adjusting traffic weights in whatever tool manages them, if you can even find where that is. The pipeline ran its steps and moved on. It's not watching. It's not reacting. It doesn't know the deployment is in a bad state.</p>
<p>Now multiply this by the number of services your platform supports. Each one with its own pipeline, its own strategy implementation, its own rollback procedure. Your platform's deployment reliability is only as good as the worst pipeline in the system.</p>
<p>And when a postmortem asks "why did we deploy 100% of traffic at once to a service that had no rollback plan" — the answer is usually "because the pipeline didn't have that logic and nobody thought to add it."</p>
<hr />
<h2 id="heading-what-the-pattern-says">What the pattern says</h2>
<p>The Strategy pattern says: define a family of algorithms, encapsulate each one, and make them interchangeable. The caller selects which strategy to use. The strategy handles the implementation. Neither needs to know the details of the other.</p>
<p>In software, this pattern appears in sorting algorithms, payment processors, compression formats — anywhere you have multiple ways to accomplish the same goal and want to switch between them without changing the code that uses them.</p>
<p>In platform engineering, deployment strategies are exactly this. Rolling update, blue/green, canary, progressive delivery with automated analysis — these are a family of algorithms for getting a new version of software in front of users. They have different risk profiles, different rollback characteristics, different observability requirements. But from the application's perspective, the goal is the same: deploy this version.</p>
<p>The Strategy pattern says that choice should be declarative and interchangeable — not hardcoded into a pipeline.</p>
<hr />
<h2 id="heading-where-you-see-this-in-practice">Where you see this in practice</h2>
<p><strong>Argo Rollouts</strong> is the clearest implementation of this pattern in the Kubernetes ecosystem. You define a <code>Rollout</code> resource instead of a standard <code>Deployment</code>, and you declare your strategy in the spec:</p>
<pre><code class="lang-yaml"><span class="hljs-attr">strategy:</span>
  <span class="hljs-attr">canary:</span>
    <span class="hljs-attr">steps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">20</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {<span class="hljs-attr">duration:</span> <span class="hljs-string">10m</span>}
    <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">50</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">pause:</span> {<span class="hljs-attr">duration:</span> <span class="hljs-string">10m</span>}
    <span class="hljs-bullet">-</span> <span class="hljs-attr">setWeight:</span> <span class="hljs-number">100</span>
</code></pre>
<p>The application doesn't know what strategy is being used. The pipeline doesn't implement the strategy — it just triggers the rollout. Argo Rollouts is the executor that watches the rollout progress, manages traffic weights, and reacts to what it sees.</p>
<p>Switching from canary to blue/green is a change to the <code>Rollout</code> spec. Not a pipeline rewrite. Not a new set of scripts. A declarative change to the strategy definition.</p>
<p><strong>Flagger</strong> takes the same idea and adds automated analysis. It watches metrics from Prometheus, Datadog, or other sources during a canary deployment and makes promotion decisions automatically. The strategy isn't just declared — it's reactive. If error rates spike, the rollout stops. If latency degrades, it rolls back. The strategy responds to real system state rather than waiting for a human to notice.</p>
<p><strong>Feature flags</strong> are a strategy pattern applied at the application layer. Tools like LaunchDarkly, Unleash, or Flagsmith let you decouple the deployment of code from the activation of features. You deploy 100% of traffic — but only 10% of users see the new behavior. The strategy is controlled by the platform, not the pipeline.</p>
<p>The common thread: in all of these, the strategy is a separate, interchangeable concern. It's not woven into the delivery mechanism. It can be changed, tested, and evolved independently.</p>
<hr />
<h2 id="heading-hardcoding-strategy-is-an-antipattern">Hardcoding strategy is an antipattern</h2>
<p>When deployment strategy lives in the pipeline, you've created a hidden coupling that's easy to miss until it causes a problem.</p>
<p>The pipeline owns delivery <em>and</em> strategy. Changing one risks breaking the other. Testing a new strategy means modifying production infrastructure. Rolling back a failed strategy change means a pipeline change, not a config change. And visibility into what strategy a given service uses requires reading pipeline code — not a platform API or a manifest in Git.</p>
<p>This is the antipattern. Not because pipelines are bad, but because strategy is a platform concern that's been pushed into a delivery concern. The blast radius of a mistake is larger than it needs to be. The iteration speed on strategy improvements is slower than it needs to be.</p>
<p>Separating strategy from delivery — making it declarative, observable, and interchangeable — is one of the higher-leverage improvements a platform team can make to their deployment reliability.</p>
<hr />
<h2 id="heading-what-changes-when-you-think-this-way">What changes when you think this way</h2>
<p>When strategy is a first-class platform concern, a few things shift.</p>
<p>Reliability becomes consistent. Every service that uses the platform gets the same deployment strategy primitives — canary, blue/green, progressive delivery — without each team having to implement them. The platform team implements once. Every team benefits.</p>
<p>Iteration gets faster. When a team wants to try a more conservative canary rollout, they change a few lines in their <code>Rollout</code> spec. They don't open a ticket to the platform team. They don't modify a shared pipeline. They make a declarative change and the platform handles the rest.</p>
<p>Postmortems get cleaner. When a deployment goes wrong, the strategy execution is visible in the platform — Argo Rollouts has an API, a UI, a clear audit trail. The question "what happened during the rollout" has an answer that doesn't require reading pipeline logs.</p>
<p>The goal isn't to make deployment more complex. It's to make the complexity live in the right place — a dedicated, observable, interchangeable strategy layer — rather than scattered across dozens of pipelines that each solve the same problem in a slightly different way.</p>
<hr />
<p><em>Does your platform have a consistent deployment strategy story, or is it still living in individual pipelines? Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a>.</em></p>
]]></content:encoded></item><item><title><![CDATA[The Glue Is the Platform]]></title><description><![CDATA[The Glue Is the Platform
Here's something that doesn't show up in architecture diagrams.
You have Kubernetes. You have Terraform. You have GitHub Actions, ArgoCD, Datadog, Vault, and a developer portal. Each of those tools is well-documented, widely ...]]></description><link>https://parraletz.dev/the-glue-is-the-platform</link><guid isPermaLink="true">https://parraletz.dev/the-glue-is-the-platform</guid><category><![CDATA[design patterns]]></category><category><![CDATA[Devops]]></category><category><![CDATA[integrations]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Mon, 16 Mar 2026 15:10:19 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-the-glue-is-the-platform">The Glue Is the Platform</h1>
<p>Here's something that doesn't show up in architecture diagrams.</p>
<p>You have Kubernetes. You have Terraform. You have GitHub Actions, ArgoCD, Datadog, Vault, and a developer portal. Each of those tools is well-documented, widely adopted, and excellent at what it does.</p>
<p>None of them were designed to talk to each other.</p>
<p>So you write code that connects them. Code that translates a GitHub push into a Terraform run. Code that creates a Datadog monitor when a new service is deployed. Code that bridges Vault secrets into Kubernetes. Code that turns a Backstage template into a repo, a pipeline, a namespace, and an ArgoCD application.</p>
<p>That code is glue. And in most platforms, it's not a minor detail — it's the majority of what the platform actually is.</p>
<hr />
<h2 id="heading-the-problem">The problem</h2>
<p>Platform teams spend a lot of time talking about the tools. Kubernetes vs. Nomad. ArgoCD vs. Flux. Terraform vs. Pulumi. The tools are visible, they have logos, they have docs, they have Slack communities.</p>
<p>The glue is invisible. It lives in <code>scripts/</code> directories, in Lambda functions nobody documented, in GitHub Actions workflows that have grown to 400 lines, in CronJobs that someone wrote two years ago and that now run in production and that nobody fully understands anymore.</p>
<p>The problem isn't that glue exists. Glue is inevitable — any platform that connects more than two systems will have it. The problem is that most teams treat glue as an afterthought rather than as a first-class part of the platform. It accumulates without design. It grows without standards. And eventually it becomes the thing that breaks most often and is hardest to fix.</p>
<hr />
<h2 id="heading-what-glue-actually-is">What glue actually is</h2>
<p>In platform engineering, glue is any code or automation that:</p>
<ul>
<li>doesn't contain business logic</li>
<li>doesn't belong to any specific tool</li>
<li>exists solely to integrate systems</li>
</ul>
<p>It shows up in many forms. Scripts that generate manifests and inject variables. Webhooks that trigger downstream workflows. Small services that transform data between formats. Operators that watch Kubernetes resources and call external APIs. Lambda functions that bridge cloud events to internal systems.</p>
<p>Take a simple example. A developer pushes code. You want your internal platform API to know about it — who pushed, what SHA, what repo — so it can update the deployment record, trigger downstream automation, and eventually show up in your developer portal.</p>
<p>GitHub doesn't know about your platform API. Your platform API doesn't know about GitHub. So you write glue:</p>
<pre><code class="lang-typescript"><span class="hljs-keyword">import</span> { Octokit } <span class="hljs-keyword">from</span> <span class="hljs-string">"@octokit/rest"</span>
<span class="hljs-keyword">import</span> axios <span class="hljs-keyword">from</span> <span class="hljs-string">"axios"</span>

<span class="hljs-keyword">async</span> <span class="hljs-function"><span class="hljs-keyword">function</span> <span class="hljs-title">notifyDeployment</span>(<span class="hljs-params">repo: <span class="hljs-built_in">string</span>, sha: <span class="hljs-built_in">string</span></span>) </span>{
  <span class="hljs-keyword">const</span> octokit = <span class="hljs-keyword">new</span> Octokit({ auth: process.env.GITHUB_TOKEN })

  <span class="hljs-keyword">const</span> commit = <span class="hljs-keyword">await</span> octokit.repos.getCommit({
    owner: <span class="hljs-string">"org"</span>,
    repo,
    ref: sha
  })

  <span class="hljs-keyword">await</span> axios.post(<span class="hljs-string">"https://platform-api/deployments"</span>, {
    repo,
    sha,
    author: commit.data.commit.author?.name
  })
}
</code></pre>
<p>This code is not part of GitHub. It's not part of Kubernetes. It's not part of Datadog. It exists only to connect them. That's glue.</p>
<hr />
<h2 id="heading-where-glue-lives-in-a-real-platform">Where glue lives in a real platform</h2>
<p>A typical delivery flow looks something like this:</p>
<pre><code>GitHub push
    ↓
GitHub Actions
    ↓
Terraform / Helm
    ↓
Kubernetes
    ↓
ArgoCD
    ↓
Datadog
    ↓
Slack notification
</code></pre><p>Everything in that diagram is a well-known tool. But everything <em>between</em> those tools — the thing that takes output from one and feeds it to the next, that translates formats, that triggers the next step, that handles errors in the transition — is glue.</p>
<p>In practice, platform glue tends to cluster around five areas:</p>
<p><strong>CI/CD glue</strong> — scripts and pipeline steps that connect your version control to your delivery system. Generating manifests from templates. Injecting environment-specific variables. Validating policy before apply. Running post-deploy checks.</p>
<p><strong>GitOps glue</strong> — the translation layer between developer intent and the manifests that ArgoCD or Flux actually apply. A <code>service.yaml</code> that describes what a developer wants gets translated into Helm values, which become an Argo Rollout, which triggers a Datadog monitor creation. Each step in that chain is glue.</p>
<p><strong>Observability glue</strong> — automation that creates dashboards, alerts, and SLOs when a new service is registered. Without this, observability setup is manual and inconsistent. With it, every service gets a standard monitoring baseline automatically.</p>
<p><strong>Security glue</strong> — the code that bridges Vault, IAM, Kubernetes service accounts, and CI/CD pipelines. External Secrets Operator is a well-known example of productized security glue, but most platforms have additional custom pieces connecting their specific identity model to their specific secret stores.</p>
<p><strong>Developer experience glue</strong> — the automation behind an IDP action. When a developer creates a service in Backstage, something has to run the repo template, call Terraform, register the ArgoCD application, and create the initial deployment. That workflow is pure glue logic.</p>
<hr />
<h2 id="heading-the-adapter-pattern-a-frame-for-thinking-about-glue">The Adapter pattern: a frame for thinking about glue</h2>
<p>The Adapter pattern from software design is the closest conceptual frame for what glue does. It says: when two systems have incompatible interfaces, build a component that translates between them without changing either system.</p>
<p>That's exactly what glue is. ESO is an adapter between secret managers and Kubernetes. KEDA is an adapter between external event sources and the Kubernetes scaling API. Your <code>notifyDeployment</code> function is an adapter between GitHub and your platform API.</p>
<p>Thinking about glue as adapters is useful because it sets a design standard. A well-designed adapter:</p>
<ul>
<li>translates between two specific interfaces without changing either</li>
<li>reconciles continuously rather than running once and hoping</li>
<li>handles the source being unavailable gracefully</li>
<li>surfaces errors through observable, standard mechanisms</li>
<li>is maintainable by someone other than the person who wrote it</li>
</ul>
<p>Most ad-hoc glue fails on at least three of those. That's not a criticism — it's a pattern. Glue starts as a quick script because the need is urgent, and it evolves into a liability because nobody went back to design it properly.</p>
<hr />
<h2 id="heading-the-60-80-problem">The 60-80% problem</h2>
<p>In practice, a significant portion of what a platform team actually builds and maintains is glue. Not Kubernetes. Not Terraform. The code that connects them.</p>
<p>This is a pattern that shows up repeatedly in mature platform teams: the tools are 20% of the work, the integrations between them are 80%. And yet most platform teams staff, document, and design for the tools — not the glue.</p>
<p>The teams that handle this well tend to go through the same evolution:</p>
<pre><code>scripts/
  deploy.sh
  create-service.sh
  add-monitor.sh
  update-ingress.sh
</code></pre><p>This works at small scale. It stops working when you have ten teams, three environments, and a platform that needs to be reliable rather than just functional.</p>
<p>So the scripts get promoted. Some become operators. Some become controllers. Some become platform services with proper APIs. The glue doesn't disappear — it gets intentional design applied to it. The platform stops being a collection of scripts that connect tools and starts being a system with clear interfaces, reconciliation semantics, and something approaching reliability.</p>
<p>That evolution — from glue spaghetti to platform control plane — is its own topic. But it starts with recognizing that the glue is not a side effect of the platform. In many cases, the glue <em>is</em> the platform.</p>
<hr />
<p><em>What does your glue look like today — scattered scripts, promoted operators, or something in between? Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a>.</em></p>
]]></content:encoded></item><item><title><![CDATA[A Git Commit Is Not Just a Change. It's a Command.]]></title><description><![CDATA[A Git Commit Is Not Just a Change. It's a Command.
Here's a scenario that plays out in a lot of platform teams.
Your CI/CD pipeline runs kubectl apply as a job step. It exits zero. The pipeline goes green. Deployment successful — or so the dashboard ...]]></description><link>https://parraletz.dev/a-git-commit-is-not-just-a-change-its-a-command</link><guid isPermaLink="true">https://parraletz.dev/a-git-commit-is-not-just-a-change-its-a-command</guid><category><![CDATA[ArgoCD]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[Devops]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Fri, 13 Mar 2026 15:10:20 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-a-git-commit-is-not-just-a-change-its-a-command">A Git Commit Is Not Just a Change. It's a Command.</h1>
<p>Here's a scenario that plays out in a lot of platform teams.</p>
<p>Your CI/CD pipeline runs <code>kubectl apply</code> as a job step. It exits zero. The pipeline goes green. Deployment successful — or so the dashboard says. But nobody actually knows if the manifest that was applied was valid, if the pods came up healthy, or if the service is responding. So you add another job to check. And another to verify the rollout. And a retry mechanism for when the check fails. And now your pipeline is a sequence of shell scripts duct-taped together, each one trying to compensate for the fact that the previous step had no way to confirm its own outcome.</p>
<p>And when something goes wrong — which it does — you're cross-referencing pipeline logs, Kubernetes events, and Slack messages trying to reconstruct what actually happened and in what order.</p>
<p>This is the problem that GitOps solves. But the usual explanation — "Git is your source of truth" — doesn't tell you <em>why</em> that works, or why it's architecturally sound rather than just a convention someone decided to follow.</p>
<p>The Command pattern does.</p>
<hr />
<h2 id="heading-the-problem">The problem</h2>
<p>Pipelines that apply changes directly are fast and simple. They're also invisible.</p>
<p>When a CI/CD pipeline runs <code>kubectl apply</code> or <code>terraform apply</code> directly, the action happens and disappears. You have a log entry somewhere, maybe. You have the pipeline run in your CI system, if you know where to look. But the <em>intent</em> — what was supposed to happen, who requested it, what the expected state was — isn't captured anywhere durable.</p>
<p>This creates a class of problems that compound over time. Drift between what's in your repo and what's actually running. No reliable way to roll back a change without re-running a pipeline in reverse. Audit trails that live in your CI system instead of with the infrastructure they describe. And a mental model problem: nobody has a single place to look to understand the current intended state of the system.</p>
<p>The deeper issue is that <strong>executing and recording are coupled</strong>. The change happens and the record of the change is scattered across logs, pipeline runs, and people's memory. When something goes wrong, you're reconstructing history instead of reading it.</p>
<hr />
<h2 id="heading-why-it-hurts">Why it hurts</h2>
<p>Imagine you need to answer a simple question: <em>what changed in the production cluster between Tuesday and Thursday?</em></p>
<p>Without GitOps, that question requires cross-referencing CI pipeline runs, Kubernetes audit logs, Slack messages, and whoever was on call. Each of those sources has partial information. None of them has the full picture. And if the change was a manual <code>kubectl apply</code> that someone ran from their laptop, it might not appear anywhere at all.</p>
<p>Now multiply that across ten services, three environments, and a team of fifteen. The cognitive overhead of understanding the current state of your platform becomes a full-time job. And when incidents happen — which they do — the time you spend reconstructing what changed is time you're not spending fixing the problem.</p>
<hr />
<h2 id="heading-what-the-pattern-says">What the pattern says</h2>
<p>The Command pattern says: instead of executing an action directly, encapsulate it as an object — something that can be stored, passed around, queued, undone, or audited — and separate the act of <em>defining</em> the command from the act of <em>executing</em> it.</p>
<p>In software, this pattern shows up in undo/redo systems, job queues, and transaction logs. The key insight is that by making the command a first-class thing — not just a function call that disappears — you gain properties you couldn't have otherwise: history, reversibility, auditability, and the ability to replay or defer execution.</p>
<p>In platform engineering, a Git commit in a GitOps repository is exactly this. It's not just a text change. It's a persisted command — with an author, a timestamp, a description of intent, and the ability to be reversed with a <code>git revert</code>. The commit is the command object. ArgoCD or Flux is the executor that reads that command and applies it to the cluster.</p>
<p>The separation is the point. The commit and the apply are two distinct steps, with Git in between acting as a durable, auditable command queue.</p>
<hr />
<h2 id="heading-where-you-see-this-in-practice">Where you see this in practice</h2>
<p><strong>ArgoCD and Flux</strong> are Command pattern executors. Their job is to watch a Git repository — the command queue — and continuously reconcile the cluster state against the commands stored there. When a new commit arrives, they execute it. When the cluster drifts from the desired state, they re-execute the last command to bring it back.</p>
<p>This is why GitOps gives you drift detection almost for free. The executor is always comparing what the command says against what's actually running. Drift is just a command that hasn't been fully executed yet.</p>
<p><strong>Terraform with remote state</strong> applies the same idea to infrastructure. The <code>.tf</code> files are the commands. The state file is the record of what's been executed. <code>terraform plan</code> shows you what the command would do before you run it. <code>terraform apply</code> executes it. The separation between plan and apply is the Command pattern in action.</p>
<p><strong>Pull Request workflows</strong> add another layer. Before a command is committed to the queue, it goes through review. The PR is the command in draft — visible, discussable, and rejectable before it's persisted. Merge is the moment the command becomes official. This is why "GitOps with PR-based workflows" gives you change management essentially for free, without a separate tool.</p>
<p><strong>Argo Workflows and Tekton</strong> extend this to pipeline definitions. The pipeline itself is a command — defined declaratively, stored in Git, executed by the platform. You're not running imperative scripts. You're defining commands that the platform knows how to execute.</p>
<hr />
<h2 id="heading-what-changes-when-you-think-this-way">What changes when you think this way</h2>
<p>The Command framing answers a question that comes up constantly: <em>why GitOps instead of just running <code>kubectl apply</code> in a pipeline?</em></p>
<p>The answer isn't "because GitOps is best practice." The answer is that running <code>kubectl apply</code> in a pipeline couples execution to the pipeline run. The command and its execution happen at the same time and leave no durable record tied to the infrastructure itself. GitOps decouples them — the command lives in Git, the executor applies it, and the two can operate independently.</p>
<p>That decoupling gives you things that are very hard to retrofit:</p>
<ul>
<li><strong>Auditability</strong> — every change has an author, a timestamp, and a description, tied to the infrastructure it describes</li>
<li><strong>Reversibility</strong> — rolling back is a <code>git revert</code>, not a re-run of a pipeline in the right order with the right parameters</li>
<li><strong>Drift detection</strong> — the executor knows what the command says and can tell you when reality disagrees</li>
<li><strong>Separation of concerns</strong> — the team that defines what should be running is decoupled from the mechanism that makes it run</li>
</ul>
<p>Once you see a Git commit as a command object rather than just a file change, the architecture of GitOps stops being a convention and starts being a deliberate design choice with clear properties. And that makes it much easier to explain, defend, and extend.</p>
<hr />
<p><em>Is your team still running direct applies in pipelines? Or have you made the jump to GitOps? Either way — what's the hardest part of the transition? Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Your Cluster Is Already Watching. That's the Observer Pattern.]]></title><description><![CDATA[Your Cluster Is Already Watching. That's the Observer Pattern.
At some point, most platform teams hit the same wall.
A service goes down. An secret expires. A node runs out of disk. And the first question in the postmortem is always the same: why did...]]></description><link>https://parraletz.dev/your-cluster-is-already-watching-thats-the-observer-pattern</link><guid isPermaLink="true">https://parraletz.dev/your-cluster-is-already-watching-thats-the-observer-pattern</guid><category><![CDATA[design patterns]]></category><category><![CDATA[gitops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Operators]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Wed, 11 Mar 2026 22:49:45 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-your-cluster-is-already-watching-thats-the-observer-pattern">Your Cluster Is Already Watching. That's the Observer Pattern.</h1>
<p>At some point, most platform teams hit the same wall.</p>
<p>A service goes down. An secret expires. A node runs out of disk. And the first question in the postmortem is always the same: <em>why didn't we know sooner?</em></p>
<p>So you add more monitoring. More alerts. More dashboards. And somewhere along the way, someone suggests: "we should build a controller that watches for this and reacts automatically."</p>
<p>That controller is an Observer. You probably didn't call it that either.</p>
<hr />
<h2 id="heading-the-problem">The problem</h2>
<p>Manual reconciliation doesn't scale.</p>
<p>When your platform grows — more clusters, more teams, more services — the list of things someone needs to watch and react to grows with it. Certificate expiration. Secret rotation. Node pressure. Deployment drift. Config changes that need to propagate across namespaces.</p>
<p>If the reaction to any of these depends on a human noticing and acting, you have a fragile system. Not because your team is slow, but because humans aren't designed to watch dozens of state transitions simultaneously and react to all of them in real time.</p>
<p>The deeper problem is coupling. When your response logic lives in runbooks, scripts, or people's heads, it's tightly coupled to whoever wrote it. It doesn't compose. It doesn't scale. And it breaks in the exact moment you need it most — when things are moving fast and nobody has time to read the runbook.</p>
<hr />
<h2 id="heading-why-it-hurts">Why it hurts</h2>
<p>Think about secret rotation without automation. Someone sets a reminder. The reminder gets missed. The secret expires in production on a Friday at 5pm. Three people get paged. The incident takes two hours to resolve, not because the fix is hard, but because nobody had a clear picture of what was watching what.</p>
<p>Or think about configuration drift. A team manually edits a ConfigMap in staging. Nobody notices it diverged from what's in Git. Three weeks later, a deployment to production behaves differently than expected. The diff is one line. Finding it takes half a day.</p>
<p>These aren't rare edge cases. They're the steady background noise of a platform that reacts to change manually instead of automatically.</p>
<hr />
<h2 id="heading-what-the-pattern-says">What the pattern says</h2>
<p>The Observer pattern says: when the state of something changes, every component that cares about that change gets notified and can react — without the thing that changed needing to know who's watching.</p>
<p>There are two roles. The <em>subject</em> holds state and emits events when that state changes. The <em>observers</em> subscribe to those events and decide what to do with them. The subject doesn't care how many observers there are or what they do. The observers don't care how the state changed or who triggered it.</p>
<p>The key property is decoupling. The thing that changes and the things that react to that change don't need to know about each other. You can add new observers without touching the subject. You can change how the subject works without touching the observers.</p>
<hr />
<h2 id="heading-where-you-see-this-in-practice">Where you see this in practice</h2>
<p><strong>The Kubernetes controller loop</strong> is the Observer pattern implemented at cluster scale — and it's the most important example in this entire series.</p>
<p>Every controller in Kubernetes follows the same structure: watch a resource for changes, compare the current state against the desired state, and act to close the gap. The API server is the subject. Controllers are the observers. When you create a Deployment, the Deployment controller observes that event and starts reconciling — creating ReplicaSets, scheduling Pods, updating status.</p>
<p>You didn't wire them together explicitly. The controller registered its interest in Deployments, and the API server notifies it when something changes. That's the pattern.</p>
<p><strong>Operators</strong> extend this to your own domain. When you write a Kubernetes Operator — or adopt one like External Secrets Operator, Cert-Manager, or Argo Rollouts — you're writing an Observer for a specific kind of state change. External Secrets Operator watches for <code>ExternalSecret</code> resources and reacts by fetching secrets from Vault or AWS Secrets Manager and syncing them into Kubernetes Secrets. Cert-Manager watches for <code>Certificate</code> resources and reacts by requesting TLS certificates from Let's Encrypt.</p>
<p>Each of these is an automated response to a state change. No runbook. No human in the loop. The observer watches, the event happens, the reaction fires.</p>
<p><strong>Argo Rollouts</strong> takes this further into deployment strategy. It observes metrics from Prometheus during a canary deployment and reacts automatically — promote if the error rate stays below the threshold, roll back if it spikes. The deployment strategy becomes a set of Observer reactions to real-time system state rather than a manual decision made by an engineer watching a dashboard.</p>
<hr />
<h2 id="heading-when-to-write-one-and-when-not-to">When to write one and when not to</h2>
<p>Knowing this pattern helps you make a decision that comes up more often than it should: <em>should we write a custom operator for this?</em></p>
<p>The answer is usually: only if the thing you're watching and the reaction you need aren't covered by an existing operator, and only if the operational value justifies the maintenance cost.</p>
<p>Writing an operator is committing to a stateful, long-running Observer. It will need to handle edge cases — what happens if it misses an event? What if the reaction fails halfway? What if the resource it's watching gets deleted while it's processing? These aren't reasons not to write one, but they're reasons to reach for an existing operator first and build a custom one only when you have a clear gap.</p>
<p>The pattern also helps you evaluate existing tools more clearly. When you're comparing Cert-Manager vs. manually rotating certificates via a CronJob, you're really comparing "a well-designed Observer with proper reconciliation semantics" vs. "a polling loop that approximates observation." That framing makes the tradeoff obvious.</p>
<hr />
<h2 id="heading-what-changes-when-you-think-this-way">What changes when you think this way</h2>
<p>When you see your platform through the Observer lens, the question shifts from <em>"who is responsible for watching this?"</em> to <em>"what should be watching this automatically?"</em></p>
<p>That's not just a semantic difference. It changes how you design for reliability. Instead of writing runbooks that tell humans what to watch and when to react, you build controllers that watch continuously and react consistently. The human moves from being the observer to being the designer of observers.</p>
<p>It also changes how you think about toil. Toil — the repetitive, manual, automatable work that plagues ops teams — is almost always a symptom of missing observers. If someone is manually rotating secrets, there's no secret observer. If someone is manually syncing configs across environments, there's no config observer. If someone is manually checking whether a canary deployment looks healthy enough to promote, there's no metrics observer.</p>
<p>Name the toil. Find the missing observer. Build or adopt it.</p>
<hr />
<p><em>What's something your team is still watching manually that should have an observer by now? Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a> — I'd bet the pattern already exists.</em></p>
]]></content:encoded></item><item><title><![CDATA[The Facade Is the Product]]></title><description><![CDATA[The Facade Is the Product
There's a question that comes up in almost every platform team at some point:
Should we expose Kubernetes directly to developers, or should we abstract it?
It sounds like a tooling debate. Backstage vs. raw manifests. Helm c...]]></description><link>https://parraletz.dev/the-facade-is-the-product</link><guid isPermaLink="true">https://parraletz.dev/the-facade-is-the-product</guid><category><![CDATA[internal-developer-platform]]></category><category><![CDATA[Backstage]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[developer experience]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Wed, 11 Mar 2026 22:43:43 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-the-facade-is-the-product">The Facade Is the Product</h1>
<p>There's a question that comes up in almost every platform team at some point:</p>
<p><em>Should we expose Kubernetes directly to developers, or should we abstract it?</em></p>
<p>It sounds like a tooling debate. Backstage vs. raw manifests. Helm charts vs. a self-service portal. YAML vs. a form. But underneath all of that, it's actually a design question — and the Facade pattern is the lens that makes it clearer.</p>
<hr />
<h2 id="heading-what-the-pattern-says">What the pattern says</h2>
<p>The Facade pattern is straightforward: when a system is too complex for its consumers to deal with directly, you build a simpler interface on top. Consumers interact with the facade. The complexity lives behind it.</p>
<p>The key word is <em>consumers</em>. A facade isn't about hiding things from people who need to know them. It's about not burdening people with complexity that isn't relevant to their job.</p>
<p>Your developers don't need to know which CNI the cluster runs. They don't need to understand how cert-manager issues TLS certificates, or whether secrets come from Vault or AWS Secrets Manager or an ESO sync. That's not their job. Their job is to ship features. Your job, as a platform engineer, is to make sure the infrastructure complexity doesn't get in the way of that.</p>
<p>The platform is the facade. That's not a metaphor — it's the actual pattern.</p>
<hr />
<h2 id="heading-the-cost-of-not-having-one">The cost of not having one</h2>
<p>When there's no facade, the complexity leaks.</p>
<p>Developers end up learning Kubernetes internals just to deploy an app. They copy-paste manifests they don't fully understand. They ping the platform team every time something doesn't work because they don't have the mental model to debug it themselves. The platform team becomes a bottleneck — not because they're slow, but because the interface they've built requires too much context to use independently.</p>
<p>This is what the CNCF calls <em>cognitive load</em>, and reducing it is one of the central arguments for Platform Engineering as a discipline. The Platform Engineering Maturity Model essentially describes a journey from "no facade" to "well-designed facade" — from teams managing raw infrastructure to teams consuming curated, opinionated abstractions.</p>
<p>The Facade pattern gives that journey a name and a shape.</p>
<hr />
<h2 id="heading-the-golden-path-is-a-facade-with-an-opinion">The Golden Path is a facade with an opinion</h2>
<p>The Golden Path concept — popularized by Spotify and now widely referenced in the platform engineering community — is a specific flavor of the Facade pattern.</p>
<p>It's not just an abstraction. It's an <em>opinionated</em> abstraction. You're not trying to support every possible way to deploy an app. You're defining the recommended way, making it the easiest path, and letting teams deviate only when they have a good reason.</p>
<p>That opinion is the point. A facade without an opinion is just indirection — you've added a layer but haven't reduced the decision space. A Golden Path says: we've already made the hard calls about logging, secrets management, resource limits, and deployment strategy. If you follow this path, those problems are solved for you.</p>
<p>The tradeoff is real though. The more opinionated the facade, the less flexibility consumers have. Getting that balance right — tight enough to provide value, loose enough not to become a cage — is one of the harder design problems in platform engineering.</p>
<hr />
<h2 id="heading-where-you-see-this-in-practice">Where you see this in practice</h2>
<p><strong>Backstage Software Templates</strong> are a facade over your entire delivery stack. The developer sees a form. Behind it: a repo is created, a CI pipeline is wired up, a Kubernetes namespace is provisioned, a service catalog entry is registered. The facade collapses a multi-step, multi-system process into a single interaction.</p>
<p><strong>Helm charts as internal abstractions</strong> are a more granular facade. Instead of exposing raw Kubernetes manifests, you expose a <code>values.yaml</code> with the knobs that actually matter for your context — image tag, replicas, resource tier. Everything else is a sensible default hidden inside the chart.</p>
<p><strong>Port, Cortex, or any internal developer portal</strong> — same pattern, different surface. The portal is the facade. It shows developers what they need: service health, deployment status, ownership, runbooks. It hides what they don't: the underlying observability stack, the alerting rules, the infrastructure topology.</p>
<p>Even something as simple as a well-designed <code>Makefile</code> or a CLI wrapper around your deployment scripts is a facade. You're reducing the interface to the things that matter for the person using it.</p>
<hr />
<h2 id="heading-what-changes-when-you-think-this-way">What changes when you think this way</h2>
<p>Back to the original question: should you expose Kubernetes directly or abstract it?</p>
<p>The Facade pattern reframes it. The real question isn't about Kubernetes — it's about <em>who your consumer is</em> and <em>what complexity is relevant to them</em>.</p>
<p>If your consumers are platform engineers building on top of the platform, some Kubernetes exposure might be appropriate. They have the context to use it well. If your consumers are product developers who need to ship features, exposing Kubernetes is almost always the wrong call. The abstraction isn't hiding something useful from them — it's removing friction that shouldn't exist in the first place.</p>
<p>But the more practical shift is this: once you see the platform as a facade, you start asking different questions. Instead of "how do we document all this so developers can figure it out", you ask "what should developers never need to figure out in the first place." Instead of "how do we give developers more access", you ask "what's the right interface for this consumer."</p>
<p>The facade you build should match the consumer you're designing for. And that's a product decision, not a tooling one.</p>
<hr />
<p><em>What does your facade look like today — close to the metal, fully abstracted, or somewhere in between? Reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a>.</em></p>
]]></content:encoded></item><item><title><![CDATA[Your Platform Provisions Environments. That's a Factory.]]></title><description><![CDATA[Your Platform Provisions Environments. That's a Factory.
Every platform team eventually faces the same request, over and over again.
A developer needs a new environment — staging, a feature branch, a sandbox for testing. They open a ticket, fill out ...]]></description><link>https://parraletz.dev/your-platform-provisions-environments-thats-a-factory</link><guid isPermaLink="true">https://parraletz.dev/your-platform-provisions-environments-thats-a-factory</guid><category><![CDATA[Backstage]]></category><category><![CDATA[crossplane]]></category><category><![CDATA[design patterns]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Terraform]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Mon, 09 Mar 2026 16:51:05 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-your-platform-provisions-environments-thats-a-factory">Your Platform Provisions Environments. That's a Factory.</h1>
<p>Every platform team eventually faces the same request, over and over again.</p>
<p>A developer needs a new environment — staging, a feature branch, a sandbox for testing. They open a ticket, fill out a form, ping someone on Slack, or run a script they half-understand. A few minutes or a few days later, the environment exists. Or it doesn't. Or it exists but it's missing the secrets. Or the resource limits are wrong. Or it was set up differently than the last one.</p>
<p>This is the problem: <strong>environment provisioning without a clear creation model is slow, inconsistent, and doesn't scale.</strong></p>
<hr />
<h2 id="heading-why-it-hurts">Why it hurts</h2>
<p>When there's no centralized creation logic, every team invents their own process. Some copy-paste Terraform from a previous project. Some run scripts that only the original author understands. Some open tickets and wait.</p>
<p>The result is environments that drift. Staging doesn't match production. The new service has a different secret management setup than everything else. Resource limits are either too tight or nonexistent. And every time something breaks, the platform team has to reverse-engineer how that environment was created in the first place.</p>
<p>Beyond the technical debt, there's a trust problem. Developers stop trusting that the environment they get will actually work. They start maintaining their own workarounds. The platform team loses credibility not because they're incompetent, but because the process itself is unreliable.</p>
<hr />
<h2 id="heading-what-the-pattern-says">What the pattern says</h2>
<p>The Factory pattern says: instead of letting every caller decide how to create something, you centralize that creation logic in one place. The caller says <em>what</em> they want. The factory decides <em>how</em> to build it.</p>
<p>The Abstract Factory goes one level up — it creates families of related objects that are meant to work together. Not just a namespace, but a namespace <em>plus</em> secrets <em>plus</em> network policies <em>plus</em> resource quotas <em>plus</em> an observability entry. Things that belong together get created together, consistently, every time.</p>
<p>In software, this pattern exists to avoid the chaos of scattered instantiation logic. In platform engineering, it exists for the same reason: to avoid the chaos of scattered provisioning logic.</p>
<hr />
<h2 id="heading-where-you-see-this-in-practice">Where you see this in practice</h2>
<p><strong>Crossplane</strong> is the clearest implementation of this pattern in the cloud-native ecosystem. You define a Composite Resource — say, an <code>AppEnvironment</code> — and Crossplane composes it from lower-level resources: an RDS instance, an S3 bucket, a Kubernetes namespace, an IAM role. The developer asks for an <code>AppEnvironment</code>. They don't know or care what's underneath. The factory handles it.</p>
<p>The <code>AppEnvironment</code> XRD is the interface. The composition is the implementation. You can swap the implementation — move from AWS to GCP, change the database engine — without touching the interface the developer uses.</p>
<p><strong>Backstage Software Templates</strong> work the same way. A developer picks a template — "Python microservice", "data pipeline", "internal tool" — and gets back a repo, a CI/CD pipeline, a Kubernetes manifest, and a service catalog entry. The template is the factory. The developer just chose the product type.</p>
<p><strong>Terraform modules</strong> are the most familiar version of this idea. A well-designed module is a factory for infrastructure — you pass in variables (the <em>what</em>), the module handles the implementation details (the <em>how</em>). The problem is that most teams end up with modules that are too thin, just wrappers around resources, instead of modules that encapsulate a meaningful unit of infrastructure with sensible defaults and guardrails built in.</p>
<hr />
<h2 id="heading-the-builder-pattern-when-creation-has-steps">The Builder pattern: when creation has steps</h2>
<p>The Builder pattern is a close relative. Where Factory focuses on <em>what</em> you're creating, Builder focuses on <em>how</em> you assemble something complex step by step.</p>
<p>Think of a cluster bootstrap pipeline. It doesn't create everything at once. It runs in stages: provision the cluster, install the CNI, configure cert-manager, deploy the ingress controller, register with ArgoCD, apply the base manifests. Each step depends on the previous one. The order matters.</p>
<p>Helm with dependencies, Terraform with <code>depends_on</code>, pipelines chaining multiple workspaces — these are all Builder implementations. You're constructing a complex result piece by piece, in a defined sequence, with a clear separation between "how we build this" and "what we end up with."</p>
<p>AWS CDK is an interesting case here. It's deeply influenced by the Builder pattern — you construct infrastructure programmatically using loops, conditionals, and abstractions. The mental model is sound. But it also requires teams to fully commit to that abstraction layer, which is a significant shift if your team lives in Terraform or plain YAML. That might be part of why CDK hasn't seen the adoption you'd expect given how well-designed it is — the pattern is right, but the switching cost is real.</p>
<p>The distinction between Factory and Builder matters when something breaks. A Factory failure tells you <em>what</em> couldn't be created. A Builder failure tells you <em>which step</em> broke — which is much easier to recover from.</p>
<hr />
<h2 id="heading-what-changes-when-you-think-this-way">What changes when you think this way</h2>
<p>When someone on your team says "we should let teams provision their own environments", the Factory framing turns a vague discussion into a concrete design question: <strong>how much of the creation logic do you centralize, and how much flexibility do you expose?</strong></p>
<p>Do you give teams raw Terraform and hope they follow conventions? That's no factory — just scattered creation logic waiting to drift. Do you give them a module with sensible defaults and a few knobs to turn? That's a thin factory. Do you give them a Backstage template that hides everything behind a form? That's a full factory.</p>
<p>Each of those is a valid choice depending on your context. The pattern doesn't tell you which one to pick — it gives you the vocabulary to make that decision intentionally instead of stumbling into it.</p>
<p>You're not debating Terraform vs. Crossplane. You're debating how much creation logic should live in one place, and who should own it. That's an architectural decision. It deserves to be treated like one.</p>
<hr />
<p><em>How does your team handle environment provisioning today — scattered scripts, thin modules, or full abstraction? Hit reply at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a> — curious where people have landed on this.</em></p>
]]></content:encoded></item><item><title><![CDATA[Design Patterns Are Not a Developer Thing]]></title><description><![CDATA[Design Patterns Are Not a Developer Thing
At some point, most platform engineers end up in a conversation with a senior dev or a software architect where you're trying to explain how your infrastructure works.
And somewhere in the middle of that conv...]]></description><link>https://parraletz.dev/design-patterns-are-not-a-developer-thing-1</link><guid isPermaLink="true">https://parraletz.dev/design-patterns-are-not-a-developer-thing-1</guid><category><![CDATA[design patterns]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[SRE]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Fri, 06 Mar 2026 15:09:44 GMT</pubDate><content:encoded><![CDATA[<h1 id="heading-design-patterns-are-not-a-developer-thing">Design Patterns Are Not a Developer Thing</h1>
<p>At some point, most platform engineers end up in a conversation with a senior dev or a software architect where you're trying to explain how your infrastructure works.</p>
<p>And somewhere in the middle of that conversation, something clicks. You're both describing the same problems. You're reaching for the same solutions. The only difference is the vocabulary — they say "Facade", you say "developer abstraction layer". They say "Observer pattern", you say "controller loop". They say "Command", you say "GitOps".</p>
<p>Same architecture. Different words.</p>
<p>What I've found is that design patterns aren't really a software thing. They're a complexity thing. And if there's one domain that deals with complexity at scale every single day, it's platform engineering.</p>
<p>This post is the first in a series that explores exactly that.</p>
<hr />
<h2 id="heading-the-problem-with-how-design-patterns-are-taught">The problem with how design patterns are taught</h2>
<p>Design patterns come from a book called <em>Design Patterns</em>, published in 1994 by four authors the industry nicknamed the "Gang of Four" (GoF). The book is genuinely great. It's also full of examples like <em>"imagine you have a <code>WordDocument</code> class and you need a <code>PDFDocument</code> class..."</em></p>
<p>If your day-to-day doesn't involve classes and inheritance, that framing is just noise.</p>
<p>But the patterns themselves aren't really about code. They're about <strong>how to deal with complexity as a system grows</strong>. And platforms have exactly those problems — just at a different scale and with different tools.</p>
<p>The thesis of this series is pretty straightforward: <strong>the design patterns developers learned in school or from a book? You're already applying them in your platform. Nobody just introduced them by name.</strong></p>
<hr />
<h2 id="heading-three-quick-examples-to-make-the-case">Three quick examples to make the case</h2>
<h3 id="heading-the-facade-why-platform-teams-exist-in-the-first-place">The Facade: why platform teams exist in the first place</h3>
<p>The Facade pattern says that when a system gets too complex, you build a simplified interface on top. Consumers talk to the facade, not to the guts of the system.</p>
<p>That's exactly what an Internal Developer Platform does. Your developers shouldn't need to know whether a secret comes from Vault, AWS Secrets Manager, or a ConfigMap. They shouldn't need to care which CNI the cluster runs, or how the ingress controller works under the hood. The platform is the facade. You're the one designing it.</p>
<p>When your team debates whether to expose Kubernetes directly to developers or abstract it behind a Backstage template — that's a conversation about how much facade to build. That's design work, not ops work.</p>
<h3 id="heading-the-observer-the-controller-loop-you-already-know">The Observer: the controller loop you already know</h3>
<p>The Observer pattern says that when an object's state changes, everything that depends on it gets notified automatically.</p>
<p>The Kubernetes controller loop is exactly that, implemented at cluster scale. There's a desired state (what's in etcd), an actual state (what's running), and controllers that watch the gap and act on it. Argo Rollouts watches Prometheus metrics to decide whether a canary should keep going or roll back. External Secrets Operator watches Vault for changes and syncs secrets accordingly.</p>
<p>Every operator you write or adopt is an implementation of the Observer pattern. Knowing that helps you decide when writing one actually makes sense — and when it's just adding complexity for fun.</p>
<h3 id="heading-the-command-why-gitops-makes-sense-from-first-principles">The Command: why GitOps makes sense from first principles</h3>
<p>The Command pattern says that instead of executing an action directly, you wrap it in an object that can be stored, passed around, undone, or audited.</p>
<p>A commit in your GitOps repo is exactly that. It's not just a text change. It's a persisted command — with an author, a timestamp, and the ability to be reversed with a <code>git revert</code>. ArgoCD or Flux is the executor that takes that command and applies it to the cluster.</p>
<p>The question "why GitOps instead of pipelines that just run <code>kubectl apply</code>?" has a deeper answer than "because it's best practice." It has an architectural answer: because separating the command from its execution gives you properties — auditability, reversibility, a single source of truth — that you'd otherwise have to build from scratch.</p>
<hr />
<h2 id="heading-where-this-is-going">Where this is going</h2>
<p>This is just the hook. The rest of the series goes pattern by pattern, connecting each one to real platform decisions:</p>
<ul>
<li><strong>Factory and Builder</strong> → how an IDP thinks when it provisions environments</li>
<li><strong>Facade</strong> → the cost of cognitive load and the Golden Path</li>
<li><strong>Observer</strong> → when to write an operator and when to step back</li>
<li><strong>Command</strong> → GitOps explained from first principles</li>
<li><strong>Adapter</strong> → why the Kubernetes ecosystem is basically a catalog of adapters, and why that's a feature</li>
<li><strong>Strategy</strong> → why hardcoding your deployment strategy into the pipeline is an antipattern</li>
</ul>
<p>The goal isn't for you to finish this series knowing more pattern trivia. The goal is that next time you make a platform decision, you have better vocabulary to explain it, defend it, or challenge it.</p>
<p>There's a practical side to this that I find underrated: patterns also help you diagnose problems. If your pipelines are slow and tightly coupled, that's not just a pipeline problem — that's a symptom of missing separation of concerns, which maps directly to specific patterns that exist precisely to solve it. Once you can name what's wrong architecturally, the solution space gets a lot clearer. You stop googling "how to speed up my CI pipeline" and start asking "what's the right pattern for decoupling this."</p>
<p>Because the difference between a team that operates infrastructure and one that designs a platform isn't the tool stack. It's whether they're making architectural decisions on purpose — or just stumbling into them.</p>
<hr />
<p><em>If this resonates, the next posts go deeper on each pattern. You can follow along or reply directly at <a target="_blank" href="mailto:blog@parraletz.dev">blog@parraletz.dev</a> — I read everything.</em></p>
]]></content:encoded></item><item><title><![CDATA[Por qué Git no es suficiente para ML]]></title><description><![CDATA[El setup típico que veo:
bash
ml-project/
├── .git/
├── src/
│   └── train.py
├── data/
│   └── training.csv  # 5GB, in Git
└── models/
    └── model.pkl     # 2GB, in Git

Primera pregunta: ¿Cómo está esto en Git?
Respuesta común:
bash
# .gitignore
...]]></description><link>https://parraletz.dev/por-que-git-no-es-suficiente-para-ml</link><guid isPermaLink="true">https://parraletz.dev/por-que-git-no-es-suficiente-para-ml</guid><category><![CDATA[mlops]]></category><category><![CDATA[Data Science]]></category><category><![CDATA[dataengineering]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><category><![CDATA[DVCS]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Wed, 14 Jan 2026 13:07:34 GMT</pubDate><content:encoded><![CDATA[<p><strong>El setup típico que veo:</strong></p>
<p>bash</p>
<pre><code class="lang-bash">ml-project/
├── .git/
├── src/
│   └── train.py
├── data/
│   └── training.csv  <span class="hljs-comment"># 5GB, in Git</span>
└── models/
    └── model.pkl     <span class="hljs-comment"># 2GB, in Git</span>
</code></pre>
<p><strong>Primera pregunta:</strong> ¿Cómo está esto en Git?</p>
<p><strong>Respuesta común:</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># .gitignore</span>
data/
models/

<span class="hljs-comment"># Entonces... ¿cómo compartes los datos?</span>
<span class="hljs-comment"># "Los tenemos en S3"</span>

<span class="hljs-comment"># ¿Versionados?</span>
<span class="hljs-comment"># "...no"</span>
</code></pre>
<hr />
<p><strong>El problema fundamental:</strong></p>
<p><strong>Código es fácil:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># Version 1</span>
model = Model(lr=<span class="hljs-number">0.01</span>)

<span class="hljs-comment"># Version 2</span>
model = Model(lr=<span class="hljs-number">0.001</span>)

<span class="hljs-comment"># Git diff:</span>
- model = Model(lr=<span class="hljs-number">0.01</span>)
+ model = Model(lr=<span class="hljs-number">0.001</span>)

<span class="hljs-comment"># Clear, concise, mergeable</span>
</code></pre>
<p><strong>Datos no:</strong></p>
<pre><code class="lang-bash">training.parquet v1: 50GB
training.parquet v2: 55GB

<span class="hljs-comment"># Git diff:</span>
Binary files differ

<span class="hljs-comment"># ¿Qué cambió?</span>
<span class="hljs-comment"># - ¿5GB de nuevas filas?</span>
<span class="hljs-comment"># - ¿Se modificaron filas existentes?</span>
<span class="hljs-comment"># - ¿Cambió el schema?</span>
<span class="hljs-comment"># ¿Quién sabe?</span>
</code></pre>
<p><strong>Los 5 problemas de versionar datos:</strong></p>
<p><strong>PROBLEMA 1: Tamaño</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git add data/training.parquet  <span class="hljs-comment"># 50GB</span>

<span class="hljs-comment"># Git:</span>
error: File data/training.parquet is 50GB
hint: Use Git LFS <span class="hljs-keyword">for</span> files larger than 100MB

git lfs install
git lfs track <span class="hljs-string">"*.parquet"</span>
git add data/training.parquet
git push

<span class="hljs-comment"># GitHub/GitLab:</span>
error: File exceeds 5GB <span class="hljs-built_in">limit</span>
<span class="hljs-comment"># Even with LFS</span>
</code></pre>
<p><strong>PROBLEMA 2: Performance</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git <span class="hljs-built_in">clone</span> repo

<span class="hljs-comment"># With 50GB dataset in Git:</span>
<span class="hljs-comment"># Clone time: 45 minutes</span>
<span class="hljs-comment"># Disk usage: 100GB (working copy + .git)</span>
<span class="hljs-comment"># CI/CD: Timeout después de 1 hora</span>
</code></pre>
<p><strong>PROBLEMA 3: No es text-based</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git diff data/training.csv

<span class="hljs-comment"># CSV de 10M filas:</span>
+row_9874561,value1,value2,...
+row_9874562,value1,value2,...
-row_9874563,old_value1,old_value2,...
...
<span class="hljs-comment"># 10 million line diff</span>
<span class="hljs-comment"># Inútil</span>
</code></pre>
<p><strong>PROBLEMA 4: Merge conflicts</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Branch A: Agrega 1M filas</span>
<span class="hljs-comment"># Branch B: Agrega 1M filas diferentes</span>

git merge branch-b

CONFLICT <span class="hljs-keyword">in</span> data/training.csv
<span class="hljs-comment"># ¿Cómo resuelves esto?</span>
<span class="hljs-comment"># ¿Manualmente en un CSV de 15M filas?</span>
</code></pre>
<p><strong>PROBLEMA 5: Cost</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Git LFS en GitHub</span>
<span class="hljs-comment"># - Free: 1GB storage, 1GB bandwidth</span>
<span class="hljs-comment"># - Paid: $5/month per 50GB storage</span>

<span class="hljs-comment"># Dataset de 500GB:</span>
<span class="hljs-comment"># $50/month solo para storage</span>
<span class="hljs-comment"># + bandwidth costs</span>

<span class="hljs-comment"># S3:</span>
<span class="hljs-comment"># $11.50/month per 500GB</span>
<span class="hljs-comment"># Mucho más barato</span>
```

---

**Por qué los datos son diferentes del código:**

| Característica | Código | Datos |
|----------------|--------|-------|
| Tamaño | KB-MB | GB-TB |
| Formato | Text | Binary (parquet, tfrecord) |
| Cambios | Líneas específicas | Filas/distribuciones |
| Merge | Posible | Imposible |
| Diff | Meaningful | <span class="hljs-string">"Binary files differ"</span> |
| Storage | Git OK | Git no funciona |
| Versioning | Commits | Checksums/hashes |

---

**Lo que realmente necesitas:**

No solo versionar archivos. Versionar **contexto**:

**1. El dataset mismo**
```
training.parquet
  - Size: 50GB
  - Hash: a1b2c3d4e5f6
  - Location: s3://data/training/v2.3.1/
</code></pre>
<p><strong>2. Metadata</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-attr">dataset:</span> <span class="hljs-string">training-v2.3.1</span>
<span class="hljs-attr">created:</span> <span class="hljs-number">2024-01-15</span>
<span class="hljs-attr">rows:</span> <span class="hljs-number">1</span><span class="hljs-string">,234,567</span>
<span class="hljs-attr">columns:</span> <span class="hljs-number">45</span>
<span class="hljs-attr">schema:</span>
  <span class="hljs-attr">user_id:</span> <span class="hljs-string">int64</span>
  <span class="hljs-attr">timestamp:</span> <span class="hljs-string">datetime64</span>
  <span class="hljs-attr">feature_1:</span> <span class="hljs-string">float64</span>
  <span class="hljs-string">...</span>
<span class="hljs-attr">source:</span> <span class="hljs-string">production-db</span>
<span class="hljs-attr">query:</span> <span class="hljs-string">|
  SELECT * FROM events
  WHERE date &gt; '2023-01-01'
  AND user_type = 'active'</span>
</code></pre>
<p><strong>3. Statistics</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-attr">distributions:</span>
  <span class="hljs-attr">feature_1:</span>
    <span class="hljs-attr">mean:</span> <span class="hljs-number">0.547</span>
    <span class="hljs-attr">std:</span> <span class="hljs-number">0.231</span>
    <span class="hljs-attr">min:</span> <span class="hljs-number">0.0</span>
    <span class="hljs-attr">max:</span> <span class="hljs-number">1.0</span>
  <span class="hljs-attr">feature_2:</span>
    <span class="hljs-attr">unique_values:</span> <span class="hljs-number">47</span>
    <span class="hljs-attr">top_3:</span> [<span class="hljs-string">'A'</span>, <span class="hljs-string">'B'</span>, <span class="hljs-string">'C'</span>]
    <span class="hljs-attr">missing:</span> <span class="hljs-number">2.3</span><span class="hljs-string">%</span>
<span class="hljs-string">```</span>

<span class="hljs-string">**4.</span> <span class="hljs-string">Lineage**</span>
<span class="hljs-string">```</span>
<span class="hljs-string">training-v2.3.1</span>
  <span class="hljs-string">←</span> <span class="hljs-string">derived</span> <span class="hljs-string">from</span> <span class="hljs-string">raw-events-v1.5.2</span>
    <span class="hljs-string">←</span> <span class="hljs-string">extracted</span> <span class="hljs-string">from</span> <span class="hljs-string">production-db</span>
      <span class="hljs-string">←</span> <span class="hljs-string">populated</span> <span class="hljs-string">by</span> <span class="hljs-string">application-v2.1.0</span>
</code></pre>
<p><strong>5. Validation</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-attr">checks:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">no_missing_values:</span> <span class="hljs-string">PASS</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">schema_match:</span> <span class="hljs-string">PASS</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">distribution_drift:</span> <span class="hljs-string">WARN</span> <span class="hljs-string">(feature_3</span> <span class="hljs-string">shifted)</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">no_duplicates:</span> <span class="hljs-string">PASS</span>
<span class="hljs-string">```</span>

<span class="hljs-meta">---</span>

<span class="hljs-string">**Casos</span> <span class="hljs-string">reales</span> <span class="hljs-string">donde</span> <span class="hljs-string">Git</span> <span class="hljs-string">falló:**</span>

<span class="hljs-string">**CASO</span> <span class="hljs-attr">1:</span> <span class="hljs-string">El</span> <span class="hljs-string">dataset</span> <span class="hljs-string">"desapareció"</span><span class="hljs-string">**</span>
<span class="hljs-string">```</span>
<span class="hljs-attr">DS:</span> <span class="hljs-string">"Necesito el dataset de hace 3 meses"</span>
<span class="hljs-attr">Yo:</span> <span class="hljs-string">"¿Está en Git?"</span>
<span class="hljs-attr">DS:</span> <span class="hljs-string">"No, era muy grande"</span>
<span class="hljs-attr">Yo:</span> <span class="hljs-string">"¿Dónde está?"</span>
<span class="hljs-attr">DS:</span> <span class="hljs-string">"En s3://data/training.parquet"</span>
<span class="hljs-attr">Yo:</span> <span class="hljs-string">"Eso se sobrescribe cada semana"</span>
<span class="hljs-attr">DS:</span> <span class="hljs-string">"..."</span>

<span class="hljs-comment"># Dataset perdido para siempre</span>
<span class="hljs-comment"># 3 meses de trabajo para re-generarlo</span>
</code></pre>
<p><strong>CASO 2: El merge conflict imposible</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git merge feature-new-data-source

CONFLICT <span class="hljs-keyword">in</span> data/training.csv

<span class="hljs-comment"># 10 million lines en conflict</span>
<span class="hljs-comment"># Impossible resolver manualmente</span>
<span class="hljs-comment"># Tuvimos que re-generar todo el dataset</span>
<span class="hljs-comment"># 2 semanas de trabajo perdidas</span>
</code></pre>
<p><strong>CASO 3: El CI/CD que nunca terminó</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># .github/workflows/train.yaml</span>
<span class="hljs-attr">steps:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Checkout</span>
    <span class="hljs-attr">uses:</span> <span class="hljs-string">actions/checkout@v3</span>

  <span class="hljs-comment"># Descarga 50GB de datos de Git LFS</span>
  <span class="hljs-comment"># Timeout después de 6 horas</span>
  <span class="hljs-comment"># CI/CD nunca completa</span>
  <span class="hljs-comment"># No podemos hacer automated training</span>
<span class="hljs-string">```</span>

<span class="hljs-meta">---</span>

<span class="hljs-string">**La</span> <span class="hljs-string">solución</span> <span class="hljs-string">correcta:**</span>
<span class="hljs-string">```</span>
<span class="hljs-string">Código</span> <span class="hljs-string">→</span> <span class="hljs-string">Git</span>
<span class="hljs-string">Datos</span> <span class="hljs-string">→</span> <span class="hljs-string">Data</span> <span class="hljs-string">versioning</span> <span class="hljs-string">tool</span> <span class="hljs-string">(DVC,</span> <span class="hljs-string">LakeFS,</span> <span class="hljs-string">etc.)</span>
<span class="hljs-string">Modelos</span> <span class="hljs-string">→</span> <span class="hljs-string">Model</span> <span class="hljs-string">registry</span> <span class="hljs-string">(MLflow,</span> <span class="hljs-string">etc.)</span>
<span class="hljs-string">```</span>

<span class="hljs-string">**Separación</span> <span class="hljs-string">de</span> <span class="hljs-string">concerns:**</span>
<span class="hljs-string">```</span>
<span class="hljs-string">┌─────────────────────────────────────┐</span>
<span class="hljs-string">│</span>           <span class="hljs-string">Git</span> <span class="hljs-string">(Source</span> <span class="hljs-string">Control)</span>      <span class="hljs-string">│</span>
<span class="hljs-string">│</span>                                     <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Code</span>                             <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Config</span>                           <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Pipelines</span>                        <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">.dvc</span> <span class="hljs-string">files</span> <span class="hljs-string">(pointers)</span>           <span class="hljs-string">│</span>
<span class="hljs-string">└─────────────────────────────────────┘</span>

<span class="hljs-string">┌─────────────────────────────────────┐</span>
<span class="hljs-string">│</span>     <span class="hljs-string">DVC/LakeFS</span> <span class="hljs-string">(Data</span> <span class="hljs-string">Versioning)</span>    <span class="hljs-string">│</span>
<span class="hljs-string">│</span>                                     <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Training</span> <span class="hljs-string">datasets</span>                <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Validation</span> <span class="hljs-string">datasets</span>              <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Feature</span> <span class="hljs-string">sets</span>                     <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Metadata</span>                         <span class="hljs-string">│</span>
<span class="hljs-string">└─────────────────────────────────────┘</span>

<span class="hljs-string">┌─────────────────────────────────────┐</span>
<span class="hljs-string">│</span>      <span class="hljs-string">MLflow</span> <span class="hljs-string">(Model</span> <span class="hljs-string">Registry)</span>        <span class="hljs-string">│</span>
<span class="hljs-string">│</span>                                     <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Trained</span> <span class="hljs-string">models</span>                   <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Metrics</span>                          <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Hyperparameters</span>                  <span class="hljs-string">│</span>
<span class="hljs-string">│</span>  <span class="hljs-bullet">-</span> <span class="hljs-string">Artifacts</span>                        <span class="hljs-string">│</span>
<span class="hljs-string">└─────────────────────────────────────┘</span>
</code></pre>
<hr />
<p><strong>Ejemplo práctico:</strong></p>
<p><strong>SIN data versioning:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
data = pd.read_parquet(<span class="hljs-string">"s3://data/training.parquet"</span>)
model = train(data)
mlflow.log_model(model, <span class="hljs-string">"model"</span>)

<span class="hljs-comment"># 3 meses después:</span>
<span class="hljs-comment"># "¿Con qué datos se entrenó este modelo?"</span>
<span class="hljs-comment"># "...no sé"</span>
</code></pre>
<p><strong>CON data versioning:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> dvc.api

<span class="hljs-comment"># Pull specific data version</span>
data_version = <span class="hljs-string">"v2.3.1"</span>
<span class="hljs-keyword">with</span> dvc.api.open(
    <span class="hljs-string">'data/training.parquet'</span>,
    rev=data_version
) <span class="hljs-keyword">as</span> f:
    data = pd.read_parquet(f)

<span class="hljs-comment"># Train</span>
model = train(data)

<span class="hljs-comment"># Log everything</span>
mlflow.log_param(<span class="hljs-string">"data_version"</span>, data_version)
mlflow.log_param(<span class="hljs-string">"data_hash"</span>, get_data_hash())
mlflow.log_param(<span class="hljs-string">"data_rows"</span>, len(data))
mlflow.log_model(model, <span class="hljs-string">"model"</span>)

<span class="hljs-comment"># 3 meses después:</span>
<span class="hljs-comment"># "¿Con qué datos se entrenó?"</span>
<span class="hljs-comment"># "data_version: v2.3.1, hash: a1b2c3"</span>
<span class="hljs-comment"># "Puedo reproducirlo: dvc checkout v2.3.1"</span>
</code></pre>
<hr />
<p><strong>Las preguntas críticas:</strong></p>
<p>Puedes responder:</p>
<p><strong>1. Reproducibilidad</strong></p>
<pre><code class="lang-bash">¿Puedes reproducir el experimento <span class="hljs-comment">#142?</span>
→ Sí: data v2.1.3 + code commit abc123
</code></pre>
<p><strong>2. Debugging</strong></p>
<pre><code class="lang-bash">¿Por qué el modelo empeoró?
→ Data drift: v2.3.1 vs v2.3.2
→ mean(feature_x): 0.5 → 0.7
</code></pre>
<p><strong>3. Compliance</strong></p>
<pre><code class="lang-bash">¿Qué datos personales contiene este modelo?
→ Dataset v2.1.3
→ Metadata dice: no PII
→ Validation logs confirman
</code></pre>
<p><strong>4. Rollback</strong></p>
<pre><code class="lang-bash">Necesito volver al dataset de la semana pasada
→ dvc checkout v2.2.9
→ 30 segundos
</code></pre>
<p><strong>Comparativa rápida:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Método</td><td>Pros</td><td>Contras</td><td>Use case</td></tr>
</thead>
<tbody>
<tr>
<td>Git directo</td><td>Simple</td><td>No escala &gt;100MB</td><td>Datasets pequeños (&lt;100MB)</td></tr>
<tr>
<td>Git LFS</td><td>Git workflow</td><td>Caro, lento</td><td>Datasets medianos (&lt;5GB)</td></tr>
<tr>
<td>DVC</td><td>Open source</td><td>Setup manual</td><td>Data science teams</td></tr>
<tr>
<td>LakeFS</td><td>Git-like</td><td>Complejo</td><td>Data lakes grandes</td></tr>
<tr>
<td>Delta Lake</td><td>ACID</td><td>Databricks-specific</td><td>Spark workflows</td></tr>
</tbody>
</table>
</div><p><strong>Tu turno:</strong> ¿Cómo versionas tus datos hoy? ¿O es "latest.parquet" y rezar?</p>
]]></content:encoded></item><item><title><![CDATA[Conectando los mundos - Platform Engineering meets ML]]></title><description><![CDATA[Terraform, Helm, GitOps... ¿para entrenar modelos? Sí.
Hoy: Cómo gestionar todo eso como código.

El problema:
Tienes tu stack de Platform Engineering:

Terraform para infraestructura

Helm para aplicaciones en K8s

ArgoCD para GitOps

CI/CD pipeline...]]></description><link>https://parraletz.dev/conectando-los-mundos-platform-engineering-meets-ml</link><guid isPermaLink="true">https://parraletz.dev/conectando-los-mundos-platform-engineering-meets-ml</guid><category><![CDATA[mlops]]></category><category><![CDATA[Terraform]]></category><category><![CDATA[gitops]]></category><category><![CDATA[mlengineer]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Platform Engineering ]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Tue, 13 Jan 2026 16:01:10 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768319241143/4516961c-6d05-47ee-a949-1af65a46a4da.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Terraform, Helm, GitOps... ¿para entrenar modelos? Sí.</strong></p>
<p>Hoy: <strong>Cómo gestionar todo eso como código.</strong></p>
<hr />
<p><strong>El problema:</strong></p>
<p>Tienes tu stack de Platform Engineering:</p>
<ul>
<li><p>Terraform para infraestructura</p>
</li>
<li><p>Helm para aplicaciones en K8s</p>
</li>
<li><p>ArgoCD para GitOps</p>
</li>
<li><p>CI/CD pipelines</p>
</li>
</ul>
<p>Y ahora llega ML al juego.</p>
<p><strong>Primera reacción:</strong> "Solo es otra aplicación, ¿no?"</p>
<p><strong>Spoiler: No.</strong></p>
<hr />
<p><strong>Lo que descubrí:</strong></p>
<p>ML no es una aplicación más. Es un <strong>ecosistema completo</strong> que necesita:</p>
<p><strong>Infraestructura tradicional:</strong></p>
<ul>
<li><p>Compute (K8s clusters, VMs)</p>
</li>
<li><p>Storage (S3, persistent volumes)</p>
</li>
<li><p>Networking (load balancers, ingress)</p>
</li>
</ul>
<p><strong>PERO TAMBIÉN infraestructura específica de ML:</strong></p>
<ul>
<li><p>Feature stores</p>
</li>
<li><p>Model registries</p>
</li>
<li><p>Experiment tracking</p>
</li>
<li><p>Data versioning</p>
</li>
<li><p>Training pipelines</p>
</li>
<li><p>Serving infrastructure</p>
</li>
</ul>
<p><strong>Y todo debe ser código. Reproducible. Versionado. Con GitOps.</strong></p>
<hr />
<p><strong>Los 4 niveles de IaC para MLOps:</strong></p>
<p>Terraform es el ganador indiscutible, pero existen AWS CDK, Pulumi que son dignos competidores.</p>
<h3 id="heading-nivel-1-infraestructura-base-fundational"><strong>Nivel 1: Infraestructura base (Fundational)</strong></h3>
<pre><code class="lang-ini"><span class="hljs-comment"># terraform/main.tf</span>

<span class="hljs-comment"># Cluster para ML workloads</span>
resource "aws_eks_cluster" "ml_cluster" {
  <span class="hljs-attr">name</span>     = <span class="hljs-string">"ml-${var.environment}"</span>
  <span class="hljs-attr">role_arn</span> = aws_iam_role.ml_cluster.arn

  vpc_config {
    <span class="hljs-attr">subnet_ids</span> = var.private_subnets
  }
}

<span class="hljs-comment"># Node group con GPUs</span>
resource "aws_eks_node_group" "gpu_nodes" {
  <span class="hljs-attr">cluster_name</span>    = aws_eks_cluster.ml_cluster.name
  <span class="hljs-attr">node_group_name</span> = <span class="hljs-string">"gpu-nodes"</span>

  <span class="hljs-attr">instance_types</span> = [<span class="hljs-string">"p3.2xlarge"</span>]

  scaling_config {
    <span class="hljs-attr">desired_size</span> = <span class="hljs-number">2</span>
    <span class="hljs-attr">max_size</span>     = <span class="hljs-number">10</span>
    <span class="hljs-attr">min_size</span>     = <span class="hljs-number">0</span>  <span class="hljs-comment"># Scale to zero cuando no se use</span>
  }
}

<span class="hljs-comment"># Storage para datasets</span>
resource "aws_s3_bucket" "ml_data" {
  <span class="hljs-attr">bucket</span> = <span class="hljs-string">"ml-data-${var.environment}"</span>

  versioning {
    <span class="hljs-attr">enabled</span> = <span class="hljs-literal">true</span>  <span class="hljs-comment"># Importante para data versioning</span>
  }
}
</code></pre>
<p><strong>Esto es familiar. Es Terraform normal.</strong></p>
<hr />
<h3 id="heading-nivel-2-ml-platform-components"><strong>Nivel 2: ML Platform components</strong></h3>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># helm/values.yaml</span>

<span class="hljs-comment"># MLflow para experiment tracking</span>
<span class="hljs-attr">mlflow:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">backend_store:</span> <span class="hljs-string">postgresql</span>
  <span class="hljs-attr">artifact_store:</span> <span class="hljs-string">s3://ml-artifacts-prod</span>

<span class="hljs-comment"># Kubeflow Pipelines (si lo usas)</span>
<span class="hljs-attr">kubeflow:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">pipelines:</span>
    <span class="hljs-attr">persistence:</span>
      <span class="hljs-attr">storageClass:</span> <span class="hljs-string">gp3</span>
      <span class="hljs-attr">size:</span> <span class="hljs-string">100Gi</span>

<span class="hljs-comment"># Model serving (Seldon, KServe, etc.)</span>
<span class="hljs-attr">seldon:</span>
  <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
  <span class="hljs-attr">resources:</span>
    <span class="hljs-attr">limits:</span>
      <span class="hljs-attr">nvidia.com/gpu:</span> <span class="hljs-number">1</span>
</code></pre>
<p><strong>Esto es donde Platform Engineering y ML se encuentran.</strong></p>
<hr />
<h3 id="heading-nivel-3-training-pipelines-como-codigo"><strong>Nivel 3: Training pipelines como código</strong></h3>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># pipelines/training-pipeline.yaml</span>

<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Workflow</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">training-pipeline</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">entrypoint:</span> <span class="hljs-string">train-model</span>

  <span class="hljs-attr">arguments:</span>
    <span class="hljs-attr">parameters:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">dataset-version</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"v2.3.1"</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">environment</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">"staging"</span>

  <span class="hljs-attr">templates:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">train-model</span>
    <span class="hljs-attr">steps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">ingest-data</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">ingest</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">validate-data</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">validate</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">train</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">training</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">evaluate</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">evaluation</span>
    <span class="hljs-bullet">-</span> <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">register</span>
        <span class="hljs-attr">template:</span> <span class="hljs-string">model-registration</span>

  <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">ingest</span>
    <span class="hljs-attr">container:</span>
      <span class="hljs-attr">image:</span> <span class="hljs-string">ml-pipeline/ingest:{{workflow.parameters.dataset-version}}</span>
      <span class="hljs-attr">command:</span> [<span class="hljs-string">python</span>, <span class="hljs-string">ingest.py</span>]
      <span class="hljs-attr">env:</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">DATASET_VERSION</span>
        <span class="hljs-attr">value:</span> <span class="hljs-string">"<span class="hljs-template-variable">{{workflow.parameters.dataset-version}}</span>"</span>
      <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">S3_BUCKET</span>
        <span class="hljs-attr">value:</span> <span class="hljs-string">"ml-data-<span class="hljs-template-variable">{{workflow.parameters.environment}}</span>"</span>
</code></pre>
<p><strong>Aquí es donde todo se conecta: infra + pipelines + GitOps.</strong></p>
<hr />
<h3 id="heading-nivel-4-gitops-para-ml"><strong>Nivel 4: GitOps para ML</strong></h3>
<p>Y como saben, en este lugar somos GitOps Frendly</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># argocd/ml-application.yaml</span>

<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">argoproj.io/v1alpha1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Application</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">ml-platform</span>
  <span class="hljs-attr">namespace:</span> <span class="hljs-string">argocd</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">project:</span> <span class="hljs-string">default</span>

  <span class="hljs-attr">source:</span>
    <span class="hljs-attr">repoURL:</span> <span class="hljs-string">https://github.com/company/ml-platform</span>
    <span class="hljs-attr">targetRevision:</span> <span class="hljs-string">main</span>
    <span class="hljs-attr">path:</span> <span class="hljs-string">k8s/</span>

    <span class="hljs-attr">helm:</span>
      <span class="hljs-attr">values:</span> <span class="hljs-string">|
        environment: staging
</span>
        <span class="hljs-attr">mlflow:</span>
          <span class="hljs-attr">enabled:</span> <span class="hljs-literal">true</span>
          <span class="hljs-attr">version:</span> <span class="hljs-string">"2.9.0"</span>

        <span class="hljs-attr">training_pipelines:</span>
          <span class="hljs-attr">schedule:</span> <span class="hljs-string">"0 2 * * 0"</span>  <span class="hljs-comment"># Weekly</span>
          <span class="hljs-attr">resources:</span>
            <span class="hljs-attr">gpu:</span> <span class="hljs-literal">true</span>
            <span class="hljs-attr">memory:</span> <span class="hljs-string">32Gi</span>

  <span class="hljs-attr">destination:</span>
    <span class="hljs-attr">server:</span> <span class="hljs-string">https://kubernetes.default.svc</span>
    <span class="hljs-attr">namespace:</span> <span class="hljs-string">ml-staging</span>

  <span class="hljs-attr">syncPolicy:</span>
    <span class="hljs-attr">automated:</span>
      <span class="hljs-attr">prune:</span> <span class="hljs-literal">true</span>
      <span class="hljs-attr">selfHeal:</span> <span class="hljs-literal">true</span>
</code></pre>
<p><strong>Commit → Push → ArgoCD sync → Pipeline deployed.</strong></p>
<p><strong>Igual que tus otras aplicaciones. Pero para ML.</strong></p>
<hr />
<p><strong>La integración completa:</strong></p>
<pre><code class="lang-ini">┌─────────────────────────────────────────────────┐
│              Git Repository                     │
│                                                 │
│  terraform/        → Infra base                 │
│  helm/            → ML platform                 │
│  pipelines/       → Training workflows          │
│  models/          → Model definitions           │
└─────────────┬───────────────────────────────────┘
              │
              │ git push
              │
     ┌────────▼────────┐
     │   CI Pipeline   │
     │                 │
     │  - Validate     │
     │  - Test         │
     │  - Build images │
     └────────┬────────┘
              │
              │ if tests pass
              │
     ┌────────▼────────┐
     │    ArgoCD       │
     │                 │
     │  - Sync infra   │
     │  - Deploy ML    │
     │  - Trigger      │
     └────────┬────────┘
              │
              │
     ┌────────▼─────────────────────────────┐
     │         Kubernetes                   │
     │                                      │
     │  - ML Platform running               │
     │  - Pipelines scheduled               │
     │  - Models serving                    │
     └──────────────────────────────────────┘
</code></pre>
<p><strong>Ejemplo real: Deploy de un nuevo modelo</strong></p>
<p><strong>Antes (manual):</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># 1. SSH al cluster</span>
ssh ml-cluster-prod

<span class="hljs-comment"># 2. Actualizar manualmente</span>
kubectl apply -f model-deployment.yaml

<span class="hljs-comment"># 3. Esperar y rezar</span>
kubectl get pods -w

<span class="hljs-comment"># 4. Si falla, rollback manual</span>
kubectl rollback deployment/model-server
</code></pre>
<p><strong>Ahora (GitOps):</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># 1. Update en Git</span>
cat &gt; k8s/model-deployment.yaml &lt;&lt;EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  template:
    spec:
      containers:
      - name: model
        image: models/fraud-detection:v2.3.1  <span class="hljs-comment"># Cambio aquí</span>
EOF

<span class="hljs-comment"># 2. Commit y push</span>
git add k8s/model-deployment.yaml
git commit -m <span class="hljs-string">"Deploy model v2.3.1 to staging"</span>
git push

<span class="hljs-comment"># 3. ArgoCD hace el resto</span>
<span class="hljs-comment"># - Detecta cambio</span>
<span class="hljs-comment"># - Valida</span>
<span class="hljs-comment"># - Canary deploy</span>
<span class="hljs-comment"># - Rollback automático si falla</span>
</code></pre>
<hr />
<p><strong>Los componentes clave:</strong></p>
<p><strong>Terraform para infraestructura:</strong></p>
<p>hcl</p>
<pre><code class="lang-ini">module "ml_platform" {
  <span class="hljs-attr">source</span> = <span class="hljs-string">"./modules/ml-platform"</span>

  <span class="hljs-attr">environment</span> = var.environment

  <span class="hljs-comment"># Compute</span>
  <span class="hljs-attr">gpu_nodes</span>     = <span class="hljs-number">4</span>
  <span class="hljs-attr">cpu_nodes</span>     = <span class="hljs-number">10</span>

  <span class="hljs-comment"># Storage</span>
  <span class="hljs-attr">data_bucket</span>   = <span class="hljs-string">"ml-data-${var.environment}"</span>
  <span class="hljs-attr">model_bucket</span>  = <span class="hljs-string">"ml-models-${var.environment}"</span>

  <span class="hljs-comment"># Networking</span>
  <span class="hljs-attr">vpc_id</span>        = module.vpc.vpc_id

  <span class="hljs-comment"># ML-specific</span>
  <span class="hljs-attr">mlflow_backend</span>    = module.rds.endpoint
  <span class="hljs-attr">feature_store</span>     = <span class="hljs-string">"enabled"</span>
  <span class="hljs-attr">model_registry</span>    = <span class="hljs-string">"enabled"</span>
}
</code></pre>
<p><strong>Helm para componentes de ML:</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Chart.yaml</span>
<span class="hljs-attr">dependencies:</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">mlflow</span>
  <span class="hljs-attr">version:</span> <span class="hljs-string">"0.7.0"</span>
  <span class="hljs-attr">repository:</span> <span class="hljs-string">https://charts.mlflow.org</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">kubeflow-pipelines</span>
  <span class="hljs-attr">version:</span> <span class="hljs-string">"2.0.0"</span>
  <span class="hljs-attr">repository:</span> <span class="hljs-string">https://kubeflow.github.io/pipelines</span>

<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">seldon-core</span>
  <span class="hljs-attr">version:</span> <span class="hljs-string">"1.15.0"</span>
  <span class="hljs-attr">repository:</span> <span class="hljs-string">https://storage.googleapis.com/seldon-charts</span>
</code></pre>
<p><strong>ArgoCD para sincronización:</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Sync policy para ML workloads</span>
<span class="hljs-attr">syncPolicy:</span>
  <span class="hljs-attr">automated:</span>
    <span class="hljs-attr">prune:</span> <span class="hljs-literal">true</span>
    <span class="hljs-attr">selfHeal:</span> <span class="hljs-literal">true</span>

  <span class="hljs-attr">syncOptions:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-string">CreateNamespace=true</span>

  <span class="hljs-attr">retry:</span>
    <span class="hljs-attr">limit:</span> <span class="hljs-number">5</span>
    <span class="hljs-attr">backoff:</span>
      <span class="hljs-attr">duration:</span> <span class="hljs-string">5s</span>
      <span class="hljs-attr">factor:</span> <span class="hljs-number">2</span>
      <span class="hljs-attr">maxDuration:</span> <span class="hljs-string">3m</span>
</code></pre>
<p><strong>CI/CD para pipelines:</strong></p>
<pre><code class="lang-yaml"><span class="hljs-comment"># .github/workflows/ml-pipeline.yaml</span>
<span class="hljs-attr">name:</span> <span class="hljs-string">ML</span> <span class="hljs-string">Pipeline</span> <span class="hljs-string">CI</span>

<span class="hljs-attr">on:</span>
  <span class="hljs-attr">push:</span>
    <span class="hljs-attr">paths:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">'pipelines/**'</span>
    <span class="hljs-bullet">-</span> <span class="hljs-string">'models/**'</span>

<span class="hljs-attr">jobs:</span>
  <span class="hljs-attr">validate:</span>
    <span class="hljs-attr">runs-on:</span> <span class="hljs-string">ubuntu-latest</span>
    <span class="hljs-attr">steps:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Validate</span> <span class="hljs-string">pipeline</span> <span class="hljs-string">syntax</span>
      <span class="hljs-attr">run:</span> <span class="hljs-string">|
        argo lint pipelines/training-pipeline.yaml
</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Test</span> <span class="hljs-string">with</span> <span class="hljs-string">sample</span> <span class="hljs-string">data</span>
      <span class="hljs-attr">run:</span> <span class="hljs-string">|
        pytest tests/pipeline_test.py
</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">Build</span> <span class="hljs-string">container</span> <span class="hljs-string">images</span>
      <span class="hljs-attr">run:</span> <span class="hljs-string">|</span>
        <span class="hljs-string">docker</span> <span class="hljs-string">build</span> <span class="hljs-string">-t</span> <span class="hljs-string">ml-pipeline/ingest:${{</span> <span class="hljs-string">github.sha</span> <span class="hljs-string">}}</span>
</code></pre>
<hr />
<p><strong>Los desafíos específicos de IaC para ML:</strong></p>
<p><strong>DESAFÍO 1: State management para modelos</strong></p>
<pre><code class="lang-ini">Software tradicional:
- Deploy nueva versión
- Old version se borra

ML:
- Deploy modelo v2
- Modelo v1 sigue sirviendo (A/B testing)
- Modelo v0 en rollback standby
- ¿Cómo manejas state?
</code></pre>
<p><strong>Solución:</strong></p>
<p>hcl</p>
<pre><code class="lang-ini"><span class="hljs-comment"># Terraform state para múltiples versiones</span>
resource "kubernetes_deployment" "model" {
  <span class="hljs-attr">for_each</span> = var.model_versions

  metadata {
    <span class="hljs-attr">name</span> = <span class="hljs-string">"model-${each.key}"</span>
  }

  spec {
    <span class="hljs-attr">replicas</span> = each.value.traffic_percentage &gt; <span class="hljs-number">0</span> ? <span class="hljs-number">2</span> : <span class="hljs-number">0</span>
  }
}
</code></pre>
<hr />
<p><strong>DESAFÍO 2: Recursos dinámicos</strong></p>
<pre><code class="lang-ini">Training pipeline necesita:
- 8 GPUs por 4 horas
- Después, scale a 0

Serving necesita:
- 2 GPUs 24/7
- Auto-scale basado en requests
</code></pre>
<p><strong>Solución:</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># Karpenter para auto-scaling inteligente</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">karpenter.sh/v1alpha5</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">Provisioner</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">gpu-training</span>
<span class="hljs-attr">spec:</span>
  <span class="hljs-attr">requirements:</span>
  <span class="hljs-bullet">-</span> <span class="hljs-attr">key:</span> <span class="hljs-string">node.kubernetes.io/instance-type</span>
    <span class="hljs-attr">operator:</span> <span class="hljs-string">In</span>
    <span class="hljs-attr">values:</span> [<span class="hljs-string">"p3.2xlarge"</span>, <span class="hljs-string">"p3.8xlarge"</span>]

  <span class="hljs-attr">ttlSecondsAfterEmpty:</span> <span class="hljs-number">300</span>  <span class="hljs-comment"># Terminate si no se usa</span>

  <span class="hljs-attr">limits:</span>
    <span class="hljs-attr">resources:</span>
      <span class="hljs-attr">nvidia.com/gpu:</span> <span class="hljs-number">32</span>  <span class="hljs-comment"># Max GPUs</span>
</code></pre>
<hr />
<p><strong>DESAFÍO 3: Data versioning en IaC</strong></p>
<pre><code class="lang-ini">Código: Git
Containers: Registry
Infraestructura: Terraform state

¿Datasets de gigantes?
</code></pre>
<p><strong>Solución:</strong></p>
<p>yaml</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># DVC como parte del pipeline</span>
<span class="hljs-attr">apiVersion:</span> <span class="hljs-string">v1</span>
<span class="hljs-attr">kind:</span> <span class="hljs-string">ConfigMap</span>
<span class="hljs-attr">metadata:</span>
  <span class="hljs-attr">name:</span> <span class="hljs-string">dvc-config</span>
<span class="hljs-attr">data:</span>
  <span class="hljs-string">.dvc/config:</span> <span class="hljs-string">|
    [core]
        remote = s3
    ['remote "s3"']
        url = s3://ml-data-prod/dvc-cache
</span>
<span class="hljs-comment"># Referencia en pipeline</span>
<span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">fetch-data</span>
  <span class="hljs-attr">container:</span>
    <span class="hljs-attr">image:</span> <span class="hljs-string">dvc:latest</span>
    <span class="hljs-attr">command:</span> [<span class="hljs-string">dvc</span>, <span class="hljs-string">pull</span>, <span class="hljs-string">data/training.dvc</span>]
    <span class="hljs-attr">env:</span>
    <span class="hljs-bullet">-</span> <span class="hljs-attr">name:</span> <span class="hljs-string">DVC_REMOTE</span>
      <span class="hljs-attr">value:</span> <span class="hljs-string">s3</span>
</code></pre>
<hr />
<p><strong>Stack completo recomendado:</strong></p>
<pre><code class="lang-bash">Infrastructure Layer:
├── Terraform         → Cloud resources (K8s, S3, RDS)
├── Helm             → ML platform components
└── ArgoCD           → GitOps sync

ML Platform Layer:
├── MLflow           → Experiment tracking
├── DVC              → Data versioning
├── Kubeflow/Airflow → Pipeline orchestration
└── Seldon/KServe    → Model serving

Pipeline Layer:
├── Argo Workflows   → Training execution
├── GitHub Actions   → CI/CD
└── Prometheus       → Monitoring

Data Layer:
├── S3/GCS           → Data lake
├── PostgreSQL       → Metadata
└── Feature Store    → Feature management
</code></pre>
<p><strong>Mi workflow real:</strong></p>
<p><strong>1. Definir infra (Terraform):</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-built_in">cd</span> terraform/environments/staging
terraform plan
terraform apply
</code></pre>
<p><strong>2. Deploy ML platform (Helm + ArgoCD):</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git add helm/ml-platform/
git commit -m <span class="hljs-string">"Deploy MLflow + Kubeflow to staging"</span>
git push  <span class="hljs-comment"># ArgoCD picks it up</span>
</code></pre>
<p><strong>3. Definir pipeline (YAML):</strong></p>
<p>bash</p>
<pre><code class="lang-bash">git add pipelines/training-v2.yaml
git commit -m <span class="hljs-string">"Update training pipeline"</span>
git push  <span class="hljs-comment"># ArgoCD deploys</span>
</code></pre>
<p><strong>4. Trigger entrenamiento:</strong></p>
<p>bash</p>
<pre><code class="lang-bash"><span class="hljs-comment"># Scheduled (automático via CronWorkflow)</span>
<span class="hljs-comment"># O manual:</span>
argo submit pipelines/training-v2.yaml \
  --parameter dataset-version=v2.3.1
</code></pre>
<p><strong>5. Deploy modelo (GitOps):</strong></p>
<pre><code class="lang-bash">git add k8s/model-server.yaml
git commit -m <span class="hljs-string">"Deploy model v2.3.1"</span>
git push  <span class="hljs-comment"># ArgoCD deploys con canary</span>
</code></pre>
<p><strong>Todo como código. Todo en Git. Todo reproducible.</strong></p>
<hr />
<p><strong>Los errores que he visto:</strong></p>
<p><strong>ERROR 1: "Infra manual, pipelines automatizados"</strong></p>
<pre><code class="lang-bash">Resultado: Pipelines automatizados corriendo en infra manual
Alguien borra el cluster por error
Pipelines fallan, nadie sabe cómo reconstruir
</code></pre>
<p><strong>ERROR 2: "Terraform para todo, incluso modelos"</strong></p>
<pre><code class="lang-bash">Resultado: Model registry en Terraform state
Deploy nuevo modelo = terraform apply (lento)
Rollback = terraform plan/apply (muy lento)
No es el tool correcto para esto
</code></pre>
<p><strong>ERROR 3: "GitOps sin CI"</strong></p>
<pre><code class="lang-bash">Resultado: Push directo a main sin validación
Pipeline syntax error deployado a prod
Cluster crashea
</code></pre>
<p><strong>ERROR 4: "IaC solo en prod"</strong></p>
<pre><code class="lang-bash">Resultado: Dev y staging son snowflakes
<span class="hljs-string">"Funciona en mi máquina"</span> nivel infraestructura
Debugging imposible
</code></pre>
<hr />
<p><strong>Mi recomendación pragmática:</strong></p>
<p><strong>Fase 1 - Infra as Code:</strong></p>
<pre><code class="lang-bash">- Terraform para cluster base
- Manual deploy de ML platform
- Scripts para pipelines
</code></pre>
<p><strong>Fase 2 - Platform as Code:</strong></p>
<pre><code class="lang-bash">- Helm para ML components
- CI/CD para validación
- Semi-automated deploys
</code></pre>
<p><strong>Fase 3 - Everything as Code:</strong></p>
<pre><code class="lang-bash">- Full GitOps con ArgoCD
- Pipelines como YAML en Git
- Automated canary deploys
</code></pre>
<p><strong>Empieza en Fase 1. Evoluciona cuando necesites escalar.</strong></p>
]]></content:encoded></item><item><title><![CDATA[Environments (dev / staging / prod)]]></title><description><![CDATA["Solo necesito cambiar la variable ENV a 'prod'" - Spoiler: No
Ayer: cuándo ejecutar tus pipelines. Hoy: dónde ejecutarlos.
Dev, staging, producción. Lo sabes de software tradicional.
Pero en ML, los ambientes son mucho más que código.

Historia:
Dat...]]></description><link>https://parraletz.dev/environments-dev-staging-prod</link><guid isPermaLink="true">https://parraletz.dev/environments-dev-staging-prod</guid><category><![CDATA[mlops]]></category><category><![CDATA[Devops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Kubernetes]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Mon, 12 Jan 2026 06:00:52 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1768193348200/4c7e8feb-0e98-4927-a727-e9ecee06a415.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong><em>"Solo necesito cambiar la variable ENV a 'prod'"</em> - Spoiler: No</strong></p>
<p>Ayer: cuándo ejecutar tus pipelines. Hoy: <strong>dónde ejecutarlos.</strong></p>
<p>Dev, staging, producción. Lo sabes de software tradicional.</p>
<p>Pero en ML, los ambientes son mucho más que código.</p>
<hr />
<p><strong>Historia:</strong></p>
<p>Data Scientist: "Ya terminé el modelo, súbelo a producción" Yo: "¿Lo probaste en staging?" DS: "No tenemos staging para ML" Yo: "¿Cómo no?" DS: "Es el mismo código, solo cambias la conexión a la base de datos"</p>
<p><strong>3 semanas después:</strong></p>
<p>El modelo en producción degradó un 15%.</p>
<p>¿Por qué? Porque los datos de producción eran diferentes a los de desarrollo.</p>
<p>Nadie lo había probado con datos reales de prod antes del deploy.</p>
<p><strong>El problema fundamental:</strong></p>
<p>En software tradicional:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> ENV == <span class="hljs-string">"dev"</span>:
    db = <span class="hljs-string">"localhost:5432"</span>
<span class="hljs-keyword">elif</span> ENV == <span class="hljs-string">"prod"</span>:
    db = <span class="hljs-string">"prod-db:5432"</span>
</code></pre>
<p><strong>Eso es todo. Mismo código, diferentes configuraciones.</strong></p>
<hr />
<p>En ML:</p>
<pre><code class="lang-python"><span class="hljs-keyword">if</span> ENV == <span class="hljs-string">"dev"</span>:
    db = <span class="hljs-string">"localhost:5432"</span>
    dataset = <span class="hljs-string">"sample_1000_rows.csv"</span>  <span class="hljs-comment"># Subset</span>
    model = <span class="hljs-string">"model_v1_local.pkl"</span>
    features = [<span class="hljs-string">"basic_feature_set"</span>]
    monitoring = <span class="hljs-literal">False</span>

<span class="hljs-keyword">elif</span> ENV == <span class="hljs-string">"prod"</span>:
    db = <span class="hljs-string">"prod-db:5432"</span>
    dataset = <span class="hljs-string">"s3://prod/full_dataset/"</span>  <span class="hljs-comment"># 500GB</span>
    model = <span class="hljs-string">"s3://models/prod/v2.3.1/"</span>
    features = [<span class="hljs-string">"basic + advanced + experimental"</span>]
    monitoring = <span class="hljs-literal">True</span>
    drift_detection = <span class="hljs-literal">True</span>
    A_B_testing = <span class="hljs-literal">True</span>
</code></pre>
<p><strong>No es solo configuración. Son assets completamente diferentes.</strong></p>
<hr />
<p><strong>Los 3 tipos de separación necesarios:</strong></p>
<h3 id="heading-1-separacion-de-datos"><strong>1. Separación de Datos</strong></h3>
<pre><code class="lang-bash">DEV:
├── datasets/
│   ├── sample_1k.parquet      <span class="hljs-comment"># Subset para desarrollo</span>
│   ├── synthetic_data.parquet  <span class="hljs-comment"># Datos sintéticos</span>
│   └── anonymized_prod.parquet <span class="hljs-comment"># Muestra anonimizada</span>

STAGING:
├── datasets/
│   ├── prod_snapshot.parquet   <span class="hljs-comment"># Snapshot de prod (última semana)</span>
│   └── validation_set.parquet  <span class="hljs-comment"># Set de validación fijo</span>

PROD:
├── datasets/
│   ├── realtime_stream/        <span class="hljs-comment"># Datos reales en tiempo real</span>
│   └── historical/             <span class="hljs-comment"># Todo el histórico</span>
</code></pre>
<p><strong>¿Por qué?</strong></p>
<ul>
<li><p>Dev: necesitas iterar rápido (1k rows, no 500GB)</p>
</li>
<li><p>Staging: necesitas validar con datos realistas</p>
</li>
<li><p>Prod: necesitas todos los datos reales</p>
</li>
</ul>
<p><strong>No puedes entrenar en dev con 1k rows y esperar que funcione en prod con 500GB.</strong></p>
<hr />
<h3 id="heading-2-separacion-de-modelos"><strong>2. Separación de Modelos</strong></h3>
<pre><code class="lang-markdown"><span class="hljs-section"># Model Registry por ambiente</span>

DEV:
<span class="hljs-bullet">-</span> Experiments: 1000+ modelos
<span class="hljs-bullet">-</span> No validation estricta
<span class="hljs-bullet">-</span> Cualquiera puede registrar

STAGING:
<span class="hljs-bullet">-</span> Candidates: ~10 modelos pasando validación
<span class="hljs-bullet">-</span> Métricas &gt; baseline required
<span class="hljs-bullet">-</span> Review antes de promover

PROD:
<span class="hljs-bullet">-</span> Active: 1-2 modelos serviendo tráfico
<span class="hljs-bullet">-</span> A/B testing entre versiones
<span class="hljs-bullet">-</span> Rollback automático si falla
<span class="hljs-code">```

**El flujo:**
```</span>
DEV → train 100 experiments
<span class="hljs-code">    → pick best
    → promote to STAGING
</span>
STAGING → validate con prod-like data
<span class="hljs-code">        → compare vs prod baseline
        → stress test
        → promote to PROD (if approved)
</span>
PROD → shadow mode (sin tráfico real)
<span class="hljs-code">     → canary deploy (5% tráfico)
     → gradual rollout (100%)</span>
</code></pre>
<hr />
<h3 id="heading-3-separacion-de-features"><strong>3. Separación de Features</strong></h3>
<pre><code class="lang-python"><span class="hljs-comment"># Feature Store por ambiente</span>

<span class="hljs-class"><span class="hljs-keyword">class</span> <span class="hljs-title">FeatureStore</span>:</span>
    <span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">get_features</span>(<span class="hljs-params">self, env</span>):</span>
        <span class="hljs-keyword">if</span> env == <span class="hljs-string">"dev"</span>:
            <span class="hljs-keyword">return</span> [
                <span class="hljs-string">"basic_features"</span>,
                <span class="hljs-string">"experimental_features"</span>  <span class="hljs-comment"># Puedes romper cosas</span>
            ]

        <span class="hljs-keyword">elif</span> env == <span class="hljs-string">"staging"</span>:
            <span class="hljs-keyword">return</span> [
                <span class="hljs-string">"basic_features"</span>,
                <span class="hljs-string">"tested_features"</span>  <span class="hljs-comment"># Solo features validadas</span>
            ]

        <span class="hljs-keyword">elif</span> env == <span class="hljs-string">"prod"</span>:
            <span class="hljs-keyword">return</span> [
                <span class="hljs-string">"basic_features"</span>,
                <span class="hljs-string">"tested_features"</span>,
                <span class="hljs-string">"monitored_features"</span>  <span class="hljs-comment"># + monitoring activo</span>
            ]
</code></pre>
<p><strong>Ejemplo real de pipeline multi-ambiente:</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Pipeline que funciona en todos los ambientes</span>

<span class="hljs-meta">@pipeline</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">training_pipeline</span>(<span class="hljs-params">env: str</span>):</span>

    <span class="hljs-comment"># 1. Datos diferentes por ambiente</span>
    <span class="hljs-keyword">if</span> env == <span class="hljs-string">"dev"</span>:
        data = load_sample_data(n_rows=<span class="hljs-number">1000</span>)
    <span class="hljs-keyword">elif</span> env == <span class="hljs-string">"staging"</span>:
        data = load_snapshot(<span class="hljs-string">"prod_last_week"</span>)
    <span class="hljs-keyword">else</span>:  <span class="hljs-comment"># prod</span>
        data = load_full_dataset()

    <span class="hljs-comment"># 2. Validación diferente por ambiente</span>
    <span class="hljs-keyword">if</span> env == <span class="hljs-string">"dev"</span>:
        <span class="hljs-comment"># Validación light</span>
        <span class="hljs-keyword">assert</span> data <span class="hljs-keyword">is</span> <span class="hljs-keyword">not</span> <span class="hljs-literal">None</span>
    <span class="hljs-keyword">elif</span> env == <span class="hljs-string">"staging"</span>:
        <span class="hljs-comment"># Validación completa</span>
        validate_schema(data)
        validate_distributions(data)
    <span class="hljs-keyword">else</span>:  <span class="hljs-comment"># prod</span>
        <span class="hljs-comment"># Validación + alerting</span>
        validate_and_alert(data)

    <span class="hljs-comment"># 3. Training con recursos diferentes</span>
    <span class="hljs-keyword">if</span> env == <span class="hljs-string">"dev"</span>:
        model = train(data, resources=<span class="hljs-string">"small"</span>)
    <span class="hljs-keyword">else</span>:
        model = train(data, resources=<span class="hljs-string">"large"</span>)

    <span class="hljs-comment"># 4. Evaluation con thresholds diferentes</span>
    metrics = evaluate(model, data)

    <span class="hljs-keyword">if</span> env == <span class="hljs-string">"dev"</span>:
        <span class="hljs-comment"># Cualquier resultado está OK</span>
        register_model(model, env=<span class="hljs-string">"dev"</span>)
    <span class="hljs-keyword">elif</span> env == <span class="hljs-string">"staging"</span>:
        <span class="hljs-comment"># Debe superar baseline</span>
        <span class="hljs-keyword">if</span> metrics.accuracy &gt; BASELINE:
            register_model(model, env=<span class="hljs-string">"staging"</span>)
    <span class="hljs-keyword">else</span>:  <span class="hljs-comment"># prod</span>
        <span class="hljs-comment"># Debe superar baseline + pasar A/B test</span>
        <span class="hljs-keyword">if</span> metrics.accuracy &gt; BASELINE:
            deploy_canary(model)
            <span class="hljs-keyword">if</span> canary_metrics_ok():
                promote_to_prod(model)
</code></pre>
<hr />
<p><strong>La infraestructura también cambia:</strong></p>
<pre><code class="lang-python">DEV:
- Local machine o VM compartida
- CPU only (GPUs caras para dev)
- Sin redundancia
- Puede fallar sin problema

STAGING:
- Cluster dedicado
- Similar a prod (pero más pequeño)
- Con monitoring (pero no alerting crítico)
- Datos de prod pero anonymizados

PROD:
- Cluster de alta disponibilidad
- GPUs/recursos necesarios
- Redundancia + auto-scaling
- Monitoring + alerting <span class="hljs-number">24</span>/<span class="hljs-number">7</span>
- Backup + disaster recovery
</code></pre>
<hr />
<p><strong>Los errores más comunes:</strong></p>
<p><strong>ERROR 1: "No tenemos staging para ML"</strong></p>
<pre><code class="lang-python">Resultado: Deploy directo a producción
Primera vez que ves datos reales: en prod
Cuando algo falla: afectas usuarios
</code></pre>
<p><strong>ERROR 2: "Staging usa los mismos datos que dev"</strong></p>
<pre><code class="lang-python">Resultado: Validas con sample de <span class="hljs-number">1</span>k rows
Funciona en staging (<span class="hljs-number">1</span>k rows)
Falla en prod (<span class="hljs-number">500</span>GB) por memory/performance
</code></pre>
<p><strong>ERROR 3: "Prod y staging comparten infraestructura"</strong></p>
<pre><code class="lang-python">Resultado: Experimento en staging consume recursos
Prod se degrada porque staging le robó memoria
Incident a las <span class="hljs-number">3</span>am
</code></pre>
<p><strong>ERROR 4: "Un solo model registry para todo"</strong></p>
<pre><code class="lang-python">Resultado: <span class="hljs-number">1000</span> experimentos mezclados con modelos de prod
Nadie sabe qué modelo está en producción
Rollback imposible
</code></pre>
<hr />
<p><strong>Arquitectura recomendada:</strong></p>
<pre><code class="lang-python">┌─────────────────────────────────────────────────┐
│                    DEV                          │
│  - Local/shared VM                              │
│  - Sample data (<span class="hljs-number">1</span>k<span class="hljs-number">-10</span>k rows)                    │
│  - Rapid iteration                              │
│  - No strict validation                         │
└─────────────┬───────────────────────────────────┘
              │ promote best experiment
              ▼
┌─────────────────────────────────────────────────┐
│                  STAGING                        │
│  - Prod-like cluster (smaller)                  │
│  - Prod snapshot data (last week/month)         │
│  - Full validation pipeline                     │
│  - Metrics vs prod baseline                     │
└─────────────┬───────────────────────────────────┘
              │ <span class="hljs-keyword">if</span> approved
              ▼
┌─────────────────────────────────────────────────┐
│                   PROD                          │
│  - HA cluster                                   │
│  - Real-time data                               │
│  - Shadow → Canary → Full rollout               │
│  - <span class="hljs-number">24</span>/<span class="hljs-number">7</span> monitoring                              │
└─────────────────────────────────────────────────┘
</code></pre>
<hr />
<p><strong>Implementación práctica:</strong></p>
<p><strong>Opción 1: Namespaces (si usas K8s)</strong></p>
<pre><code class="lang-yaml"><span class="hljs-comment"># dev namespace</span>
<span class="hljs-attr">namespace:</span> <span class="hljs-string">ml-dev</span>
<span class="hljs-attr">resources:</span>
  <span class="hljs-attr">memory:</span> <span class="hljs-string">4GB</span>
  <span class="hljs-attr">cpu:</span> <span class="hljs-number">2</span>

<span class="hljs-comment"># staging namespace  </span>
<span class="hljs-attr">namespace:</span> <span class="hljs-string">ml-staging</span>
<span class="hljs-attr">resources:</span>
  <span class="hljs-attr">memory:</span> <span class="hljs-string">16GB</span>
  <span class="hljs-attr">cpu:</span> <span class="hljs-number">8</span>

<span class="hljs-comment"># prod namespace</span>
<span class="hljs-attr">namespace:</span> <span class="hljs-string">ml-prod</span>
<span class="hljs-attr">resources:</span>
  <span class="hljs-attr">memory:</span> <span class="hljs-string">64GB</span>
  <span class="hljs-attr">cpu:</span> <span class="hljs-number">32</span>
  <span class="hljs-attr">replicas:</span> <span class="hljs-number">3</span>
</code></pre>
<p><strong>Opción 2: Cuentas separadas (AWS/GCP/Azure)</strong></p>
<pre><code class="lang-bash">dev-account/
├── s3://dev-ml-data/
├── dev-model-registry
└── dev-feature-store

staging-account/
├── s3://staging-ml-data/
├── staging-model-registry
└── staging-feature-store

prod-account/
├── s3://prod-ml-data/
├── prod-model-registry
└── prod-feature-store
</code></pre>
<p><strong>Opción 3: Prefixes (si no puedes separar infraestructura)</strong></p>
<pre><code class="lang-python">CONFIGS = {
    <span class="hljs-string">"dev"</span>: {
        <span class="hljs-string">"data_bucket"</span>: <span class="hljs-string">"ml-data"</span>,
        <span class="hljs-string">"data_prefix"</span>: <span class="hljs-string">"dev/"</span>,
        <span class="hljs-string">"model_registry"</span>: <span class="hljs-string">"models/dev/"</span>,
    },
    <span class="hljs-string">"staging"</span>: {
        <span class="hljs-string">"data_bucket"</span>: <span class="hljs-string">"ml-data"</span>,
        <span class="hljs-string">"data_prefix"</span>: <span class="hljs-string">"staging/"</span>,
        <span class="hljs-string">"model_registry"</span>: <span class="hljs-string">"models/staging/"</span>,
    },
    <span class="hljs-string">"prod"</span>: {
        <span class="hljs-string">"data_bucket"</span>: <span class="hljs-string">"ml-data-prod"</span>,  <span class="hljs-comment"># Bucket separado</span>
        <span class="hljs-string">"data_prefix"</span>: <span class="hljs-string">"prod/"</span>,
        <span class="hljs-string">"model_registry"</span>: <span class="hljs-string">"models/prod/"</span>,
    }
}
</code></pre>
<hr />
<p><strong>Checklist de separación de ambientes:</strong></p>
<p><strong>Datos:</strong></p>
<ul>
<li><p>[ ] Dev usa sample/synthetic data</p>
</li>
<li><p>[ ] Staging usa snapshot de prod</p>
</li>
<li><p>[ ] Prod usa datos reales</p>
</li>
<li><p>[ ] Datos de prod nunca en dev (privacidad)</p>
</li>
</ul>
<p><strong>Modelos:</strong></p>
<ul>
<li><p>[ ] Model registry por ambiente</p>
</li>
<li><p>[ ] Promotion workflow claro (dev → staging → prod)</p>
</li>
<li><p>[ ] Rollback fácil en prod</p>
</li>
<li><p>[ ] Experiments solo en dev/staging</p>
</li>
</ul>
<p><strong>Features:</strong></p>
<ul>
<li><p>[ ] Feature store por ambiente</p>
</li>
<li><p>[ ] Features experimentales solo en dev</p>
</li>
<li><p>[ ] Monitoring de features en prod</p>
</li>
</ul>
<p><strong>Infraestructura:</strong></p>
<ul>
<li><p>[ ] Recursos separados por ambiente</p>
</li>
<li><p>[ ] Prod no comparte recursos con dev/staging</p>
</li>
<li><p>[ ] Credenciales diferentes por ambiente</p>
</li>
<li><p>[ ] Logs/metrics separados</p>
</li>
</ul>
<p><strong>Proceso:</strong></p>
<ul>
<li><p>[ ] CI/CD diferente por ambiente</p>
</li>
<li><p>[ ] Approvals para promote a prod</p>
</li>
<li><p>[ ] Testing en staging antes de prod</p>
</li>
<li><p>[ ] Canary/shadow deploys en prod</p>
</li>
</ul>
<hr />
<p><strong>Mi recomendación como Platform Engineer:</strong></p>
<p><strong>Fase 1 - Mínimo viable:</strong></p>
<pre><code class="lang-markdown"><span class="hljs-bullet">-</span> Dev: local con sample data
<span class="hljs-bullet">-</span> Prod: cluster con datos reales
<span class="hljs-bullet">-</span> Deploy: manual con checklist
</code></pre>
<p><strong>Fase 2 - Agregar staging:</strong></p>
<pre><code class="lang-markdown"><span class="hljs-bullet">-</span> Staging: snapshot semanal de prod
<span class="hljs-bullet">-</span> Validation automática en staging
<span class="hljs-bullet">-</span> Deploy: semi-automático con approval
</code></pre>
<p><strong>Fase 3 - Full separation:</strong></p>
<pre><code class="lang-markdown"><span class="hljs-bullet">-</span> 3 ambientes completamente separados
<span class="hljs-bullet">-</span> CI/CD automatizado
<span class="hljs-bullet">-</span> Canary deploys
<span class="hljs-bullet">-</span> A/B testing
</code></pre>
<p><strong>No intentes hacer Fase 3 desde día 1.</strong></p>
<p><strong>Mañana:</strong></p>
<p>Ya sabes ejecutar pipelines en diferentes ambientes.</p>
<p>Pero hay un problema: <strong>todo esto es código que vive en algún lado.</strong></p>
<p>¿Cómo conectas Terraform, Helm, K8s, GitOps con MLOps?</p>
<p><strong>Infrastructure as Code para ML. Y por qué no es solo</strong> <code>terraform apply</code>.</p>
<hr />
<p><strong>Tu turno:</strong></p>
<ul>
<li><p>¿Cuántos ambientes tienen para ML?</p>
</li>
<li><p>¿O es "dev es mi laptop y prod es... también mi laptop"?</p>
</li>
</ul>
<hr />
]]></content:encoded></item><item><title><![CDATA[Time-based vs Event-driven: Por qué no es solo cron vs webhooks]]></title><description><![CDATA[Ayer: qué orquestador usar.Hoy: cuándo ejecutar ese orquestador.
El setup más común que veo: ( No sean asi, xD quieran a su PE)
# En el orquestador de tu elección
@schedule("0 2 * * *")  # Todos los días 2am
def train_model():
    data = fetch_data()...]]></description><link>https://parraletz.dev/time-based-vs-event-driven-por-que-no-es-solo-cron-vs-webhooks</link><guid isPermaLink="true">https://parraletz.dev/time-based-vs-event-driven-por-que-no-es-solo-cron-vs-webhooks</guid><category><![CDATA[mlops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[data-engineering]]></category><category><![CDATA[Orchestration]]></category><category><![CDATA[scheduling]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Fri, 09 Jan 2026 12:53:58 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767963121578/c11be0da-e890-4613-a7a9-597c74f3dbfa.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ayer: qué orquestador usar.<br />Hoy: <strong>cuándo ejecutar ese orquestador.</strong></p>
<p><strong>El setup más común que veo: ( No sean asi, xD quieran a su PE)</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># En el orquestador de tu elección</span>
<span class="hljs-meta">@schedule("0 2 * * *")  # Todos los días 2am</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train_model</span>():</span>
    data = fetch_data()
    model = train(data)
    deploy(model)
</code></pre>
<p><strong>Pregunta simple:</strong> ¿Por qué 2am? ¿Por qué todos los días?</p>
<p><strong>Respuesta común:</strong> Porque... ¿ventana de bajo tráfico?</p>
<p><strong>Pero la pregunta real es:</strong></p>
<p><strong>"¿Qué evento justifica gastar $X en compute para re-entrenar?"</strong></p>
<p>Porque re-entrenar cuesta:</p>
<ul>
<li><p>Compute (GPUs no son gratis)</p>
</li>
<li><p>Tiempo (tu pipeline bloquea recursos)</p>
</li>
<li><p>Riesgo (cada deploy es un posible incidente)</p>
</li>
</ul>
<p><strong>Si nada cambió, ¿para qué re-entrenar?</strong></p>
<hr />
<h2 id="heading-los-3-triggers-legitimos"><strong>Los 3 triggers legítimos:</strong></h2>
<p><strong>1. Datos nuevos</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Llegaron 10K nuevas transacciones</span>
<span class="hljs-comment"># El modelo PUEDE aprender algo nuevo</span>
<span class="hljs-keyword">if</span> new_data.count &gt;= <span class="hljs-number">10000</span>:
    trigger_training()
</code></pre>
<p><strong>2. Degradación detectada</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># El modelo está fallando en prod</span>
<span class="hljs-comment"># El accuracy bajó 5%</span>
<span class="hljs-keyword">if</span> current_accuracy &lt; baseline - <span class="hljs-number">0.05</span>:
    trigger_emergency_training()
</code></pre>
<p><strong>3. Código/features cambiaron</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># Agregaste un nuevo feature</span>
<span class="hljs-comment"># Cambiaste hiperparámetros</span>
<span class="hljs-keyword">if</span> code_changed() <span class="hljs-keyword">or</span> features_changed():
    trigger_ci_training()
</code></pre>
<p><strong>Si ninguno de estos 3 pasó, probablemente no necesitas re-entrenar.</strong></p>
<hr />
<h3 id="heading-pattern-1-time-based-batch"><strong>Pattern 1: Time-based (Batch)</strong></h3>
<pre><code class="lang-python"><span class="hljs-comment"># Airflow, Prefect, Kubeflow, etc.</span>
<span class="hljs-meta">@schedule("@weekly")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">training_pipeline</span>():</span>
    train_model()
</code></pre>
<p><strong>Trade-offs:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Contras</td></tr>
</thead>
<tbody>
<tr>
<td>Simple de implementar</td><td>Puede ser desperdicio</td></tr>
<tr>
<td>Predecible</td><td>No responde a cambios reales</td></tr>
<tr>
<td>Fácil de debuggear</td><td>Puede entrenar con datos malos</td></tr>
<tr>
<td>Conocido por equipos</td><td>Latencia hasta próximo run</td></tr>
</tbody>
</table>
</div><p><strong>Úsalo cuando:</strong></p>
<ul>
<li><p>Tus datos llegan en schedule predecible</p>
</li>
<li><p>El costo de entrenar es bajo</p>
</li>
<li><p>Prefieres simplicidad sobre optimización</p>
</li>
</ul>
<p><strong>No lo uses cuando:</strong></p>
<ul>
<li><p>Los datos llegan irregularmente</p>
</li>
<li><p>Re-entrenar es muy caro</p>
</li>
<li><p>Necesitas respuesta inmediata a cambios</p>
</li>
</ul>
<hr />
<h3 id="heading-pattern-2-event-driven"><strong>Pattern 2: Event-driven</strong></h3>
<pre><code class="lang-python"><span class="hljs-comment"># Trigger por evento externo</span>
<span class="hljs-meta">@trigger(bucket="s3://data", event="new_file")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">training_pipeline</span>():</span>
    <span class="hljs-keyword">if</span> new_data_meets_threshold():
        train_model()
</code></pre>
<p><strong>Trade-offs:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Contras</td></tr>
</thead>
<tbody>
<tr>
<td>Solo entrenas cuando necesitas</td><td>Más complejo de implementar</td></tr>
<tr>
<td>Optimiza cómputo</td><td>Debugging más difícil</td></tr>
<tr>
<td>Responde a eventos reales</td><td>Necesitas infraestructura de eventos</td></tr>
<tr>
<td>Escala mejor</td><td>Puede disparar demasiado</td></tr>
</tbody>
</table>
</div><p><strong>Úsalo cuando:</strong></p>
<ul>
<li><p>Los datos llegan irregularmente</p>
</li>
<li><p>Re-entrenar es caro (&gt;$100/run)</p>
</li>
<li><p>Tienes infraestructura de eventos (S3, Kafka, etc.)</p>
</li>
</ul>
<p><strong>No lo uses cuando:</strong></p>
<ul>
<li><p>Tu equipo no tiene experiencia con eventos</p>
</li>
<li><p>Los datos llegan consistentemente</p>
</li>
<li><p>Prefieres simplicidad</p>
</li>
</ul>
<hr />
<h3 id="heading-pattern-3-metric-driven"><strong>Pattern 3: Metric-driven</strong></h3>
<pre><code class="lang-python"><span class="hljs-meta">@schedule("@hourly")  # Check frecuente</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">monitor_and_train</span>():</span>
    metrics = get_production_metrics()

    <span class="hljs-keyword">if</span> metrics.accuracy &lt; THRESHOLD:
        trigger_emergency_retrain()
        alert_team()
</code></pre>
<p><strong>Trade-offs:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Pros</td><td>Contras</td></tr>
</thead>
<tbody>
<tr>
<td>Responde a problemas reales</td><td>Requiere monitoring robusto</td></tr>
<tr>
<td>Optimiza para calidad</td><td>Puede tener falsos positivos</td></tr>
<tr>
<td>Proactivo ante drift</td><td>Latencia entre detección y fix</td></tr>
<tr>
<td>Justifica cada re-entreno</td><td>Complejo de implementar bien</td></tr>
</tbody>
</table>
</div><p><strong>Úsalo cuando:</strong></p>
<ul>
<li><p>Puedes medir performance en producción</p>
</li>
<li><p>El costo de modelo malo &gt; costo de re-entrenar</p>
</li>
<li><p>Tienes alerting robusto</p>
</li>
</ul>
<p><strong>No lo uses cuando:</strong></p>
<ul>
<li><p>No tienes métricas confiables en prod</p>
</li>
<li><p>El re-entrenamiento toma demasiado tiempo</p>
</li>
<li><p>Tu modelo no degrada predeciblemente</p>
</li>
</ul>
<hr />
<h3 id="heading-pattern-4-hybrid-real-world"><strong>Pattern 4: Hybrid (Real-world)</strong></h3>
<pre><code class="lang-python"><span class="hljs-comment"># Lo que REALMENTE usamos en producción</span>

<span class="hljs-comment"># Baseline: time-based conservador</span>
<span class="hljs-meta">@schedule("@weekly")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">baseline_training</span>():</span>
    train_model(priority=<span class="hljs-string">"normal"</span>)

<span class="hljs-comment"># Emergency: metric-driven</span>
<span class="hljs-meta">@monitor(metric="accuracy", threshold=0.85)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">emergency_training</span>():</span>
    train_model(priority=<span class="hljs-string">"high"</span>)
    alert_oncall()

<span class="hljs-comment"># CI/CD: code-driven</span>
<span class="hljs-meta">@on_merge(branch="main")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ci_training</span>():</span>
    <span class="hljs-keyword">if</span> features_changed():
        train_model(priority=<span class="hljs-string">"normal"</span>)
</code></pre>
<p><strong>Esta es la realidad en prod:</strong></p>
<ul>
<li><p>80% de los trainings → Time-based baseline</p>
</li>
<li><p>15% → Triggered por CI/CD</p>
</li>
<li><p>5% → Emergency por métricas</p>
</li>
</ul>
<hr />
<p><strong>Criterios de decisión:</strong></p>
<p><strong>Frecuencia de datos:</strong></p>
<pre><code class="lang-python">Streaming (segundos) → Event-driven
Batch diario → Daily schedule
Batch semanal → Weekly schedule
Batch mensual → Monthly schedule
Irregular → Event-driven
</code></pre>
<p><strong>Costo de training:</strong></p>
<pre><code class="lang-python">&lt;$<span class="hljs-number">10</span>/run → Time-based frecuente está OK
$<span class="hljs-number">10</span><span class="hljs-number">-100</span>/run → Time-based + basic triggers
$<span class="hljs-number">100</span><span class="hljs-number">-1000</span>/run → Event-driven + monitoring
&gt;$<span class="hljs-number">1000</span>/run → Solo cuando absolutamente necesario
</code></pre>
<p><strong>Criticidad de estar actualizado:</strong></p>
<pre><code class="lang-python">Alta (fraude, spam) → Hybrid con monitoring
Media (recomendaciones) → Time-based + events
Baja (analytics) → Time-based conservador
</code></pre>
<hr />
<p><strong>Ejemplo real de arquitectura:</strong></p>
<pre><code class="lang-python">┌─────────────────────────────────────┐
│     Data Pipeline (Event-driven)     │
│   S3 → Trigger → Validation → ...   │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│  Training Orchestrator (Hybrid)     │
│                                     │
│  <span class="hljs-number">1.</span> Baseline: @weekly              │
│  <span class="hljs-number">2.</span> On data threshold: &gt;<span class="hljs-number">10</span>K rows   │
│  <span class="hljs-number">3.</span> On metric drop: &lt;<span class="hljs-number">85</span>% accuracy  │
│  <span class="hljs-number">4.</span> On code change: features/*     │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│    ML Pipeline (Orquestado)        │
│  ingest → train → evaluate → ...   │
└─────────────────────────────────────┘
</code></pre>
<hr />
<p><strong>Los errores que he visto:</strong></p>
<p><strong>ERROR 1: "Entrenar cada hora por si acaso"</strong></p>
<blockquote>
<p>Costo mensual: $432 en compute Beneficio: Modelo 0.1% mejor ROI: Negativo</p>
</blockquote>
<p><strong>ERROR 2: "Solo entrenar manualmente"</strong></p>
<blockquote>
<p>Resultado: Se olvidan durante semanas El modelo degrada sin que nadie note Cliente se queja primero</p>
</blockquote>
<p><strong>ERROR 3: "Re-entrenar en cada commit"</strong></p>
<blockquote>
<p>100 commits/día × $10/training = $1000/día 99% de esos trainings no cambian nada</p>
</blockquote>
<p><strong>LO CORRECTO:</strong></p>
<blockquote>
<p>Baseline predecible + triggers inteligentes Monitoreo que justifica cada training Balance entre frescura y costo</p>
</blockquote>
<p><strong>Mi recomendación pragmática:</strong></p>
<p><strong>Fase 1 - MVP:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-meta">@schedule("@weekly")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    <span class="hljs-comment"># Simple, predecible, suficiente para empezar</span>
</code></pre>
<p><strong>Fase 2 - Añade monitoring:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-meta">@schedule("@weekly")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>():</span>
    train_model()

<span class="hljs-meta">@alert(metric="accuracy &lt; 0.85")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">notify</span>():</span>
    <span class="hljs-comment"># Solo alerta, no auto-retrain aún</span>
</code></pre>
<p><strong>Fase 3 - Event-driven si lo necesitas:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-meta">@schedule("@weekly")</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">baseline_train</span>():</span>
    <span class="hljs-keyword">pass</span>

<span class="hljs-meta">@trigger(on_data_threshold)</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">event_train</span>():</span>
    <span class="hljs-keyword">pass</span>
</code></pre>
<p><strong>No saltes a Fase 3 si Fase 1 es suficiente.</strong></p>
<hr />
<p><strong>Mañana hablamos de environments:</strong></p>
<p>Dev, staging, producción... pero para ML.</p>
<p>Donde "environment" no es solo código diferente, sino:</p>
<ul>
<li><p>Datasets diferentes</p>
</li>
<li><p>Modelos diferentes</p>
</li>
<li><p>Métricas diferentes</p>
</li>
<li><p>Infraestructura diferente</p>
</li>
</ul>
<p><strong>Spoiler:</strong> <code>if ENV == "prod"</code> no es suficiente.</p>
<hr />
<p><strong>Pregunta:</strong></p>
<ul>
<li><p>¿Qué pattern usas para scheduling?</p>
</li>
<li><p>¿Te has cuestionado si es el correcto?</p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[Concepto de Pipeline Orquestrador]]></title><description><![CDATA[Airflow, Kubeflow, Prefect, Dagster: La pregunta que todos hacen mal
Ayer vimos por qué un script no es un pipeline.
Hoy la pregunta obvia: "¿Qué orquestador uso?"
Pero esa no es la pregunta correcta.
Historia real:
Hace tiempo, un colega me preguntó...]]></description><link>https://parraletz.dev/concepto-de-pipeline-orquestrador</link><guid isPermaLink="true">https://parraletz.dev/concepto-de-pipeline-orquestrador</guid><category><![CDATA[mlops]]></category><category><![CDATA[Kubernetes]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Thu, 08 Jan 2026 12:59:37 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767877117040/5cfa8c06-8c01-454a-8b4f-aa51551965cc.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Airflow, Kubeflow, Prefect, Dagster: La pregunta que todos hacen mal</strong></p>
<p>Ayer vimos por qué un script no es un pipeline.</p>
<p>Hoy la pregunta obvia: <strong>"¿Qué orquestador uso?"</strong></p>
<p><strong>Pero esa no es la pregunta correcta.</strong></p>
<p><strong>Historia real:</strong></p>
<p>Hace tiempo, un colega me preguntó: <em>"¿Por qué no usamos Airflow como para todo lo demás?"</em></p>
<p>Pregunta legítima. Teníamos Airflow funcionando perfecto para:</p>
<ul>
<li><p>ETLs diarios</p>
</li>
<li><p>Reportes automáticos</p>
</li>
<li><p>Data pipelines batch</p>
</li>
</ul>
<p><strong>¿Por qué necesitábamos evaluar Kubeflow?</strong></p>
<hr />
<p><strong>El problema apareció con ML:</strong></p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-comment"># Lo que funcionaba para ETL</span>
<span class="hljs-meta">@task</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">process_data</span>():</span>
    df = read_database()
    df = transform(df)
    save_to_warehouse(df)
    <span class="hljs-comment"># Siempre el mismo flujo, predecible</span>

<span class="hljs-comment"># Lo que necesitábamos para ML</span>
<span class="hljs-meta">@task</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train_model</span>():</span>
    <span class="hljs-comment"># Necesito ejecutar 100 combinaciones</span>
    <span class="hljs-comment"># de hiperparámetros EN PARALELO</span>
    <span class="hljs-comment"># con GPUs</span>
    <span class="hljs-comment"># y solo registrar el mejor</span>
    <span class="hljs-comment"># ¿Cómo hago eso en Airflow?</span>
</code></pre>
<p><strong>Ahí entendí algo fundamental:</strong></p>
<p><strong>Los orquestadores vienen de mundos diferentes y resuelven problemas diferentes.</strong></p>
<p>No hay "el mejor". Solo el mejor <strong>para tu problema</strong>.</p>
<hr />
<h2 id="heading-primero-que-es-un-orquestador"><strong>Primero: ¿Qué es un orquestador?</strong></h2>
<p>Si vienes de CI/CD (como yo), es como GitHub Actions pero para workflows de datos/ML.</p>
<p>En vez de:</p>
<pre><code class="lang-yaml"><span class="hljs-comment"># .github/workflows/deploy.yml</span>
<span class="hljs-string">build</span> <span class="hljs-string">→</span> <span class="hljs-string">test</span> <span class="hljs-string">→</span> <span class="hljs-string">deploy</span>
</code></pre>
<p>Es:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># ml_pipeline.py</span>
ingest → validate → train → evaluate → deploy
</code></pre>
<p><strong>Con 3 diferencias clave:</strong></p>
<ol>
<li><p><strong>Workflows más largos</strong> (horas, no minutos)</p>
</li>
<li><p><strong>Dependencias de datos</strong> (no solo código)</p>
</li>
<li><p><strong>Fallos esperados</strong> (datos malos, métricas bajas)</p>
</li>
</ol>
<hr />
<p><strong>Lo que TODOS los orquestadores te dan:</strong></p>
<p>Conceptualmente, estas 4 cosas:</p>
<p><strong>1. DAGs (Directed Acyclic Graphs)</strong></p>
<pre><code class="lang-markdown"><span class="hljs-code">    ingest
     / \
validate process
     \ /
    train → evaluate → deploy</span>
</code></pre>
<p>Defines dependencias. "Step B solo corre si Step A pasó"</p>
<p><strong>2. Scheduling</strong></p>
<ul>
<li><p>"Ejecuta esto cada día a las 2am"</p>
</li>
<li><p>"Ejecuta cuando lleguen nuevos datos"</p>
</li>
<li><p>Triggers customizados</p>
</li>
</ul>
<p><strong>3. Retries &amp; Error Handling</strong></p>
<ul>
<li><p>Reintentos automáticos</p>
</li>
<li><p>Alertas de fallos</p>
</li>
<li><p>Recuperación desde puntos específicos</p>
</li>
</ul>
<p><strong>4. Observabilidad</strong></p>
<ul>
<li><p>Logs centralizados de cada paso</p>
</li>
<li><p>Visualización de flujos</p>
</li>
<li><p>Métricas de ejecución</p>
</li>
</ul>
<p><strong>Eso es un orquestador. El resto es implementación.</strong></p>
<hr />
<h2 id="heading-los-4-grandes-y-de-donde-vienen"><strong>Los 4 grandes (y de dónde vienen):</strong></h2>
<h3 id="heading-airflow"><strong>Airflow</strong></h3>
<p>De: Data Engineering<br />Para: ETLs batch, scheduling robusto</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> airflow <span class="hljs-keyword">import</span> DAG
<span class="hljs-keyword">from</span> airflow.operators.python <span class="hljs-keyword">import</span> PythonOperator

<span class="hljs-keyword">with</span> DAG(<span class="hljs-string">'ml_pipeline'</span>, schedule=<span class="hljs-string">'@daily'</span>) <span class="hljs-keyword">as</span> dag:
    ingest = PythonOperator(task_id=<span class="hljs-string">'ingest'</span>, ...)
    train = PythonOperator(task_id=<span class="hljs-string">'train'</span>, ...)

    ingest &gt;&gt; train  <span class="hljs-comment"># Dependencias explícitas</span>
</code></pre>
<p><strong>Úsalo si:</strong></p>
<ul>
<li><p>Ya tienes Airflow para ETLs</p>
</li>
<li><p>Workflows batch tradicionales</p>
</li>
<li><p>Necesitas ecosistema maduro (1000+ operators)</p>
</li>
</ul>
<p><strong>Evítalo si:</strong></p>
<ul>
<li><p>Workflows ML complejos (hyperparameter tuning)</p>
</li>
<li><p>Necesitas ejecución dinámica</p>
</li>
<li><p>Quieres mejor developer experience</p>
</li>
</ul>
<hr />
<h3 id="heading-kubeflow-pipelines"><strong>Kubeflow Pipelines</strong></h3>
<p>De: ML en Kubernetes<br />Para: Training distribuido, GPUs, ML nativo</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> kfp <span class="hljs-keyword">import</span> dsl

<span class="hljs-meta">@dsl.pipeline(name='ml_pipeline')</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">pipeline</span>():</span>
    ingest = dsl.ContainerOp(
        name=<span class="hljs-string">'ingest'</span>,
        image=<span class="hljs-string">'gcr.io/my-project/ingest:v1'</span>
    )
    train = dsl.ContainerOp(
        name=<span class="hljs-string">'train'</span>,
        image=<span class="hljs-string">'gcr.io/my-project/train:v1'</span>
    )
    train.after(ingest)
</code></pre>
<p><strong>✅ Úsalo si:</strong></p>
<ul>
<li><p>Ya vives en Kubernetes</p>
</li>
<li><p>Necesitas GPU scheduling</p>
</li>
<li><p>Hyperparameter tuning nativo</p>
</li>
</ul>
<p><strong>❌ Evítalo si:</strong></p>
<ul>
<li><p>No quieres gestionar K8s</p>
</li>
<li><p>Workflows simples</p>
</li>
<li><p>Equipo pequeño sin expertise K8s</p>
</li>
</ul>
<h3 id="heading-prefect"><strong>Prefect</strong></h3>
<p>De: "Airflow pero mejor UX"<br />Para: Workflows dinámicos, developer experience</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> prefect <span class="hljs-keyword">import</span> task, flow

<span class="hljs-meta">@task</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ingest</span>():</span>
    <span class="hljs-keyword">return</span> fetch_data()

<span class="hljs-meta">@task</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">train</span>(<span class="hljs-params">data</span>):</span>
    <span class="hljs-keyword">return</span> train_model(data)

<span class="hljs-meta">@flow</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">ml_pipeline</span>():</span>
    data = ingest()
    model = train(data)  <span class="hljs-comment"># Dependencias implícitas</span>
</code></pre>
<p><strong>✅ Úsalo si:</strong></p>
<ul>
<li><p>Quieres Airflow pero mejor DX</p>
</li>
<li><p>Workflows dinámicos (loops, conditionals)</p>
</li>
<li><p>Deployment más simple</p>
</li>
</ul>
<p><strong>❌ Evítalo si:</strong></p>
<ul>
<li><p>Necesitas ecosistema gigante de plugins</p>
</li>
<li><p>Tu equipo ya domina Airflow</p>
</li>
<li><p>Quieres UI súper robusta (aunque ha mejorado)</p>
</li>
</ul>
<hr />
<h3 id="heading-dagster"><strong>Dagster</strong></h3>
<p>De: "Data pipelines como software engineering"<br />Para: Data platforms, software-defined assets</p>
<p>python</p>
<pre><code class="lang-python"><span class="hljs-keyword">from</span> dagster <span class="hljs-keyword">import</span> asset, Definitions

<span class="hljs-meta">@asset</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">raw_data</span>():</span>
    <span class="hljs-keyword">return</span> fetch_data()

<span class="hljs-meta">@asset</span>
<span class="hljs-function"><span class="hljs-keyword">def</span> <span class="hljs-title">trained_model</span>(<span class="hljs-params">raw_data</span>):</span>
    <span class="hljs-comment"># Dagster sabe que depende de raw_data</span>
    <span class="hljs-keyword">return</span> train_model(raw_data)

defs = Definitions(assets=[raw_data, trained_model])
</code></pre>
<p><strong>✅ Úsalo si:</strong></p>
<ul>
<li><p>Construyes data platform desde cero</p>
</li>
<li><p>Quieres "assets" en vez de "tasks"</p>
</li>
<li><p>Testing robusto de pipelines</p>
</li>
</ul>
<p><strong>❌ Evítalo si:</strong></p>
<ul>
<li><p>Solo necesitas scheduling simple</p>
</li>
<li><p>Learning curve no vale la pena</p>
</li>
<li><p>Stack legacy</p>
</li>
</ul>
<hr />
<p><strong>Comparativa rápida:</strong></p>
<div class="hn-table">
<table>
<thead>
<tr>
<td>Feature</td><td>Airflow</td><td>Kubeflow</td><td>Prefect</td><td>Dagster</td></tr>
</thead>
<tbody>
<tr>
<td><strong>Origen</strong></td><td>Data Eng</td><td>ML/K8s</td><td>Airflow 2.0</td><td>Data Platform</td></tr>
<tr>
<td><strong>Curva aprendizaje</strong></td><td>Media</td><td>Alta</td><td>Baja</td><td>Media-Alta</td></tr>
<tr>
<td><strong>Kubernetes</strong></td><td>Opcional</td><td>Requerido</td><td>Opcional</td><td>Opcional</td></tr>
<tr>
<td><strong>ML específico</strong></td><td>🟡 Plugins</td><td>✅ Nativo</td><td>🟡 Posible</td><td>🟡 Posible</td></tr>
<tr>
<td><strong>DX</strong></td><td>🟡 Mejoró en 2.0</td><td>🔴 Complejo</td><td>✅ Excelente</td><td>✅ Muy bueno</td></tr>
<tr>
<td><strong>Madurez</strong></td><td>✅ Battle-tested</td><td>✅ Google-backed</td><td>🟡 Creciendo</td><td>🟡 Emergente</td></tr>
</tbody>
</table>
</div><hr />
<p><strong>Framework de decisión:</strong></p>
<p><strong>PASO 1: ¿Ya tienes uno?</strong> → Sí? Úsalo hasta que <strong>realmente</strong> no puedas<br />→ No? Continúa ↓</p>
<p><strong>PASO 2: ¿Cuál es tu stack?</strong> → Kubernetes everywhere? → Kubeflow<br />→ Python + infra simple? → Prefect<br />→ Múltiples data teams? → Dagster<br />→ Legacy + necesitas estabilidad? → Airflow</p>
<p><strong>PASO 3: ¿Qué tan complejo es tu ML?</strong> → ETLs + modelos simples? → Airflow/Prefect<br />→ Hyperparameter tuning + GPUs? → Kubeflow<br />→ Data platform end-to-end? → Dagster</p>
<p><strong>PASO 4: ¿Qué puede mantener tu equipo?</strong> → Esta es LA pregunta más importante</p>
<hr />
<p><strong>Mi take honesto (Platform Engineer):</strong></p>
<p><strong>En 2025-2026:</strong></p>
<ul>
<li><p>¿Tienes Airflow funcionando? → Quédate ahí hasta que duela</p>
</li>
<li><p>¿Stack K8s nativo + GPUs? → Kubeflow es la opción obvia</p>
</li>
<li><p>¿Greenfield project + equipo Python? → Prefect (mejor DX)</p>
</li>
<li><p>¿Construyendo data platform? → Dagster (pero evalúa el costo)</p>
</li>
</ul>
<p><strong>Lo que aprendí:</strong></p>
<p>No busques "el mejor orquestador".</p>
<p>Busca el que resuelve <strong>tu problema específico</strong> con <strong>tu stack actual</strong> que <strong>tu equipo puede mantener</strong>.</p>
<p>Las preguntas correctas:</p>
<ul>
<li><p>"¿Cuál es mejor?"</p>
</li>
<li><p>"¿Qué problema estoy resolviendo?"</p>
</li>
<li><p>"¿Qué usa Google?"</p>
</li>
<li><p>"¿Qué puede mantener mi equipo?"</p>
</li>
<li><p>"¿Cuál tiene más estrellas en GitHub?"</p>
</li>
<li><p>"¿Cuál se integra con mi stack?"</p>
</li>
</ul>
<hr />
<p><strong>Volviendo a mi historia:</strong></p>
<p>¿Qué elegimos?</p>
<p><strong>Airflow para ETLs, Kubeflow para ML.</strong></p>
<p>¿Es ideal? No.<br />¿Funciona? Sí.<br />¿Lo haría diferente hoy? Tal vez Prefect.</p>
<p>Pero ya tenemos Airflow funcionando y el equipo lo conoce.</p>
<p><strong>No seas la empresa con 4 orquestadores diferentes.</strong></p>
<hr />
<p><strong>Mañana hablamos de scheduling:</strong></p>
<p>Tienes tu orquestador. Cool.</p>
<p>Ahora: <strong>¿Cuándo ejecutas tus pipelines?</strong></p>
<ul>
<li><p>¿Cada día a las 2am?</p>
</li>
<li><p>¿Cada hora?</p>
</li>
<li><p>¿Cuando lleguen datos nuevos?</p>
</li>
</ul>
<p><strong>Jobs batch vs event-driven. Y por qué importa para ML.</strong></p>
<hr />
<p><strong>Tu turno:</strong></p>
<ul>
<li><p>¿Qué orquestador usas (o te gustaría usar)?</p>
</li>
<li><p>¿Cuántos orquestadores diferentes tiene tu empresa? 😅</p>
</li>
</ul>
<p>(He visto empresas con 3+... no seas esa empresa o peor aun, usan servidores windows para production)</p>
]]></content:encoded></item><item><title><![CDATA[Script vs Pipeline: Por qué tu cronjob no es suficiente]]></title><description><![CDATA[Ayer te prometí explicar por qué tus herramientas de Platform Engineering no son suficientes para ML. Pero antes, hablemos de cuando alguien de PE pide automatizar o pregunta si ya esta automatizado, por alguna razón la respuesta es:
Por que? .. porq...]]></description><link>https://parraletz.dev/script-vs-pipeline-por-que-tu-cronjob-no-es-suficiente</link><guid isPermaLink="true">https://parraletz.dev/script-vs-pipeline-por-que-tu-cronjob-no-es-suficiente</guid><category><![CDATA[mlops]]></category><category><![CDATA[Platform Engineering ]]></category><category><![CDATA[Devops]]></category><category><![CDATA[AI]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Wed, 07 Jan 2026 06:00:30 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767447277167/c6b36fd0-5cfc-4dc6-b9b5-2518536fed37.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ayer te prometí explicar por qué tus herramientas de Platform Engineering no son suficientes para ML. Pero antes, hablemos de cuando alguien de PE pide automatizar o pregunta si ya esta automatizado, por alguna razón la respuesta es:</p>
<p>Por que? .. porque puedo y porque quiero:</p>
<pre><code class="lang-bash"><span class="hljs-comment"># crontab</span>
20 4 * * * <span class="hljs-built_in">cd</span> /ml &amp;&amp; python train.py
</code></pre>
<p>"Ya está automatizado", me dicen, no sean así, quieran a us PEs, hay algunos que siguen encerrados en su mundo donde yo mando, donde es mi infra, es mi workflow… pero somos mas los buenos xD.</p>
<p><strong>Hace tiempo, un Data Scientist me mostró su trabajo (Los ilovesomucho a mis compañeros DS, de verdad xD ):</strong></p>
<pre><code class="lang-python"><span class="hljs-comment"># train.py</span>
<span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd
<span class="hljs-keyword">from</span> sklearn.ensemble <span class="hljs-keyword">import</span> RandomForest

<span class="hljs-comment"># Leer datos</span>
df = pd.read_csv(<span class="hljs-string">'data.csv'</span>)

<span class="hljs-comment"># Entrenar</span>
model = RandomForest()
model.fit(df[features], df[<span class="hljs-string">'target'</span>])

<span class="hljs-comment"># Guardar</span>
model.save(<span class="hljs-string">'model.pkl'</span>)
```
</code></pre>
<p>"Solo ejecuta esto en producción", me dijo.</p>
<p>Yo, ingenuo, pensé: <em>"Easy. Lo meto en un container, un cronjob, y listo", no ?.</em></p>
<p><strong>3 meses después:</strong></p>
<ul>
<li><p>El modelo degradó y nadie sabe por qué</p>
</li>
<li><p>"data.csv" cambió y rompió todo</p>
</li>
<li><p>No sabemos qué versión del modelo está en producción</p>
</li>
<li><p>No podemos reproducir los resultados</p>
</li>
<li><p>Cada retrain es manual y reza para que funcione</p>
</li>
</ul>
<p><strong>¿El problema?</strong> Un script no es un pipeline.</p>
<hr />
<p><strong>La diferencia fundamental:</strong></p>
<p><strong>🔴 Script (</strong><a target="_blank" href="http://train.py"><strong>train.py</strong></a><strong>):</strong></p>
<pre><code class="lang-python">python train.py
<span class="hljs-comment"># Todo en un solo archivo</span>
<span class="hljs-comment"># Sin separación de responsabilidades</span>
<span class="hljs-comment"># Sin trazabilidad</span>
<span class="hljs-comment"># Sin recuperación de errores</span>
</code></pre>
<p><strong>🟢 Pipeline orquestado:</strong></p>
<pre><code class="lang-python"><span class="hljs-number">1.</span> INGEST    → Trae datos versionados
<span class="hljs-number">2.</span> VALIDATE  → Verifica calidad
<span class="hljs-number">3.</span> PREPROCESS→ Transforma de forma reproducible
<span class="hljs-number">4.</span> TRAIN     → Entrena con hiperparámetros trackeados
<span class="hljs-number">5.</span> EVALUATE  → Valida métricas vs baseline
<span class="hljs-number">6.</span> REGISTER  → Solo si pasa el threshold
</code></pre>
<p><strong>¿Por qué importa?</strong></p>
<p>Con un <strong>script</strong>:</p>
<ul>
<li><p>Si falla en el minuto 45, pierdes todo</p>
</li>
<li><p>No sabes qué datos usaste</p>
</li>
<li><p>No puedes reproducirlo</p>
</li>
<li><p>Debugging = rezar y <code>print()</code></p>
</li>
</ul>
<p>Con un <strong>pipeline</strong>:</p>
<ul>
<li><p>Cada paso es independiente y reproducible</p>
</li>
<li><p>Falla el paso 3? Reinicia desde ahí</p>
</li>
<li><p>Trazabilidad completa: datos, código, parámetros</p>
</li>
<li><p>Debugging con logs estructurados de cada etapa</p>
</li>
</ul>
<hr />
<p><strong>El día que aprendí esto:</strong></p>
<p>Un modelo empezó a fallar en producción.</p>
<p>Con el script: "No sé qué cambió, vuelve a entrenar todo y cruza los dedos"</p>
<p>Con el pipeline: "Ah, los datos de entrada fallaron en validación. El modelo anterior sigue sirviendo. Arreglo los datos y solo re-ejecuto desde INGEST"</p>
<p><strong>Esa es la diferencia entre firefighting y engineering.</strong></p>
<hr />
<p><strong>Mañana te cuento:</strong> Qué hace que un pipeline sea realmente un pipeline (spoiler: no es solo dividir tu script en funciones)</p>
<p>Y te presento a los orquestadores: Airflow, Kubeflow, Prefect, Dagster.</p>
<p>¿Cuál usar? Depende. Mañana vemos por qué.</p>
<hr />
<p><strong>Pregunta para ti:</strong> ¿Tu equipo de ML tiene scripts o pipelines? (Pista: si no lo sabes con certeza, probablemente son scripts)</p>
]]></content:encoded></item><item><title><![CDATA[*Ops to MLOPS]]></title><description><![CDATA[Hace tiempo, cuando algunos de mis compañeros de datos o de AI (como en ese entonces les conocía) se acercaban y nos pedían infraestructura o forma “despliegues que se salian del standard“, siempre nos retaban al equipo de plataforma con sus requerim...]]></description><link>https://parraletz.dev/ops-to-mlops</link><guid isPermaLink="true">https://parraletz.dev/ops-to-mlops</guid><category><![CDATA[mlops]]></category><category><![CDATA[Platform Engineering ]]></category><dc:creator><![CDATA[Alex Parra (Parraletz)]]></dc:creator><pubDate>Tue, 06 Jan 2026 06:00:36 GMT</pubDate><enclosure url="https://cdn.hashnode.com/res/hashnode/image/upload/v1767443931934/b7b1cd21-41fa-48b2-a88d-0c4de159fce0.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>Hace tiempo, cuando algunos de mis compañeros de datos o de AI (como en ese entonces les conocía) se acercaban y nos pedían infraestructura o forma “despliegues que se salian del standard“, siempre nos retaban al equipo de plataforma con sus requerimientos, y eso honestamente era divertido</strong></p>
<hr />
<p>Hoy es 6 de enero, y como cada año, los Reyes Magos traen regalos.</p>
<p>Pero los míos no vinieron envueltos en papel dorado.</p>
<p>Hace tiempo, yo era ese Platform Engineer que:</p>
<ul>
<li><p>Construía infraestructura sólida</p>
</li>
<li><p>Automatizaba pipelines de CI/CD</p>
</li>
<li><p>No entendía por qué los Data Scientists necesitaban "cosas raras"</p>
</li>
</ul>
<p>Los Data Scientists llegaban a con sus requierimientos:</p>
<ul>
<li><p>"Necesito versionar datasets de 500GB"</p>
</li>
<li><p>"¿Por qué no puedo reproducir este experimento?"</p>
</li>
<li><p>"Este modelo funcionaba ayer y hoy no"</p>
</li>
</ul>
<p>Y yo pensaba: <em>"¿Por qué no usan Git como gente normal?"</em></p>
<p><strong>Spoiler: ML no es software tradicional. Y las plataformas para ML tampoco.</strong></p>
<p>Hasta que descubrí que <strong>MLOps no es DevOps con otro nombre</strong>. Son 3 pilares que transforman cómo construyes plataformas para equipos de ML:</p>
<h3 id="heading-regalo-1-oro-entender-el-problema-real"><strong>Regalo 1: ORO - Entender el problema real</strong></h3>
<p>Por qué ML es diferente a software tradicional. Por qué tus pipelines de CI/CD no son suficientes. Qué necesitan realmente los equipos de ML.</p>
<h3 id="heading-regalo-2-incienso-infraestructura-para-datos-y-experimentos"><strong>Regalo 2: INCIENSO - Infraestructura para datos y experimentos</strong></h3>
<p>No solo versionar código. Versionar datasets, trackear miles de experimentos, garantizar reproducibilidad. DVC, LakeFS, MLflow.</p>
<h3 id="heading-regalo-3-mirra-orquestacion-que-entiende-ml"><strong>Regalo 3: MIRRA - Orquestación que entiende ML</strong></h3>
<p>Airflow está bien para ETLs. Pero ¿training pipelines? ¿Model serving? ¿A/B testing de modelos? Necesitas otras herramientas.</p>
<hr />
<p><strong>Durante las próximas 3 semanas voy a compartir estos 3 regalos contigo.</strong></p>
<p>Desde la perspectiva de alguien que viene de <strong>plataformas e infraestructura</strong>, no de Data Science.</p>
<p>📅 Cada semana = 1 regalo<br />📝 5 posts diarios (Lun-Vie) con contenido práctico</p>
<p><strong>Semana 1 (7-11 Ene):</strong> Por qué ML rompe tus asunciones de software tradicional<br /><strong>Semana 2 (13-18 Ene):</strong> Infraestructura para versionado y tracking<br /><strong>Semana 3 (20-25 Ene):</strong> Orquestación y automatización para ML</p>
<p><strong>El objetivo:</strong> Que construyas plataformas que tus Data Scientists realmente necesiten.</p>
<p>No más:</p>
<ul>
<li><p>"Usamos Kubernetes" (¿pero para qué?)</p>
</li>
<li><p>"Tenemos CI/CD" (¿versiona tus datos?)</p>
</li>
<li><p>"Deployamos modelos" (¿monitoreas drift?)</p>
</li>
</ul>
<p>De Zero a MLOps. Desde la perspectiva de quien <strong>construye la plataforma</strong>.</p>
<p>¿Te unes al viaje? 🚀</p>
<p>Nos vemos mañana con el primer post: "Por qué Git no es suficiente para ML"</p>
<hr />
]]></content:encoded></item></channel></rss>