Alejandro Parra's Blog

Kubernetes Isn't Outdated. It's Evolving. And So Should You.

Alex Parra (Parraletz) — Thu, 26 Mar 2026 03:08:25 GMT

There's a narrative I keep hearing in tech communities across LATAM and beyond: "Kubernetes is too complex now. It had its moment."

I think that narrative confuses two very different things: the accidental complexity we engineers have built on top of it, and the obsolescence of the platform itself. Kubernetes isn't trending or out of trend. It's infrastructure. And infrastructure doesn't go anywhere.

What is changing — fast — is the role we need to play as engineers around it.

"We won the scalability war. But we are losing the simplicity battle."

The Problem Nobody Wants to Admit

The original DevOps promise was simple: "You build it, you run it." It was a philosophy of ownership. But as the CNCF ecosystem grew from 10 to over 200 projects, that philosophy became a cognitive load trap.

Today, in many teams, a backend developer is expected to be an expert in:

Business logic — their actual job
Containers and Kubernetes — manifests, Helm charts, CRDs
Network security — NetworkPolicies, mTLS, Cilium
Observability — OpenTelemetry, Prometheus, Grafana Loki
Compliance & Governance — OPA, Kyverno, RBAC policies
Continuous Delivery — ArgoCD, Flux, GitOps workflows

That is an absurd stack of responsibilities for a single person. The result: developers become bottlenecks, and delivery speed collapses.

The problem isn't Kubernetes. The problem is that nobody built the floor between Kubernetes and the developer.

Platform Engineering and the Golden Path

Platform Engineering didn't emerge to "run servers" on behalf of developers. It emerged to build an Internal Developer Platform (IDP) — a product that absorbs infrastructure complexity and delivers a self-service interface.

The goal is the Golden Path: a set of automated, pre-approved, secure, and observable-by-default workflows. If you follow the Golden Path, everything just works. If you deviate, you need to justify it.

💡 What does a 2026 self-service workflow actually look like?

A developer opens their internal portal and requests a new microservice. Within 5 minutes they have: a dev environment deployed on K8s, an isolated database provisioned via Crossplane, traces and logs enabled with OpenTelemetry, and GitOps sync active against their repo. Zero infra tickets. Zero manual YAML.

Your CNCF Toolbox for 2026

If you're a Platform Engineer, the CNCF is your toolbox. Here's how you connect the dots to build your software factory:

Component	Key Technologies	Why It Matters
The Portal (Frontend)	Backstage	The "storefront" of your platform. A central hub for software catalogs and self-service templates.
Infra as Code / API	Crossplane	Manage cloud resources (RDS, S3, GCS) using the Kubernetes API. Turns your infra into K8s objects.
Networking & Security	Cilium / Istio	Fast, visible, and secure networking (mTLS) from day one.
Unified Observability	OpenTelemetry	A common language for metrics, logs, and traces. Essential for explainability.
Automated Policy	Kyverno / OPA	Governance as code. Blocks insecure deployments before they ever hit production.
GitOps Delivery	ArgoCD / Flux	The heartbeat of the platform. Git as the single source of truth.

The Next Challenge: The Agentic Platform

This is where the conversation gets interesting — and where most Platform Engineers aren't looking yet.

This week, during KubeCon EU Amsterdam, the CNCF published a standards paper for agentic AI in cloud native environments. The document — produced by the AI TCG — doesn't talk about models or data science. It talks about how Kubernetes must orchestrate autonomous agents securely, observably, and with governance from day one.

📄 Reference · CNCF AI TCG — Cloud Native Agentic Standards (March 2026)

The paper defines four pillars for agentic systems on K8s: General (containers), Control & Communication (MCP, A2A, SPIFFE), Observability (OTel + token metrics + agent traces), and Governance + Security (agent identity, JIT access, tenancy isolation).

Read the full document at cncf.io.

The protocols emerging as standards — MCP (from Anthropic, now donated to the Agentic AI Foundation) and A2A (from Google for peer-to-peer agent communication) — are already vocabulary that a 2026 Platform Engineer needs to understand.

Why does this matter to Platform Engineering? Because AI Agents are not a separate layer living "in some vendor's cloud." They are workloads. They run in containers. They need ephemeral identities (SPIFFE/SVID), namespace isolation, NetworkPolicies, and observability of LLM inference tokens.

⚠️ What nobody is telling you

If your IDP today has no identity model for agents, if your observability pipelines don't track LLM tokens, if your Kyverno policies don't account for MCP tool access — you already have technical debt in the 2026 stack. This isn't science fiction. KubeRay and Volcano are already optimizing how K8s handles these workloads.

Kubernetes Isn't Going Anywhere. You Need to Grow.

The message here isn't that DevOps is dead. It's that it matured. Platform Engineering is the next logical chapter of the same story: less manual heroism, more self-governing systems.

Kubernetes is the operating system of modern infrastructure. Nobody says Linux "went out of style." The same should be true of Kubernetes. What changes is the level of abstraction from which we operate. And for those of us building platforms, that is exactly the challenge that makes this work fascinating.

Next year, the most mature platforms won't just be the ones with the best DX for human developers. They'll be the ones capable of orchestrating autonomous developers — agents — with the same rigor, security, and observability we apply today to any microservice.

Are you building that platform, or are you still debugging YAML?

Want to continue this conversation in person? On April 18th we're hosting KCD Guadalajara 2026 at the Holiday Inn Guadalajara Expo. Platform Engineering, agentic AI, cloud native security, and the entire CNCF community of Mexico — all in one place.

👉 community.cncf.io/kcd-guadalajara-2026

#kubernetes #platformengineering #cncf #cloudnative #devops #mcp #agenticai #idp #kcdguadalajara

The Platform Is the Architecture

Alex Parra (Parraletz) — Tue, 24 Mar 2026 16:10:40 GMT

The Platform Is the Architecture

There's a distinction that comes up a lot in platform engineering conversations, and it's worth making explicit.

There's a difference between a team that operates infrastructure and a team that designs a platform.

Both teams run Kubernetes. Both teams write Terraform. Both teams deal with incidents, manage clusters, and keep things running. From the outside, they look similar. From the inside, they make decisions in fundamentally different ways.

The operating team asks: is this working? The platform team asks: why is this designed this way, and what does that design decision cost us over time?

This series has been about developing the second kind of thinking — and about the vocabulary that makes it possible.

What we covered

Six posts. Six patterns. Each one a different lens on a problem that platform engineers face every day.

Factory and Builder — environment provisioning without a creation model is slow, inconsistent, and doesn't scale. When you centralize creation logic, you get consistency, self-service, and the ability to evolve the implementation without changing the interface.

Facade — complexity leaks when there's no abstraction layer. The platform is a facade. The question isn't whether to build one — it's how opinionated to make it, and who it's designed for.

Observer — manual reconciliation doesn't scale. The Kubernetes controller loop is the Observer pattern at cluster scale. Toil is a symptom of missing observers. Name the toil, find the missing observer, build or adopt it.

Command — pipelines that apply changes directly are fast and invisible. GitOps decouples the command from its execution, and that decoupling gives you auditability, reversibility, and drift detection almost for free.

Adapter and Glue — every platform connects systems that weren't designed to talk to each other. That connective tissue is glue. In most platforms, glue is 60-80% of what the team actually builds and maintains. The question isn't whether you'll write it — it's whether you'll design it intentionally.

Strategy — deployment strategy hardcoded into pipelines is a hidden coupling that makes platforms less reliable and harder to evolve. When strategy is a first-class platform concern — declarative, observable, interchangeable — the whole delivery system gets more consistent.

The thread that connects them

Looking back at these six patterns, there's a common thread.

Every one of them is about the same thing: separating concerns that are accidentally coupled, and making that separation explicit.

Factory separates what gets created from how it's created. Facade separates the consumer's interface from the system's complexity. Observer separates the thing that changes from the things that react to it. Command separates the definition of a change from its execution. Adapter separates incompatible interfaces without modifying either. Strategy separates the delivery mechanism from the deployment approach.

This isn't a coincidence. The Gang of Four didn't invent these patterns — they documented solutions that experienced engineers kept independently arriving at, because these couplings cause the same problems everywhere. In codebases. In distributed systems. In platforms.

The platform is a system. It has the same structural problems that software has. And it responds to the same structural solutions.

From glue spaghetti to platform control plane

Earlier in this series, we talked about glue — the code that connects systems that weren't designed to talk to each other. And we mentioned an evolution that a lot of platform teams go through:

scripts/
  deploy.sh
  create-service.sh
  add-monitor.sh
  update-ingress.sh

This is where most platforms start. It's not wrong — it's the natural first step when you're moving fast and the priority is getting things working.

But at some point, the scripts become a liability. They accumulate. They drift. They're owned by whoever wrote them last. They break in ways that are hard to debug because they have no reconciliation semantics, no observability, no clear interfaces.

The teams that scale past this point aren't necessarily the ones with the best tools. They're the ones that start treating the platform as a product with an architecture — not just a collection of scripts that happen to run in the right order.

That shift looks like:

Glue scripts become operators with proper reconciliation semantics
Ad-hoc pipelines become platform APIs with clear contracts
Scattered strategy implementations become a first-class delivery layer
Manual processes become observers that react automatically
Implicit conventions become explicit facades that abstract complexity

This is what a platform control plane looks like. Not a specific tool. A specific way of thinking about what the platform is and how it should behave.

And it maps almost exactly to the patterns in this series.

The vocabulary is the point

I want to be honest about what this series was and wasn't trying to do.

It wasn't trying to teach you Kubernetes. You already know Kubernetes. It wasn't trying to convince you to use Argo Rollouts or Crossplane or External Secrets Operator. Those are good tools, but the patterns exist independently of the tools.

What it was trying to do is give you vocabulary.

Vocabulary matters for a few reasons.

It makes architectural decisions explicit. When your team debates whether to expose Kubernetes directly to developers, "how much facade do we build" is a more productive framing than "should we use Backstage or not." The pattern gives the conversation a shape.

It helps you diagnose problems. When your pipelines are slow and tightly coupled, that's not just a pipeline problem — it's a symptom of missing separation of concerns. When you name the pattern that's missing or being violated, the solution space becomes clearer.

It connects you to a larger body of knowledge. The Gang of Four documented these patterns in 1994. Thirty years of software engineers have written about them, extended them, and argued about their limits. That literature is available to platform engineers too — it just needs translation.

And it levels the conversation. When a platform engineer and a software architect talk about the same problem using the same vocabulary, they can build on each other's thinking instead of talking past each other. The platform stops being "the infra people's problem" and starts being a shared architectural concern.

What comes next

This series ends here, but the topic doesn't.

We covered six patterns from the structural and behavioral categories — the ones most directly applicable to platform engineering. There's more to explore: architectural styles like event-driven systems and how they map to platform data planes, distributed systems patterns and how they show up in multi-cluster architectures, and the evolution from platform-as-infrastructure to platform-as-product.

If any of these posts sparked a question, a disagreement, or a "this is exactly what we're dealing with right now" — I'd genuinely like to hear about it.

Hit reply at blog@parraletz.dev. That's where the real conversation happens.

Your Deployment Strategy Shouldn't Live in Your Pipeline

Alex Parra (Parraletz) — Fri, 20 Mar 2026 15:10:17 GMT

Your Deployment Strategy Shouldn't Live in Your Pipeline

Here's a situation most platform teams have been in.

A team wants to do a canary deployment. So they modify the pipeline. They add a step that deploys to 10% of traffic, waits, checks some metric, then either proceeds or rolls back. It works. Six months later, a different team wants the same thing — but their pipeline is structured differently, so they write their own version. Then someone wants blue/green. Another pipeline modification. Then progressive delivery with automated analysis. Another version, in another pipeline, maintained by another team.

You now have five different implementations of deployment strategies scattered across your CI/CD system. None of them are consistent. All of them are someone's responsibility to maintain. And when something goes wrong mid-deployment, the person on call has to figure out which version of the strategy they're dealing with before they can even start debugging.

This is what happens when deployment strategy is hardcoded into the pipeline instead of being a first-class concern of the platform.

The problem

A deployment pipeline has two distinct jobs that are often collapsed into one.

The first job is delivery: take the artifact, get it to the cluster. The second job is strategy: decide how it gets there — all at once, gradually, with traffic splitting, with automated rollback triggers.

When these two jobs live in the same pipeline, the strategy becomes tightly coupled to the delivery mechanism. Changing the strategy means changing the pipeline. Testing a new approach means modifying infrastructure that other things depend on. And because pipelines are usually owned by individual teams rather than the platform, you end up with strategy logic that's duplicated, inconsistent, and invisible to anyone outside that team.

The deeper problem is that strategy decisions are cross-cutting. They affect reliability, blast radius, rollback time, and developer confidence. They deserve to be made once, implemented well, and available to every team — not reinvented in every pipeline.

Why it hurts

Imagine a deployment goes wrong mid-canary. Traffic is split 20/80 between the new version and the old one. The new version has a bug. You need to roll back.

If the canary logic lives in the pipeline, rollback means re-running the pipeline in reverse — or manually adjusting traffic weights in whatever tool manages them, if you can even find where that is. The pipeline ran its steps and moved on. It's not watching. It's not reacting. It doesn't know the deployment is in a bad state.

Now multiply this by the number of services your platform supports. Each one with its own pipeline, its own strategy implementation, its own rollback procedure. Your platform's deployment reliability is only as good as the worst pipeline in the system.

And when a postmortem asks "why did we deploy 100% of traffic at once to a service that had no rollback plan" — the answer is usually "because the pipeline didn't have that logic and nobody thought to add it."

What the pattern says

The Strategy pattern says: define a family of algorithms, encapsulate each one, and make them interchangeable. The caller selects which strategy to use. The strategy handles the implementation. Neither needs to know the details of the other.

In software, this pattern appears in sorting algorithms, payment processors, compression formats — anywhere you have multiple ways to accomplish the same goal and want to switch between them without changing the code that uses them.

In platform engineering, deployment strategies are exactly this. Rolling update, blue/green, canary, progressive delivery with automated analysis — these are a family of algorithms for getting a new version of software in front of users. They have different risk profiles, different rollback characteristics, different observability requirements. But from the application's perspective, the goal is the same: deploy this version.

The Strategy pattern says that choice should be declarative and interchangeable — not hardcoded into a pipeline.

Where you see this in practice

Argo Rollouts is the clearest implementation of this pattern in the Kubernetes ecosystem. You define a Rollout resource instead of a standard Deployment, and you declare your strategy in the spec:

strategy:
  canary:
    steps:
    - setWeight: 20
    - pause: {duration: 10m}
    - setWeight: 50
    - pause: {duration: 10m}
    - setWeight: 100

The application doesn't know what strategy is being used. The pipeline doesn't implement the strategy — it just triggers the rollout. Argo Rollouts is the executor that watches the rollout progress, manages traffic weights, and reacts to what it sees.

Switching from canary to blue/green is a change to the Rollout spec. Not a pipeline rewrite. Not a new set of scripts. A declarative change to the strategy definition.

Flagger takes the same idea and adds automated analysis. It watches metrics from Prometheus, Datadog, or other sources during a canary deployment and makes promotion decisions automatically. The strategy isn't just declared — it's reactive. If error rates spike, the rollout stops. If latency degrades, it rolls back. The strategy responds to real system state rather than waiting for a human to notice.

Feature flags are a strategy pattern applied at the application layer. Tools like LaunchDarkly, Unleash, or Flagsmith let you decouple the deployment of code from the activation of features. You deploy 100% of traffic — but only 10% of users see the new behavior. The strategy is controlled by the platform, not the pipeline.

The common thread: in all of these, the strategy is a separate, interchangeable concern. It's not woven into the delivery mechanism. It can be changed, tested, and evolved independently.

Hardcoding strategy is an antipattern

When deployment strategy lives in the pipeline, you've created a hidden coupling that's easy to miss until it causes a problem.

The pipeline owns delivery and strategy. Changing one risks breaking the other. Testing a new strategy means modifying production infrastructure. Rolling back a failed strategy change means a pipeline change, not a config change. And visibility into what strategy a given service uses requires reading pipeline code — not a platform API or a manifest in Git.

This is the antipattern. Not because pipelines are bad, but because strategy is a platform concern that's been pushed into a delivery concern. The blast radius of a mistake is larger than it needs to be. The iteration speed on strategy improvements is slower than it needs to be.

Separating strategy from delivery — making it declarative, observable, and interchangeable — is one of the higher-leverage improvements a platform team can make to their deployment reliability.

What changes when you think this way

When strategy is a first-class platform concern, a few things shift.

Reliability becomes consistent. Every service that uses the platform gets the same deployment strategy primitives — canary, blue/green, progressive delivery — without each team having to implement them. The platform team implements once. Every team benefits.

Iteration gets faster. When a team wants to try a more conservative canary rollout, they change a few lines in their Rollout spec. They don't open a ticket to the platform team. They don't modify a shared pipeline. They make a declarative change and the platform handles the rest.

Postmortems get cleaner. When a deployment goes wrong, the strategy execution is visible in the platform — Argo Rollouts has an API, a UI, a clear audit trail. The question "what happened during the rollout" has an answer that doesn't require reading pipeline logs.

The goal isn't to make deployment more complex. It's to make the complexity live in the right place — a dedicated, observable, interchangeable strategy layer — rather than scattered across dozens of pipelines that each solve the same problem in a slightly different way.

Does your platform have a consistent deployment strategy story, or is it still living in individual pipelines? Hit reply at blog@parraletz.dev.

The Glue Is the Platform

Alex Parra (Parraletz) — Mon, 16 Mar 2026 15:10:19 GMT

The Glue Is the Platform

Here's something that doesn't show up in architecture diagrams.

You have Kubernetes. You have Terraform. You have GitHub Actions, ArgoCD, Datadog, Vault, and a developer portal. Each of those tools is well-documented, widely adopted, and excellent at what it does.

None of them were designed to talk to each other.

So you write code that connects them. Code that translates a GitHub push into a Terraform run. Code that creates a Datadog monitor when a new service is deployed. Code that bridges Vault secrets into Kubernetes. Code that turns a Backstage template into a repo, a pipeline, a namespace, and an ArgoCD application.

That code is glue. And in most platforms, it's not a minor detail — it's the majority of what the platform actually is.

The problem

Platform teams spend a lot of time talking about the tools. Kubernetes vs. Nomad. ArgoCD vs. Flux. Terraform vs. Pulumi. The tools are visible, they have logos, they have docs, they have Slack communities.

The glue is invisible. It lives in scripts/ directories, in Lambda functions nobody documented, in GitHub Actions workflows that have grown to 400 lines, in CronJobs that someone wrote two years ago and that now run in production and that nobody fully understands anymore.

The problem isn't that glue exists. Glue is inevitable — any platform that connects more than two systems will have it. The problem is that most teams treat glue as an afterthought rather than as a first-class part of the platform. It accumulates without design. It grows without standards. And eventually it becomes the thing that breaks most often and is hardest to fix.

What glue actually is

In platform engineering, glue is any code or automation that:

doesn't contain business logic
doesn't belong to any specific tool
exists solely to integrate systems

It shows up in many forms. Scripts that generate manifests and inject variables. Webhooks that trigger downstream workflows. Small services that transform data between formats. Operators that watch Kubernetes resources and call external APIs. Lambda functions that bridge cloud events to internal systems.

Take a simple example. A developer pushes code. You want your internal platform API to know about it — who pushed, what SHA, what repo — so it can update the deployment record, trigger downstream automation, and eventually show up in your developer portal.

GitHub doesn't know about your platform API. Your platform API doesn't know about GitHub. So you write glue:

import { Octokit } from "@octokit/rest"
import axios from "axios"

async function notifyDeployment(repo: string, sha: string) {
  const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN })

  const commit = await octokit.repos.getCommit({
    owner: "org",
    repo,
    ref: sha
  })

  await axios.post("https://platform-api/deployments", {
    repo,
    sha,
    author: commit.data.commit.author?.name
  })
}

This code is not part of GitHub. It's not part of Kubernetes. It's not part of Datadog. It exists only to connect them. That's glue.

Where glue lives in a real platform

A typical delivery flow looks something like this:

GitHub push
    ↓
GitHub Actions
    ↓
Terraform / Helm
    ↓
Kubernetes
    ↓
ArgoCD
    ↓
Datadog
    ↓
Slack notification

Everything in that diagram is a well-known tool. But everything between those tools — the thing that takes output from one and feeds it to the next, that translates formats, that triggers the next step, that handles errors in the transition — is glue.

In practice, platform glue tends to cluster around five areas:

CI/CD glue — scripts and pipeline steps that connect your version control to your delivery system. Generating manifests from templates. Injecting environment-specific variables. Validating policy before apply. Running post-deploy checks.

GitOps glue — the translation layer between developer intent and the manifests that ArgoCD or Flux actually apply. A service.yaml that describes what a developer wants gets translated into Helm values, which become an Argo Rollout, which triggers a Datadog monitor creation. Each step in that chain is glue.

Observability glue — automation that creates dashboards, alerts, and SLOs when a new service is registered. Without this, observability setup is manual and inconsistent. With it, every service gets a standard monitoring baseline automatically.

Security glue — the code that bridges Vault, IAM, Kubernetes service accounts, and CI/CD pipelines. External Secrets Operator is a well-known example of productized security glue, but most platforms have additional custom pieces connecting their specific identity model to their specific secret stores.

Developer experience glue — the automation behind an IDP action. When a developer creates a service in Backstage, something has to run the repo template, call Terraform, register the ArgoCD application, and create the initial deployment. That workflow is pure glue logic.

The Adapter pattern: a frame for thinking about glue

The Adapter pattern from software design is the closest conceptual frame for what glue does. It says: when two systems have incompatible interfaces, build a component that translates between them without changing either system.

That's exactly what glue is. ESO is an adapter between secret managers and Kubernetes. KEDA is an adapter between external event sources and the Kubernetes scaling API. Your notifyDeployment function is an adapter between GitHub and your platform API.

Thinking about glue as adapters is useful because it sets a design standard. A well-designed adapter:

translates between two specific interfaces without changing either
reconciles continuously rather than running once and hoping
handles the source being unavailable gracefully
surfaces errors through observable, standard mechanisms
is maintainable by someone other than the person who wrote it

Most ad-hoc glue fails on at least three of those. That's not a criticism — it's a pattern. Glue starts as a quick script because the need is urgent, and it evolves into a liability because nobody went back to design it properly.

The 60-80% problem

In practice, a significant portion of what a platform team actually builds and maintains is glue. Not Kubernetes. Not Terraform. The code that connects them.

This is a pattern that shows up repeatedly in mature platform teams: the tools are 20% of the work, the integrations between them are 80%. And yet most platform teams staff, document, and design for the tools — not the glue.

The teams that handle this well tend to go through the same evolution:

scripts/
  deploy.sh
  create-service.sh
  add-monitor.sh
  update-ingress.sh

This works at small scale. It stops working when you have ten teams, three environments, and a platform that needs to be reliable rather than just functional.

So the scripts get promoted. Some become operators. Some become controllers. Some become platform services with proper APIs. The glue doesn't disappear — it gets intentional design applied to it. The platform stops being a collection of scripts that connect tools and starts being a system with clear interfaces, reconciliation semantics, and something approaching reliability.

That evolution — from glue spaghetti to platform control plane — is its own topic. But it starts with recognizing that the glue is not a side effect of the platform. In many cases, the glue is the platform.

What does your glue look like today — scattered scripts, promoted operators, or something in between? Hit reply at blog@parraletz.dev.

A Git Commit Is Not Just a Change. It's a Command.

Alex Parra (Parraletz) — Fri, 13 Mar 2026 15:10:20 GMT

A Git Commit Is Not Just a Change. It's a Command.

Here's a scenario that plays out in a lot of platform teams.

Your CI/CD pipeline runs kubectl apply as a job step. It exits zero. The pipeline goes green. Deployment successful — or so the dashboard says. But nobody actually knows if the manifest that was applied was valid, if the pods came up healthy, or if the service is responding. So you add another job to check. And another to verify the rollout. And a retry mechanism for when the check fails. And now your pipeline is a sequence of shell scripts duct-taped together, each one trying to compensate for the fact that the previous step had no way to confirm its own outcome.

And when something goes wrong — which it does — you're cross-referencing pipeline logs, Kubernetes events, and Slack messages trying to reconstruct what actually happened and in what order.

This is the problem that GitOps solves. But the usual explanation — "Git is your source of truth" — doesn't tell you why that works, or why it's architecturally sound rather than just a convention someone decided to follow.

The Command pattern does.

The problem

Pipelines that apply changes directly are fast and simple. They're also invisible.

When a CI/CD pipeline runs kubectl apply or terraform apply directly, the action happens and disappears. You have a log entry somewhere, maybe. You have the pipeline run in your CI system, if you know where to look. But the intent — what was supposed to happen, who requested it, what the expected state was — isn't captured anywhere durable.

This creates a class of problems that compound over time. Drift between what's in your repo and what's actually running. No reliable way to roll back a change without re-running a pipeline in reverse. Audit trails that live in your CI system instead of with the infrastructure they describe. And a mental model problem: nobody has a single place to look to understand the current intended state of the system.

The deeper issue is that executing and recording are coupled. The change happens and the record of the change is scattered across logs, pipeline runs, and people's memory. When something goes wrong, you're reconstructing history instead of reading it.

Why it hurts

Imagine you need to answer a simple question: what changed in the production cluster between Tuesday and Thursday?

Without GitOps, that question requires cross-referencing CI pipeline runs, Kubernetes audit logs, Slack messages, and whoever was on call. Each of those sources has partial information. None of them has the full picture. And if the change was a manual kubectl apply that someone ran from their laptop, it might not appear anywhere at all.

Now multiply that across ten services, three environments, and a team of fifteen. The cognitive overhead of understanding the current state of your platform becomes a full-time job. And when incidents happen — which they do — the time you spend reconstructing what changed is time you're not spending fixing the problem.

What the pattern says

The Command pattern says: instead of executing an action directly, encapsulate it as an object — something that can be stored, passed around, queued, undone, or audited — and separate the act of defining the command from the act of executing it.

In software, this pattern shows up in undo/redo systems, job queues, and transaction logs. The key insight is that by making the command a first-class thing — not just a function call that disappears — you gain properties you couldn't have otherwise: history, reversibility, auditability, and the ability to replay or defer execution.

In platform engineering, a Git commit in a GitOps repository is exactly this. It's not just a text change. It's a persisted command — with an author, a timestamp, a description of intent, and the ability to be reversed with a git revert. The commit is the command object. ArgoCD or Flux is the executor that reads that command and applies it to the cluster.

The separation is the point. The commit and the apply are two distinct steps, with Git in between acting as a durable, auditable command queue.

Where you see this in practice

ArgoCD and Flux are Command pattern executors. Their job is to watch a Git repository — the command queue — and continuously reconcile the cluster state against the commands stored there. When a new commit arrives, they execute it. When the cluster drifts from the desired state, they re-execute the last command to bring it back.

This is why GitOps gives you drift detection almost for free. The executor is always comparing what the command says against what's actually running. Drift is just a command that hasn't been fully executed yet.

Terraform with remote state applies the same idea to infrastructure. The .tf files are the commands. The state file is the record of what's been executed. terraform plan shows you what the command would do before you run it. terraform apply executes it. The separation between plan and apply is the Command pattern in action.

Pull Request workflows add another layer. Before a command is committed to the queue, it goes through review. The PR is the command in draft — visible, discussable, and rejectable before it's persisted. Merge is the moment the command becomes official. This is why "GitOps with PR-based workflows" gives you change management essentially for free, without a separate tool.

Argo Workflows and Tekton extend this to pipeline definitions. The pipeline itself is a command — defined declaratively, stored in Git, executed by the platform. You're not running imperative scripts. You're defining commands that the platform knows how to execute.

What changes when you think this way

The Command framing answers a question that comes up constantly: why GitOps instead of just running kubectl apply in a pipeline?

The answer isn't "because GitOps is best practice." The answer is that running kubectl apply in a pipeline couples execution to the pipeline run. The command and its execution happen at the same time and leave no durable record tied to the infrastructure itself. GitOps decouples them — the command lives in Git, the executor applies it, and the two can operate independently.

That decoupling gives you things that are very hard to retrofit:

Auditability — every change has an author, a timestamp, and a description, tied to the infrastructure it describes
Reversibility — rolling back is a git revert, not a re-run of a pipeline in the right order with the right parameters
Drift detection — the executor knows what the command says and can tell you when reality disagrees
Separation of concerns — the team that defines what should be running is decoupled from the mechanism that makes it run

Once you see a Git commit as a command object rather than just a file change, the architecture of GitOps stops being a convention and starts being a deliberate design choice with clear properties. And that makes it much easier to explain, defend, and extend.

Is your team still running direct applies in pipelines? Or have you made the jump to GitOps? Either way — what's the hardest part of the transition? Hit reply at blog@parraletz.dev.

Your Cluster Is Already Watching. That's the Observer Pattern.

Alex Parra (Parraletz) — Wed, 11 Mar 2026 22:49:45 GMT

Your Cluster Is Already Watching. That's the Observer Pattern.

At some point, most platform teams hit the same wall.

A service goes down. An secret expires. A node runs out of disk. And the first question in the postmortem is always the same: why didn't we know sooner?

So you add more monitoring. More alerts. More dashboards. And somewhere along the way, someone suggests: "we should build a controller that watches for this and reacts automatically."

That controller is an Observer. You probably didn't call it that either.

The problem

Manual reconciliation doesn't scale.

When your platform grows — more clusters, more teams, more services — the list of things someone needs to watch and react to grows with it. Certificate expiration. Secret rotation. Node pressure. Deployment drift. Config changes that need to propagate across namespaces.

If the reaction to any of these depends on a human noticing and acting, you have a fragile system. Not because your team is slow, but because humans aren't designed to watch dozens of state transitions simultaneously and react to all of them in real time.

The deeper problem is coupling. When your response logic lives in runbooks, scripts, or people's heads, it's tightly coupled to whoever wrote it. It doesn't compose. It doesn't scale. And it breaks in the exact moment you need it most — when things are moving fast and nobody has time to read the runbook.

Why it hurts

Think about secret rotation without automation. Someone sets a reminder. The reminder gets missed. The secret expires in production on a Friday at 5pm. Three people get paged. The incident takes two hours to resolve, not because the fix is hard, but because nobody had a clear picture of what was watching what.

Or think about configuration drift. A team manually edits a ConfigMap in staging. Nobody notices it diverged from what's in Git. Three weeks later, a deployment to production behaves differently than expected. The diff is one line. Finding it takes half a day.

These aren't rare edge cases. They're the steady background noise of a platform that reacts to change manually instead of automatically.

What the pattern says

The Observer pattern says: when the state of something changes, every component that cares about that change gets notified and can react — without the thing that changed needing to know who's watching.

There are two roles. The subject holds state and emits events when that state changes. The observers subscribe to those events and decide what to do with them. The subject doesn't care how many observers there are or what they do. The observers don't care how the state changed or who triggered it.

The key property is decoupling. The thing that changes and the things that react to that change don't need to know about each other. You can add new observers without touching the subject. You can change how the subject works without touching the observers.

Where you see this in practice

The Kubernetes controller loop is the Observer pattern implemented at cluster scale — and it's the most important example in this entire series.

Every controller in Kubernetes follows the same structure: watch a resource for changes, compare the current state against the desired state, and act to close the gap. The API server is the subject. Controllers are the observers. When you create a Deployment, the Deployment controller observes that event and starts reconciling — creating ReplicaSets, scheduling Pods, updating status.

You didn't wire them together explicitly. The controller registered its interest in Deployments, and the API server notifies it when something changes. That's the pattern.

Operators extend this to your own domain. When you write a Kubernetes Operator — or adopt one like External Secrets Operator, Cert-Manager, or Argo Rollouts — you're writing an Observer for a specific kind of state change. External Secrets Operator watches for ExternalSecret resources and reacts by fetching secrets from Vault or AWS Secrets Manager and syncing them into Kubernetes Secrets. Cert-Manager watches for Certificate resources and reacts by requesting TLS certificates from Let's Encrypt.

Each of these is an automated response to a state change. No runbook. No human in the loop. The observer watches, the event happens, the reaction fires.

Argo Rollouts takes this further into deployment strategy. It observes metrics from Prometheus during a canary deployment and reacts automatically — promote if the error rate stays below the threshold, roll back if it spikes. The deployment strategy becomes a set of Observer reactions to real-time system state rather than a manual decision made by an engineer watching a dashboard.

When to write one and when not to

Knowing this pattern helps you make a decision that comes up more often than it should: should we write a custom operator for this?

The answer is usually: only if the thing you're watching and the reaction you need aren't covered by an existing operator, and only if the operational value justifies the maintenance cost.

Writing an operator is committing to a stateful, long-running Observer. It will need to handle edge cases — what happens if it misses an event? What if the reaction fails halfway? What if the resource it's watching gets deleted while it's processing? These aren't reasons not to write one, but they're reasons to reach for an existing operator first and build a custom one only when you have a clear gap.

The pattern also helps you evaluate existing tools more clearly. When you're comparing Cert-Manager vs. manually rotating certificates via a CronJob, you're really comparing "a well-designed Observer with proper reconciliation semantics" vs. "a polling loop that approximates observation." That framing makes the tradeoff obvious.

What changes when you think this way

When you see your platform through the Observer lens, the question shifts from "who is responsible for watching this?" to "what should be watching this automatically?"

That's not just a semantic difference. It changes how you design for reliability. Instead of writing runbooks that tell humans what to watch and when to react, you build controllers that watch continuously and react consistently. The human moves from being the observer to being the designer of observers.

It also changes how you think about toil. Toil — the repetitive, manual, automatable work that plagues ops teams — is almost always a symptom of missing observers. If someone is manually rotating secrets, there's no secret observer. If someone is manually syncing configs across environments, there's no config observer. If someone is manually checking whether a canary deployment looks healthy enough to promote, there's no metrics observer.

Name the toil. Find the missing observer. Build or adopt it.

What's something your team is still watching manually that should have an observer by now? Hit reply at blog@parraletz.dev — I'd bet the pattern already exists.

The Facade Is the Product

Alex Parra (Parraletz) — Wed, 11 Mar 2026 22:43:43 GMT

The Facade Is the Product

There's a question that comes up in almost every platform team at some point:

Should we expose Kubernetes directly to developers, or should we abstract it?

It sounds like a tooling debate. Backstage vs. raw manifests. Helm charts vs. a self-service portal. YAML vs. a form. But underneath all of that, it's actually a design question — and the Facade pattern is the lens that makes it clearer.

What the pattern says

The Facade pattern is straightforward: when a system is too complex for its consumers to deal with directly, you build a simpler interface on top. Consumers interact with the facade. The complexity lives behind it.

The key word is consumers. A facade isn't about hiding things from people who need to know them. It's about not burdening people with complexity that isn't relevant to their job.

Your developers don't need to know which CNI the cluster runs. They don't need to understand how cert-manager issues TLS certificates, or whether secrets come from Vault or AWS Secrets Manager or an ESO sync. That's not their job. Their job is to ship features. Your job, as a platform engineer, is to make sure the infrastructure complexity doesn't get in the way of that.

The platform is the facade. That's not a metaphor — it's the actual pattern.

The cost of not having one

When there's no facade, the complexity leaks.

Developers end up learning Kubernetes internals just to deploy an app. They copy-paste manifests they don't fully understand. They ping the platform team every time something doesn't work because they don't have the mental model to debug it themselves. The platform team becomes a bottleneck — not because they're slow, but because the interface they've built requires too much context to use independently.

This is what the CNCF calls cognitive load, and reducing it is one of the central arguments for Platform Engineering as a discipline. The Platform Engineering Maturity Model essentially describes a journey from "no facade" to "well-designed facade" — from teams managing raw infrastructure to teams consuming curated, opinionated abstractions.

The Facade pattern gives that journey a name and a shape.

The Golden Path is a facade with an opinion

The Golden Path concept — popularized by Spotify and now widely referenced in the platform engineering community — is a specific flavor of the Facade pattern.

It's not just an abstraction. It's an opinionated abstraction. You're not trying to support every possible way to deploy an app. You're defining the recommended way, making it the easiest path, and letting teams deviate only when they have a good reason.

That opinion is the point. A facade without an opinion is just indirection — you've added a layer but haven't reduced the decision space. A Golden Path says: we've already made the hard calls about logging, secrets management, resource limits, and deployment strategy. If you follow this path, those problems are solved for you.

The tradeoff is real though. The more opinionated the facade, the less flexibility consumers have. Getting that balance right — tight enough to provide value, loose enough not to become a cage — is one of the harder design problems in platform engineering.

Where you see this in practice

Backstage Software Templates are a facade over your entire delivery stack. The developer sees a form. Behind it: a repo is created, a CI pipeline is wired up, a Kubernetes namespace is provisioned, a service catalog entry is registered. The facade collapses a multi-step, multi-system process into a single interaction.

Helm charts as internal abstractions are a more granular facade. Instead of exposing raw Kubernetes manifests, you expose a values.yaml with the knobs that actually matter for your context — image tag, replicas, resource tier. Everything else is a sensible default hidden inside the chart.

Port, Cortex, or any internal developer portal — same pattern, different surface. The portal is the facade. It shows developers what they need: service health, deployment status, ownership, runbooks. It hides what they don't: the underlying observability stack, the alerting rules, the infrastructure topology.

Even something as simple as a well-designed Makefile or a CLI wrapper around your deployment scripts is a facade. You're reducing the interface to the things that matter for the person using it.

What changes when you think this way

Back to the original question: should you expose Kubernetes directly or abstract it?

The Facade pattern reframes it. The real question isn't about Kubernetes — it's about who your consumer is and what complexity is relevant to them.

If your consumers are platform engineers building on top of the platform, some Kubernetes exposure might be appropriate. They have the context to use it well. If your consumers are product developers who need to ship features, exposing Kubernetes is almost always the wrong call. The abstraction isn't hiding something useful from them — it's removing friction that shouldn't exist in the first place.

But the more practical shift is this: once you see the platform as a facade, you start asking different questions. Instead of "how do we document all this so developers can figure it out", you ask "what should developers never need to figure out in the first place." Instead of "how do we give developers more access", you ask "what's the right interface for this consumer."

The facade you build should match the consumer you're designing for. And that's a product decision, not a tooling one.

What does your facade look like today — close to the metal, fully abstracted, or somewhere in between? Reply at blog@parraletz.dev.

Your Platform Provisions Environments. That's a Factory.

Alex Parra (Parraletz) — Mon, 09 Mar 2026 16:51:05 GMT

Your Platform Provisions Environments. That's a Factory.

Every platform team eventually faces the same request, over and over again.

A developer needs a new environment — staging, a feature branch, a sandbox for testing. They open a ticket, fill out a form, ping someone on Slack, or run a script they half-understand. A few minutes or a few days later, the environment exists. Or it doesn't. Or it exists but it's missing the secrets. Or the resource limits are wrong. Or it was set up differently than the last one.

This is the problem: environment provisioning without a clear creation model is slow, inconsistent, and doesn't scale.

Why it hurts

When there's no centralized creation logic, every team invents their own process. Some copy-paste Terraform from a previous project. Some run scripts that only the original author understands. Some open tickets and wait.

The result is environments that drift. Staging doesn't match production. The new service has a different secret management setup than everything else. Resource limits are either too tight or nonexistent. And every time something breaks, the platform team has to reverse-engineer how that environment was created in the first place.

Beyond the technical debt, there's a trust problem. Developers stop trusting that the environment they get will actually work. They start maintaining their own workarounds. The platform team loses credibility not because they're incompetent, but because the process itself is unreliable.

What the pattern says

The Factory pattern says: instead of letting every caller decide how to create something, you centralize that creation logic in one place. The caller says what they want. The factory decides how to build it.

The Abstract Factory goes one level up — it creates families of related objects that are meant to work together. Not just a namespace, but a namespace plus secrets plus network policies plus resource quotas plus an observability entry. Things that belong together get created together, consistently, every time.

In software, this pattern exists to avoid the chaos of scattered instantiation logic. In platform engineering, it exists for the same reason: to avoid the chaos of scattered provisioning logic.

Where you see this in practice

Crossplane is the clearest implementation of this pattern in the cloud-native ecosystem. You define a Composite Resource — say, an AppEnvironment — and Crossplane composes it from lower-level resources: an RDS instance, an S3 bucket, a Kubernetes namespace, an IAM role. The developer asks for an AppEnvironment. They don't know or care what's underneath. The factory handles it.

The AppEnvironment XRD is the interface. The composition is the implementation. You can swap the implementation — move from AWS to GCP, change the database engine — without touching the interface the developer uses.

Backstage Software Templates work the same way. A developer picks a template — "Python microservice", "data pipeline", "internal tool" — and gets back a repo, a CI/CD pipeline, a Kubernetes manifest, and a service catalog entry. The template is the factory. The developer just chose the product type.

Terraform modules are the most familiar version of this idea. A well-designed module is a factory for infrastructure — you pass in variables (the what), the module handles the implementation details (the how). The problem is that most teams end up with modules that are too thin, just wrappers around resources, instead of modules that encapsulate a meaningful unit of infrastructure with sensible defaults and guardrails built in.

The Builder pattern: when creation has steps

The Builder pattern is a close relative. Where Factory focuses on what you're creating, Builder focuses on how you assemble something complex step by step.

Think of a cluster bootstrap pipeline. It doesn't create everything at once. It runs in stages: provision the cluster, install the CNI, configure cert-manager, deploy the ingress controller, register with ArgoCD, apply the base manifests. Each step depends on the previous one. The order matters.

Helm with dependencies, Terraform with depends_on, pipelines chaining multiple workspaces — these are all Builder implementations. You're constructing a complex result piece by piece, in a defined sequence, with a clear separation between "how we build this" and "what we end up with."

AWS CDK is an interesting case here. It's deeply influenced by the Builder pattern — you construct infrastructure programmatically using loops, conditionals, and abstractions. The mental model is sound. But it also requires teams to fully commit to that abstraction layer, which is a significant shift if your team lives in Terraform or plain YAML. That might be part of why CDK hasn't seen the adoption you'd expect given how well-designed it is — the pattern is right, but the switching cost is real.

The distinction between Factory and Builder matters when something breaks. A Factory failure tells you what couldn't be created. A Builder failure tells you which step broke — which is much easier to recover from.

What changes when you think this way

When someone on your team says "we should let teams provision their own environments", the Factory framing turns a vague discussion into a concrete design question: how much of the creation logic do you centralize, and how much flexibility do you expose?

Do you give teams raw Terraform and hope they follow conventions? That's no factory — just scattered creation logic waiting to drift. Do you give them a module with sensible defaults and a few knobs to turn? That's a thin factory. Do you give them a Backstage template that hides everything behind a form? That's a full factory.

Each of those is a valid choice depending on your context. The pattern doesn't tell you which one to pick — it gives you the vocabulary to make that decision intentionally instead of stumbling into it.

You're not debating Terraform vs. Crossplane. You're debating how much creation logic should live in one place, and who should own it. That's an architectural decision. It deserves to be treated like one.

How does your team handle environment provisioning today — scattered scripts, thin modules, or full abstraction? Hit reply at blog@parraletz.dev — curious where people have landed on this.

Design Patterns Are Not a Developer Thing

Alex Parra (Parraletz) — Fri, 06 Mar 2026 15:09:44 GMT

Design Patterns Are Not a Developer Thing

At some point, most platform engineers end up in a conversation with a senior dev or a software architect where you're trying to explain how your infrastructure works.

And somewhere in the middle of that conversation, something clicks. You're both describing the same problems. You're reaching for the same solutions. The only difference is the vocabulary — they say "Facade", you say "developer abstraction layer". They say "Observer pattern", you say "controller loop". They say "Command", you say "GitOps".

Same architecture. Different words.

What I've found is that design patterns aren't really a software thing. They're a complexity thing. And if there's one domain that deals with complexity at scale every single day, it's platform engineering.

This post is the first in a series that explores exactly that.

The problem with how design patterns are taught

Design patterns come from a book called Design Patterns, published in 1994 by four authors the industry nicknamed the "Gang of Four" (GoF). The book is genuinely great. It's also full of examples like "imagine you have a WordDocument class and you need a PDFDocument class..."

If your day-to-day doesn't involve classes and inheritance, that framing is just noise.

But the patterns themselves aren't really about code. They're about how to deal with complexity as a system grows. And platforms have exactly those problems — just at a different scale and with different tools.

The thesis of this series is pretty straightforward: the design patterns developers learned in school or from a book? You're already applying them in your platform. Nobody just introduced them by name.

Three quick examples to make the case

The Facade: why platform teams exist in the first place

The Facade pattern says that when a system gets too complex, you build a simplified interface on top. Consumers talk to the facade, not to the guts of the system.

That's exactly what an Internal Developer Platform does. Your developers shouldn't need to know whether a secret comes from Vault, AWS Secrets Manager, or a ConfigMap. They shouldn't need to care which CNI the cluster runs, or how the ingress controller works under the hood. The platform is the facade. You're the one designing it.

When your team debates whether to expose Kubernetes directly to developers or abstract it behind a Backstage template — that's a conversation about how much facade to build. That's design work, not ops work.

The Observer: the controller loop you already know

The Observer pattern says that when an object's state changes, everything that depends on it gets notified automatically.

The Kubernetes controller loop is exactly that, implemented at cluster scale. There's a desired state (what's in etcd), an actual state (what's running), and controllers that watch the gap and act on it. Argo Rollouts watches Prometheus metrics to decide whether a canary should keep going or roll back. External Secrets Operator watches Vault for changes and syncs secrets accordingly.

Every operator you write or adopt is an implementation of the Observer pattern. Knowing that helps you decide when writing one actually makes sense — and when it's just adding complexity for fun.

The Command: why GitOps makes sense from first principles

The Command pattern says that instead of executing an action directly, you wrap it in an object that can be stored, passed around, undone, or audited.

A commit in your GitOps repo is exactly that. It's not just a text change. It's a persisted command — with an author, a timestamp, and the ability to be reversed with a git revert. ArgoCD or Flux is the executor that takes that command and applies it to the cluster.

The question "why GitOps instead of pipelines that just run kubectl apply?" has a deeper answer than "because it's best practice." It has an architectural answer: because separating the command from its execution gives you properties — auditability, reversibility, a single source of truth — that you'd otherwise have to build from scratch.

Where this is going

This is just the hook. The rest of the series goes pattern by pattern, connecting each one to real platform decisions:

Factory and Builder → how an IDP thinks when it provisions environments
Facade → the cost of cognitive load and the Golden Path
Observer → when to write an operator and when to step back
Command → GitOps explained from first principles
Adapter → why the Kubernetes ecosystem is basically a catalog of adapters, and why that's a feature
Strategy → why hardcoding your deployment strategy into the pipeline is an antipattern

The goal isn't for you to finish this series knowing more pattern trivia. The goal is that next time you make a platform decision, you have better vocabulary to explain it, defend it, or challenge it.

There's a practical side to this that I find underrated: patterns also help you diagnose problems. If your pipelines are slow and tightly coupled, that's not just a pipeline problem — that's a symptom of missing separation of concerns, which maps directly to specific patterns that exist precisely to solve it. Once you can name what's wrong architecturally, the solution space gets a lot clearer. You stop googling "how to speed up my CI pipeline" and start asking "what's the right pattern for decoupling this."

Because the difference between a team that operates infrastructure and one that designs a platform isn't the tool stack. It's whether they're making architectural decisions on purpose — or just stumbling into them.

If this resonates, the next posts go deeper on each pattern. You can follow along or reply directly at blog@parraletz.dev — I read everything.

Por qué Git no es suficiente para ML

Alex Parra (Parraletz) — Wed, 14 Jan 2026 13:07:34 GMT

El setup típico que veo:

bash

ml-project/
├── .git/
├── src/
│   └── train.py
├── data/
│   └── training.csv  # 5GB, in Git
└── models/
    └── model.pkl     # 2GB, in Git

Primera pregunta: ¿Cómo está esto en Git?

Respuesta común:

bash

# .gitignore
data/
models/

# Entonces... ¿cómo compartes los datos?
# "Los tenemos en S3"

# ¿Versionados?
# "...no"

El problema fundamental:

Código es fácil:

python

# Version 1
model = Model(lr=0.01)

# Version 2
model = Model(lr=0.001)

# Git diff:
- model = Model(lr=0.01)
+ model = Model(lr=0.001)

# Clear, concise, mergeable

Datos no:

training.parquet v1: 50GB
training.parquet v2: 55GB

# Git diff:
Binary files differ

# ¿Qué cambió?
# - ¿5GB de nuevas filas?
# - ¿Se modificaron filas existentes?
# - ¿Cambió el schema?
# ¿Quién sabe?

Los 5 problemas de versionar datos:

PROBLEMA 1: Tamaño

bash

git add data/training.parquet  # 50GB

# Git:
error: File data/training.parquet is 50GB
hint: Use Git LFS for files larger than 100MB

git lfs install
git lfs track "*.parquet"
git add data/training.parquet
git push

# GitHub/GitLab:
error: File exceeds 5GB limit
# Even with LFS

PROBLEMA 2: Performance

bash

git clone repo

# With 50GB dataset in Git:
# Clone time: 45 minutes
# Disk usage: 100GB (working copy + .git)
# CI/CD: Timeout después de 1 hora

PROBLEMA 3: No es text-based

bash

git diff data/training.csv

# CSV de 10M filas:
+row_9874561,value1,value2,...
+row_9874562,value1,value2,...
-row_9874563,old_value1,old_value2,...
...
# 10 million line diff
# Inútil

PROBLEMA 4: Merge conflicts

bash

# Branch A: Agrega 1M filas
# Branch B: Agrega 1M filas diferentes

git merge branch-b

CONFLICT in data/training.csv
# ¿Cómo resuelves esto?
# ¿Manualmente en un CSV de 15M filas?

PROBLEMA 5: Cost

bash

# Git LFS en GitHub
# - Free: 1GB storage, 1GB bandwidth
# - Paid: $5/month per 50GB storage

# Dataset de 500GB:
# $50/month solo para storage
# + bandwidth costs

# S3:
# $11.50/month per 500GB
# Mucho más barato
```

---

**Por qué los datos son diferentes del código:**

| Característica | Código | Datos |
|----------------|--------|-------|
| Tamaño | KB-MB | GB-TB |
| Formato | Text | Binary (parquet, tfrecord) |
| Cambios | Líneas específicas | Filas/distribuciones |
| Merge | Posible | Imposible |
| Diff | Meaningful | "Binary files differ" |
| Storage | Git OK | Git no funciona |
| Versioning | Commits | Checksums/hashes |

---

**Lo que realmente necesitas:**

No solo versionar archivos. Versionar **contexto**:

**1. El dataset mismo**
```
training.parquet
  - Size: 50GB
  - Hash: a1b2c3d4e5f6
  - Location: s3://data/training/v2.3.1/

2. Metadata

yaml

dataset: training-v2.3.1
created: 2024-01-15
rows: 1,234,567
columns: 45
schema:
  user_id: int64
  timestamp: datetime64
  feature_1: float64
  ...
source: production-db
query: |
  SELECT * FROM events
  WHERE date > '2023-01-01'
  AND user_type = 'active'

3. Statistics

yaml

distributions:
  feature_1:
    mean: 0.547
    std: 0.231
    min: 0.0
    max: 1.0
  feature_2:
    unique_values: 47
    top_3: ['A', 'B', 'C']
    missing: 2.3%
```

**4. Lineage**
```
training-v2.3.1
  ← derived from raw-events-v1.5.2
    ← extracted from production-db
      ← populated by application-v2.1.0

5. Validation

yaml

checks:
  - no_missing_values: PASS
  - schema_match: PASS
  - distribution_drift: WARN (feature_3 shifted)
  - no_duplicates: PASS
```

---

**Casos reales donde Git falló:**

**CASO 1: El dataset "desapareció"**
```
DS: "Necesito el dataset de hace 3 meses"
Yo: "¿Está en Git?"
DS: "No, era muy grande"
Yo: "¿Dónde está?"
DS: "En s3://data/training.parquet"
Yo: "Eso se sobrescribe cada semana"
DS: "..."

# Dataset perdido para siempre
# 3 meses de trabajo para re-generarlo

CASO 2: El merge conflict imposible

bash

git merge feature-new-data-source

CONFLICT in data/training.csv

# 10 million lines en conflict
# Impossible resolver manualmente
# Tuvimos que re-generar todo el dataset
# 2 semanas de trabajo perdidas

CASO 3: El CI/CD que nunca terminó

yaml

# .github/workflows/train.yaml
steps:
  - name: Checkout
    uses: actions/checkout@v3

  # Descarga 50GB de datos de Git LFS
  # Timeout después de 6 horas
  # CI/CD nunca completa
  # No podemos hacer automated training
```

---

**La solución correcta:**
```
Código → Git
Datos → Data versioning tool (DVC, LakeFS, etc.)
Modelos → Model registry (MLflow, etc.)
```

**Separación de concerns:**
```
┌─────────────────────────────────────┐
│           Git (Source Control)      │
│                                     │
│  - Code                             │
│  - Config                           │
│  - Pipelines                        │
│  - .dvc files (pointers)           │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│     DVC/LakeFS (Data Versioning)    │
│                                     │
│  - Training datasets                │
│  - Validation datasets              │
│  - Feature sets                     │
│  - Metadata                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│      MLflow (Model Registry)        │
│                                     │
│  - Trained models                   │
│  - Metrics                          │
│  - Hyperparameters                  │
│  - Artifacts                        │
└─────────────────────────────────────┘

Ejemplo práctico:

SIN data versioning:

python

# train.py
data = pd.read_parquet("s3://data/training.parquet")
model = train(data)
mlflow.log_model(model, "model")

# 3 meses después:
# "¿Con qué datos se entrenó este modelo?"
# "...no sé"

CON data versioning:

python

# train.py
import dvc.api

# Pull specific data version
data_version = "v2.3.1"
with dvc.api.open(
    'data/training.parquet',
    rev=data_version
) as f:
    data = pd.read_parquet(f)

# Train
model = train(data)

# Log everything
mlflow.log_param("data_version", data_version)
mlflow.log_param("data_hash", get_data_hash())
mlflow.log_param("data_rows", len(data))
mlflow.log_model(model, "model")

# 3 meses después:
# "¿Con qué datos se entrenó?"
# "data_version: v2.3.1, hash: a1b2c3"
# "Puedo reproducirlo: dvc checkout v2.3.1"

Las preguntas críticas:

Puedes responder:

1. Reproducibilidad

¿Puedes reproducir el experimento #142?
→ Sí: data v2.1.3 + code commit abc123

2. Debugging

¿Por qué el modelo empeoró?
→ Data drift: v2.3.1 vs v2.3.2
→ mean(feature_x): 0.5 → 0.7

3. Compliance

¿Qué datos personales contiene este modelo?
→ Dataset v2.1.3
→ Metadata dice: no PII
→ Validation logs confirman

4. Rollback

Necesito volver al dataset de la semana pasada
→ dvc checkout v2.2.9
→ 30 segundos

Comparativa rápida:

Método	Pros	Contras	Use case
Git directo	Simple	No escala >100MB	Datasets pequeños (<100MB)
Git LFS	Git workflow	Caro, lento	Datasets medianos (<5GB)
DVC	Open source	Setup manual	Data science teams
LakeFS	Git-like	Complejo	Data lakes grandes
Delta Lake	ACID	Databricks-specific	Spark workflows

Tu turno: ¿Cómo versionas tus datos hoy? ¿O es "latest.parquet" y rezar?

Conectando los mundos - Platform Engineering meets ML

Alex Parra (Parraletz) — Tue, 13 Jan 2026 16:01:10 GMT

Terraform, Helm, GitOps... ¿para entrenar modelos? Sí.

Hoy: Cómo gestionar todo eso como código.

El problema:

Tienes tu stack de Platform Engineering:

Terraform para infraestructura
Helm para aplicaciones en K8s
ArgoCD para GitOps
CI/CD pipelines

Y ahora llega ML al juego.

Primera reacción: "Solo es otra aplicación, ¿no?"

Spoiler: No.

Lo que descubrí:

ML no es una aplicación más. Es un ecosistema completo que necesita:

Infraestructura tradicional:

Compute (K8s clusters, VMs)
Storage (S3, persistent volumes)
Networking (load balancers, ingress)

PERO TAMBIÉN infraestructura específica de ML:

Feature stores
Model registries
Experiment tracking
Data versioning
Training pipelines
Serving infrastructure

Y todo debe ser código. Reproducible. Versionado. Con GitOps.

Los 4 niveles de IaC para MLOps:

Terraform es el ganador indiscutible, pero existen AWS CDK, Pulumi que son dignos competidores.

Nivel 1: Infraestructura base (Fundational)

# terraform/main.tf

# Cluster para ML workloads
resource "aws_eks_cluster" "ml_cluster" {
  name     = "ml-${var.environment}"
  role_arn = aws_iam_role.ml_cluster.arn

  vpc_config {
    subnet_ids = var.private_subnets
  }
}

# Node group con GPUs
resource "aws_eks_node_group" "gpu_nodes" {
  cluster_name    = aws_eks_cluster.ml_cluster.name
  node_group_name = "gpu-nodes"

  instance_types = ["p3.2xlarge"]

  scaling_config {
    desired_size = 2
    max_size     = 10
    min_size     = 0  # Scale to zero cuando no se use
  }
}

# Storage para datasets
resource "aws_s3_bucket" "ml_data" {
  bucket = "ml-data-${var.environment}"

  versioning {
    enabled = true  # Importante para data versioning
  }
}

Esto es familiar. Es Terraform normal.

Nivel 2: ML Platform components

yaml

# helm/values.yaml

# MLflow para experiment tracking
mlflow:
  enabled: true
  backend_store: postgresql
  artifact_store: s3://ml-artifacts-prod

# Kubeflow Pipelines (si lo usas)
kubeflow:
  enabled: true
  pipelines:
    persistence:
      storageClass: gp3
      size: 100Gi

# Model serving (Seldon, KServe, etc.)
seldon:
  enabled: true
  replicas: 3
  resources:
    limits:
      nvidia.com/gpu: 1

Esto es donde Platform Engineering y ML se encuentran.

Nivel 3: Training pipelines como código

yaml

# pipelines/training-pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: training-pipeline
spec:
  entrypoint: train-model

  arguments:
    parameters:
    - name: dataset-version
      value: "v2.3.1"
    - name: environment
      value: "staging"

  templates:
  - name: train-model
    steps:
    - - name: ingest-data
        template: ingest
    - - name: validate-data
        template: validate
    - - name: train
        template: training
    - - name: evaluate
        template: evaluation
    - - name: register
        template: model-registration

  - name: ingest
    container:
      image: ml-pipeline/ingest:{{workflow.parameters.dataset-version}}
      command: [python, ingest.py]
      env:
      - name: DATASET_VERSION
        value: "{{workflow.parameters.dataset-version}}"
      - name: S3_BUCKET
        value: "ml-data-{{workflow.parameters.environment}}"

Aquí es donde todo se conecta: infra + pipelines + GitOps.

Nivel 4: GitOps para ML

Y como saben, en este lugar somos GitOps Frendly

# argocd/ml-application.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-platform
  namespace: argocd
spec:
  project: default

  source:
    repoURL: https://github.com/company/ml-platform
    targetRevision: main
    path: k8s/

    helm:
      values: |
        environment: staging

        mlflow:
          enabled: true
          version: "2.9.0"

        training_pipelines:
          schedule: "0 2 * * 0"  # Weekly
          resources:
            gpu: true
            memory: 32Gi

  destination:
    server: https://kubernetes.default.svc
    namespace: ml-staging

  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Commit → Push → ArgoCD sync → Pipeline deployed.

Igual que tus otras aplicaciones. Pero para ML.

La integración completa:

┌─────────────────────────────────────────────────┐
│              Git Repository                     │
│                                                 │
│  terraform/        → Infra base                 │
│  helm/            → ML platform                 │
│  pipelines/       → Training workflows          │
│  models/          → Model definitions           │
└─────────────┬───────────────────────────────────┘
              │
              │ git push
              │
     ┌────────▼────────┐
     │   CI Pipeline   │
     │                 │
     │  - Validate     │
     │  - Test         │
     │  - Build images │
     └────────┬────────┘
              │
              │ if tests pass
              │
     ┌────────▼────────┐
     │    ArgoCD       │
     │                 │
     │  - Sync infra   │
     │  - Deploy ML    │
     │  - Trigger      │
     └────────┬────────┘
              │
              │
     ┌────────▼─────────────────────────────┐
     │         Kubernetes                   │
     │                                      │
     │  - ML Platform running               │
     │  - Pipelines scheduled               │
     │  - Models serving                    │
     └──────────────────────────────────────┘

Ejemplo real: Deploy de un nuevo modelo

Antes (manual):

bash

# 1. SSH al cluster
ssh ml-cluster-prod

# 2. Actualizar manualmente
kubectl apply -f model-deployment.yaml

# 3. Esperar y rezar
kubectl get pods -w

# 4. Si falla, rollback manual
kubectl rollback deployment/model-server

Ahora (GitOps):

bash

# 1. Update en Git
cat > k8s/model-deployment.yaml <# Cambio aquí
EOF

# 2. Commit y push
git add k8s/model-deployment.yaml
git commit -m "Deploy model v2.3.1 to staging"
git push

# 3. ArgoCD hace el resto
# - Detecta cambio
# - Valida
# - Canary deploy
# - Rollback automático si falla

Los componentes clave:

Terraform para infraestructura:

hcl

module "ml_platform" {
  source = "./modules/ml-platform"

  environment = var.environment

  # Compute
  gpu_nodes     = 4
  cpu_nodes     = 10

  # Storage
  data_bucket   = "ml-data-${var.environment}"
  model_bucket  = "ml-models-${var.environment}"

  # Networking
  vpc_id        = module.vpc.vpc_id

  # ML-specific
  mlflow_backend    = module.rds.endpoint
  feature_store     = "enabled"
  model_registry    = "enabled"
}

Helm para componentes de ML:

yaml

# Chart.yaml
dependencies:
- name: mlflow
  version: "0.7.0"
  repository: https://charts.mlflow.org

- name: kubeflow-pipelines
  version: "2.0.0"
  repository: https://kubeflow.github.io/pipelines

- name: seldon-core
  version: "1.15.0"
  repository: https://storage.googleapis.com/seldon-charts

ArgoCD para sincronización:

yaml

# Sync policy para ML workloads
syncPolicy:
  automated:
    prune: true
    selfHeal: true

  syncOptions:
  - CreateNamespace=true

  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m

CI/CD para pipelines:

# .github/workflows/ml-pipeline.yaml
name: ML Pipeline CI

on:
  push:
    paths:
    - 'pipelines/**'
    - 'models/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
    - name: Validate pipeline syntax
      run: |
        argo lint pipelines/training-pipeline.yaml

    - name: Test with sample data
      run: |
        pytest tests/pipeline_test.py

    - name: Build container images
      run: |
        docker build -t ml-pipeline/ingest:${{ github.sha }}

Los desafíos específicos de IaC para ML:

DESAFÍO 1: State management para modelos

Software tradicional:
- Deploy nueva versión
- Old version se borra

ML:
- Deploy modelo v2
- Modelo v1 sigue sirviendo (A/B testing)
- Modelo v0 en rollback standby
- ¿Cómo manejas state?

Solución:

hcl

# Terraform state para múltiples versiones
resource "kubernetes_deployment" "model" {
  for_each = var.model_versions

  metadata {
    name = "model-${each.key}"
  }

  spec {
    replicas = each.value.traffic_percentage > 0 ? 2 : 0
  }
}

DESAFÍO 2: Recursos dinámicos

Training pipeline necesita:
- 8 GPUs por 4 horas
- Después, scale a 0

Serving necesita:
- 2 GPUs 24/7
- Auto-scale basado en requests

Solución:

yaml

# Karpenter para auto-scaling inteligente
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-training
spec:
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["p3.2xlarge", "p3.8xlarge"]

  ttlSecondsAfterEmpty: 300  # Terminate si no se usa

  limits:
    resources:
      nvidia.com/gpu: 32  # Max GPUs

DESAFÍO 3: Data versioning en IaC

Código: Git
Containers: Registry
Infraestructura: Terraform state

¿Datasets de gigantes?

Solución:

yaml

# DVC como parte del pipeline
apiVersion: v1
kind: ConfigMap
metadata:
  name: dvc-config
data:
  .dvc/config: |
    [core]
        remote = s3
    ['remote "s3"']
        url = s3://ml-data-prod/dvc-cache

# Referencia en pipeline
- name: fetch-data
  container:
    image: dvc:latest
    command: [dvc, pull, data/training.dvc]
    env:
    - name: DVC_REMOTE
      value: s3

Stack completo recomendado:

Infrastructure Layer:
├── Terraform         → Cloud resources (K8s, S3, RDS)
├── Helm             → ML platform components
└── ArgoCD           → GitOps sync

ML Platform Layer:
├── MLflow           → Experiment tracking
├── DVC              → Data versioning
├── Kubeflow/Airflow → Pipeline orchestration
└── Seldon/KServe    → Model serving

Pipeline Layer:
├── Argo Workflows   → Training execution
├── GitHub Actions   → CI/CD
└── Prometheus       → Monitoring

Data Layer:
├── S3/GCS           → Data lake
├── PostgreSQL       → Metadata
└── Feature Store    → Feature management

Mi workflow real:

1. Definir infra (Terraform):

bash

cd terraform/environments/staging
terraform plan
terraform apply

2. Deploy ML platform (Helm + ArgoCD):

bash

git add helm/ml-platform/
git commit -m "Deploy MLflow + Kubeflow to staging"
git push  # ArgoCD picks it up

3. Definir pipeline (YAML):

bash

git add pipelines/training-v2.yaml
git commit -m "Update training pipeline"
git push  # ArgoCD deploys

4. Trigger entrenamiento:

bash

# Scheduled (automático via CronWorkflow)
# O manual:
argo submit pipelines/training-v2.yaml \
  --parameter dataset-version=v2.3.1

5. Deploy modelo (GitOps):

git add k8s/model-server.yaml
git commit -m "Deploy model v2.3.1"
git push  # ArgoCD deploys con canary

Todo como código. Todo en Git. Todo reproducible.

Los errores que he visto:

ERROR 1: "Infra manual, pipelines automatizados"

Resultado: Pipelines automatizados corriendo en infra manual
Alguien borra el cluster por error
Pipelines fallan, nadie sabe cómo reconstruir

ERROR 2: "Terraform para todo, incluso modelos"

Resultado: Model registry en Terraform state
Deploy nuevo modelo = terraform apply (lento)
Rollback = terraform plan/apply (muy lento)
No es el tool correcto para esto

ERROR 3: "GitOps sin CI"

Resultado: Push directo a main sin validación
Pipeline syntax error deployado a prod
Cluster crashea

ERROR 4: "IaC solo en prod"

Resultado: Dev y staging son snowflakes
"Funciona en mi máquina" nivel infraestructura
Debugging imposible

Mi recomendación pragmática:

Fase 1 - Infra as Code:

- Terraform para cluster base
- Manual deploy de ML platform
- Scripts para pipelines

Fase 2 - Platform as Code:

- Helm para ML components
- CI/CD para validación
- Semi-automated deploys

Fase 3 - Everything as Code:

- Full GitOps con ArgoCD
- Pipelines como YAML en Git
- Automated canary deploys

Empieza en Fase 1. Evoluciona cuando necesites escalar.

Environments (dev / staging / prod)

Alex Parra (Parraletz) — Mon, 12 Jan 2026 06:00:52 GMT

"Solo necesito cambiar la variable ENV a 'prod'" - Spoiler: No

Ayer: cuándo ejecutar tus pipelines. Hoy: dónde ejecutarlos.

Dev, staging, producción. Lo sabes de software tradicional.

Pero en ML, los ambientes son mucho más que código.

Historia:

Data Scientist: "Ya terminé el modelo, súbelo a producción" Yo: "¿Lo probaste en staging?" DS: "No tenemos staging para ML" Yo: "¿Cómo no?" DS: "Es el mismo código, solo cambias la conexión a la base de datos"

3 semanas después:

El modelo en producción degradó un 15%.

¿Por qué? Porque los datos de producción eran diferentes a los de desarrollo.

Nadie lo había probado con datos reales de prod antes del deploy.

El problema fundamental:

En software tradicional:

if ENV == "dev":
    db = "localhost:5432"
elif ENV == "prod":
    db = "prod-db:5432"

Eso es todo. Mismo código, diferentes configuraciones.

En ML:

if ENV == "dev":
    db = "localhost:5432"
    dataset = "sample_1000_rows.csv"  # Subset
    model = "model_v1_local.pkl"
    features = ["basic_feature_set"]
    monitoring = False

elif ENV == "prod":
    db = "prod-db:5432"
    dataset = "s3://prod/full_dataset/"  # 500GB
    model = "s3://models/prod/v2.3.1/"
    features = ["basic + advanced + experimental"]
    monitoring = True
    drift_detection = True
    A_B_testing = True

No es solo configuración. Son assets completamente diferentes.

Los 3 tipos de separación necesarios:

1. Separación de Datos

DEV:
├── datasets/
│   ├── sample_1k.parquet      # Subset para desarrollo
│   ├── synthetic_data.parquet  # Datos sintéticos
│   └── anonymized_prod.parquet # Muestra anonimizada

STAGING:
├── datasets/
│   ├── prod_snapshot.parquet   # Snapshot de prod (última semana)
│   └── validation_set.parquet  # Set de validación fijo

PROD:
├── datasets/
│   ├── realtime_stream/        # Datos reales en tiempo real
│   └── historical/             # Todo el histórico

¿Por qué?

Dev: necesitas iterar rápido (1k rows, no 500GB)
Staging: necesitas validar con datos realistas
Prod: necesitas todos los datos reales

No puedes entrenar en dev con 1k rows y esperar que funcione en prod con 500GB.

2. Separación de Modelos

# Model Registry por ambiente

DEV:
- Experiments: 1000+ modelos
- No validation estricta
- Cualquiera puede registrar

STAGING:
- Candidates: ~10 modelos pasando validación
- Métricas > baseline required
- Review antes de promover

PROD:
- Active: 1-2 modelos serviendo tráfico
- A/B testing entre versiones
- Rollback automático si falla
```

**El flujo:**
```
DEV → train 100 experiments
    → pick best
    → promote to STAGING

STAGING → validate con prod-like data
        → compare vs prod baseline
        → stress test
        → promote to PROD (if approved)

PROD → shadow mode (sin tráfico real)
     → canary deploy (5% tráfico)
     → gradual rollout (100%)

3. Separación de Features

# Feature Store por ambiente

class FeatureStore:
    def get_features(self, env):
        if env == "dev":
            return [
                "basic_features",
                "experimental_features"  # Puedes romper cosas
            ]

        elif env == "staging":
            return [
                "basic_features",
                "tested_features"  # Solo features validadas
            ]

        elif env == "prod":
            return [
                "basic_features",
                "tested_features",
                "monitored_features"  # + monitoring activo
            ]

Ejemplo real de pipeline multi-ambiente:

# Pipeline que funciona en todos los ambientes

@pipeline
def training_pipeline(env: str):

    # 1. Datos diferentes por ambiente
    if env == "dev":
        data = load_sample_data(n_rows=1000)
    elif env == "staging":
        data = load_snapshot("prod_last_week")
    else:  # prod
        data = load_full_dataset()

    # 2. Validación diferente por ambiente
    if env == "dev":
        # Validación light
        assert data is not None
    elif env == "staging":
        # Validación completa
        validate_schema(data)
        validate_distributions(data)
    else:  # prod
        # Validación + alerting
        validate_and_alert(data)

    # 3. Training con recursos diferentes
    if env == "dev":
        model = train(data, resources="small")
    else:
        model = train(data, resources="large")

    # 4. Evaluation con thresholds diferentes
    metrics = evaluate(model, data)

    if env == "dev":
        # Cualquier resultado está OK
        register_model(model, env="dev")
    elif env == "staging":
        # Debe superar baseline
        if metrics.accuracy > BASELINE:
            register_model(model, env="staging")
    else:  # prod
        # Debe superar baseline + pasar A/B test
        if metrics.accuracy > BASELINE:
            deploy_canary(model)
            if canary_metrics_ok():
                promote_to_prod(model)

La infraestructura también cambia:

DEV:
- Local machine o VM compartida
- CPU only (GPUs caras para dev)
- Sin redundancia
- Puede fallar sin problema

STAGING:
- Cluster dedicado
- Similar a prod (pero más pequeño)
- Con monitoring (pero no alerting crítico)
- Datos de prod pero anonymizados

PROD:
- Cluster de alta disponibilidad
- GPUs/recursos necesarios
- Redundancia + auto-scaling
- Monitoring + alerting 24/7
- Backup + disaster recovery

Los errores más comunes:

ERROR 1: "No tenemos staging para ML"

Resultado: Deploy directo a producción
Primera vez que ves datos reales: en prod
Cuando algo falla: afectas usuarios

ERROR 2: "Staging usa los mismos datos que dev"

Resultado: Validas con sample de 1k rows
Funciona en staging (1k rows)
Falla en prod (500GB) por memory/performance

ERROR 3: "Prod y staging comparten infraestructura"

Resultado: Experimento en staging consume recursos
Prod se degrada porque staging le robó memoria
Incident a las 3am

ERROR 4: "Un solo model registry para todo"

Resultado: 1000 experimentos mezclados con modelos de prod
Nadie sabe qué modelo está en producción
Rollback imposible

Arquitectura recomendada:

┌─────────────────────────────────────────────────┐
│                    DEV                          │
│  - Local/shared VM                              │
│  - Sample data (1k-10k rows)                    │
│  - Rapid iteration                              │
│  - No strict validation                         │
└─────────────┬───────────────────────────────────┘
              │ promote best experiment
              ▼
┌─────────────────────────────────────────────────┐
│                  STAGING                        │
│  - Prod-like cluster (smaller)                  │
│  - Prod snapshot data (last week/month)         │
│  - Full validation pipeline                     │
│  - Metrics vs prod baseline                     │
└─────────────┬───────────────────────────────────┘
              │ if approved
              ▼
┌─────────────────────────────────────────────────┐
│                   PROD                          │
│  - HA cluster                                   │
│  - Real-time data                               │
│  - Shadow → Canary → Full rollout               │
│  - 24/7 monitoring                              │
└─────────────────────────────────────────────────┘

Implementación práctica:

Opción 1: Namespaces (si usas K8s)

# dev namespace
namespace: ml-dev
resources:
  memory: 4GB
  cpu: 2

# staging namespace  
namespace: ml-staging
resources:
  memory: 16GB
  cpu: 8

# prod namespace
namespace: ml-prod
resources:
  memory: 64GB
  cpu: 32
  replicas: 3

Opción 2: Cuentas separadas (AWS/GCP/Azure)

dev-account/
├── s3://dev-ml-data/
├── dev-model-registry
└── dev-feature-store

staging-account/
├── s3://staging-ml-data/
├── staging-model-registry
└── staging-feature-store

prod-account/
├── s3://prod-ml-data/
├── prod-model-registry
└── prod-feature-store

Opción 3: Prefixes (si no puedes separar infraestructura)

CONFIGS = {
    "dev": {
        "data_bucket": "ml-data",
        "data_prefix": "dev/",
        "model_registry": "models/dev/",
    },
    "staging": {
        "data_bucket": "ml-data",
        "data_prefix": "staging/",
        "model_registry": "models/staging/",
    },
    "prod": {
        "data_bucket": "ml-data-prod",  # Bucket separado
        "data_prefix": "prod/",
        "model_registry": "models/prod/",
    }
}

Checklist de separación de ambientes:

Datos:

[ ] Dev usa sample/synthetic data
[ ] Staging usa snapshot de prod
[ ] Prod usa datos reales
[ ] Datos de prod nunca en dev (privacidad)

Modelos:

[ ] Model registry por ambiente
[ ] Promotion workflow claro (dev → staging → prod)
[ ] Rollback fácil en prod
[ ] Experiments solo en dev/staging

Features:

[ ] Feature store por ambiente
[ ] Features experimentales solo en dev
[ ] Monitoring de features en prod

Infraestructura:

[ ] Recursos separados por ambiente
[ ] Prod no comparte recursos con dev/staging
[ ] Credenciales diferentes por ambiente
[ ] Logs/metrics separados

Proceso:

[ ] CI/CD diferente por ambiente
[ ] Approvals para promote a prod
[ ] Testing en staging antes de prod
[ ] Canary/shadow deploys en prod

Mi recomendación como Platform Engineer:

Fase 1 - Mínimo viable:

- Dev: local con sample data
- Prod: cluster con datos reales
- Deploy: manual con checklist

Fase 2 - Agregar staging:

- Staging: snapshot semanal de prod
- Validation automática en staging
- Deploy: semi-automático con approval

Fase 3 - Full separation:

- 3 ambientes completamente separados
- CI/CD automatizado
- Canary deploys
- A/B testing

No intentes hacer Fase 3 desde día 1.

Mañana:

Ya sabes ejecutar pipelines en diferentes ambientes.

Pero hay un problema: todo esto es código que vive en algún lado.

¿Cómo conectas Terraform, Helm, K8s, GitOps con MLOps?

Infrastructure as Code para ML. Y por qué no es solo terraform apply.

Tu turno:

¿Cuántos ambientes tienen para ML?
¿O es "dev es mi laptop y prod es... también mi laptop"?

Time-based vs Event-driven: Por qué no es solo cron vs webhooks

Alex Parra (Parraletz) — Fri, 09 Jan 2026 12:53:58 GMT

Ayer: qué orquestador usar.
Hoy: cuándo ejecutar ese orquestador.

El setup más común que veo: ( No sean asi, xD quieran a su PE)

# En el orquestador de tu elección
@schedule("0 2 * * *")  # Todos los días 2am
def train_model():
    data = fetch_data()
    model = train(data)
    deploy(model)

Pregunta simple: ¿Por qué 2am? ¿Por qué todos los días?

Respuesta común: Porque... ¿ventana de bajo tráfico?

Pero la pregunta real es:

"¿Qué evento justifica gastar $X en compute para re-entrenar?"

Porque re-entrenar cuesta:

Compute (GPUs no son gratis)
Tiempo (tu pipeline bloquea recursos)
Riesgo (cada deploy es un posible incidente)

Si nada cambió, ¿para qué re-entrenar?

Los 3 triggers legítimos:

1. Datos nuevos

# Llegaron 10K nuevas transacciones
# El modelo PUEDE aprender algo nuevo
if new_data.count >= 10000:
    trigger_training()

2. Degradación detectada

# El modelo está fallando en prod
# El accuracy bajó 5%
if current_accuracy < baseline - 0.05:
    trigger_emergency_training()

3. Código/features cambiaron

# Agregaste un nuevo feature
# Cambiaste hiperparámetros
if code_changed() or features_changed():
    trigger_ci_training()

Si ninguno de estos 3 pasó, probablemente no necesitas re-entrenar.

Pattern 1: Time-based (Batch)

# Airflow, Prefect, Kubeflow, etc.
@schedule("@weekly")
def training_pipeline():
    train_model()

Trade-offs:

Pros	Contras
Simple de implementar	Puede ser desperdicio
Predecible	No responde a cambios reales
Fácil de debuggear	Puede entrenar con datos malos
Conocido por equipos	Latencia hasta próximo run

Úsalo cuando:

Tus datos llegan en schedule predecible
El costo de entrenar es bajo
Prefieres simplicidad sobre optimización

No lo uses cuando:

Los datos llegan irregularmente
Re-entrenar es muy caro
Necesitas respuesta inmediata a cambios

Pattern 2: Event-driven

# Trigger por evento externo
@trigger(bucket="s3://data", event="new_file")
def training_pipeline():
    if new_data_meets_threshold():
        train_model()

Trade-offs:

Pros	Contras
Solo entrenas cuando necesitas	Más complejo de implementar
Optimiza cómputo	Debugging más difícil
Responde a eventos reales	Necesitas infraestructura de eventos
Escala mejor	Puede disparar demasiado

Úsalo cuando:

Los datos llegan irregularmente
Re-entrenar es caro (>$100/run)
Tienes infraestructura de eventos (S3, Kafka, etc.)

No lo uses cuando:

Tu equipo no tiene experiencia con eventos
Los datos llegan consistentemente
Prefieres simplicidad

Pattern 3: Metric-driven

@schedule("@hourly")  # Check frecuente
def monitor_and_train():
    metrics = get_production_metrics()

    if metrics.accuracy < THRESHOLD:
        trigger_emergency_retrain()
        alert_team()

Trade-offs:

Pros	Contras
Responde a problemas reales	Requiere monitoring robusto
Optimiza para calidad	Puede tener falsos positivos
Proactivo ante drift	Latencia entre detección y fix
Justifica cada re-entreno	Complejo de implementar bien

Úsalo cuando:

Puedes medir performance en producción
El costo de modelo malo > costo de re-entrenar
Tienes alerting robusto

No lo uses cuando:

No tienes métricas confiables en prod
El re-entrenamiento toma demasiado tiempo
Tu modelo no degrada predeciblemente

Pattern 4: Hybrid (Real-world)

# Lo que REALMENTE usamos en producción

# Baseline: time-based conservador
@schedule("@weekly")
def baseline_training():
    train_model(priority="normal")

# Emergency: metric-driven
@monitor(metric="accuracy", threshold=0.85)
def emergency_training():
    train_model(priority="high")
    alert_oncall()

# CI/CD: code-driven
@on_merge(branch="main")
def ci_training():
    if features_changed():
        train_model(priority="normal")

Esta es la realidad en prod:

80% de los trainings → Time-based baseline
15% → Triggered por CI/CD
5% → Emergency por métricas

Criterios de decisión:

Frecuencia de datos:

Streaming (segundos) → Event-driven
Batch diario → Daily schedule
Batch semanal → Weekly schedule
Batch mensual → Monthly schedule
Irregular → Event-driven

Costo de training:

<$10/run → Time-based frecuente está OK
$10-100/run → Time-based + basic triggers
$100-1000/run → Event-driven + monitoring
>$1000/run → Solo cuando absolutamente necesario

Criticidad de estar actualizado:

Alta (fraude, spam) → Hybrid con monitoring
Media (recomendaciones) → Time-based + events
Baja (analytics) → Time-based conservador

Ejemplo real de arquitectura:

┌─────────────────────────────────────┐
│     Data Pipeline (Event-driven)     │
│   S3 → Trigger → Validation → ...   │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│  Training Orchestrator (Hybrid)     │
│                                     │
│  1. Baseline: @weekly              │
│  2. On data threshold: >10K rows   │
│  3. On metric drop: <85% accuracy  │
│  4. On code change: features/*     │
└──────────────┬──────────────────────┘
               │
               ▼
┌─────────────────────────────────────┐
│    ML Pipeline (Orquestado)        │
│  ingest → train → evaluate → ...   │
└─────────────────────────────────────┘

Los errores que he visto:

ERROR 1: "Entrenar cada hora por si acaso"

Costo mensual: $432 en compute Beneficio: Modelo 0.1% mejor ROI: Negativo

ERROR 2: "Solo entrenar manualmente"

Resultado: Se olvidan durante semanas El modelo degrada sin que nadie note Cliente se queja primero

ERROR 3: "Re-entrenar en cada commit"

100 commits/día × $10/training = $1000/día 99% de esos trainings no cambian nada

LO CORRECTO:

Baseline predecible + triggers inteligentes Monitoreo que justifica cada training Balance entre frescura y costo

Mi recomendación pragmática:

Fase 1 - MVP:

python

@schedule("@weekly")
def train():
    # Simple, predecible, suficiente para empezar

Fase 2 - Añade monitoring:

python

@schedule("@weekly")
def train():
    train_model()

@alert(metric="accuracy < 0.85")
def notify():
    # Solo alerta, no auto-retrain aún

Fase 3 - Event-driven si lo necesitas:

python

@schedule("@weekly")
def baseline_train():
    pass

@trigger(on_data_threshold)
def event_train():
    pass

No saltes a Fase 3 si Fase 1 es suficiente.

Mañana hablamos de environments:

Dev, staging, producción... pero para ML.

Donde "environment" no es solo código diferente, sino:

Datasets diferentes
Modelos diferentes
Métricas diferentes
Infraestructura diferente

Spoiler: if ENV == "prod" no es suficiente.

Pregunta:

¿Qué pattern usas para scheduling?
¿Te has cuestionado si es el correcto?

Concepto de Pipeline Orquestrador

Alex Parra (Parraletz) — Thu, 08 Jan 2026 12:59:37 GMT

Airflow, Kubeflow, Prefect, Dagster: La pregunta que todos hacen mal

Ayer vimos por qué un script no es un pipeline.

Hoy la pregunta obvia: "¿Qué orquestador uso?"

Pero esa no es la pregunta correcta.

Historia real:

Hace tiempo, un colega me preguntó: "¿Por qué no usamos Airflow como para todo lo demás?"

Pregunta legítima. Teníamos Airflow funcionando perfecto para:

ETLs diarios
Reportes automáticos
Data pipelines batch

¿Por qué necesitábamos evaluar Kubeflow?

El problema apareció con ML:

python

# Lo que funcionaba para ETL
@task
def process_data():
    df = read_database()
    df = transform(df)
    save_to_warehouse(df)
    # Siempre el mismo flujo, predecible

# Lo que necesitábamos para ML
@task
def train_model():
    # Necesito ejecutar 100 combinaciones
    # de hiperparámetros EN PARALELO
    # con GPUs
    # y solo registrar el mejor
    # ¿Cómo hago eso en Airflow?

Ahí entendí algo fundamental:

Los orquestadores vienen de mundos diferentes y resuelven problemas diferentes.

No hay "el mejor". Solo el mejor para tu problema.

Primero: ¿Qué es un orquestador?

Si vienes de CI/CD (como yo), es como GitHub Actions pero para workflows de datos/ML.

En vez de:

# .github/workflows/deploy.yml
build → test → deploy

Es:

# ml_pipeline.py
ingest → validate → train → evaluate → deploy

Con 3 diferencias clave:

Workflows más largos (horas, no minutos)
Dependencias de datos (no solo código)
Fallos esperados (datos malos, métricas bajas)

Lo que TODOS los orquestadores te dan:

Conceptualmente, estas 4 cosas:

1. DAGs (Directed Acyclic Graphs)

    ingest
     / \
validate process
     \ /
    train → evaluate → deploy

Defines dependencias. "Step B solo corre si Step A pasó"

2. Scheduling

"Ejecuta esto cada día a las 2am"
"Ejecuta cuando lleguen nuevos datos"
Triggers customizados

3. Retries & Error Handling

Reintentos automáticos
Alertas de fallos
Recuperación desde puntos específicos

4. Observabilidad

Logs centralizados de cada paso
Visualización de flujos
Métricas de ejecución

Eso es un orquestador. El resto es implementación.

Los 4 grandes (y de dónde vienen):

Airflow

De: Data Engineering
Para: ETLs batch, scheduling robusto

python

from airflow import DAG
from airflow.operators.python import PythonOperator

with DAG('ml_pipeline', schedule='@daily') as dag:
    ingest = PythonOperator(task_id='ingest', ...)
    train = PythonOperator(task_id='train', ...)

    ingest >> train  # Dependencias explícitas

Úsalo si:

Ya tienes Airflow para ETLs
Workflows batch tradicionales
Necesitas ecosistema maduro (1000+ operators)

Evítalo si:

Workflows ML complejos (hyperparameter tuning)
Necesitas ejecución dinámica
Quieres mejor developer experience

Kubeflow Pipelines

De: ML en Kubernetes
Para: Training distribuido, GPUs, ML nativo

python

from kfp import dsl

@dsl.pipeline(name='ml_pipeline')
def pipeline():
    ingest = dsl.ContainerOp(
        name='ingest',
        image='gcr.io/my-project/ingest:v1'
    )
    train = dsl.ContainerOp(
        name='train',
        image='gcr.io/my-project/train:v1'
    )
    train.after(ingest)

✅ Úsalo si:

Ya vives en Kubernetes
Necesitas GPU scheduling
Hyperparameter tuning nativo

❌ Evítalo si:

No quieres gestionar K8s
Workflows simples
Equipo pequeño sin expertise K8s

Prefect

De: "Airflow pero mejor UX"
Para: Workflows dinámicos, developer experience

python

from prefect import task, flow

@task
def ingest():
    return fetch_data()

@task
def train(data):
    return train_model(data)

@flow
def ml_pipeline():
    data = ingest()
    model = train(data)  # Dependencias implícitas

✅ Úsalo si:

Quieres Airflow pero mejor DX
Workflows dinámicos (loops, conditionals)
Deployment más simple

❌ Evítalo si:

Necesitas ecosistema gigante de plugins
Tu equipo ya domina Airflow
Quieres UI súper robusta (aunque ha mejorado)

Dagster

De: "Data pipelines como software engineering"
Para: Data platforms, software-defined assets

python

from dagster import asset, Definitions

@asset
def raw_data():
    return fetch_data()

@asset
def trained_model(raw_data):
    # Dagster sabe que depende de raw_data
    return train_model(raw_data)

defs = Definitions(assets=[raw_data, trained_model])

✅ Úsalo si:

Construyes data platform desde cero
Quieres "assets" en vez de "tasks"
Testing robusto de pipelines

❌ Evítalo si:

Solo necesitas scheduling simple
Learning curve no vale la pena
Stack legacy

Comparativa rápida:

Feature	Airflow	Kubeflow	Prefect	Dagster
Origen	Data Eng	ML/K8s	Airflow 2.0	Data Platform
Curva aprendizaje	Media	Alta	Baja	Media-Alta
Kubernetes	Opcional	Requerido	Opcional	Opcional
ML específico	🟡 Plugins	✅ Nativo	🟡 Posible	🟡 Posible
DX	🟡 Mejoró en 2.0	🔴 Complejo	✅ Excelente	✅ Muy bueno
Madurez	✅ Battle-tested	✅ Google-backed	🟡 Creciendo	🟡 Emergente

Framework de decisión:

PASO 1: ¿Ya tienes uno? → Sí? Úsalo hasta que realmente no puedas
→ No? Continúa ↓

PASO 2: ¿Cuál es tu stack? → Kubernetes everywhere? → Kubeflow
→ Python + infra simple? → Prefect
→ Múltiples data teams? → Dagster
→ Legacy + necesitas estabilidad? → Airflow

PASO 3: ¿Qué tan complejo es tu ML? → ETLs + modelos simples? → Airflow/Prefect
→ Hyperparameter tuning + GPUs? → Kubeflow
→ Data platform end-to-end? → Dagster

PASO 4: ¿Qué puede mantener tu equipo? → Esta es LA pregunta más importante

Mi take honesto (Platform Engineer):

En 2025-2026:

¿Tienes Airflow funcionando? → Quédate ahí hasta que duela
¿Stack K8s nativo + GPUs? → Kubeflow es la opción obvia
¿Greenfield project + equipo Python? → Prefect (mejor DX)
¿Construyendo data platform? → Dagster (pero evalúa el costo)

Lo que aprendí:

No busques "el mejor orquestador".

Busca el que resuelve tu problema específico con tu stack actual que tu equipo puede mantener.

Las preguntas correctas:

"¿Cuál es mejor?"
"¿Qué problema estoy resolviendo?"
"¿Qué usa Google?"
"¿Qué puede mantener mi equipo?"
"¿Cuál tiene más estrellas en GitHub?"
"¿Cuál se integra con mi stack?"

Volviendo a mi historia:

¿Qué elegimos?

Airflow para ETLs, Kubeflow para ML.

¿Es ideal? No.
¿Funciona? Sí.
¿Lo haría diferente hoy? Tal vez Prefect.

Pero ya tenemos Airflow funcionando y el equipo lo conoce.

No seas la empresa con 4 orquestadores diferentes.

Mañana hablamos de scheduling:

Tienes tu orquestador. Cool.

Ahora: ¿Cuándo ejecutas tus pipelines?

¿Cada día a las 2am?
¿Cada hora?
¿Cuando lleguen datos nuevos?

Jobs batch vs event-driven. Y por qué importa para ML.

Tu turno:

¿Qué orquestador usas (o te gustaría usar)?
¿Cuántos orquestadores diferentes tiene tu empresa? 😅

(He visto empresas con 3+... no seas esa empresa o peor aun, usan servidores windows para production)

Script vs Pipeline: Por qué tu cronjob no es suficiente

Alex Parra (Parraletz) — Wed, 07 Jan 2026 06:00:30 GMT

Ayer te prometí explicar por qué tus herramientas de Platform Engineering no son suficientes para ML. Pero antes, hablemos de cuando alguien de PE pide automatizar o pregunta si ya esta automatizado, por alguna razón la respuesta es:

Por que? .. porque puedo y porque quiero:

# crontab
20 4 * * * cd /ml && python train.py

"Ya está automatizado", me dicen, no sean así, quieran a us PEs, hay algunos que siguen encerrados en su mundo donde yo mando, donde es mi infra, es mi workflow… pero somos mas los buenos xD.

Hace tiempo, un Data Scientist me mostró su trabajo (Los ilovesomucho a mis compañeros DS, de verdad xD ):

# train.py
import pandas as pd
from sklearn.ensemble import RandomForest

# Leer datos
df = pd.read_csv('data.csv')

# Entrenar
model = RandomForest()
model.fit(df[features], df['target'])

# Guardar
model.save('model.pkl')
```

"Solo ejecuta esto en producción", me dijo.

Yo, ingenuo, pensé: "Easy. Lo meto en un container, un cronjob, y listo", no ?.

3 meses después:

El modelo degradó y nadie sabe por qué
"data.csv" cambió y rompió todo
No sabemos qué versión del modelo está en producción
No podemos reproducir los resultados
Cada retrain es manual y reza para que funcione

¿El problema? Un script no es un pipeline.

La diferencia fundamental:

🔴 Script (train.py):

python train.py
# Todo en un solo archivo
# Sin separación de responsabilidades
# Sin trazabilidad
# Sin recuperación de errores

🟢 Pipeline orquestado:

1. INGEST    → Trae datos versionados
2. VALIDATE  → Verifica calidad
3. PREPROCESS→ Transforma de forma reproducible
4. TRAIN     → Entrena con hiperparámetros trackeados
5. EVALUATE  → Valida métricas vs baseline
6. REGISTER  → Solo si pasa el threshold

¿Por qué importa?

Con un script:

Si falla en el minuto 45, pierdes todo
No sabes qué datos usaste
No puedes reproducirlo
Debugging = rezar y print()

Con un pipeline:

Cada paso es independiente y reproducible
Falla el paso 3? Reinicia desde ahí
Trazabilidad completa: datos, código, parámetros
Debugging con logs estructurados de cada etapa

El día que aprendí esto:

Un modelo empezó a fallar en producción.

Con el script: "No sé qué cambió, vuelve a entrenar todo y cruza los dedos"

Con el pipeline: "Ah, los datos de entrada fallaron en validación. El modelo anterior sigue sirviendo. Arreglo los datos y solo re-ejecuto desde INGEST"

Esa es la diferencia entre firefighting y engineering.

Mañana te cuento: Qué hace que un pipeline sea realmente un pipeline (spoiler: no es solo dividir tu script en funciones)

Y te presento a los orquestadores: Airflow, Kubeflow, Prefect, Dagster.

¿Cuál usar? Depende. Mañana vemos por qué.

Pregunta para ti: ¿Tu equipo de ML tiene scripts o pipelines? (Pista: si no lo sabes con certeza, probablemente son scripts)

*Ops to MLOPS

Alex Parra (Parraletz) — Tue, 06 Jan 2026 06:00:36 GMT

Hace tiempo, cuando algunos de mis compañeros de datos o de AI (como en ese entonces les conocía) se acercaban y nos pedían infraestructura o forma “despliegues que se salian del standard“, siempre nos retaban al equipo de plataforma con sus requerimientos, y eso honestamente era divertido

Hoy es 6 de enero, y como cada año, los Reyes Magos traen regalos.

Pero los míos no vinieron envueltos en papel dorado.

Hace tiempo, yo era ese Platform Engineer que:

Construía infraestructura sólida
Automatizaba pipelines de CI/CD
No entendía por qué los Data Scientists necesitaban "cosas raras"

Los Data Scientists llegaban a con sus requierimientos:

"Necesito versionar datasets de 500GB"
"¿Por qué no puedo reproducir este experimento?"
"Este modelo funcionaba ayer y hoy no"

Y yo pensaba: "¿Por qué no usan Git como gente normal?"

Spoiler: ML no es software tradicional. Y las plataformas para ML tampoco.

Hasta que descubrí que MLOps no es DevOps con otro nombre. Son 3 pilares que transforman cómo construyes plataformas para equipos de ML:

Regalo 1: ORO - Entender el problema real

Por qué ML es diferente a software tradicional. Por qué tus pipelines de CI/CD no son suficientes. Qué necesitan realmente los equipos de ML.

Regalo 2: INCIENSO - Infraestructura para datos y experimentos

No solo versionar código. Versionar datasets, trackear miles de experimentos, garantizar reproducibilidad. DVC, LakeFS, MLflow.

Regalo 3: MIRRA - Orquestación que entiende ML

Airflow está bien para ETLs. Pero ¿training pipelines? ¿Model serving? ¿A/B testing de modelos? Necesitas otras herramientas.

Durante las próximas 3 semanas voy a compartir estos 3 regalos contigo.

Desde la perspectiva de alguien que viene de plataformas e infraestructura, no de Data Science.

📅 Cada semana = 1 regalo
📝 5 posts diarios (Lun-Vie) con contenido práctico

Semana 1 (7-11 Ene): Por qué ML rompe tus asunciones de software tradicional
Semana 2 (13-18 Ene): Infraestructura para versionado y tracking
Semana 3 (20-25 Ene): Orquestación y automatización para ML

El objetivo: Que construyas plataformas que tus Data Scientists realmente necesiten.

No más:

"Usamos Kubernetes" (¿pero para qué?)
"Tenemos CI/CD" (¿versiona tus datos?)
"Deployamos modelos" (¿monitoreas drift?)

De Zero a MLOps. Desde la perspectiva de quien construye la plataforma.

¿Te unes al viaje? 🚀

Nos vemos mañana con el primer post: "Por qué Git no es suficiente para ML"