Conectando los mundos - Platform Engineering meets ML

Terraform, Helm, GitOps... ¿para entrenar modelos? Sí.

Hoy: Cómo gestionar todo eso como código.

El problema:

Tienes tu stack de Platform Engineering:

Terraform para infraestructura
Helm para aplicaciones en K8s
ArgoCD para GitOps
CI/CD pipelines

Y ahora llega ML al juego.

Primera reacción: "Solo es otra aplicación, ¿no?"

Spoiler: No.

Lo que descubrí:

ML no es una aplicación más. Es un ecosistema completo que necesita:

Infraestructura tradicional:

Compute (K8s clusters, VMs)
Storage (S3, persistent volumes)
Networking (load balancers, ingress)

PERO TAMBIÉN infraestructura específica de ML:

Feature stores
Model registries
Experiment tracking
Data versioning
Training pipelines
Serving infrastructure

Y todo debe ser código. Reproducible. Versionado. Con GitOps.

Los 4 niveles de IaC para MLOps:

Terraform es el ganador indiscutible, pero existen AWS CDK, Pulumi que son dignos competidores.

Nivel 1: Infraestructura base (Fundational)

# terraform/main.tf

# Cluster para ML workloads
resource "aws_eks_cluster" "ml_cluster" {
  name     = "ml-${var.environment}"
  role_arn = aws_iam_role.ml_cluster.arn

  vpc_config {
    subnet_ids = var.private_subnets
  }
}

# Node group con GPUs
resource "aws_eks_node_group" "gpu_nodes" {
  cluster_name    = aws_eks_cluster.ml_cluster.name
  node_group_name = "gpu-nodes"

  instance_types = ["p3.2xlarge"]

  scaling_config {
    desired_size = 2
    max_size     = 10
    min_size     = 0  # Scale to zero cuando no se use
  }
}

# Storage para datasets
resource "aws_s3_bucket" "ml_data" {
  bucket = "ml-data-${var.environment}"

  versioning {
    enabled = true  # Importante para data versioning
  }
}

Esto es familiar. Es Terraform normal.

Nivel 2: ML Platform components

yaml

# helm/values.yaml

# MLflow para experiment tracking
mlflow:
  enabled: true
  backend_store: postgresql
  artifact_store: s3://ml-artifacts-prod

# Kubeflow Pipelines (si lo usas)
kubeflow:
  enabled: true
  pipelines:
    persistence:
      storageClass: gp3
      size: 100Gi

# Model serving (Seldon, KServe, etc.)
seldon:
  enabled: true
  replicas: 3
  resources:
    limits:
      nvidia.com/gpu: 1

Esto es donde Platform Engineering y ML se encuentran.

Nivel 3: Training pipelines como código

yaml

# pipelines/training-pipeline.yaml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: training-pipeline
spec:
  entrypoint: train-model

  arguments:
    parameters:
    - name: dataset-version
      value: "v2.3.1"
    - name: environment
      value: "staging"

  templates:
  - name: train-model
    steps:
    - - name: ingest-data
        template: ingest
    - - name: validate-data
        template: validate
    - - name: train
        template: training
    - - name: evaluate
        template: evaluation
    - - name: register
        template: model-registration

  - name: ingest
    container:
      image: ml-pipeline/ingest:{{workflow.parameters.dataset-version}}
      command: [python, ingest.py]
      env:
      - name: DATASET_VERSION
        value: "{{workflow.parameters.dataset-version}}"
      - name: S3_BUCKET
        value: "ml-data-{{workflow.parameters.environment}}"

Aquí es donde todo se conecta: infra + pipelines + GitOps.

Nivel 4: GitOps para ML

Y como saben, en este lugar somos GitOps Frendly

# argocd/ml-application.yaml

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: ml-platform
  namespace: argocd
spec:
  project: default

  source:
    repoURL: https://github.com/company/ml-platform
    targetRevision: main
    path: k8s/

    helm:
      values: |
        environment: staging

        mlflow:
          enabled: true
          version: "2.9.0"

        training_pipelines:
          schedule: "0 2 * * 0"  # Weekly
          resources:
            gpu: true
            memory: 32Gi

  destination:
    server: https://kubernetes.default.svc
    namespace: ml-staging

  syncPolicy:
    automated:
      prune: true
      selfHeal: true

Commit → Push → ArgoCD sync → Pipeline deployed.

Igual que tus otras aplicaciones. Pero para ML.

La integración completa:

┌─────────────────────────────────────────────────┐
│              Git Repository                     │
│                                                 │
│  terraform/        → Infra base                 │
│  helm/            → ML platform                 │
│  pipelines/       → Training workflows          │
│  models/          → Model definitions           │
└─────────────┬───────────────────────────────────┘
              │
              │ git push
              │
     ┌────────▼────────┐
     │   CI Pipeline   │
     │                 │
     │  - Validate     │
     │  - Test         │
     │  - Build images │
     └────────┬────────┘
              │
              │ if tests pass
              │
     ┌────────▼────────┐
     │    ArgoCD       │
     │                 │
     │  - Sync infra   │
     │  - Deploy ML    │
     │  - Trigger      │
     └────────┬────────┘
              │
              │
     ┌────────▼─────────────────────────────┐
     │         Kubernetes                   │
     │                                      │
     │  - ML Platform running               │
     │  - Pipelines scheduled               │
     │  - Models serving                    │
     └──────────────────────────────────────┘

Ejemplo real: Deploy de un nuevo modelo

Antes (manual):

bash

# 1. SSH al cluster
ssh ml-cluster-prod

# 2. Actualizar manualmente
kubectl apply -f model-deployment.yaml

# 3. Esperar y rezar
kubectl get pods -w

# 4. Si falla, rollback manual
kubectl rollback deployment/model-server

Ahora (GitOps):

bash

# 1. Update en Git
cat > k8s/model-deployment.yaml <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: model-server
spec:
  template:
    spec:
      containers:
      - name: model
        image: models/fraud-detection:v2.3.1  # Cambio aquí
EOF

# 2. Commit y push
git add k8s/model-deployment.yaml
git commit -m "Deploy model v2.3.1 to staging"
git push

# 3. ArgoCD hace el resto
# - Detecta cambio
# - Valida
# - Canary deploy
# - Rollback automático si falla

Los componentes clave:

Terraform para infraestructura:

hcl

module "ml_platform" {
  source = "./modules/ml-platform"

  environment = var.environment

  # Compute
  gpu_nodes     = 4
  cpu_nodes     = 10

  # Storage
  data_bucket   = "ml-data-${var.environment}"
  model_bucket  = "ml-models-${var.environment}"

  # Networking
  vpc_id        = module.vpc.vpc_id

  # ML-specific
  mlflow_backend    = module.rds.endpoint
  feature_store     = "enabled"
  model_registry    = "enabled"
}

Helm para componentes de ML:

yaml

# Chart.yaml
dependencies:
- name: mlflow
  version: "0.7.0"
  repository: https://charts.mlflow.org

- name: kubeflow-pipelines
  version: "2.0.0"
  repository: https://kubeflow.github.io/pipelines

- name: seldon-core
  version: "1.15.0"
  repository: https://storage.googleapis.com/seldon-charts

ArgoCD para sincronización:

yaml

# Sync policy para ML workloads
syncPolicy:
  automated:
    prune: true
    selfHeal: true

  syncOptions:
  - CreateNamespace=true

  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m

CI/CD para pipelines:

# .github/workflows/ml-pipeline.yaml
name: ML Pipeline CI

on:
  push:
    paths:
    - 'pipelines/**'
    - 'models/**'

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
    - name: Validate pipeline syntax
      run: |
        argo lint pipelines/training-pipeline.yaml

    - name: Test with sample data
      run: |
        pytest tests/pipeline_test.py

    - name: Build container images
      run: |
        docker build -t ml-pipeline/ingest:${{ github.sha }}

Los desafíos específicos de IaC para ML:

DESAFÍO 1: State management para modelos

Software tradicional:
- Deploy nueva versión
- Old version se borra

ML:
- Deploy modelo v2
- Modelo v1 sigue sirviendo (A/B testing)
- Modelo v0 en rollback standby
- ¿Cómo manejas state?

Solución:

hcl

# Terraform state para múltiples versiones
resource "kubernetes_deployment" "model" {
  for_each = var.model_versions

  metadata {
    name = "model-${each.key}"
  }

  spec {
    replicas = each.value.traffic_percentage > 0 ? 2 : 0
  }
}

DESAFÍO 2: Recursos dinámicos

Training pipeline necesita:
- 8 GPUs por 4 horas
- Después, scale a 0

Serving necesita:
- 2 GPUs 24/7
- Auto-scale basado en requests

Solución:

yaml

# Karpenter para auto-scaling inteligente
apiVersion: karpenter.sh/v1alpha5
kind: Provisioner
metadata:
  name: gpu-training
spec:
  requirements:
  - key: node.kubernetes.io/instance-type
    operator: In
    values: ["p3.2xlarge", "p3.8xlarge"]

  ttlSecondsAfterEmpty: 300  # Terminate si no se usa

  limits:
    resources:
      nvidia.com/gpu: 32  # Max GPUs

DESAFÍO 3: Data versioning en IaC

Código: Git
Containers: Registry
Infraestructura: Terraform state

¿Datasets de gigantes?

Solución:

yaml

# DVC como parte del pipeline
apiVersion: v1
kind: ConfigMap
metadata:
  name: dvc-config
data:
  .dvc/config: |
    [core]
        remote = s3
    ['remote "s3"']
        url = s3://ml-data-prod/dvc-cache

# Referencia en pipeline
- name: fetch-data
  container:
    image: dvc:latest
    command: [dvc, pull, data/training.dvc]
    env:
    - name: DVC_REMOTE
      value: s3

Stack completo recomendado:

Infrastructure Layer:
├── Terraform         → Cloud resources (K8s, S3, RDS)
├── Helm             → ML platform components
└── ArgoCD           → GitOps sync

ML Platform Layer:
├── MLflow           → Experiment tracking
├── DVC              → Data versioning
├── Kubeflow/Airflow → Pipeline orchestration
└── Seldon/KServe    → Model serving

Pipeline Layer:
├── Argo Workflows   → Training execution
├── GitHub Actions   → CI/CD
└── Prometheus       → Monitoring

Data Layer:
├── S3/GCS           → Data lake
├── PostgreSQL       → Metadata
└── Feature Store    → Feature management

Mi workflow real:

1. Definir infra (Terraform):

bash

cd terraform/environments/staging
terraform plan
terraform apply

2. Deploy ML platform (Helm + ArgoCD):

bash

git add helm/ml-platform/
git commit -m "Deploy MLflow + Kubeflow to staging"
git push  # ArgoCD picks it up

3. Definir pipeline (YAML):

bash

git add pipelines/training-v2.yaml
git commit -m "Update training pipeline"
git push  # ArgoCD deploys

4. Trigger entrenamiento:

bash

# Scheduled (automático via CronWorkflow)
# O manual:
argo submit pipelines/training-v2.yaml \
  --parameter dataset-version=v2.3.1

5. Deploy modelo (GitOps):

git add k8s/model-server.yaml
git commit -m "Deploy model v2.3.1"
git push  # ArgoCD deploys con canary

Todo como código. Todo en Git. Todo reproducible.

Los errores que he visto:

ERROR 1: "Infra manual, pipelines automatizados"

Resultado: Pipelines automatizados corriendo en infra manual
Alguien borra el cluster por error
Pipelines fallan, nadie sabe cómo reconstruir

ERROR 2: "Terraform para todo, incluso modelos"

Resultado: Model registry en Terraform state
Deploy nuevo modelo = terraform apply (lento)
Rollback = terraform plan/apply (muy lento)
No es el tool correcto para esto

ERROR 3: "GitOps sin CI"

Resultado: Push directo a main sin validación
Pipeline syntax error deployado a prod
Cluster crashea

ERROR 4: "IaC solo en prod"

Resultado: Dev y staging son snowflakes
"Funciona en mi máquina" nivel infraestructura
Debugging imposible

Mi recomendación pragmática:

Fase 1 - Infra as Code:

- Terraform para cluster base
- Manual deploy de ML platform
- Scripts para pipelines

Fase 2 - Platform as Code:

- Helm para ML components
- CI/CD para validación
- Semi-automated deploys

Fase 3 - Everything as Code:

- Full GitOps con ArgoCD
- Pipelines como YAML en Git
- Automated canary deploys

Empieza en Fase 1. Evoluciona cuando necesites escalar.

Conectando los mundos - Platform Engineering meets ML

Nivel 1: Infraestructura base (Fundational)

Nivel 2: ML Platform components

Nivel 3: Training pipelines como código

Nivel 4: GitOps para ML

Comments

More from this blog

Kubernetes Isn't Outdated. It's Evolving. And So Should You.

The Platform Is the Architecture

Your Deployment Strategy Shouldn't Live in Your Pipeline

The Glue Is the Platform

A Git Commit Is Not Just a Change. It's a Command.

Command Palette

Nivel 1: Infraestructura base (Fundational)

Nivel 2: ML Platform components

Nivel 3: Training pipelines como código

Nivel 4: GitOps para ML

Comments

More from this blog