Por qué Git no es suficiente para ML

El setup típico que veo:

bash

ml-project/
├── .git/
├── src/
│   └── train.py
├── data/
│   └── training.csv  # 5GB, in Git
└── models/
    └── model.pkl     # 2GB, in Git

Primera pregunta: ¿Cómo está esto en Git?

Respuesta común:

bash

# .gitignore
data/
models/

# Entonces... ¿cómo compartes los datos?
# "Los tenemos en S3"

# ¿Versionados?
# "...no"

El problema fundamental:

Código es fácil:

python

# Version 1
model = Model(lr=0.01)

# Version 2
model = Model(lr=0.001)

# Git diff:
- model = Model(lr=0.01)
+ model = Model(lr=0.001)

# Clear, concise, mergeable

Datos no:

training.parquet v1: 50GB
training.parquet v2: 55GB

# Git diff:
Binary files differ

# ¿Qué cambió?
# - ¿5GB de nuevas filas?
# - ¿Se modificaron filas existentes?
# - ¿Cambió el schema?
# ¿Quién sabe?

Los 5 problemas de versionar datos:

PROBLEMA 1: Tamaño

bash

git add data/training.parquet  # 50GB

# Git:
error: File data/training.parquet is 50GB
hint: Use Git LFS for files larger than 100MB

git lfs install
git lfs track "*.parquet"
git add data/training.parquet
git push

# GitHub/GitLab:
error: File exceeds 5GB limit
# Even with LFS

PROBLEMA 2: Performance

bash

git clone repo

# With 50GB dataset in Git:
# Clone time: 45 minutes
# Disk usage: 100GB (working copy + .git)
# CI/CD: Timeout después de 1 hora

PROBLEMA 3: No es text-based

bash

git diff data/training.csv

# CSV de 10M filas:
+row_9874561,value1,value2,...
+row_9874562,value1,value2,...
-row_9874563,old_value1,old_value2,...
...
# 10 million line diff
# Inútil

PROBLEMA 4: Merge conflicts

bash

# Branch A: Agrega 1M filas
# Branch B: Agrega 1M filas diferentes

git merge branch-b

CONFLICT in data/training.csv
# ¿Cómo resuelves esto?
# ¿Manualmente en un CSV de 15M filas?

PROBLEMA 5: Cost

bash

# Git LFS en GitHub
# - Free: 1GB storage, 1GB bandwidth
# - Paid: $5/month per 50GB storage

# Dataset de 500GB:
# $50/month solo para storage
# + bandwidth costs

# S3:
# $11.50/month per 500GB
# Mucho más barato
```

---

**Por qué los datos son diferentes del código:**

| Característica | Código | Datos |
|----------------|--------|-------|
| Tamaño | KB-MB | GB-TB |
| Formato | Text | Binary (parquet, tfrecord) |
| Cambios | Líneas específicas | Filas/distribuciones |
| Merge | Posible | Imposible |
| Diff | Meaningful | "Binary files differ" |
| Storage | Git OK | Git no funciona |
| Versioning | Commits | Checksums/hashes |

---

**Lo que realmente necesitas:**

No solo versionar archivos. Versionar **contexto**:

**1. El dataset mismo**
```
training.parquet
  - Size: 50GB
  - Hash: a1b2c3d4e5f6
  - Location: s3://data/training/v2.3.1/

2. Metadata

yaml

dataset: training-v2.3.1
created: 2024-01-15
rows: 1,234,567
columns: 45
schema:
  user_id: int64
  timestamp: datetime64
  feature_1: float64
  ...
source: production-db
query: |
  SELECT * FROM events
  WHERE date > '2023-01-01'
  AND user_type = 'active'

3. Statistics

yaml

distributions:
  feature_1:
    mean: 0.547
    std: 0.231
    min: 0.0
    max: 1.0
  feature_2:
    unique_values: 47
    top_3: ['A', 'B', 'C']
    missing: 2.3%
```

**4. Lineage**
```
training-v2.3.1
  ← derived from raw-events-v1.5.2
    ← extracted from production-db
      ← populated by application-v2.1.0

5. Validation

yaml

checks:
  - no_missing_values: PASS
  - schema_match: PASS
  - distribution_drift: WARN (feature_3 shifted)
  - no_duplicates: PASS
```

---

**Casos reales donde Git falló:**

**CASO 1: El dataset "desapareció"**
```
DS: "Necesito el dataset de hace 3 meses"
Yo: "¿Está en Git?"
DS: "No, era muy grande"
Yo: "¿Dónde está?"
DS: "En s3://data/training.parquet"
Yo: "Eso se sobrescribe cada semana"
DS: "..."

# Dataset perdido para siempre
# 3 meses de trabajo para re-generarlo

CASO 2: El merge conflict imposible

bash

git merge feature-new-data-source

CONFLICT in data/training.csv

# 10 million lines en conflict
# Impossible resolver manualmente
# Tuvimos que re-generar todo el dataset
# 2 semanas de trabajo perdidas

CASO 3: El CI/CD que nunca terminó

yaml

# .github/workflows/train.yaml
steps:
  - name: Checkout
    uses: actions/checkout@v3

  # Descarga 50GB de datos de Git LFS
  # Timeout después de 6 horas
  # CI/CD nunca completa
  # No podemos hacer automated training
```

---

**La solución correcta:**
```
Código → Git
Datos → Data versioning tool (DVC, LakeFS, etc.)
Modelos → Model registry (MLflow, etc.)
```

**Separación de concerns:**
```
┌─────────────────────────────────────┐
│           Git (Source Control)      │
│                                     │
│  - Code                             │
│  - Config                           │
│  - Pipelines                        │
│  - .dvc files (pointers)           │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│     DVC/LakeFS (Data Versioning)    │
│                                     │
│  - Training datasets                │
│  - Validation datasets              │
│  - Feature sets                     │
│  - Metadata                         │
└─────────────────────────────────────┘

┌─────────────────────────────────────┐
│      MLflow (Model Registry)        │
│                                     │
│  - Trained models                   │
│  - Metrics                          │
│  - Hyperparameters                  │
│  - Artifacts                        │
└─────────────────────────────────────┘

Ejemplo práctico:

SIN data versioning:

python

# train.py
data = pd.read_parquet("s3://data/training.parquet")
model = train(data)
mlflow.log_model(model, "model")

# 3 meses después:
# "¿Con qué datos se entrenó este modelo?"
# "...no sé"

CON data versioning:

python

# train.py
import dvc.api

# Pull specific data version
data_version = "v2.3.1"
with dvc.api.open(
    'data/training.parquet',
    rev=data_version
) as f:
    data = pd.read_parquet(f)

# Train
model = train(data)

# Log everything
mlflow.log_param("data_version", data_version)
mlflow.log_param("data_hash", get_data_hash())
mlflow.log_param("data_rows", len(data))
mlflow.log_model(model, "model")

# 3 meses después:
# "¿Con qué datos se entrenó?"
# "data_version: v2.3.1, hash: a1b2c3"
# "Puedo reproducirlo: dvc checkout v2.3.1"

Las preguntas críticas:

Puedes responder:

1. Reproducibilidad

¿Puedes reproducir el experimento #142?
→ Sí: data v2.1.3 + code commit abc123

2. Debugging

¿Por qué el modelo empeoró?
→ Data drift: v2.3.1 vs v2.3.2
→ mean(feature_x): 0.5 → 0.7

3. Compliance

¿Qué datos personales contiene este modelo?
→ Dataset v2.1.3
→ Metadata dice: no PII
→ Validation logs confirman

4. Rollback

Necesito volver al dataset de la semana pasada
→ dvc checkout v2.2.9
→ 30 segundos

Comparativa rápida:

Método	Pros	Contras	Use case
Git directo	Simple	No escala >100MB	Datasets pequeños (<100MB)
Git LFS	Git workflow	Caro, lento	Datasets medianos (<5GB)
DVC	Open source	Setup manual	Data science teams
LakeFS	Git-like	Complejo	Data lakes grandes
Delta Lake	ACID	Databricks-specific	Spark workflows

Tu turno: ¿Cómo versionas tus datos hoy? ¿O es "latest.parquet" y rezar?

Por qué Git no es suficiente para ML

Comments

More from this blog

Kubernetes Isn't Outdated. It's Evolving. And So Should You.

The Platform Is the Architecture

Your Deployment Strategy Shouldn't Live in Your Pipeline

The Glue Is the Platform

A Git Commit Is Not Just a Change. It's a Command.

Command Palette

Comments

More from this blog