构建基于Crossplane和Rails控制平面的生成式AI声明式训练平台

MLOps

文章字数: 3.8k

阅读时长: 16 分

我们面临的根本挑战并非如何微调一个生成式AI模型，而是如何系统性地、可重复地、规模化地管理成百上千次模型微调实验的完整生命周期。每个实验都需要昂贵的、短暂的GPU资源、精确版本化的数据集，以及一个隔离的、干净的执行环境。当团队规模扩大，实验频率增加时，依赖手写脚本或拼凑的CI/CD任务来管理基础设施和数据流，会迅速演变成一个难以维护、充满潜在错误的混乱系统。

问题的核心在于状态管理和声明式意图。一个工程师提交一个实验请求，他关心的“意图”是：“我需要一个配备A100 GPU的环境，使用S3数据湖中datasets/project-alpha/v3.1.parquet这个版本的数据，运行我的训练脚本。” 他不应该关心如何向云服务商申请虚拟机、配置网络、安装驱动、拉取数据、最后再清理这一切。

传统的解决方案，比如在CI/CD流水线中直接调用Terraform或云服务商的CLI工具，本质上是命令式的。流水线告诉云平台“创建一个虚拟机”，然后“配置它”，再“运行任务”。如果其中一步失败，整个环境就处于一个不确定的中间状态。基础设施的实际状态与我们的预期（定义在某个配置文件中）之间会产生漂移。这种漂移是运维复杂性和偶发性错误的根源。

因此，我们决定评估一种完全不同的架构：构建一个声明式的MLOps平台。在这个架构中，我们只描述“最终需要什么”，而不是“如何一步步做到”。一个常驻的控制器会持续工作，确保现实世界（云基础设施）与我们声明的期望状态保持一致。

方案权衡：命令式脚本 vs. 声明式控制平面

方案A：基于GitHub Actions与Terraform/CLI的命令式工作流

这是最直接的方案。开发者在一个Git仓库中定义一个训练任务，提交代码后触发GitHub Actions。

实现流程:
1. GitHub Actions workflow被触发。
2. terraform apply或gcloud compute instances create命令被执行，用于创建带GPU的虚拟机。
3. 使用ssh或类似的工具在虚拟机上配置环境、安装依赖。
4. 从数据湖（如S3）拉取指定版本的数据。
5. 执行python train.py。
6. 训练完成后，执行terraform destroy或gcloud compute instances delete来清理资源。
优势:
- 上手快，学习曲线相对平缓。
- 对于简单的、一次性的任务非常有效。
- 生态成熟，有大量的文档和示例。
劣势:
- 脆弱性: 清理步骤（destroy）可能会因为网络问题、权限变更或脚本错误而失败，导致昂贵的GPU资源被闲置并持续计费。这是一个在真实项目中极其常见的“坑”。
- 状态管理复杂: Terraform的状态文件（tfstate）需要被妥善管理，多人协作时容易产生冲突和状态污染。
- 缺乏持续调节: 如果有人手动更改了云上的资源（例如，为了调试而修改了防火墙规则），命令式脚本对此一无所知。基础设施状态与代码定义之间产生了漂移，下一次运行时可能会失败或产生非预期行为。
- 关注点耦合: CI/CD流水线同时承担了“做什么”（业务逻辑，如训练）和“如何准备环境”（基础设施逻辑）的职责，违反了单一职责原则。

方案B：基于Crossplane与Kubernetes的声明式控制平面

这个方案将基础设施的管理权从一次性的CI/CD任务转移到一个常驻的、基于Kubernetes的控制平面上。我们使用Crossplane将云资源（如GPU节点池、S3存储桶、IAM角色）抽象为Kubernetes自定义资源（CRD）。

实现流程:
1. 我们预先在管理Kubernetes集群中安装并配置好Crossplane以及对应的云服务商Provider（如provider-aws）。
2. 开发者通过一个API接口提交训练请求。这个API的后端（我们选择Ruby on Rails）会将请求转化为一个描述训练环境的YAML文件。
3. 该YAML文件被kubectl apply到管理集群。这个YAML描述的是一个高层级的抽象资源，例如一个TrainingEnvironment。
4. Crossplane的控制器检测到这个新资源，并根据预定义的Composition，开始在云上创建所有必要的底层资源（例如，一个GKE NodePool、一个Namespace、一个ServiceAccount以及对应的IAM绑定）。
5. Crossplane会持续监控这些云资源。如果资源被手动修改或删除，它会自动将其恢复到YAML中定义的状态。这是一个关键区别，我们称之为“持续调节循环”（reconciliation loop）。
6. 一旦环境就绪，一个Kubernetes Job会被触发，它挂载了所需的数据卷或拥有访问数据湖的权限，开始执行训练任务。
7. 实验结束后，只需要删除TrainingEnvironment这个Kubernetes资源，Crossplane会自动、可靠地清理所有相关的云资源。
优势:
- 声明式与自愈能力: 我们只关心最终状态。Crossplane负责达到并维持这个状态，极大地增强了系统的鲁棒性。资源泄露的风险被降到最低。
- 统一的API: 所有的基础设施都变成了Kubernetes API的一部分。无论是数据库、消息队列还是GPU节点，都可以用kubectl和YAML来管理，为平台工程提供了坚实的基础。
- 关注点分离: GitHub Actions的角色被简化为纯粹的编排者：接收事件，生成声明，然后kubectl apply。它不再关心基础设施的实现细节。
- 可扩展性: 我们可以通过Crossplane的Composition功能，将一系列复杂的云资源打包成一个简单的、对开发者友好的抽象API（如TrainingEnvironment），隐藏底层复杂性。
劣势:
- 初始复杂性高: 需要建立和维护一个Kubernetes集群作为控制平面，并深入学习Crossplane的概念，如XRD（Composite Resource Definitions）和Compositions。
- 生态系统仍在发展: 相比Terraform，某些云服务商的Crossplane Provider可能覆盖的功能不够全面或不够稳定。

最终决策: 考虑到我们追求的是一个可规模化、高容错、可维护的平台，而不是一次性的解决方案，我们选择了方案B。前期的投入是为后期的稳定性和自动化效率服务的。在一个真实的、有数十个工程师和数据科学家并行进行实验的环境中，这种声明式的、具备自愈能力的架构所带来的价值远超其初始复杂性。

核心实现概览

整个系统的核心数据流和组件交互如下所示。

sequenceDiagram
    participant User as 开发者/用户
    participant Rails as Rails API 控制平面
    participant GitHub as GitHub Actions
    participant K8s_CP as Kubernetes 管理集群 (含Crossplane)
    participant CloudProvider as 云服务商 (AWS/GCP/Azure)
    participant K8s_Worker as Kubernetes 训练集群

    User->>+Rails: POST /api/v1/experiments (携带配置)
    Rails->>+GitHub: 触发 repository_dispatch 事件 (payload)
    Rails-->>-User: 202 Accepted (返回 experiment_id)

    GitHub->>+K8s_CP: 1. `kubectl apply -f environment.yaml`
    K8s_CP->>+CloudProvider: 2. Crossplane Provider 创建资源 (GPU节点池, IAM角色等)
    CloudProvider-->>-K8s_CP: 资源创建成功
    K8s_CP->>+K8s_Worker: 3. Crossplane 将新节点加入集群
    K8s_CP->>+K8s_CP: 4. 创建 Namespace, ServiceAccount 等
    K8s_CP-->>-GitHub: `apply` 命令完成

    GitHub->>+K8s_Worker: 5. `kubectl apply -f training_job.yaml`
    K8s_Worker->>+K8s_Worker: 6. Kube-scheduler 将Pod调度到新GPU节点
    Note right of K8s_Worker: Pod 启动...
    K8s_Worker->>DataLake: 7. 从数据湖拉取数据
    Note right of K8s_Worker: 模型训练中...
    K8s_Worker->>ModelRegistry: 8. 训练完成，推送模型

1. Crossplane抽象层：定义`TrainingEnvironment`

这是整个声明式架构的基石。我们不希望用户直接操作底层的NodePool或IAMRole。因此，我们创建一个更高层次的抽象TrainingEnvironment。

首先，定义一个CompositeResourceDefinition (XRD)，它描述了我们这个新API的schema。

# xrds/trainingenvironment.xrd.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: trainingenvironments.mlops.example.com
spec:
  group: mlops.example.com
  names:
    kind: TrainingEnvironment
    plural: trainingenvironments
  claimNames:
    kind: TrainingEnvironmentClaim
    plural: trainingenvironmentclaims
  versions:
  - name: v1alpha1
    served: true
    referenceable: true
    schema:
      openAPIV3Schema:
        type: object
        properties:
          spec:
            type: object
            properties:
              experimentId:
                type: string
                description: "Unique ID for the experiment, used for tagging resources."
              region:
                type: string
                description: "The cloud provider region."
                default: "us-central1"
              gpu:
                type: object
                description: "GPU configuration for the worker nodes."
                properties:
                  type:
                    type: string
                    description: "Type of the GPU, e.g., 'nvidia-tesla-a100'."
                    enum: ["nvidia-tesla-a100", "nvidia-tesla-t4"]
                  count:
                    type: integer
                    description: "Number of GPUs per node."
                    minimum: 1
                required: ["type", "count"]
              nodeCount:
                type: integer
                description: "Number of nodes in the dedicated pool."
                default: 1
            required: ["experimentId", "gpu"]

接下来，创建Composition，它告诉Crossplane如何将一个TrainingEnvironment的声明“翻译”成一组具体的、底层的云资源。这里以GCP为例。

# compositions/gke-gpu-environment.composition.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: gke.gpu-environment.mlops.example.com
  labels:
    provider: gcp
spec:
  writeConnectionSecretsToNamespace: crossplane-system
  compositeTypeRef:
    apiVersion: mlops.example.com/v1alpha1
    kind: TrainingEnvironment
  resources:
    - name: gkeNodePool
      base:
        apiVersion: container.gcp.upbound.io/v1beta1
        kind: NodePool
        spec:
          forProvider:
            cluster: "main-training-cluster" # Hardcoded cluster name
            management:
              autoRepair: true
              autoUpgrade: true
            nodeConfig:
              machineType: "n1-standard-4"
              oauthScopes:
                - "https://www.googleapis.com/auth/cloud-platform"
              # Critical: This section configures the GPU
              guestAccelerator:
                - type: "nvidia-tesla-a100" # This will be patched
                  count: 1 # This will be patched
              # Taints to ensure only specific training jobs land here
              taint:
                - key: "mlops.example.com/workload"
                  value: "training"
                  effect: "NO_SCHEDULE"
            initialNodeCount: 1 # This will be patched
      patches:
        - fromFieldPath: "spec.region"
          toFieldPath: "spec.forProvider.location"
        - fromFieldPath: "spec.experimentId"
          toFieldPath: "metadata.labels.experiment-id"
        - fromFieldPath: "spec.gpu.type"
          toFieldPath: "spec.forProvider.nodeConfig.guestAccelerator[0].type"
        - fromFieldPath: "spec.gpu.count"
          toFieldPath: "spec.forProvider.nodeConfig.guestAccelerator[0].count"
        - fromFieldPath: "spec.nodeCount"
          toFieldPath: "spec.forProvider.initialNodeCount"
        # Generate a unique name for the node pool to avoid conflicts
        - fromFieldPath: "spec.experimentId"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "pool-%s"

    - name: trainingNamespace
      base:
        apiVersion: v1
        kind: Namespace
      patches:
        - fromFieldPath: "spec.experimentId"
          toFieldPath: "metadata.name"
          transforms:
            - type: string
              string:
                fmt: "exp-%s"

现在，开发者或自动化系统只需要创建一个非常简单的TrainingEnvironmentClaim资源，Crossplane就会在后台完成所有繁重的工作。

2. Ruby on Rails API 控制平面

Rails应用在这里不处理任何业务核心逻辑，它的唯一职责是：

提供一个RESTful API来接收、验证和持久化实验请求。
将合法的请求转化为触发GitHub Actions工作流的事件。

# app/models/experiment.rb
class Experiment < ApplicationRecord
  # enums for state management, gpu types, etc.
  enum status: { pending: 0, provisioning: 1, running: 2, completed: 3, failed: 4 }
  enum gpu_type: { t4: 'nvidia-tesla-t4', a100: 'nvidia-tesla-a100' }

  validates :dataset_uri, presence: true, format: { with: %r{\As3://[a-zA-Z0-9.-]+/[a-zA-Z0-9./_-]+\z} }
  validates :docker_image, presence: true
  validates :gpu_type, presence: true
  validates :gpu_count, numericality: { only_integer: true, greater_than: 0 }
end

# app/controllers/api/v1/experiments_controller.rb
module Api::V1
  class ExperimentsController < ApplicationController
    # Basic auth or more robust token auth should be here
    before_action :authenticate

    def create
      @experiment = Experiment.new(experiment_params)

      if @experiment.save
        # Use a background job for robustness
        OrchestratorDispatchJob.perform_later(@experiment.id)
        render json: { id: @experiment.id, status: @experiment.status }, status: :accepted
      else
        render json: { errors: @experiment.errors.full_messages }, status: :unprocessable_entity
      end
    end

    private

    def experiment_params
      params.require(:experiment).permit(:name, :dataset_uri, :docker_image, :gpu_type, :gpu_count)
    end

    def authenticate
      # Production: Implement proper token-based authentication
      # For demo: Simple HTTP Basic Auth
      authenticate_or_request_with_http_basic do |username, password|
        ActiveSupport::SecurityUtils.secure_compare(username, ENV['API_USER']) &&
        ActiveSupport::SecurityUtils.secure_compare(password, ENV['API_PASSWORD'])
      end
    end
  end
end

# app/jobs/orchestrator_dispatch_job.rb
class OrchestratorDispatchJob < ApplicationJob
  queue_as :default

  # A common pitfall is making the API call directly in the controller.
  # This makes the API slow and not resilient to network failures.
  # Using a background job is the correct production approach.
  def perform(experiment_id)
    experiment = Experiment.find(experiment_id)
    return unless experiment.pending?

    client = Octokit::Client.new(access_token: ENV['GITHUB_PAT'])
    repo = ENV['GITHUB_REPO'] # e.g., 'my-org/mlops-pipelines'

    # The payload is the contract between the Rails API and the GitHub Actions workflow
    payload = {
      event_type: 'run-experiment',
      client_payload: {
        experiment_id: experiment.id.to_s,
        region: 'us-central1', # Could be dynamic
        gpu: {
          type: experiment.gpu_type_before_type_cast, # sends the string value
          count: experiment.gpu_count
        },
        namespace: "exp-#{experiment.id}",
        dataset_uri: experiment.dataset_uri,
        docker_image: experiment.docker_image
      }
    }

    begin
      client.repository_dispatch(repo, payload[:event_type], payload[:client_payload])
      experiment.update!(status: :provisioning)
    rescue Octokit::Error => e
      # Log the error and potentially set the experiment to a failed state
      Rails.logger.error "Failed to dispatch to GitHub for experiment #{experiment.id}: #{e.message}"
      experiment.update!(status: :failed, logs: "GitHub dispatch failed: #{e.message}")
    end
  end
end

3. GitHub Actions：轻量级编排器

工作流的角色变得非常简单。它不再包含复杂的逻辑，只是一个事件的响应者和命令的执行者。

# .github/workflows/ml-experiment-orchestrator.yml
name: ML Experiment Orchestrator

on:
  repository_dispatch:
    types: [run-experiment]

jobs:
  provision-and-run:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v3

      - name: 'Authenticate to GCP and GKE'
        uses: 'google-github-actions/auth@v1'
        with:
          credentials_json: '${{ secrets.GCP_SA_KEY }}'

      - name: 'Set up GKE cluster credentials'
        uses: 'google-github-actions/get-gke-credentials@v1'
        with:
          cluster_name: 'main-management-cluster' # The cluster where Crossplane is running
          location: 'us-central1'

      - name: 'Generate and Apply TrainingEnvironment Claim'
        id: apply-env
        run: |
          # The client_payload from the Rails API is available in github.event context
          cat <<EOF > environment-claim.yaml
          apiVersion: mlops.example.com/v1alpha1
          kind: TrainingEnvironmentClaim
          metadata:
            name: ${{ github.event.client_payload.experiment_id }}
            namespace: default # Claims are typically created in a user-facing namespace
          spec:
            compositionSelector:
              matchLabels:
                provider: gcp
            parameters:
              experimentId: "${{ github.event.client_payload.experiment_id }}"
              region: "${{ github.event.client_payload.region }}"
              gpu:
                type: "${{ github.event.client_payload.gpu.type }}"
                count: ${{ github.event.client_payload.gpu.count }}
          EOF

          echo "--- Applying TrainingEnvironmentClaim ---"
          cat environment-claim.yaml
          kubectl apply -f environment-claim.yaml

      # This step is crucial. We must wait for Crossplane to finish its work.
      - name: 'Wait for TrainingEnvironment to be Ready'
        run: |
          echo "Waiting for environment to be provisioned by Crossplane..."
          # In a real project, this should have a more robust check and a timeout.
          # We wait for the composite resource (not the claim) to report `Ready=True`.
          kubectl wait --for=condition=Ready=true \
            trainingenvironment.mlops.example.com/${{ github.event.client_payload.experiment_id }} \
            --timeout=20m

      - name: 'Generate and Submit Training Job'
        run: |
          NAMESPACE="${{ github.event.client_payload.namespace }}"
          
          cat <<EOF > training-job.yaml
          apiVersion: batch/v1
          kind: Job
          metadata:
            name: training-job-${{ github.event.client_payload.experiment_id }}
            namespace: ${NAMESPACE}
          spec:
            template:
              spec:
                # Tolerations to allow scheduling on the tainted GPU nodes
                tolerations:
                - key: "mlops.example.com/workload"
                  operator: "Equal"
                  value: "training"
                  effect: "NoSchedule"
                containers:
                - name: training-container
                  image: ${{ github.event.client_payload.docker_image }}
                  args: [
                    "--dataset-uri", "${{ github.event.client_payload.dataset_uri }}",
                    "--experiment-id", "${{ github.event.client_payload.experiment_id }}",
                    "--output-path", "s3://my-models-bucket/..."
                  ]
                  # This is where we would mount secrets for Data Lake access
                  # or use Workload Identity for pod-level GCP permissions.
                  resources:
                    limits:
                      nvidia.com/gpu: ${{ github.event.client_payload.gpu.count }}
                restartPolicy: Never
            backoffLimit: 1
          EOF
          
          echo "--- Submitting Training Job ---"
          cat training-job.yaml
          kubectl apply -f training-job.yaml

架构的扩展性与局限性

这个架构的真正威力在于它的可扩展性。我们可以轻松地为TrainingEnvironment添加更多参数，比如支持不同的云服务商（通过创建新的Composition），或者集成特定的存储卷、网络策略。Rails控制平面可以演进为一个功能完备的平台，提供UI、实验结果跟踪、成本核算等功能，而底层的声明式基础设施始终保持一致和稳定。

然而，这个方案并非没有缺点。

反馈循环: 当Crossplane在云上创建资源失败时，调试链条会变长。你需要检查TrainingEnvironmentClaim的状态，然后深入到它管理的TrainingEnvironment，再到具体的NodePool或IAMPolicy资源，最后查看Crossplane Provider Pod的日志。这比直接看Terraform的输出要复杂。
Provider的成熟度: Crossplane的生态系统依赖于各个Provider的质量。虽然主流云服务商的Provider已经相当成熟，但对于一些冷门的服务或最新的功能，支持可能会有延迟或存在bug。
Day-2运维: 维护一个运行Crossplane的Kubernetes控制平面本身就需要一定的专业知识，包括升级Crossplane、升级Providers以及备份ETCD中的CRD状态。

尽管存在这些挑战，对于需要管理复杂、动态、多租户的AI/ML基础设施的场景，采用这种基于声明式控制平面的架构模式，能够从根本上解决命令式脚本带来的脆弱性和状态漂移问题，构建一个真正健壮、可预测且高度自动化的平台。