Quick Reference 11

Vertex AI & MLOps

Quick reference for Vertex AI services, pipeline components, feature store, model registry, monitoring, and AutoML.

6 min readGoogle CloudQuick ReferenceDownload PDF

Vertex AI Service Map

Vertex AI is Google Cloud's unified ML platform with over a dozen interconnected services. This map helps you quickly identify which service to use for each stage of the ML lifecycle.

ServicePurposeWhen to Use
Vertex AI TrainingCustom model trainingNeed full control over training
AutoMLNo-code model trainingTabular, image, text, video classification
Vertex AI PipelinesOrchestrated ML workflowsReproducible training pipelines
Feature StoreCentralized feature managementShared features across models
Model RegistryModel versioning and metadataTrack, compare, and promote models
EndpointsModel serving (online/batch)Real-time or batch predictions
Vertex AI StudioPrompt design and tuningGenAI experimentation
Model GardenPre-trained model catalogUse or fine-tune foundation models
Vertex AI Agent BuilderBuild AI agents and RAGGrounded generation, agent tools
Model MonitoringDetect drift and anomaliesProduction model health
ExperimentsTrack training runsCompare hyperparameters, metrics

AutoML vs Custom Training

AutoML gets you a baseline model in hours with zero code; custom training gives you full control at the cost of engineering effort. Start with AutoML to establish a benchmark, then switch to custom only if you need to beat it.

AspectAutoMLCustom Training
Code requiredNone (UI/API config)Full training code
Model controlLimited (architecture chosen)Full (any framework)
Data prepMinimal (automatic)Manual feature engineering
Training timeHoursFlexible
Best forPrototyping, tabular dataResearch, complex architectures
CostHigher per training hourLower with spot/preemptible
ExplainabilityBuilt-inManual (SHAP, etc.)
ExportTF SavedModel, containerAny format

AutoML Supported Tasks

Data TypeTasks
TabularClassification, regression, forecasting
ImageClassification, object detection, segmentation
TextClassification, entity extraction, sentiment
VideoClassification, object tracking, action recognition

Vertex AI Pipelines

Pipelines make your ML workflow reproducible, auditable, and automatable. Without them, training runs are one-off scripts that nobody can reproduce six months later.

Pipeline Structure

┌─────────────┐    ┌──────────────┐    ┌─────────────┐
│  Data Prep  │───>│   Training   │───>│  Evaluation  │
│  Component  │    │  Component   │    │  Component   │
└─────────────┘    └──────────────┘    └──────┬──────┘
                                              │
                                     ┌────────▼────────┐
                                     │ Conditional Gate │
                                     │ (metrics check)  │
                                     └────────┬────────┘
                                              │
                              ┌───────────────▼──────────────┐
                              │ Model Upload + Deploy to      │
                              │ Endpoint (if metrics pass)    │
                              └───────────────────────────────┘

Pipeline Definition (KFP v2)

from kfp import dsl
from google.cloud import aiplatform

@dsl.component(base_image="python:3.11", packages_to_install=["pandas", "scikit-learn"])
def preprocess_data(input_path: str, output_path: dsl.OutputPath("Dataset")):
    import pandas as pd
    df = pd.read_csv(input_path)
    # preprocessing logic
    df.to_csv(output_path, index=False)

@dsl.component(base_image="python:3.11", packages_to_install=["scikit-learn", "joblib"])
def train_model(
    dataset_path: dsl.InputPath("Dataset"),
    model_path: dsl.OutputPath("Model"),
    learning_rate: float = 0.01
):
    import pandas as pd
    from sklearn.ensemble import GradientBoostingClassifier
    import joblib

    df = pd.read_csv(dataset_path)
    X, y = df.drop("target", axis=1), df["target"]
    model = GradientBoostingClassifier(learning_rate=learning_rate)
    model.fit(X, y)
    joblib.dump(model, model_path)

@dsl.pipeline(name="training-pipeline")
def training_pipeline(input_path: str, lr: float = 0.01):
    preprocess_task = preprocess_data(input_path=input_path)
    train_task = train_model(
        dataset_path=preprocess_task.outputs["output_path"],
        learning_rate=lr
    )

Running a Pipeline

from google.cloud import aiplatform

aiplatform.init(project="my-project", location="us-central1")

job = aiplatform.PipelineJob(
    display_name="training-run-001",
    template_path="pipeline.json",  # compiled pipeline
    parameter_values={"input_path": "gs://bucket/data.csv", "lr": 0.01},
    pipeline_root="gs://bucket/pipeline-runs",
)
job.run(service_account="sa@project.iam.gserviceaccount.com")

Feature Store

Feature stores solve the problem of feature consistency between training and serving -- without one, teams recompute features inconsistently, leading to training-serving skew and duplicated engineering effort.

Key Concepts

ConceptDescription
Feature GroupLogical grouping of related features (e.g., "user_features")
FeatureIndividual data column with type and metadata
EntityKey that identifies the row (e.g., user_id)
Online StoreLow-latency serving for real-time predictions
Offline StoreBigQuery-backed for batch training
Point-in-time lookupGet feature values as of a specific timestamp

Feature Store Operations

from google.cloud import aiplatform

# Create feature group
fg = aiplatform.FeatureGroup.create(
    name="user_features",
    source=aiplatform.FeatureGroup.BigQuerySource(
        uri="bq://project.dataset.user_features_table"
    )
)

# Create feature (column in the source table)
feature = fg.create_feature(name="purchase_count")

# Online serving: create Feature Online Store
online_store = aiplatform.FeatureOnlineStore.create(
    name="production-store",
    bigtable=aiplatform.FeatureOnlineStore.Bigtable(
        auto_scaling=aiplatform.FeatureOnlineStore.Bigtable.AutoScaling(
            min_node_count=1, max_node_count=3
        )
    )
)

# Create feature view for online serving
view = online_store.create_feature_view(
    name="user_features_view",
    source=aiplatform.FeatureView.BigQuerySource(
        uri="bq://project.dataset.user_features_table",
        entity_id_columns=["user_id"]
    )
)

Model Registry

The model registry is your single source of truth for what models exist, which version is in production, and how each version performed. Without it, model management becomes spreadsheets and tribal knowledge.

Uploading a Model

model = aiplatform.Model.upload(
    display_name="fraud-detector-v2",
    artifact_uri="gs://bucket/models/fraud-v2/",
    serving_container_image_uri="us-docker.pkg.dev/vertex-ai/prediction/sklearn-cpu.1-3:latest",
    labels={"team": "fraud", "version": "2"},
    description="Gradient boosting fraud detection model"
)

Model Versioning

# Upload as new version of existing model
model_v2 = aiplatform.Model.upload(
    display_name="fraud-detector-v2",
    parent_model=existing_model.resource_name,  # creates new version
    artifact_uri="gs://bucket/models/fraud-v2.1/",
    serving_container_image_uri="...",
    version_aliases=["champion"],
    version_description="Improved recall by 5%"
)

Endpoints and Serving

Endpoints handle the operational complexity of serving models at scale -- autoscaling, traffic splitting, and health checks. Use traffic splitting for A/B tests and canary deployments instead of risky all-at-once rollouts.

Deploy Model

endpoint = aiplatform.Endpoint.create(display_name="fraud-endpoint")

endpoint.deploy(
    model=model,
    deployed_model_display_name="fraud-v2",
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=5,
    traffic_percentage=100,
    accelerator_type=None
)

Traffic Splitting (A/B Testing)

# Deploy new model alongside existing one
endpoint.deploy(
    model=model_v3,
    deployed_model_display_name="fraud-v3-candidate",
    machine_type="n1-standard-4",
    min_replica_count=1,
    max_replica_count=3,
    traffic_percentage=10  # 10% to new model
)
# Existing model automatically gets 90%

Online Prediction

instances = [
    {"feature1": 1.0, "feature2": "category_a", "feature3": 42}
]
prediction = endpoint.predict(instances=instances)
print(prediction.predictions)

Batch Prediction

batch_job = model.batch_predict(
    job_display_name="batch-fraud-scoring",
    gcs_source="gs://bucket/input/batch_data.jsonl",
    gcs_destination_prefix="gs://bucket/output/",
    machine_type="n1-standard-4",
    max_replica_count=10,
    accelerator_count=0
)
batch_job.wait()

Model Monitoring

Models degrade silently as the real world drifts away from training data. Monitoring for feature skew and prediction drift catches these problems before they impact business metrics.

Monitoring Setup

from google.cloud.aiplatform import model_monitoring

# Define skew/drift thresholds
skew_config = model_monitoring.SkewDetectionConfig(
    data_source="bq://project.dataset.training_data",
    skew_thresholds={"feature1": 0.3, "feature2": 0.2}
)

drift_config = model_monitoring.DriftDetectionConfig(
    drift_thresholds={"feature1": 0.3, "feature2": 0.2}
)

# Create monitoring job
monitor_job = aiplatform.ModelDeploymentMonitoringJob.create(
    display_name="fraud-monitoring",
    endpoint=endpoint,
    logging_sampling_strategy={"random_sample_config": {"sample_rate": 0.1}},
    schedule_config={"monitor_interval": {"seconds": 3600}},  # hourly
    objective_configs=[{
        "deployed_model_id": deployed_model_id,
        "objective_config": {
            "training_dataset": skew_config,
            "training_prediction_skew_detection_config": skew_config,
            "prediction_drift_detection_config": drift_config
        }
    }],
    alert_config={"email_alert_config": {"user_emails": ["team@company.com"]}}
)

Monitoring Metrics

Metric TypeWhat It DetectsMethod
Feature skewTraining/serving data mismatchDistribution comparison (Jensen-Shannon)
Feature driftInput distribution shift over timeDistribution comparison over windows
Prediction driftOutput distribution shiftDistribution comparison
Attribution driftFeature importance shiftExplanation-based monitoring

Compute Options

Choosing the wrong machine type wastes money on over-provisioning or causes OOM failures during training. Match your compute to your workload -- use GPUs for model training and right-size CPU machines for preprocessing and serving.

Machine TypevCPUsMemoryGPUUse Case
n1-standard-4415 GBOptionalSmall models, preprocessing
n1-standard-161660 GBOptionalMedium models
n1-highmem-8852 GBOptionalMemory-intensive
a2-highgpu-1g1285 GB1x A100Large model training
a2-highgpu-8g96680 GB8x A100Distributed training
g2-standard-4416 GB1x L4Inference, light training

Common Pitfalls

Most Vertex AI failures come from skipping automation and monitoring -- the same mistakes that plague any ML platform. These shortcuts save time initially but create painful debugging sessions later.

PitfallProblemFix
No pipeline versioningCan't reproduce resultsUse pipeline templates with version tags
Manual deploymentsError-prone, slowAutomate via CI/CD pipeline
No monitoringSilent model degradationSet up skew/drift alerts
Over-provisioned endpointsWasted costUse autoscaling, right-size machines
Ignoring Feature StoreFeature recomputation, inconsistencyCentralize features
No A/B testingRisky all-at-once deploymentsUse traffic splitting
Training on local machineNot reproducibleUse Vertex AI Training
Hardcoded project/locationNot portableUse environment variables