Quick Reference 12

BigQuery ML

Quick reference for BQML syntax, supported model types, feature engineering, evaluation, and export.

7 min readGoogle CloudQuick ReferenceDownload PDF

What Is BigQuery ML?

BQML eliminates the biggest friction in enterprise ML: moving data out of the warehouse and into a separate training environment. BigQuery ML lets you create, train, evaluate, and predict with ML models using SQL directly in BigQuery -- no data movement, no separate ML infrastructure.

Core Syntax

These three SQL statements -- CREATE MODEL, ML.PREDICT, and ML.EVALUATE -- are the backbone of every BQML workflow. Master them and you can train, evaluate, and deploy models without leaving SQL.

CREATE MODEL

CREATE OR REPLACE MODEL `project.dataset.model_name`
OPTIONS(
  model_type = 'LOGISTIC_REG',          -- required
  input_label_cols = ['target_column'],   -- required for supervised
  auto_class_weights = TRUE,              -- optional: handle imbalanced classes
  max_iterations = 20,                    -- optional: training iterations
  learn_rate = 0.1,                       -- optional: learning rate
  l1_reg = 0.001,                         -- optional: L1 regularization
  l2_reg = 0.001,                         -- optional: L2 regularization
  early_stop = TRUE,                      -- optional: stop when no improvement
  data_split_method = 'AUTO_SPLIT'        -- optional: train/eval split
) AS
SELECT
  feature1,
  feature2,
  feature3,
  target_column
FROM `project.dataset.training_table`
WHERE date BETWEEN '2025-01-01' AND '2025-12-31';

ML.PREDICT

-- Batch prediction
SELECT *
FROM ML.PREDICT(
  MODEL `project.dataset.model_name`,
  (SELECT feature1, feature2, feature3
   FROM `project.dataset.new_data`)
);

-- With threshold for classification
SELECT *,
  CASE WHEN predicted_target_column_probs[OFFSET(1)].prob > 0.7
       THEN 'positive' ELSE 'negative'
  END AS high_confidence_label
FROM ML.PREDICT(
  MODEL `project.dataset.model_name`,
  (SELECT * FROM `project.dataset.new_data`)
);

ML.EVALUATE

-- Overall model evaluation
SELECT *
FROM ML.EVALUATE(
  MODEL `project.dataset.model_name`,
  (SELECT * FROM `project.dataset.eval_data`)
);

-- Evaluate with specific threshold
SELECT *
FROM ML.EVALUATE(
  MODEL `project.dataset.model_name`,
  (SELECT * FROM `project.dataset.eval_data`),
  STRUCT(0.5 AS threshold)
);

Supported Model Types

BQML supports everything from simple linear regression to imported TensorFlow models and remote LLM calls. Knowing which model_type value to use saves you from digging through documentation mid-query.

Model Typemodel_type ValueTaskNotes
Linear RegressionLINEAR_REGRegressionContinuous target
Logistic RegressionLOGISTIC_REGBinary/multi-classDefault for classification
K-MeansKMEANSClusteringUnsupervised
Matrix FactorizationMATRIX_FACTORIZATIONRecommendationUser-item interactions
PCAPCADimensionality reductionUnsupervised
XGBoostBOOSTED_TREE_CLASSIFIERClassificationGradient boosting
XGBoostBOOSTED_TREE_REGRESSORRegressionGradient boosting
Random ForestRANDOM_FOREST_CLASSIFIERClassificationEnsemble trees
Random ForestRANDOM_FOREST_REGRESSORRegressionEnsemble trees
DNNDNN_CLASSIFIERClassificationDeep neural network
DNNDNN_REGRESSORRegressionDeep neural network
Wide & DeepDNN_LINEAR_COMBINED_CLASSIFIERClassificationWide + deep combo
ARIMA PlusARIMA_PLUSTime series forecastAutomatic seasonality
AutoML TablesAUTOML_CLASSIFIERClassificationAutomatic architecture
AutoML TablesAUTOML_REGRESSORRegressionAutomatic architecture
TensorFlow (imported)TENSORFLOWAnyImport SavedModel
ONNX (imported)ONNXAnyImport ONNX model
LLM (remote)CLOUD_AI_LLM_V1Text generationGemini, PaLM

Model Selection Guide

When you are not sure which model type to pick, default to boosted trees for tabular data -- they consistently outperform other BQML options on structured datasets.

ScenarioRecommended ModelWhy
Quick binary classificationLOGISTIC_REGFast, interpretable
Tabular classification (best accuracy)BOOSTED_TREE_CLASSIFIERUsually best for tabular
Regression with many featuresBOOSTED_TREE_REGRESSORHandles non-linearity
Customer segmentationKMEANSUnsupervised grouping
Sales forecastingARIMA_PLUSTime series native
Recommendation engineMATRIX_FACTORIZATIONCollaborative filtering
Don't know what to pickAUTOML_CLASSIFIER/REGRESSORLet BQML decide
Use existing TF modelTENSORFLOWImport and serve

Feature Engineering in BQML

BQML's TRANSFORM clause lets you define feature engineering in SQL, and it automatically applies the same transforms at prediction time. This eliminates training-serving skew -- the most common cause of model performance drops in production.

Built-in Transforms

CREATE OR REPLACE MODEL `project.dataset.my_model`
TRANSFORM(
  -- Numeric: standardize
  ML.STANDARD_SCALER(age) OVER() AS age_scaled,
  ML.MIN_MAX_SCALER(income) OVER() AS income_scaled,

  -- Numeric: bucketize
  ML.BUCKETIZE(age, [18, 25, 35, 50, 65]) AS age_bucket,

  -- Quantile bucketize
  ML.QUANTILE_BUCKETIZE(income, 10) OVER() AS income_decile,

  -- Categorical: one-hot (automatic for STRING/BOOL)
  category,

  -- Text: TF-IDF or bag of words
  ML.NGRAMS(description, [1, 2]) AS description_ngrams,

  -- Polynomial
  ML.POLYNOMIAL_EXPAND(STRUCT(feature1, feature2), 2) AS poly_features,

  -- Feature cross
  ML.FEATURE_CROSS(STRUCT(city, device_type)) AS city_device,

  -- Label
  target_column
)
OPTIONS(model_type='BOOSTED_TREE_CLASSIFIER', input_label_cols=['target_column'])
AS SELECT * FROM `project.dataset.training_data`;

Common Feature Patterns

PatternSQL ExampleUse Case
Date partsEXTRACT(DAYOFWEEK FROM date)Cyclical patterns
Ratiosclicks / NULLIF(impressions, 0)Rate metrics
Lag featuresLAG(value, 7) OVER(ORDER BY date)Time series
Rolling avgAVG(value) OVER(ORDER BY date ROWS 7 PRECEDING)Smoothing
Log transformLN(value + 1)Skewed distributions
Null handlingIFNULL(value, 0)Missing data
String lengthLENGTH(text_field)Text features
Count encodingSubquery with COUNT(*) per categoryHigh cardinality

Time Series with ARIMA_PLUS

ARIMA_PLUS is BQML's most powerful built-in model -- it automatically handles seasonality, holidays, spike cleaning, and multiple time series in a single query. For most business forecasting tasks, it is the fastest path to production-quality predictions.

-- Create time series model
CREATE OR REPLACE MODEL `project.dataset.sales_forecast`
OPTIONS(
  model_type = 'ARIMA_PLUS',
  time_series_timestamp_col = 'date',
  time_series_data_col = 'daily_sales',
  time_series_id_col = 'store_id',  -- multiple series
  auto_arima = TRUE,
  data_frequency = 'DAILY',
  holiday_region = 'US',
  clean_spikes_and_dips = TRUE,
  adjust_step_changes = TRUE
) AS
SELECT date, store_id, daily_sales
FROM `project.dataset.historical_sales`;

-- Forecast future values
SELECT *
FROM ML.FORECAST(
  MODEL `project.dataset.sales_forecast`,
  STRUCT(30 AS horizon, 0.9 AS confidence_level)
);

-- Explain forecast components
SELECT *
FROM ML.EXPLAIN_FORECAST(
  MODEL `project.dataset.sales_forecast`,
  STRUCT(30 AS horizon, 0.9 AS confidence_level)
);

LLM Integration in BQML

BQML can call Gemini and other Vertex AI models directly from SQL, enabling text generation and embedding at warehouse scale. This lets you enrich millions of rows with LLM outputs without building a separate pipeline.

-- Create a remote model connection to Gemini
CREATE OR REPLACE MODEL `project.dataset.gemini_model`
REMOTE WITH CONNECTION `project.us.vertex-ai-connection`
OPTIONS(endpoint = 'gemini-2.0-flash');

-- Use for text generation
SELECT *
FROM ML.GENERATE_TEXT(
  MODEL `project.dataset.gemini_model`,
  (SELECT
    CONCAT('Summarize this review: ', review_text) AS prompt
   FROM `project.dataset.reviews`
   LIMIT 100),
  STRUCT(
    0.2 AS temperature,
    1024 AS max_output_tokens,
    TRUE AS flatten_json_output
  )
);

-- Use for embeddings
CREATE OR REPLACE MODEL `project.dataset.embedding_model`
REMOTE WITH CONNECTION `project.us.vertex-ai-connection`
OPTIONS(endpoint = 'text-embedding-004');

SELECT *
FROM ML.GENERATE_EMBEDDING(
  MODEL `project.dataset.embedding_model`,
  (SELECT content AS content FROM `project.dataset.documents`),
  STRUCT(TRUE AS flatten_json_output)
);

Evaluation Metrics by Model Type

ML.EVALUATE returns different metrics depending on your model type. Knowing which metrics to expect -- and which evaluation queries to run -- prevents confusion when interpreting model performance.

Model TypeMetrics Returned by ML.EVALUATE
Classificationprecision, recall, accuracy, f1_score, log_loss, roc_auc
Regressionmean_absolute_error, mean_squared_error, r2_score, mean_squared_log_error
Clusteringdavies_bouldin_index, mean_squared_distance
Time SeriesAIC, variance, log_likelihood, seasonal_periods

Useful Evaluation Queries

-- Confusion matrix
SELECT *
FROM ML.CONFUSION_MATRIX(
  MODEL `project.dataset.model_name`,
  (SELECT * FROM `project.dataset.eval_data`)
);

-- ROC curve
SELECT *
FROM ML.ROC_CURVE(
  MODEL `project.dataset.model_name`,
  (SELECT * FROM `project.dataset.eval_data`)
);

-- Feature importance (tree models)
SELECT *
FROM ML.FEATURE_IMPORTANCE(MODEL `project.dataset.model_name`);

-- Feature info
SELECT *
FROM ML.FEATURE_INFO(MODEL `project.dataset.model_name`);

-- Model weights (linear/logistic)
SELECT *
FROM ML.WEIGHTS(MODEL `project.dataset.model_name`);

Model Export

BQML models trained in BigQuery can be exported and deployed to Vertex AI endpoints for low-latency serving. Not all model types support export -- check this table before planning your deployment strategy.

-- Export to Cloud Storage
EXPORT MODEL `project.dataset.model_name`
OPTIONS(uri = 'gs://bucket/exported_model/');
Model TypeExport FormatDeploy To
Linear/LogisticTensorFlow SavedModelVertex AI, TF Serving
Boosted TreeBooster + TF SavedModelVertex AI, XGBoost
DNNTensorFlow SavedModelVertex AI, TF Serving
ARIMA_PLUSNot exportableUse ML.FORECAST in BQ
AutoMLNot directlyUse Vertex AI

When to Use BQML

BQML shines when your data is already in BigQuery and your team thinks in SQL. It is not the right tool for every ML problem -- know its strengths and limitations before committing.

Use BQML When

  • Data already lives in BigQuery
  • Team is SQL-proficient but not ML-expert
  • Need quick prototyping or baseline models
  • Tabular data classification/regression
  • Time series forecasting
  • Want to avoid data movement

Do NOT Use BQML When

  • Need custom neural architectures
  • Working with images, audio, or video (train from scratch)
  • Need real-time online learning
  • Require model interpretability beyond built-in methods
  • Training budget is very tight (BQML can be expensive at scale)

Cost Considerations

BQML charges for both data processing and ML compute slots, and costs can escalate quickly with large datasets on on-demand pricing. Understand the pricing model before running CREATE MODEL on your full dataset.

OperationPricing
CREATE MODELCharged per bytes processed + ML training slot time
ML.PREDICTCharged per bytes processed
ML.EVALUATECharged per bytes processed
Model storageStandard BQ storage rates
On-demand training~$250/slot/hour for ML
Flat-rate (editions)Included in Enterprise/Enterprise Plus

Common Pitfalls

BQML's SQL simplicity makes it easy to train models quickly but also easy to make mistakes that go unnoticed. These pitfalls are especially common for SQL practitioners new to ML.

PitfallProblemFix
No eval splitOverfitting not detectedUse data_split_method='AUTO_SPLIT'
All features as-isPoor model performanceUse TRANSFORM for scaling, encoding
Ignoring class imbalanceBiased predictionsSet auto_class_weights=TRUE
SELECT * for trainingIncludes ID columns, data leakageExplicitly select features
No feature importanceCan't explain modelRun ML.FEATURE_IMPORTANCE
Forecasting without cleaningBad forecasts from outliersSet clean_spikes_and_dips=TRUE
Large training data with on-demandHigh costUse flat-rate pricing or sample data
Not exporting for productionStuck in BQ for servingExport and deploy to Vertex AI endpoint