Migrating from Snowflake and AWS Spark to Google Cloud BigQuery: A Complete Technical Guide

Enterprises that standardized on Snowflake or the AWS analytics stack — Redshift, Glue, EMR, Step Functions, and PySpark — are increasingly evaluating Google Cloud Platform as their primary or secondary cloud. The reasons vary: multi-cloud strategy mandates, Google's leadership in AI/ML with Vertex AI and Gemini, BigQuery's serverless economics, contractual renegotiation leverage, or acquisition-driven platform consolidation. Whatever the driver, the technical challenge is the same: how do you migrate thousands of Snowflake SQL scripts, Snowpark Python programs, PySpark jobs, Glue ETL pipelines, and Redshift stored procedures to BigQuery, Dataform, Dataflow, and Cloud Composer without a multi-year manual rewrite?

This guide provides a comprehensive technical mapping of every major Snowflake and AWS/Spark concept to its GCP equivalent, demonstrates how MigryX automates the conversion with parser-driven accuracy, and outlines a practical migration methodology that minimizes risk and accelerates time to production.

Why Move from Snowflake or AWS to GCP?

Before diving into the technical mappings, it is worth examining the strategic drivers that make this migration path increasingly common across enterprise data organizations.

Serverless Economics at Scale

Snowflake's credit-based pricing model charges for compute time across virtual warehouses, with costs that scale directly with warehouse size and concurrency. Organizations running dozens of warehouses across development, staging, and production environments often find costs escalating faster than data volumes. BigQuery's on-demand pricing charges only for data scanned by queries — there are no idle compute costs, no warehouse sizing decisions, and no auto-suspend/resume complexity. For workloads with variable query patterns, the elimination of always-on warehouse costs can reduce analytics spend by 40-60%.

Similarly, the AWS analytics stack requires provisioning and managing multiple services independently: Redshift clusters with node sizing, EMR clusters with instance types and autoscaling policies, Glue job DPU allocations, and Step Functions state machine execution costs. BigQuery consolidates storage, compute, transformation, and orchestration into a single serverless platform with a unified cost model.

AI/ML Integration

Google Cloud's AI/ML ecosystem is the most deeply integrated with its analytics platform. BigQuery ML allows model training, evaluation, and prediction using SQL — directly on warehouse data without export. Vertex AI provides a full ML platform for custom models with native BigQuery connectors. Gemini models are accessible from BigQuery through remote functions. Snowflake's Cortex and AWS SageMaker both require more infrastructure management and data movement to achieve comparable ML integration.

Multi-Cloud and Data Sovereignty

BigQuery Omni allows federated queries across data stored in AWS S3 and Azure Blob Storage without data movement. Organizations can query Snowflake-exported Parquet files in S3 directly from BigQuery during a phased migration, reducing the pressure for a big-bang cutover. This capability enables a gradual transition where GCP becomes the primary analytics platform while legacy data remains in existing cloud storage.

Snowflake to BigQuery migration — automated end-to-end by MigryX

Architecture Mapping: Snowflake to BigQuery

Snowflake and BigQuery share several architectural principles — separation of storage and compute, columnar storage, and SQL-first interfaces — making the migration conceptually straightforward. The differences lie in compute management, SQL dialect, and ecosystem tooling.

Snowflake Concept	BigQuery Equivalent	Migration Notes
Virtual Warehouse	Slots (on-demand or flat-rate)	No warehouse sizing needed; BigQuery auto-allocates compute per query
Database / Schema	Project / Dataset	Snowflake database → BigQuery project; Snowflake schema → BigQuery dataset
Stage (internal/external)	Google Cloud Storage (GCS) bucket	External tables or LOAD DATA from GCS replaces Snowflake stages
Snowpipe	BigQuery Pub/Sub subscription / Storage Write API	Streaming ingestion replaces Snowpipe auto-ingest
Tasks & Streams	Dataform + Cloud Composer / Scheduled Queries	Tasks → Dataform models with scheduling; Streams → CDC via BigQuery change history
Stored Procedures (JavaScript/SQL)	BigQuery Stored Procedures (SQL/JavaScript)	Near-direct mapping; BigQuery supports procedural SQL and JavaScript UDFs
UDFs (SQL/JavaScript/Java/Python)	BigQuery UDFs (SQL/JavaScript) + Remote Functions	SQL and JS UDFs map directly; Python/Java UDFs migrate to Remote Functions via Cloud Run
Snowpark (Python/Scala/Java)	BigQuery DataFrames / Dataform / Dataflow	Snowpark Python → BigQuery DataFrames API or Dataform SQLX
Dynamic Tables	Dataform incremental models / Materialized Views	Declarative pipeline definitions replace Dynamic Tables
Time Travel (up to 90 days)	Time Travel (up to 7 days) + snapshot decorators	BigQuery time travel is shorter; use scheduled snapshots for longer retention
Snowflake Cortex (ML)	BigQuery ML + Vertex AI	BigQuery ML for SQL-based ML; Vertex AI for advanced custom models
Data Sharing / Marketplace	Analytics Hub / BigQuery data sharing	Analytics Hub provides cross-organization data exchange
Iceberg Tables	BigLake / BigQuery external tables on Iceberg	BigLake provides a unified interface for Iceberg, Delta, and Hudi formats

Snowflake SQL to BigQuery SQL: Key Dialect Differences

While both platforms support ANSI SQL, there are dialect-specific differences that require systematic translation. MigryX's SQL transpiler handles these automatically through AST-level transformation rules.

-- Snowflake SQL: Common patterns that require translation
-- 1. QUALIFY clause (supported in both, but syntax identical)
SELECT customer_id, order_date, amount
FROM orders
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

-- 2. FLATTEN for semi-structured data
SELECT
    f.value:name::STRING AS feature_name,
    f.value:score::FLOAT AS feature_score
FROM ml_predictions,
LATERAL FLATTEN(input => prediction_features) f;

-- 3. VARIANT / OBJECT / ARRAY types
SELECT
    raw_json:customer.name::STRING AS customer_name,
    raw_json:customer.addresses[0].city::STRING AS primary_city
FROM raw_events;

-- BigQuery SQL equivalent
-- 1. QUALIFY (identical syntax in BigQuery)
SELECT customer_id, order_date, amount
FROM `project.dataset.orders`
QUALIFY ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

-- 2. UNNEST replaces FLATTEN
SELECT
    feature.name AS feature_name,
    feature.score AS feature_score
FROM `project.dataset.ml_predictions`,
UNNEST(prediction_features) AS feature;

-- 3. JSON functions replace VARIANT path notation
SELECT
    JSON_VALUE(raw_json, '$.customer.name') AS customer_name,
    JSON_VALUE(raw_json, '$.customer.addresses[0].city') AS primary_city
FROM `project.dataset.raw_events`;

Complete Snowflake SQL Function Mapping

Snowflake Function	BigQuery Equivalent	Notes
`DATEADD(day, 7, dt)`	`DATE_ADD(dt, INTERVAL 7 DAY)`	Argument order differs
`DATEDIFF(day, a, b)`	`DATE_DIFF(b, a, DAY)`	Argument order reversed
`TO_DATE(str, fmt)`	`PARSE_DATE(fmt, str)`	Format strings differ (Snowflake: YYYY-MM-DD; BigQuery: %Y-%m-%d)
`TO_TIMESTAMP(str)`	`PARSE_TIMESTAMP(fmt, str)`	BigQuery requires explicit format
`IFF(cond, a, b)`	`IF(cond, a, b)`	Direct equivalent
`NVL(a, b)`	`COALESCE(a, b)` or `IFNULL(a, b)`	COALESCE supports multiple arguments
`NVL2(a, b, c)`	`IF(a IS NOT NULL, b, c)`	No direct NVL2 equivalent
`ZEROIFNULL(x)`	`COALESCE(x, 0)`	No direct equivalent
`TRY_CAST(x AS type)`	`SAFE_CAST(x AS type)`	Same behavior — returns NULL on failure
`FLATTEN()`	`UNNEST()`	LATERAL FLATTEN → CROSS JOIN UNNEST
`PARSE_JSON(str)`	`JSON_VALUE() / JSON_QUERY()`	BigQuery uses function-based JSON access
`OBJECT_CONSTRUCT()`	`STRUCT() / TO_JSON()`	STRUCT for typed objects; TO_JSON for JSON strings
`ARRAY_AGG()`	`ARRAY_AGG()`	Identical syntax
`LISTAGG(col, ',')`	`STRING_AGG(col, ',')`	Direct equivalent
`GET_PATH(v, 'a.b')`	`JSON_VALUE(v, '$.a.b')`	JSONPath syntax in BigQuery
`GENERATOR(ROWCOUNT => n)`	`GENERATE_ARRAY(1, n)` with `UNNEST`	BigQuery uses array generation

Snowpark Python to BigQuery DataFrames

Snowpark allows Python, Scala, and Java code to execute directly on Snowflake's compute engine using a DataFrame API. BigQuery provides an equivalent with the BigQuery DataFrames library (bigframes), which runs Python DataFrame operations as BigQuery jobs without data leaving the warehouse.

# Snowpark Python: DataFrame operations on Snowflake
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, sum as sf_sum, avg, when, lit

session = Session.builder.configs(connection_params).create()

orders = session.table("analytics.orders")
customers = session.table("analytics.customers")

# Join, aggregate, filter
result = (
    orders
    .join(customers, orders.customer_id == customers.customer_id)
    .group_by(customers.segment, customers.region)
    .agg(
        sf_sum(orders.revenue).alias("total_revenue"),
        avg(orders.revenue).alias("avg_revenue"),
        col("*").count().alias("order_count")
    )
    .filter(col("total_revenue") > 10000)
    .sort(col("total_revenue").desc())
)
result.write.save_as_table("analytics.segment_summary", mode="overwrite")

# BigQuery DataFrames (bigframes): Equivalent operations on BigQuery
import bigframes.pandas as bpd

bpd.options.bigquery.project = "my-gcp-project"

orders = bpd.read_gbq("project.dataset.orders")
customers = bpd.read_gbq("project.dataset.customers")

# Join, aggregate, filter (pandas-like API, executes as BigQuery jobs)
merged = orders.merge(customers, on="customer_id")
result = (
    merged
    .groupby(["segment", "region"])
    .agg(
        total_revenue=("revenue", "sum"),
        avg_revenue=("revenue", "mean"),
        order_count=("revenue", "count")
    )
    .reset_index()
    .query("total_revenue > 10000")
    .sort_values("total_revenue", ascending=False)
)
result.to_gbq("project.dataset.segment_summary", if_exists="replace")

BigQuery DataFrames uses a pandas-compatible API that translates operations to BigQuery SQL behind the scenes. Data never leaves BigQuery — the Python code runs locally as a query planner, but all computation happens on BigQuery's serverless engine. This makes it a natural migration path for Snowpark Python code.

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Architecture Mapping: AWS Analytics Stack to GCP

Migrating from AWS is more complex than Snowflake because the AWS analytics ecosystem is disaggregated — multiple services (Redshift, Glue, EMR, Step Functions, S3, Athena, Lake Formation) must be mapped to their GCP equivalents, and the interaction patterns between services change.

AWS Service	GCP Equivalent	Migration Notes
Amazon Redshift	BigQuery	Redshift SQL → BigQuery SQL; distribution/sort keys → partitioning/clustering
AWS Glue ETL (PySpark)	Dataform (SQL) / Dataflow (beam) / Dataproc (Spark)	SQL-convertible Glue jobs → Dataform; complex PySpark → Dataproc or Dataflow
AWS Glue Catalog	BigQuery metastore / Dataplex Catalog	Schema registry and table definitions migrate to BigQuery datasets
Amazon EMR (PySpark/Spark SQL)	Dataproc (managed Spark) / BigQuery	Spark SQL → BigQuery SQL; PySpark DataFrames → BigQuery DataFrames or Dataproc
AWS Step Functions	Cloud Composer (Airflow) / Cloud Workflows	State machine JSON → Airflow DAGs (Python); simple sequences → Cloud Workflows (YAML)
Amazon S3	Google Cloud Storage (GCS)	Direct transfer via Storage Transfer Service; path mappings update in all pipelines
Amazon Athena	BigQuery (external tables)	Athena SQL → BigQuery SQL; Athena tables on S3 → BigQuery external tables on GCS
AWS Lake Formation	Dataplex	Column-level security, data quality, lineage → Dataplex policies and quality rules
Amazon SageMaker	Vertex AI + BigQuery ML	Simple models → BigQuery ML (SQL); complex models → Vertex AI
Amazon Kinesis	Pub/Sub + Dataflow	Kinesis streams → Pub/Sub topics; Kinesis Analytics → Dataflow streaming
AWS Lambda	Cloud Functions / Cloud Run	Event-driven functions map directly; container-based → Cloud Run
Amazon MWAA (Airflow)	Cloud Composer (Airflow)	DAG Python files transfer directly; operator imports change to GCP equivalents

Redshift SQL to BigQuery SQL

Amazon Redshift uses a PostgreSQL-derived SQL dialect with extensions for distribution, sorting, and Redshift-specific functions. BigQuery SQL is ANSI-compliant with its own extensions. The key translation patterns affect data types, distribution strategy, and function names.

-- Redshift SQL: Table definition with distribution and sort keys
CREATE TABLE analytics.customer_orders (
    order_id BIGINT IDENTITY(1,1),
    customer_id INTEGER NOT NULL ENCODE lzo,
    order_date DATE NOT NULL ENCODE delta,
    product_category VARCHAR(100) ENCODE bytedict,
    quantity INTEGER ENCODE mostly8,
    revenue DECIMAL(12,2) ENCODE az64,
    discount_pct FLOAT,
    shipping_region VARCHAR(50),
    order_status VARCHAR(20)
)
DISTKEY(customer_id)
SORTKEY(order_date, customer_id)
;

-- Redshift query with Redshift-specific functions
SELECT
    customer_id,
    DATE_TRUNC('month', order_date) AS order_month,
    NVL(product_category, 'Unknown') AS category,
    COUNT(*) AS order_count,
    SUM(revenue) AS total_revenue,
    MEDIAN(revenue) AS median_revenue,
    LISTAGG(DISTINCT order_status, ', ')
        WITHIN GROUP (ORDER BY order_status) AS statuses,
    GETDATE() AS run_timestamp
FROM analytics.customer_orders
WHERE order_date >= DATEADD(year, -2, GETDATE())
GROUP BY 1, 2, 3
HAVING SUM(revenue) > 1000
ORDER BY total_revenue DESC;

-- BigQuery SQL equivalent
CREATE OR REPLACE TABLE `project.dataset.customer_orders` (
    order_id INT64 NOT NULL,
    customer_id INT64 NOT NULL,
    order_date DATE NOT NULL,
    product_category STRING,
    quantity INT64,
    revenue NUMERIC(12,2),
    discount_pct FLOAT64,
    shipping_region STRING,
    order_status STRING
)
PARTITION BY order_date
CLUSTER BY customer_id, product_category
;

-- BigQuery query equivalent
SELECT
    customer_id,
    DATE_TRUNC(order_date, MONTH) AS order_month,
    COALESCE(product_category, 'Unknown') AS category,
    COUNT(*) AS order_count,
    SUM(revenue) AS total_revenue,
    APPROX_QUANTILES(revenue, 2)[OFFSET(1)] AS median_revenue,
    STRING_AGG(DISTINCT order_status, ', '
        ORDER BY order_status) AS statuses,
    CURRENT_TIMESTAMP() AS run_timestamp
FROM `project.dataset.customer_orders`
WHERE order_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 2 YEAR)
GROUP BY 1, 2, 3
HAVING SUM(revenue) > 1000
ORDER BY total_revenue DESC;

AWS Glue PySpark to BigQuery SQL / Dataform

AWS Glue ETL jobs are PySpark programs that use Glue's DynamicFrame API or standard PySpark DataFrames. The majority of Glue ETL logic — joins, filters, aggregations, type conversions, and deduplication — translates directly to BigQuery SQL or Dataform SQLX models. Complex transformations that require Python logic can migrate to Dataflow (Apache Beam) or Dataproc (managed Spark).

# AWS Glue ETL job: PySpark with DynamicFrame
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.functions import col, sum as spark_sum, avg, current_date, datediff

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session

# Read from Glue Catalog
orders_dyf = glueContext.create_dynamic_frame.from_catalog(
    database="analytics_db", table_name="raw_orders"
)
customers_dyf = glueContext.create_dynamic_frame.from_catalog(
    database="analytics_db", table_name="customers"
)

# Convert to DataFrame for complex operations
orders_df = orders_dyf.toDF()
customers_df = customers_dyf.toDF()

# Join, transform, aggregate
result = (
    orders_df
    .join(customers_df, "customer_id", "left")
    .filter(col("order_date") >= "2025-01-01")
    .withColumn("days_since_order",
        datediff(current_date(), col("order_date")))
    .groupBy("segment", "region")
    .agg(
        spark_sum("revenue").alias("total_revenue"),
        avg("revenue").alias("avg_order_value"),
        spark_sum(
            when(col("days_since_order") <= 90, col("revenue"))
            .otherwise(0)
        ).alias("recent_revenue")
    )
    .filter(col("total_revenue") > 5000)
)

# Write to Redshift
glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(result, glueContext, "output"),
    connection_type="redshift",
    connection_options={"dbtable": "analytics.segment_summary", ...}
)

-- BigQuery SQL / Dataform SQLX equivalent
-- File: models/gold/segment_summary.sqlx
config {
    type: "table",
    schema: "gold",
    description: "Segment revenue summary with recent activity",
    bigquery: {
        partitionBy: "DATE(calculated_at)",
        clusterBy: ["segment", "region"]
    },
    assertions: {
        nonNull: ["segment", "region", "total_revenue"],
        rowConditions: ["total_revenue > 0"]
    }
}

SELECT
    c.segment,
    c.region,
    SUM(o.revenue) AS total_revenue,
    AVG(o.revenue) AS avg_order_value,
    SUM(
        CASE WHEN DATE_DIFF(CURRENT_DATE(), o.order_date, DAY) <= 90
             THEN o.revenue ELSE 0 END
    ) AS recent_revenue,
    CURRENT_TIMESTAMP() AS calculated_at
FROM ${ref("silver", "orders")} o
LEFT JOIN ${ref("silver", "customers")} c
    ON o.customer_id = c.customer_id
WHERE o.order_date >= '2025-01-01'
GROUP BY c.segment, c.region
HAVING SUM(o.revenue) > 5000

MigryX's PySpark parser analyzes Glue ETL jobs at the AST level, identifying DataFrame operations (joins, filters, aggregations, window functions) and converting them to equivalent BigQuery SQL. Operations that are purely SQL-expressible — which covers 80-90% of typical Glue ETL logic — become Dataform SQLX models. Complex Python logic that cannot be expressed in SQL is flagged for manual review or converted to Dataflow pipelines.

AWS Step Functions to Cloud Composer

AWS Step Functions define workflows as state machines in JSON (Amazon States Language). Cloud Composer uses Apache Airflow DAGs defined in Python. The migration converts state machine definitions to Airflow task dependencies, replacing AWS-specific SDK integrations with GCP operator equivalents.

// AWS Step Functions: State machine definition (ASL JSON)
{
  "StartAt": "ExtractRawData",
  "States": {
    "ExtractRawData": {
      "Type": "Task",
      "Resource": "arn:aws:glue:us-east-1:123456789:job/extract-raw",
      "Next": "CheckDataQuality"
    },
    "CheckDataQuality": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-east-1:123456789:function:dq-check",
      "Next": "QualityGate"
    },
    "QualityGate": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.quality_passed",
          "BooleanEquals": true,
          "Next": "TransformAndLoad"
        }
      ],
      "Default": "SendAlert"
    },
    "TransformAndLoad": {
      "Type": "Task",
      "Resource": "arn:aws:glue:us-east-1:123456789:job/transform-load",
      "Next": "UpdateDashboard"
    },
    "UpdateDashboard": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:...:refresh-dashboard",
      "End": true
    },
    "SendAlert": {
      "Type": "Task",
      "Resource": "arn:aws:sns:...:data-quality-alerts",
      "End": true
    }
  }
}

# Cloud Composer (Airflow) DAG equivalent
from airflow import DAG
from airflow.operators.python import BranchPythonOperator
from airflow.providers.google.cloud.operators.bigquery import (
    BigQueryInsertJobOperator
)
from airflow.providers.google.cloud.operators.dataform import (
    DataformCreateCompilationResultOperator,
    DataformCreateWorkflowInvocationOperator
)
from airflow.operators.email import EmailOperator
from datetime import datetime

dag = DAG(
    "analytics_pipeline",
    schedule_interval="0 6 * * *",
    start_date=datetime(2026, 1, 1),
    catchup=False,
)

extract_raw = BigQueryInsertJobOperator(
    task_id="extract_raw_data",
    configuration={
        "query": {
            "query": "CALL `project.ops.extract_raw_data`()",
            "useLegacySql": False,
        }
    },
    dag=dag,
)

check_quality = BigQueryInsertJobOperator(
    task_id="check_data_quality",
    configuration={
        "query": {
            "query": "CALL `project.ops.data_quality_check`()",
            "useLegacySql": False,
        }
    },
    dag=dag,
)

def quality_gate(**context):
    ti = context["ti"]
    result = ti.xcom_pull(task_ids="check_data_quality")
    if result and result.get("quality_passed"):
        return "transform_and_load"
    return "send_alert"

quality_branch = BranchPythonOperator(
    task_id="quality_gate",
    python_callable=quality_gate,
    dag=dag,
)

transform_load = DataformCreateWorkflowInvocationOperator(
    task_id="transform_and_load",
    project_id="my-project",
    region="us-central1",
    repository_id="analytics-repo",
    dag=dag,
)

send_alert = EmailOperator(
    task_id="send_alert",
    to="data-team@company.com",
    subject="Data Quality Check Failed",
    html_content="Quality gate failed. Check Dataform logs.",
    dag=dag,
)

extract_raw >> check_quality >> quality_branch
quality_branch >> [transform_load, send_alert]

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

How MigryX Automates the Conversion

Manual migration from Snowflake or AWS to BigQuery requires developers to learn two SQL dialects, understand the semantic differences between hundreds of functions, and rewrite every script, stored procedure, and pipeline definition by hand. For organizations with thousands of Snowflake scripts or hundreds of Glue ETL jobs, this approach is impractical — it takes years, introduces human error, and the source platform continues to accumulate new code during the migration.

MigryX automates this conversion through a multi-stage parser-driven pipeline that operates at the Abstract Syntax Tree (AST) level, not through text-based find-and-replace.

Parser-Driven SQL Transpilation

MigryX parses source SQL (Snowflake SQL, Redshift SQL, Spark SQL, Athena/Presto SQL) into a language-independent AST, applies transformation rules to convert dialect-specific constructs to BigQuery equivalents, and emits syntactically correct BigQuery SQL. This approach handles edge cases that regex-based tools miss: nested function calls, correlated subqueries, complex CASE expressions, window function framing, and multi-statement procedures.

Function mapping: 200+ Snowflake and Redshift functions automatically mapped to BigQuery equivalents, including argument reordering (DATEADD, DATEDIFF), format string translation (TO_DATE → PARSE_DATE), and semantic equivalents (NVL2 → IF expression)
Data type translation: Snowflake VARIANT/OBJECT/ARRAY to BigQuery JSON/STRUCT/ARRAY; Redshift ENCODE directives removed; IDENTITY columns to BigQuery sequences or GENERATE_UUID()
DDL conversion: DISTKEY/SORTKEY to PARTITION BY/CLUSTER BY; Snowflake TRANSIENT tables to BigQuery with appropriate expiration; external table definitions remapped from S3 to GCS paths
Procedural code: Snowflake JavaScript stored procedures to BigQuery JavaScript UDFs or stored procedures; Snowflake SQL scripting (DECLARE, SET, IF/ELSE, LOOP) to BigQuery scripting

PySpark and Glue ETL Conversion

MigryX's PySpark parser reads Glue ETL scripts and EMR PySpark programs, builds an operation graph from DataFrame API calls, and determines whether each operation is SQL-expressible. SQL-convertible operations — filters, joins, aggregations, window functions, type casts, and conditional expressions — become Dataform SQLX models. Operations requiring Python logic (custom UDFs, ML model scoring, complex string parsing) are flagged with conversion recommendations.

DynamicFrame operations: Glue-specific APIs (ApplyMapping, ResolveChoice, Relationalize) translated to BigQuery SQL equivalents
Spark SQL queries: spark.sql() calls extracted and transpiled through the SQL transpilation engine
DataFrame API chains: Method chains (.filter().join().groupBy().agg()) converted to equivalent SQL with CTEs
Read/write operations: S3 paths remapped to GCS; Glue Catalog references converted to BigQuery dataset.table references; Redshift JDBC writes converted to BigQuery table writes

Column-Level Lineage and STTM Generation

For every converted script, MigryX generates column-level lineage documentation — a Source-to-Target Mapping (STTM) that traces each output column back to its source columns through every transformation step. This lineage is critical for migration validation, regulatory compliance (SOX, GDPR, Basel III), and post-migration governance in Dataplex.

On-Premise Deployment

MigryX deploys entirely within the customer's environment. Source code — Snowflake scripts, PySpark programs, Redshift stored procedures — never leaves the customer's network. The conversion engine runs as a containerized application on the customer's infrastructure (on-premise servers, private cloud VMs, or air-gapped environments), ensuring full compliance with data sovereignty and intellectual property requirements.

Migration Methodology: A Phased Approach

Migrating from an established Snowflake or AWS analytics platform to GCP requires a structured methodology that balances speed with risk management. The following four-phase approach has been proven across enterprise migrations.

Phase 1: Discovery and Assessment (2-4 weeks)

Inventory all Snowflake objects (databases, schemas, tables, views, stored procedures, UDFs, tasks, streams, pipes) or AWS resources (Redshift clusters, Glue jobs, EMR notebooks, Step Functions, S3 data)
Run MigryX analysis to generate complexity scores, dependency graphs, and conversion feasibility reports for every object
Identify critical path workloads and quick-win candidates
Map security policies, roles, and access controls to GCP IAM equivalents
Establish BigQuery project structure, dataset naming conventions, and Dataform repository layout

Phase 2: Automated Conversion (4-8 weeks)

Run MigryX batch conversion on all in-scope SQL scripts, stored procedures, and PySpark jobs
Review conversion output: auto-converted code (typically 85-95% of total), flagged items requiring manual adjustment, and unconvertible patterns
Convert orchestration: Snowflake Tasks → Dataform scheduling + Cloud Composer; Step Functions → Airflow DAGs
Set up data transfer: Snowflake stages or S3 buckets → GCS via Storage Transfer Service
Configure Dataplex governance policies mirroring existing Snowflake RBAC or Lake Formation policies

Phase 3: Validation and Parallel Run (4-6 weeks)

Execute converted workloads on BigQuery alongside existing Snowflake/AWS pipelines
Automated row-count, checksum, and statistical validation between source and target outputs
Performance benchmarking: compare query execution times and costs
User acceptance testing with downstream consumers (dashboards, reports, ML models)
Iterate on flagged items and edge cases identified during validation

Phase 4: Cutover and Decommission (2-4 weeks)

Switch production workloads to BigQuery pipelines
Redirect downstream applications and dashboards to BigQuery endpoints
Run Snowflake/AWS in read-only mode for rollback period (typically 30 days)
Decommission source platform resources after rollback window closes
Final governance review: verify lineage, access controls, and audit logging in Dataplex

Cost Comparison: Snowflake + AWS vs. BigQuery

Cost Category	Snowflake + AWS	BigQuery + GCP
Compute	Snowflake credits (warehouse-hours) + EMR instance-hours + Glue DPU-hours	BigQuery slots (on-demand per-TB or flat-rate) — single billing model
Storage	Snowflake storage + S3 storage + Redshift managed storage	BigQuery storage (single rate, auto long-term discount after 90 days)
ETL / Orchestration	Glue job runs + Step Functions state transitions + MWAA environment	Dataform (included) + Cloud Composer (managed Airflow)
Data Transfer	Cross-region/cross-service data movement charges	Intra-GCP transfers free; BigQuery ↔ GCS ↔ Dataflow is zero-cost within region
ML Platform	SageMaker notebook + training + endpoint hosting	BigQuery ML (included) + Vertex AI (pay-per-use)
Governance	Lake Formation + manual catalog tools	Dataplex (included with BigQuery)
Idle Costs	Suspended warehouses still consume credits on resume; EMR clusters incur costs while idle	Zero idle cost — no infrastructure to suspend/resume

Organizations running Snowflake with multi-cluster warehouses and AWS EMR/Glue pipelines typically see 30-50% cost reduction after migrating to BigQuery, driven primarily by the elimination of idle compute costs, the consolidation of multiple service charges into a single billing model, and BigQuery's automatic columnar compression reducing storage costs.

Key Takeaways
Snowflake SQL translates to BigQuery SQL with systematic function mapping (DATEADD → DATE_ADD, FLATTEN → UNNEST, VARIANT path → JSON_VALUE), data type conversion (VARIANT → JSON/STRUCT), and DDL restructuring (stages → GCS, warehouses → serverless slots).
Snowpark Python migrates to BigQuery DataFrames (bigframes) using a pandas-compatible API that executes operations as BigQuery jobs without data leaving the warehouse.
AWS Redshift SQL converts to BigQuery SQL with distribution/sort key removal (replaced by partition/cluster), function renaming (NVL → COALESCE, GETDATE → CURRENT_TIMESTAMP), and encoding directive elimination.
AWS Glue PySpark ETL jobs convert to Dataform SQLX models for SQL-expressible logic (80-90% of typical jobs) and to Dataproc or Dataflow for complex Python transformations.
AWS Step Functions state machines convert to Cloud Composer (Airflow) DAGs with GCP-native operators replacing AWS SDK integrations.
MigryX automates conversion at the AST level with +95% parser accuracy, handling 200+ function mappings, procedural code translation, and DDL restructuring automatically.
Column-level lineage (STTM) is generated for every converted object, enabling regulatory compliance and post-migration governance in Dataplex.
BigQuery's serverless architecture eliminates idle compute costs, warehouse sizing decisions, and multi-service billing complexity present in Snowflake and AWS architectures.
BigQuery Omni enables federated queries on S3 data during migration, allowing a phased transition without a big-bang cutover.
On-premise MigryX deployment ensures source code never leaves the customer environment, meeting data sovereignty and IP protection requirements.

Migrating from Snowflake or the AWS analytics stack to Google Cloud BigQuery is not a lateral move — it is an architectural simplification. The disaggregated AWS toolchain (Redshift + Glue + EMR + Step Functions + S3 + Athena + Lake Formation + SageMaker) and Snowflake's credit-based compute model are replaced by a single serverless platform that handles storage, compute, transformation, orchestration, governance, and machine learning. For organizations evaluating this path, MigryX eliminates the manual rewrite bottleneck by automating SQL transpilation, PySpark conversion, lineage documentation, and validation — reducing a multi-year manual effort to weeks of automated conversion followed by focused validation and tuning.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

Production-ready output: MigryX generates code that passes code review and runs in production — not prototype-quality output that needs weeks of cleanup.
Platform optimization: Converted code leverages target platform-specific features for maximum performance and cost efficiency.
25+ source technologies: Whether migrating from SAS, Informatica, DataStage, SSIS, or any of 25+ legacy technologies, MigryX handles it.
Automated documentation: Every conversion decision is documented with before/after code mappings and transformation rationale.

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from Snowflake or AWS to BigQuery?

See how MigryX automates conversion of Snowflake SQL, Snowpark Python, Redshift SQL, Glue PySpark, and Step Functions to production-ready BigQuery SQL, Dataform, and Cloud Composer pipelines.

Explore BigQuery Migration Schedule a Demo