hdinsight-migrationbởi microsoft
Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
npx skills add https://github.com/microsoft/skills-for-fabric --skill hdinsight-migrationUpdate Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.
- GitHub Copilot CLI / VS Code: invoke the
check-updatesskill.- Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
- Skip if the check was already performed earlier in this session.
CRITICAL NOTES
- To find workspace details (including its ID) from a workspace name: list all workspaces, then use JMESPath filtering
- To find item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace, then use JMESPath filtering
- HDInsight has no
mssparkutilsordbutilsequivalent —notebookutilsis net-new capability being introducedHiveContextandSQLContextare legacy Spark 1.x/2.x APIs — Fabric uses Spark 3.xSparkSessionexclusivelywasb://paths are deprecated and require a Storage Account key or SAS — replace with OneLake shortcuts
HDInsight → Microsoft Fabric Migration
Prerequisite Knowledge
Read these companion documents before executing migration tasks:
- COMMON-CORE.md — Fabric REST API patterns, authentication, token audiences, item discovery
- COMMON-CLI.md —
az rest,az login, token acquisition, Fabric REST via CLI - SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, Spark job execution
For notebook and Lakehouse creation, see spark-authoring-cli. For Fabric Warehouse DDL/DML authoring, see sqldw-authoring-cli.
Table of Contents
| Topic | Reference |
|---|---|
| Migration Workload Map | § Migration Workload Map |
| SparkSession & Context API Changes | § SparkSession API Changes |
| WASB / ABFS → OneLake Path Migration | path-migration.md |
| Hive DDL → Delta Lake / Lakehouse Schemas | hive-to-delta.md |
| Oozie → Fabric Pipelines | § Oozie → Fabric Pipelines |
Introducing notebookutils | § Introducing notebookutils |
| Before/After Code Patterns | code-patterns.md |
| Spark Configuration Differences | § Spark Configuration Differences |
| Must / Prefer / Avoid | § Must / Prefer / Avoid |
| Authentication & Token Acquisition | COMMON-CORE.md § Authentication |
| Lakehouse Management | SPARK-AUTHORING-CORE.md § Lakehouse Management |
Migration Workload Map
| HDInsight Component | Fabric Target | Notes |
|---|---|---|
| Spark cluster (notebooks, scripts) | Fabric Spark (Lakehouse / Notebooks / SJD) | No persistent cluster — Starter Pool or Custom Pool provides on-demand Spark |
| Hive / HiveServer2 | Lakehouse SQL Endpoint + Lakehouse schemas | Delta Lake replaces Hive metastore; schemas provide namespace equivalent |
| HBase | Fabric Warehouse or Azure Cosmos DB (separate from Fabric) | HBase has no direct Fabric equivalent — assess workload access patterns |
| Oozie workflows | Fabric Data Pipelines | Map Oozie actions to Fabric activities; see § Oozie → Fabric Pipelines |
| YARN Resource Manager | Fabric Spark monitoring (Spark UI, Monitoring Hub) | No YARN — Fabric manages compute automatically |
| Ambari | Fabric Monitoring Hub + Admin Portal | Cluster health, capacity, and job monitoring |
| WASB / ABFS storage | OneLake Shortcuts → abfss://[email protected]/ | See path-migration.md |
| Ranger policies | Fabric workspace roles + OneLake data access roles | Map Ranger row/column filters to Lakehouse row-level security |
| Livy REST server | Fabric Livy API | Compatible endpoint — see SPARK-AUTHORING-CORE.md |
SparkSession & Context API Changes
HDInsight Spark clusters often use legacy Spark 1.x / 2.x API styles. Replace all of these with the unified SparkSession:
| Legacy HDInsight Pattern | Fabric Spark 3.x Replacement |
|---|---|
from pyspark import SparkContext; sc = SparkContext() | Not needed — sc = spark.sparkContext (pre-instantiated) |
from pyspark.sql import HiveContext; hc = HiveContext(sc) | Not needed — spark session has Hive-compatible SQL support via Delta schemas |
from pyspark.sql import SQLContext; sqlc = SQLContext(sc) | Not needed — use spark.sql(...) directly |
SparkSession.builder.enableHiveSupport().getOrCreate() | Not needed in Fabric — spark is pre-built and available |
sc.textFile("wasb://[email protected]/path") | spark.read.text("abfss://[email protected]/lh.Lakehouse/Files/path") |
sqlContext.sql("CREATE TABLE ... STORED AS ORC") | See hive-to-delta.md for Delta DDL equivalent |
In Fabric notebooks,
spark(SparkSession) andsc(SparkContext) are pre-instantiated — do not callSparkContext()orSparkSession.builder...getOrCreate()at the top of migrated notebooks.
Oozie → Fabric Pipelines
Map Oozie workflow actions to Fabric Data Pipeline activities:
| Oozie Action Type | Fabric Pipeline Activity | Notes |
|---|---|---|
<spark> action | Notebook activity or Spark Job Definition activity | Pass parameters via notebook cell parameters or SJD arguments |
<hive> action | Script activity (SQL) against Lakehouse SQL Endpoint | Convert HiveQL to Spark SQL or Delta SQL |
<shell> action | Azure Function activity or Web activity | Shell scripts must be refactored; no direct shell execution in Fabric Pipelines |
<java> action | Azure Batch activity (external) or refactor to PySpark | Java MapReduce jobs must be rewritten |
<sqoop> action | Copy Data activity (Fabric Data Factory connector) | Sqoop import/export maps to Fabric Copy Data with JDBC source/sink |
<coordinator> (time-based schedule) | Pipeline schedule trigger | Set recurrence in pipeline trigger; supports cron-like expressions |
<coordinator> (data-triggered) | Storage Event trigger | Trigger on OneLake file arrival |
Delegate to
spark-authoring-clifor notebook and SJD creation after mapping pipeline activities.
Introducing notebookutils
HDInsight Spark had no built-in utility framework equivalent to mssparkutils or dbutils. When migrating to Fabric, introduce notebookutils for common operations:
| Operation | Old HDInsight Approach | notebookutils Equivalent |
|---|---|---|
| List files | dbutils (N/A) / HDFS CLI | notebookutils.fs.ls("abfss://...") |
| Copy file | HDFS API / shutil | notebookutils.fs.cp(src, dest) |
| Read secret | Azure Key Vault REST call | notebookutils.credentials.getSecret(keyVaultUrl, secretName) |
| Get notebook context | Not available | notebookutils.runtime.context — returns workspace ID, notebook ID, etc. |
| Run child notebook | Not available | notebookutils.notebook.run("notebook_name", timeout, {"param": "value"}) |
| Exit notebook with value | sys.exit() | notebookutils.notebook.exit("value") |
| Mount storage | WASB config in spark-defaults.conf | OneLake Shortcut (no runtime mount needed) |
Spark Configuration Differences
| HDInsight Concept | Fabric Spark Equivalent | Migration Action |
|---|---|---|
spark-defaults.conf (cluster-wide) | Fabric Spark Workspace Settings + Environment item | Move config properties to Environment or use %%configure in notebooks |
%%configure magic | %%configure magic — identical | No change needed |
| YARN queue / resource allocation | Fabric Spark pool node size and autoscale settings | Map queue SLAs to Custom Pool configuration |
| Ambari service configs (HDFS, YARN tuning) | Not applicable — Fabric manages infrastructure | Remove; focus on application-level Spark configs |
| HDI Spark version (e.g., Spark 2.4) | Fabric Runtime 1.3 = Spark 3.5 (latest) | Test for deprecated API removals (e.g., HiveContext, RDD-style ML) |
Conda environment / bootstrap.sh | Fabric Environment item with custom libraries | Recreate conda/pip dependencies in a Fabric Environment |
hive-site.xml (metastore connection) | Not needed — Delta Lake IS the metastore in Fabric | Remove metastore config; use Lakehouse schemas for namespace organization |
Must / Prefer / Avoid
MUST DO
- Replace all
wasb:///wasbs://paths with OneLakeabfss://paths or OneLake Shortcuts —wasb://requires storage account keys which are not the Fabric-preferred auth model - Replace
HiveContext,SQLContext, and standaloneSparkContext()— use the pre-instantiatedsparksession in Fabric notebooks - Migrate Hive DDL (
STORED AS ORC,LOCATION,TBLPROPERTIES) to Delta Lake DDL — see hive-to-delta.md - Introduce
notebookutilsfor file system operations, secret retrieval, and child notebook orchestration where HDInsight used custom scripts or direct API calls - Replace Oozie XML workflows with Fabric Data Pipelines — see § Oozie → Fabric Pipelines
- Align library management to Fabric Environments — remove
bootstrap.sh, conda envs, and runtime%pip installpatterns for production workloads
PREFER
- OneLake Shortcuts over copying data — mount existing ADLS Gen2 containers as shortcuts to avoid re-ingestion during migration
- Delta Lake for all tables migrated from Hive ORC/Parquet — ACID guarantees, time travel, and schema enforcement improve data quality
- Fabric Starter Pool for initial migration validation — no pool configuration overhead, fast session startup
- Lakehouse schemas (database namespaces) for organizing migrated Hive databases — one schema per Hive database within a single Lakehouse
- Medallion architecture for restructuring migrated data layers during migration — align Bronze/Silver/Gold with raw Hive → validated Delta → serving Gold patterns
AVOID
- Do not use
SparkContext()orHiveContext()constructors in Fabric notebooks — they conflict with the pre-instantiatedsparksession and will raise errors - Do not use
hive-site.xmlor external Hive metastore configuration — Fabric's Delta Lake-backed Lakehouse IS the metastore - Do not assume YARN queue mappings translate to Fabric pools — re-design resource allocation based on Fabric Spark pool SLAs
- Do not attempt to run Oozie shell actions or Java MapReduce jobs directly in Fabric — these must be refactored (see § Oozie → Fabric Pipelines)
- Do not use
%shmagic for file system operations in production notebooks — usenotebookutils.fs.*for portability and OneLake token-based auth
Examples
See code-patterns.md for full before/after examples. Key quick references:
Legacy context → Fabric pre-instantiated session
# HDInsight (remove entirely)
from pyspark.sql import HiveContext
hc = HiveContext(sc)
# Fabric — use pre-instantiated spark directly
df = spark.sql("SELECT * FROM sales.fact_orders")
WASB path → OneLake path (after shortcut creation)
# HDInsight
df = spark.read.parquet("wasb://[email protected]/orders/")
# Fabric
df = spark.read.parquet("Files/raw/orders/")
Hive DDL → Delta DDL
-- HDInsight
CREATE TABLE sales_db.fact_orders (...) STORED AS ORC LOCATION 'wasb://...';
-- Fabric
CREATE SCHEMA IF NOT EXISTS sales_db;
CREATE TABLE sales_db.fact_orders (...) USING DELTA;