hdinsight-migration

作成者: microsoft

更新確認 — セッションごとに1回(必須) このスキルがセッション内で初めて使用される際は、先にcheck-updatesスキルを実行してください。

npx skills add https://github.com/microsoft/skills-for-fabric --skill hdinsight-migration

Update Check — ONCE PER SESSION (mandatory) The first time this skill is used in a session, run the check-updates skill before proceeding.

  • GitHub Copilot CLI / VS Code: invoke the check-updates skill.
  • Claude Code / Cowork / Cursor / Windsurf / Codex: compare local vs remote package.json version.
  • Skip if the check was already performed earlier in this session.

CRITICAL NOTES

  1. To find workspace details (including its ID) from a workspace name: list all workspaces, then use JMESPath filtering
  2. To find item details (including its ID) from workspace ID, item type, and item name: list all items of that type in that workspace, then use JMESPath filtering
  3. HDInsight has no mssparkutils or dbutils equivalent — notebookutils is net-new capability being introduced
  4. HiveContext and SQLContext are legacy Spark 1.x/2.x APIs — Fabric uses Spark 3.x SparkSession exclusively
  5. wasb:// paths are deprecated and require a Storage Account key or SAS — replace with OneLake shortcuts

HDInsight → Microsoft Fabric Migration

Prerequisite Knowledge

Read these companion documents before executing migration tasks:

  • COMMON-CORE.md — Fabric REST API patterns, authentication, token audiences, item discovery
  • COMMON-CLI.mdaz rest, az login, token acquisition, Fabric REST via CLI
  • SPARK-AUTHORING-CORE.md — Notebook deployment, lakehouse creation, Spark job execution

For notebook and Lakehouse creation, see spark-authoring-cli. For Fabric Warehouse DDL/DML authoring, see sqldw-authoring-cli.


Table of Contents

TopicReference
Migration Workload Map§ Migration Workload Map
SparkSession & Context API Changes§ SparkSession API Changes
WASB / ABFS → OneLake Path Migrationpath-migration.md
Hive DDL → Delta Lake / Lakehouse Schemashive-to-delta.md
Oozie → Fabric Pipelines§ Oozie → Fabric Pipelines
Introducing notebookutils§ Introducing notebookutils
Before/After Code Patternscode-patterns.md
Spark Configuration Differences§ Spark Configuration Differences
Must / Prefer / Avoid§ Must / Prefer / Avoid
Authentication & Token AcquisitionCOMMON-CORE.md § Authentication
Lakehouse ManagementSPARK-AUTHORING-CORE.md § Lakehouse Management

Migration Workload Map

HDInsight ComponentFabric TargetNotes
Spark cluster (notebooks, scripts)Fabric Spark (Lakehouse / Notebooks / SJD)No persistent cluster — Starter Pool or Custom Pool provides on-demand Spark
Hive / HiveServer2Lakehouse SQL Endpoint + Lakehouse schemasDelta Lake replaces Hive metastore; schemas provide namespace equivalent
HBaseFabric Warehouse or Azure Cosmos DB (separate from Fabric)HBase has no direct Fabric equivalent — assess workload access patterns
Oozie workflowsFabric Data PipelinesMap Oozie actions to Fabric activities; see § Oozie → Fabric Pipelines
YARN Resource ManagerFabric Spark monitoring (Spark UI, Monitoring Hub)No YARN — Fabric manages compute automatically
AmbariFabric Monitoring Hub + Admin PortalCluster health, capacity, and job monitoring
WASB / ABFS storageOneLake Shortcutsabfss://[email protected]/See path-migration.md
Ranger policiesFabric workspace roles + OneLake data access rolesMap Ranger row/column filters to Lakehouse row-level security
Livy REST serverFabric Livy APICompatible endpoint — see SPARK-AUTHORING-CORE.md

SparkSession & Context API Changes

HDInsight Spark clusters often use legacy Spark 1.x / 2.x API styles. Replace all of these with the unified SparkSession:

Legacy HDInsight PatternFabric Spark 3.x Replacement
from pyspark import SparkContext; sc = SparkContext()Not needed — sc = spark.sparkContext (pre-instantiated)
from pyspark.sql import HiveContext; hc = HiveContext(sc)Not needed — spark session has Hive-compatible SQL support via Delta schemas
from pyspark.sql import SQLContext; sqlc = SQLContext(sc)Not needed — use spark.sql(...) directly
SparkSession.builder.enableHiveSupport().getOrCreate()Not needed in Fabric — spark is pre-built and available
sc.textFile("wasb://[email protected]/path")spark.read.text("abfss://[email protected]/lh.Lakehouse/Files/path")
sqlContext.sql("CREATE TABLE ... STORED AS ORC")See hive-to-delta.md for Delta DDL equivalent

In Fabric notebooks, spark (SparkSession) and sc (SparkContext) are pre-instantiated — do not call SparkContext() or SparkSession.builder...getOrCreate() at the top of migrated notebooks.


Oozie → Fabric Pipelines

Map Oozie workflow actions to Fabric Data Pipeline activities:

Oozie Action TypeFabric Pipeline ActivityNotes
<spark> actionNotebook activity or Spark Job Definition activityPass parameters via notebook cell parameters or SJD arguments
<hive> actionScript activity (SQL) against Lakehouse SQL EndpointConvert HiveQL to Spark SQL or Delta SQL
<shell> actionAzure Function activity or Web activityShell scripts must be refactored; no direct shell execution in Fabric Pipelines
<java> actionAzure Batch activity (external) or refactor to PySparkJava MapReduce jobs must be rewritten
<sqoop> actionCopy Data activity (Fabric Data Factory connector)Sqoop import/export maps to Fabric Copy Data with JDBC source/sink
<coordinator> (time-based schedule)Pipeline schedule triggerSet recurrence in pipeline trigger; supports cron-like expressions
<coordinator> (data-triggered)Storage Event triggerTrigger on OneLake file arrival

Delegate to spark-authoring-cli for notebook and SJD creation after mapping pipeline activities.


Introducing notebookutils

HDInsight Spark had no built-in utility framework equivalent to mssparkutils or dbutils. When migrating to Fabric, introduce notebookutils for common operations:

OperationOld HDInsight Approachnotebookutils Equivalent
List filesdbutils (N/A) / HDFS CLInotebookutils.fs.ls("abfss://...")
Copy fileHDFS API / shutilnotebookutils.fs.cp(src, dest)
Read secretAzure Key Vault REST callnotebookutils.credentials.getSecret(keyVaultUrl, secretName)
Get notebook contextNot availablenotebookutils.runtime.context — returns workspace ID, notebook ID, etc.
Run child notebookNot availablenotebookutils.notebook.run("notebook_name", timeout, {"param": "value"})
Exit notebook with valuesys.exit()notebookutils.notebook.exit("value")
Mount storageWASB config in spark-defaults.confOneLake Shortcut (no runtime mount needed)

Spark Configuration Differences

HDInsight ConceptFabric Spark EquivalentMigration Action
spark-defaults.conf (cluster-wide)Fabric Spark Workspace Settings + Environment itemMove config properties to Environment or use %%configure in notebooks
%%configure magic%%configure magic — identicalNo change needed
YARN queue / resource allocationFabric Spark pool node size and autoscale settingsMap queue SLAs to Custom Pool configuration
Ambari service configs (HDFS, YARN tuning)Not applicable — Fabric manages infrastructureRemove; focus on application-level Spark configs
HDI Spark version (e.g., Spark 2.4)Fabric Runtime 1.3 = Spark 3.5 (latest)Test for deprecated API removals (e.g., HiveContext, RDD-style ML)
Conda environment / bootstrap.shFabric Environment item with custom librariesRecreate conda/pip dependencies in a Fabric Environment
hive-site.xml (metastore connection)Not needed — Delta Lake IS the metastore in FabricRemove metastore config; use Lakehouse schemas for namespace organization

Must / Prefer / Avoid

MUST DO

  • Replace all wasb:// / wasbs:// paths with OneLake abfss:// paths or OneLake Shortcuts — wasb:// requires storage account keys which are not the Fabric-preferred auth model
  • Replace HiveContext, SQLContext, and standalone SparkContext() — use the pre-instantiated spark session in Fabric notebooks
  • Migrate Hive DDL (STORED AS ORC, LOCATION, TBLPROPERTIES) to Delta Lake DDL — see hive-to-delta.md
  • Introduce notebookutils for file system operations, secret retrieval, and child notebook orchestration where HDInsight used custom scripts or direct API calls
  • Replace Oozie XML workflows with Fabric Data Pipelines — see § Oozie → Fabric Pipelines
  • Align library management to Fabric Environments — remove bootstrap.sh, conda envs, and runtime %pip install patterns for production workloads

PREFER

  • OneLake Shortcuts over copying data — mount existing ADLS Gen2 containers as shortcuts to avoid re-ingestion during migration
  • Delta Lake for all tables migrated from Hive ORC/Parquet — ACID guarantees, time travel, and schema enforcement improve data quality
  • Fabric Starter Pool for initial migration validation — no pool configuration overhead, fast session startup
  • Lakehouse schemas (database namespaces) for organizing migrated Hive databases — one schema per Hive database within a single Lakehouse
  • Medallion architecture for restructuring migrated data layers during migration — align Bronze/Silver/Gold with raw Hive → validated Delta → serving Gold patterns

AVOID

  • Do not use SparkContext() or HiveContext() constructors in Fabric notebooks — they conflict with the pre-instantiated spark session and will raise errors
  • Do not use hive-site.xml or external Hive metastore configuration — Fabric's Delta Lake-backed Lakehouse IS the metastore
  • Do not assume YARN queue mappings translate to Fabric pools — re-design resource allocation based on Fabric Spark pool SLAs
  • Do not attempt to run Oozie shell actions or Java MapReduce jobs directly in Fabric — these must be refactored (see § Oozie → Fabric Pipelines)
  • Do not use %sh magic for file system operations in production notebooks — use notebookutils.fs.* for portability and OneLake token-based auth

Examples

See code-patterns.md for full before/after examples. Key quick references:

Legacy context → Fabric pre-instantiated session

# HDInsight (remove entirely)
from pyspark.sql import HiveContext
hc = HiveContext(sc)

# Fabric — use pre-instantiated spark directly
df = spark.sql("SELECT * FROM sales.fact_orders")

WASB path → OneLake path (after shortcut creation)

# HDInsight
df = spark.read.parquet("wasb://[email protected]/orders/")

# Fabric
df = spark.read.parquet("Files/raw/orders/")

Hive DDL → Delta DDL

-- HDInsight
CREATE TABLE sales_db.fact_orders (...) STORED AS ORC LOCATION 'wasb://...';

-- Fabric
CREATE SCHEMA IF NOT EXISTS sales_db;
CREATE TABLE sales_db.fact_orders (...) USING DELTA;

microsoftのその他のスキル

oss-growth
microsoft
OSS成長ハッカーのペルソナ
official
microsoft-foundry
microsoft
Foundryエージェントのエンドツーエンドでのデプロイ、評価、管理:Dockerビルド、ACRプッシュ、ホスト型/プロンプトエージェント作成、コンテナ起動、バッチ評価、継続的評価、プロンプト最適化ワークフロー、agent.yaml、トレースからのデータセットキュレーション。用途:エージェントをFoundryにデプロイ、ホスト型エージェント、エージェント作成、エージェント呼び出し、エージェント評価、バッチ評価実行、継続的評価、継続的モニタリング、継続的評価ステータス、プロンプト最適化、プロンプト改善、プロンプトオプティマイザー、エージェント指示最適化、エージェント改善...
officialdevelopmentdevops
azure-ai
microsoft
Azure AI向けに使用:Search、Speech、OpenAI、Document Intelligence。検索、ベクター/ハイブリッド検索、音声認識、音声合成、文字起こし、OCRを支援。使用時:AI Search、クエリ検索、ベクター検索、ハイブリッド検索、セマンティック検索、音声認識、音声合成、文字起こし、OCR、テキスト読み上げ。
officialdevelopmentapi
azure-deploy
microsoft
既存の.azure/deployment-plan.mdとインフラストラクチャファイルを持つ、すでに準備済みのアプリケーションに対してAzureデプロイを実行します。ユーザーが新しいアプリケーションの作成を依頼した場合はこのスキルを使用せず、代わりにazure-prepareを使用してください。このスキルは、azd up、azd deploy、terraform apply、az deploymentコマンドを組み込みのエラーリカバリ機能付きで実行します。azure-prepareからの.azure/deployment-plan.mdと、azure-validateからの検証済みステータスが必要です。使用タイミング:「azd upを実行」、「azd deployを実行」、「デプロイを実行」...
officialdevopsaws
azure-storage
microsoft
Azure Storage Servicesには、Blob Storage、File Shares、Queue Storage、Table Storage、Data Lakeが含まれます。ストレージアクセス層(ホット、クール、コールド、アーカイブ)について、各層の使用タイミングや比較に関する質問に回答します。オブジェクトストレージ、SMBファイル共有、非同期メッセージング、NoSQLキーバリュー、ビッグデータ分析を提供します。ライフサイクル管理を含みます。使用用途:ブロブストレージ、ファイル共有、キューストレージ、テーブルストレージ、データレイク、ファイルアップロード、ブロブダウンロード、ストレージアカウント、アクセス層、...
officialdevelopmentdatabase
azure-diagnostics
microsoft
Azure上でAppLens、Azure Monitor、リソースヘルス、安全なトリアージを使用して、Azureの本番環境の問題をデバッグします。使用時:本番環境の問題のデバッグ、App Serviceのトラブルシューティング、App Serviceの高CPU、App Serviceのデプロイ障害、コンテナアプリのトラブルシューティング、Functionsのトラブルシューティング、AKSのトラブルシューティング、kubectlが接続できない、kube-system/CoreDNSの障害、PodがPending状態、CrashLoop、ノードがReadyにならない、アップグレード障害、ログの分析、KQL、インサイト、イメージプル障害、コールドスタート問題、ヘルスプローブ障害、...
officialdevopsdevelopment
azure-prepare
microsoft
Azureアプリのデプロイ準備(インフラBicep/Terraform、azure.yaml、Dockerfiles)。新規作成/モダナイズ、または作成+デプロイに使用。クロスクラウド移行には非対応(azure-cloud-migrateを使用)。使用禁止:copilot-sdkアプリ(azure-hosted-copilot-sdkを使用)。対象:「アプリ作成」「Webアプリ構築」「API作成」「サーバーレスHTTP API作成」「フロントエンド作成」「バックエンド作成」「サービス構築」「アプリケーションのモダナイズ」「アプリケーション更新」「認証追加」「キャッシュ追加」「Azureへのホスティング」「作成および...」
officialdevelopmentdevops
azure-validate
microsoft
Azureへの準備が整っているかを確認するためのデプロイ前検証。構成、インフラストラクチャ(BicepまたはTerraform)、RBACロールの割り当て、マネージドIDの権限、前提条件について詳細なチェックを実行します。使用場面:アプリの検証、デプロイ準備状況の確認、事前チェックの実行、構成の確認、デプロイ可能かの確認、azure.yamlの検証、Bicepの検証、デプロイ前のテスト、デプロイエラーのトラブルシューティング、Azure Functionsの検証、関数アプリの検証、サーバーレスの検証...
officialdevopstesting