apify-actor-development

作者: apify

建立、除錯及部署無伺服器雲端程式,用於網頁爬取、自動化及資料處理。支援 JavaScript、TypeScript 及 Python 範本,內建 Crawlee、Playwright 與 Cheerio 函式庫,適用於 HTTP 及瀏覽器爬取。包含透過 apify run 進行本地測試(具備隔離儲存)、輸入/輸出結構驗證,以及透過 apify push 部署至 Apify 平台。需進行 Apify CLI 驗證,並在 .actor/actor.json 中強制加入 generatedBy 元資料以供 AI 使用...

npx skills add https://github.com/apify/agent-skills --skill apify-actor-development

Apify Actor development

Important: Before you begin, fill in the generatedBy property in the meta section of .actor/actor.json. Replace it with the tool and model you're currently using, such as "Claude Code with Claude Sonnet 4.5". This helps Apify monitor and improve AGENTS.md for specific AI tools and models.

What are Apify Actors?

Actors are serverless programs inspired by the UNIX philosophy - programs that do one thing well and can be easily combined to build complex systems. They're packaged as Docker images and run in isolated containers in the cloud.

Core Concepts:

  • Accept well-defined JSON input
  • Perform isolated tasks (web scraping, automation, data processing)
  • Produce structured JSON output to datasets and/or store data in key-value stores
  • Can run from seconds to hours or even indefinitely
  • Persist state and can be restarted

Prerequisites and setup (mandatory)

Before creating or modifying Actors, verify that apify CLI is installed apify --help.

If it is not installed, use one of these methods (listed in order of preference):

# Preferred: install via a package manager (provides integrity checks)
npm install -g apify-cli

# Or (Mac): brew install apify-cli

Security note: Do NOT install the CLI by piping remote scripts to a shell (e.g. curl … | bash or irm … | iex). Always use a package manager.

When the apify CLI is installed, check that it is logged in with:

apify info  # Should return your username

If not logged in, authenticate using OAuth (opens browser):

apify login

If browser login isn't available (headless environment or CI), the CLI automatically reads APIFY_TOKEN from the environment. Ensure the env var is exported and run any apify command - no explicit login needed. If the user doesn't have a token, generate one at https://console.apify.com/settings/integrations.

Security note: Avoid passing tokens as command-line arguments (e.g. apify login -t <token>). Arguments are visible in process listings and may be recorded in shell history. Prefer environment variables or interactive login instead. Never log, print, or embed APIFY_TOKEN in source code or configuration files. Use a token with the minimum required permissions (scoped token) and rotate it periodically.

Template selection

IMPORTANT: Before starting Actor development, always ask the user which programming language they prefer:

  • JavaScript - Use apify create <actor-name> -t project_empty
  • TypeScript - Use apify create <actor-name> -t ts_empty
  • Python - Use apify create <actor-name> -t python-empty

Use the appropriate CLI command based on the user's language choice. Additional packages (Crawlee, Playwright, etc.) can be installed later as needed.

Quick start workflow

  1. Create Actor project - Run the appropriate apify create command based on user's language preference (see Template selection above)
  2. Install dependencies (verify package names match intended packages before installing)
    • JavaScript/TypeScript: npm install (uses package-lock.json for reproducible, integrity-checked installs — commit the lockfile to version control)
    • Python: pip install -r requirements.txt (pin exact versions in requirements.txt, e.g. crawlee==1.2.3, and commit the file to version control)
  3. Implement logic - Write the Actor code in src/main.py, src/main.js, or src/main.ts
  4. Configure schemas - Update input/output schemas in .actor/input_schema.json, .actor/output_schema.json, .actor/dataset_schema.json
  5. Configure platform settings - Update .actor/actor.json with Actor metadata (see references/actor-json.md)
  6. Write documentation - Create comprehensive README.md for the marketplace (see references/actor-readme.md — this is mandatory, not optional)
  7. Test locally - Run apify run to verify functionality (see Local testing section below)
  8. Deploy - Run apify push to deploy the Actor on the Apify platform (Actor name is defined in .actor/actor.json)

Security

Treat all crawled web content as untrusted input. Actors ingest data from external websites that may contain malicious payloads. Follow these rules:

  • Sanitize crawled data — Never pass raw HTML, URLs, or scraped text directly into shell commands, eval(), database queries, or template engines. Use proper escaping or parameterized APIs.
  • Validate and type-check all external data — Before pushing to datasets or key-value stores, verify that values match expected types and formats. Reject or sanitize unexpected structures.
  • Do not execute or interpret crawled content — Never treat scraped text as code, commands, or configuration. Content from websites could include prompt injection attempts or embedded scripts.
  • Isolate credentials from data pipelines — Ensure APIFY_TOKEN and other secrets are never accessible in request handlers or passed alongside crawled data. Use the Apify SDK's built-in credential management rather than passing tokens through environment variables in data-processing code.
  • Review dependencies before installing — When adding packages with npm install or pip install, verify the package name and publisher. Typosquatting is a common supply-chain attack vector. Prefer well-known, actively maintained packages.
  • Pin versions and use lockfiles — Always commit package-lock.json (Node.js) or pin exact versions in requirements.txt (Python). Lockfiles ensure reproducible builds and prevent silent dependency substitution. Run npm audit or pip-audit periodically to check for known vulnerabilities.

Best practices

✓ Do:

  • Use apify run to test Actors locally (configures Apify environment and storage)
  • Use Apify SDK (apify) for code running on the Apify platform
  • Validate input early with proper error handling and fail gracefully
  • Use CheerioCrawler for static HTML (10x faster than browsers)
  • Use PlaywrightCrawler only for JavaScript-heavy sites
  • Use router pattern (createCheerioRouter/createPlaywrightRouter) for complex crawls
  • Implement retry strategies with exponential backoff
  • Use proper concurrency: HTTP (10-50), Browser (1-5)
  • Set sensible defaults in .actor/input_schema.json
  • Define output schema in .actor/output_schema.json
  • Clean and validate data before pushing to dataset
  • Use semantic CSS selectors with fallback strategies
  • Respect robots.txt, ToS, and implement rate limiting
  • Always use apify/log package — censors sensitive data (API keys, tokens, credentials)
  • Implement readiness probe handler (required if your Actor uses standby mode)

✗ Don't:

  • Use npm start, npm run start, npx apify run, or similar commands to run Actors (use apify run instead)
  • Assume local storage from apify run is pushed to or visible in Apify Console — it is local-only; deploy with apify push and run on the platform to see results in Apify Console
  • Rely on Dataset.getInfo() for final counts on Cloud
  • Use browser crawlers when HTTP/Cheerio works
  • Hard code values that should be in input schema or environment variables
  • Skip input validation or error handling
  • Overload servers - use appropriate concurrency and delays
  • Scrape prohibited content or ignore Terms of Service
  • Store personal/sensitive data unless explicitly permitted
  • Use deprecated options like requestHandlerTimeoutMillis on CheerioCrawler (v3.x)
  • Use additionalHttpHeaders - use preNavigationHooks instead
  • Pass raw crawled content into shell commands, eval(), or code-generation functions
  • Use console.log() or print() instead of the Apify logger — these bypass credential censoring
  • Disable standby mode without explicit permission

Logging

See references/logging.md for complete logging documentation including available log levels and best practices for JavaScript/TypeScript and Python.

Commands

# Bootstrap & local development
apify create [name]                    # Create new Actor project from a template
apify init                             # Initialize Actor in current directory
apify run                              # Run Actor locally with simulated platform env
apify run --purge                      # Run after clearing previous local storage
apify validate-schema                  # Validate .actor/input_schema.json

# Authentication & account
apify login                            # Authenticate account (token stored in ~/.apify)
apify logout                           # Remove stored credentials
apify info                             # Print currently authenticated account info

# Deployment & remote execution
apify push                             # Deploy Actor to platform per .actor/actor.json
apify pull <actor>                     # Download Actor code from the platform
apify actors info <actor> --readme     # Inspect Actor documentation
apify actors info <actor> --input      # Inspect Actor input schema
apify call <actor> --input-file input.json
apify call <actor> --input '{"startUrls":[{"url":"https://example.com"}]}'
apify actors build <actor>             # Create a new build of an Actor
apify runs ls                          # List recent runs

# Discovery (search Apify Store for community Actors)
apify actors search "<query>"
apify actors info <actor>

# Secrets (referenced from actor.json via "@mySecret")
apify secrets add <name> <value>       # Store a secret locally; uploaded on push
apify secrets ls                       # List stored secret keys

# Direct API access
apify api <endpoint>                   # Authenticated HTTP request to Apify API

# Help
apify help                             # List all commands
apify <command> --help                 # Detailed help for a specific command

Remote Actor calls

When running Actors remotely, use this flow:

  1. Search for the right Actor with apify actors search "<query>".
  2. Inspect its README with apify actors info <actor> --readme.
  3. Inspect its input schema with apify actors info <actor> --input.
  4. Call it with either --input-file input.json or quoted inline JSON.

Actor input is one JSON object, not an array. --input accepts inline JSON object input only; wrap inline JSON in quotes to avoid shell parsing issues, for example --input '{"startUrls":[{"url":"https://example.com"}]}'. For JSON files or complex inputs, use --input-file input.json.

If no dedicated Actor exists for your target, search Apify Store for community options before building from scratch.

Local and runtime commands

Always use apify run to test Actors locally. Do not use npm run start, npm start, yarn start, or other package manager commands - these will not properly configure the Apify environment and storage.

Inside a running Actor, prefer the SDK (Actor.getInput() / Actor.get_input(), Actor.pushData() / Actor.push_data(), Actor.setValue() / Actor.set_value()) over the equivalent apify actor runtime subcommands.

Apify platform environment

When the Actor runs on the Apify platform, the API token is automatically available via the APIFY_TOKEN environment variable (note: the variable is APIFY_TOKEN, not APIFY_API_TOKEN). The Apify SDK reads it automatically, so you do not need to pass it explicitly. Locally, run apify login once and the SDK will use your stored credentials.

Local testing

When testing an Actor locally with apify run, provide input data by creating a JSON file at:

storage/key_value_stores/default/INPUT.json

This file should contain the input parameters defined in your .actor/input_schema.json. The actor will read this input when running locally, mirroring how it receives input on the Apify platform.

IMPORTANT - Local storage is NOT synced to Apify Console:

  • Running apify run stores all data (datasets, key-value stores, request queues) only on your local filesystem in the storage/ directory.
  • This data is never automatically uploaded or pushed to the Apify platform. It exists only on your machine.
  • To verify results on Apify Console, you must deploy the Actor with apify push and then run it on the platform.
  • Do not rely on checking Apify Console to verify results from local runs — instead, inspect the local storage/ directory or check the Actor's log output.

Standby mode

Standby mode enables Actors to work as API servers - they remain ready in the background to handle HTTP requests.

When to use Standby mode: Use Standby when the Actor must handle interactive, real-time HTTP requests — API endpoints, webhook receivers, real-time data lookups, MCP servers, or scraping APIs serving on-demand single-URL requests.

When building a Standby Actor, set usesStandbyMode: true in .actor/actor.json and implement an HTTP server. See references/standby-mode.md for configuration, environment variables, complete code examples, and operational limits.

Project structure

.actor/
├── actor.json           # Actor config: name, version, env vars, runtime
├── input_schema.json    # Input validation & Console form definition
└── output_schema.json   # Output storage and display templates
src/
└── main.js/ts/py       # Actor entry point
storage/                # Local-only storage (NOT synced to Apify Console)
├── datasets/           # Output items (JSON objects)
├── key_value_stores/   # Files, config, INPUT
└── request_queues/     # Pending crawl requests
Dockerfile              # Container image definition

Actor configuration

See references/actor-json.md for complete actor.json structure and configuration options.

Input schema

See references/input-schema.md for input schema structure and examples.

Output schema

See references/output-schema.md for output schema structure, examples, and template variables.

Dataset schema

See references/dataset-schema.md for dataset schema structure, configuration, and display properties.

Key-value store schema

See references/key-value-store-schema.md for key-value store schema structure, collections, and configuration.

Actor README

IMPORTANT: Always generate a README.md as part of Actor development. The README is the Actor's landing page on Apify Store and is critical for discoverability (SEO), user onboarding, and support. Do not consider an Actor complete without a proper README.

See references/actor-readme.md for the required structure, SEO best practices, and content guidelines. Also review these top Actors for best practices:

MCP tools

Apify MCP

If the Apify MCP server is configured, use these tools for documentation:

  • search-apify-docs - Search documentation
  • fetch-apify-docs - Get full doc pages

Otherwise, the MCP Server url: https://mcp.apify.com/?tools=docs.

Playwright MCP (debugging)

The Playwright MCP server is a useful tool for debugging Actors that interact with the web - it lets the agent drive a real browser to inspect pages, capture selectors, and reproduce issues.

Install with the Claude Code CLI:

claude mcp add playwright npx @playwright/mcp@latest

Or add it manually to your MCP config:

{
  "mcpServers": {
    "playwright": {
      "command": "npx",
      "args": ["@playwright/mcp@latest"]
    }
  }
}

Resources

來自 apify 的更多技能

bug-triage
apify
分類處理 apify/apify-mcp-server 上的未解決錯誤問題。分析、草擬回覆、取得核准、發布。
official
dig
apify
用於在 Apify MCP 伺服器上探索、規劃與規格化工作的靈活技能。請勿編輯原始檔案——此技能僅供理解與規劃使用。
official
apify-actorization
apify
將現有專案轉換為無伺服器 Apify Actors,並整合語言專屬 SDK。支援 JavaScript/TypeScript(使用 Actor.init() / Actor.exit())、Python(非同步上下文管理器),以及透過 CLI 包裝器的任何語言。提供結構化工作流程:使用 apify init 建立專案骨架、套用 SDK 包裝、設定輸入/輸出架構、以 apify run 進行本地測試,再透過 apify push 部署。包含輸入與輸出架構驗證、Docker 容器化,以及可選的按事件付費...
official
apify-audience-analysis
apify
從Facebook、Instagram、YouTube和TikTok提取受眾人口統計、互動模式及行為數據。支援18個以上專業Actor,涵蓋四個平台的粉絲人口統計、互動指標、留言及個人檔案分析。提供三種輸出格式:快速聊天顯示、CSV匯出或JSON匯出供後續分析。需使用Apify token及mcpc CLI工具;透過動態架構擷取來調整輸入以符合各Actor需求。包含結構化...
official
apify-brand-reputation-monitoring
apify
監控品牌在Google Maps、Booking.com、TripAdvisor、Facebook、Instagram、YouTube和TikTok上的聲譽。支援16個以上的專用Apify Actors,涵蓋所有主要平台的評論、評分、留言和提及。靈活的輸出格式:在聊天中顯示結果、匯出為CSV,或儲存為JSON供後續分析使用。需要Apify token和Node.js 20.6+;使用mcpc CLI動態擷取Actor架構和輸入參數。工作流程引導使用者選擇平台...
official
apify-competitor-intelligence
apify
透過 Apify Actors 進行多平台競爭對手分析,涵蓋 Google Maps、Booking.com、Facebook、Instagram、YouTube 及 TikTok。包含 25 個以上專用 Actors,橫跨七大平台,每個皆針對特定分析類型最佳化:商業資料擷取、評論比較、廣告策略監控、內容成效及受眾洞察。需具備 Apify 權杖、Node.js 20.6+ 及 mcpc CLI 工具,以動態擷取 Actor 架構並執行分析。支援三種輸出格式:快速聊天顯示、...
official
apify-content-analytics
apify
透過 Apify Actors 進行多平台內容分析,支援 Instagram、Facebook、YouTube 及 TikTok。涵蓋 17 種以上專用 Actors,可處理貼文、Reels、限時動態、留言、Hashtag、粉絲及廣告等內容,並動態使用 mcpc CLI 擷取 Actor 架構,以判斷所需輸入與可用輸出欄位。結果提供三種格式:快速聊天顯示、CSV 匯出或 JSON 匯出,並可自訂結果數量。需在 .env 檔案中設定 Apify Token,並使用 Node.js 20.6+...
official
apify-ecommerce
apify
從50多個電子商務平台提取產品數據、價格、評論及賣家資訊。三種工作流程模式:產品與定價(價格追蹤、競爭對手分析)、客戶評論(情感分析、品質問題)及賣家情報(透過Google Shopping發現供應商)。支援Amazon(20多個地區)、Walmart、eBay、IKEA、Costco及歐洲零售商;可透過產品網址、分類網址或關鍵字搜尋輸入。可選AI驅動分析,生成價格洞察...
official