ray-data

Pemrosesan data yang dapat diskalakan untuk beban kerja ML. Eksekusi streaming di seluruh CPU/GPU, mendukung Parquet/CSV/JSON/gambar. Terintegrasi dengan Ray Train, PyTorch,…

npx skills add https://github.com/firecrawl/ai-research-skills --skill ray-data

Ray Data - Scalable ML Data Processing

Distributed data processing library for ML and AI workloads.

When to use Ray Data

Use Ray Data when:

  • Processing large datasets (>100GB) for ML training
  • Need distributed data preprocessing across cluster
  • Building batch inference pipelines
  • Loading multi-modal data (images, audio, video)
  • Scaling data processing from laptop to cluster

Key features:

  • Streaming execution: Process data larger than memory
  • GPU support: Accelerate transforms with GPUs
  • Framework integration: PyTorch, TensorFlow, HuggingFace
  • Multi-modal: Images, Parquet, CSV, JSON, audio, video

Use alternatives instead:

  • Pandas: Small data (<1GB) on single machine
  • Dask: Tabular data, SQL-like operations
  • Spark: Enterprise ETL, SQL queries

Quick start

Installation

pip install -U 'ray[data]'

Load and transform data

import ray

# Read Parquet files
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# Transform data (lazy execution)
ds = ds.map_batches(lambda batch: {"processed": batch["text"].str.lower()})

# Consume data
for batch in ds.iter_batches(batch_size=100):
    print(batch)

Integration with Ray Train

import ray
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer

# Create dataset
train_ds = ray.data.read_parquet("s3://bucket/train/*.parquet")

def train_func(config):
    # Access dataset in training
    train_ds = ray.train.get_dataset_shard("train")

    for epoch in range(10):
        for batch in train_ds.iter_batches(batch_size=32):
            # Train on batch
            pass

# Train with Ray
trainer = TorchTrainer(
    train_func,
    datasets={"train": train_ds},
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True)
)
trainer.fit()

Reading data

From cloud storage

import ray

# Parquet (recommended for ML)
ds = ray.data.read_parquet("s3://bucket/data/*.parquet")

# CSV
ds = ray.data.read_csv("s3://bucket/data/*.csv")

# JSON
ds = ray.data.read_json("gs://bucket/data/*.json")

# Images
ds = ray.data.read_images("s3://bucket/images/")

From Python objects

# From list
ds = ray.data.from_items([{"id": i, "value": i * 2} for i in range(1000)])

# From range
ds = ray.data.range(1000000)  # Synthetic data

# From pandas
import pandas as pd
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})
ds = ray.data.from_pandas(df)

Transformations

Map batches (vectorized)

# Batch transformation (fast)
def process_batch(batch):
    batch["doubled"] = batch["value"] * 2
    return batch

ds = ds.map_batches(process_batch, batch_size=1000)

Row transformations

# Row-by-row (slower)
def process_row(row):
    row["squared"] = row["value"] ** 2
    return row

ds = ds.map(process_row)

Filter

# Filter rows
ds = ds.filter(lambda row: row["value"] > 100)

Group by and aggregate

# Group by column
ds = ds.groupby("category").count()

# Custom aggregation
ds = ds.groupby("category").map_groups(lambda group: {"sum": group["value"].sum()})

GPU-accelerated transforms

# Use GPU for preprocessing
def preprocess_images_gpu(batch):
    import torch
    images = torch.tensor(batch["image"]).cuda()
    # GPU preprocessing
    processed = images * 255
    return {"processed": processed.cpu().numpy()}

ds = ds.map_batches(
    preprocess_images_gpu,
    batch_size=64,
    num_gpus=1  # Request GPU
)

Writing data

# Write to Parquet
ds.write_parquet("s3://bucket/output/")

# Write to CSV
ds.write_csv("output/")

# Write to JSON
ds.write_json("output/")

Performance optimization

Repartition

# Control parallelism
ds = ds.repartition(100)  # 100 blocks for 100-core cluster

Batch size tuning

# Larger batches = faster vectorized ops
ds.map_batches(process_fn, batch_size=10000)  # vs batch_size=100

Streaming execution

# Process data larger than memory
ds = ray.data.read_parquet("s3://huge-dataset/")
for batch in ds.iter_batches(batch_size=1000):
    process(batch)  # Streamed, not loaded to memory

Common patterns

Batch inference

import ray

# Load model
def load_model():
    # Load once per worker
    return MyModel()

# Inference function
class BatchInference:
    def __init__(self):
        self.model = load_model()

    def __call__(self, batch):
        predictions = self.model(batch["input"])
        return {"prediction": predictions}

# Run distributed inference
ds = ray.data.read_parquet("s3://data/")
predictions = ds.map_batches(BatchInference, batch_size=32, num_gpus=1)
predictions.write_parquet("s3://output/")

Data preprocessing pipeline

# Multi-step pipeline
ds = (
    ray.data.read_parquet("s3://raw/")
    .map_batches(clean_data)
    .map_batches(tokenize)
    .map_batches(augment)
    .write_parquet("s3://processed/")
)

Integration with ML frameworks

PyTorch

# Convert to PyTorch
torch_ds = ds.to_torch(label_column="label", batch_size=32)

for batch in torch_ds:
    # batch is dict with tensors
    inputs, labels = batch["features"], batch["label"]

TensorFlow

# Convert to TensorFlow
tf_ds = ds.to_tf(feature_columns=["image"], label_column="label", batch_size=32)

for features, labels in tf_ds:
    # Train model
    pass

Supported data formats

FormatReadWriteUse Case
ParquetML data (recommended)
CSVTabular data
JSONSemi-structured
ImagesComputer vision
NumPyArrays
PandasDataFrames

Performance benchmarks

Scaling (processing 100GB data):

  • 1 node (16 cores): ~30 minutes
  • 4 nodes (64 cores): ~8 minutes
  • 16 nodes (256 cores): ~2 minutes

GPU acceleration (image preprocessing):

  • CPU only: 1,000 images/sec
  • 1 GPU: 5,000 images/sec
  • 4 GPUs: 18,000 images/sec

Use cases

Production deployments:

  • Pinterest: Last-mile data processing for model training
  • ByteDance: Scaling offline inference with multi-modal LLMs
  • Spotify: ML platform for batch inference

References

Resources

Lebih banyak skill dari firecrawl

oracle
firecrawl
Praktik terbaik dalam menggunakan CLI oracle (penggabungan prompt dan file, mesin, sesi, dan pola lampiran file).
official
firecrawl-monitor
firecrawl
Deteksi saat konten di situs web berubah dan dapatkan pemberitahuan melalui webhook atau email — tanpa perlu cron job, scraper, atau skrip diff. Gunakan skill ini setiap kali pengguna ingin melacak perubahan pada halaman, memantau harga pesaing, mendapat peringatan tentang lowongan kerja baru atau posting blog, memantau halaman dokumen/changelog/status, atau mengatakan "pantau", "awasi", "lacak", "beri tahu saya saat", "beri tahu saat X berubah", "kirim pesan jika", "kirim email saat", atau "kirim webhook saat". Sebuah hakim AI bawaan menyaring format, stempel waktu, dan...
officialweb-scrapingresearch
firecrawl-deep-research
firecrawl
Jalankan riset mendalam multi-sumber dengan Firecrawl. Gunakan saat pengguna meminta untuk meneliti suatu topik, membandingkan perspektif, menghasilkan briefing bersumber, menyelidiki pertanyaan teknis atau pasar, atau mensintesis bukti web dari banyak sumber.
officialresearchweb-scraping
firecrawl-research-papers
firecrawl
Temukan dan sintesis makalah penelitian, whitepaper, PDF, laporan teknis, serta sumber akademik dengan Firecrawl. Gunakan saat pengguna menginginkan tinjauan literatur, ringkasan makalah, lanskap penelitian, atau sintesis bersumber dari PDF dan publikasi ilmiah/industri.
officialresearchweb-scraping
firecrawl-market-research
firecrawl
Ekstrak metrik pasar, keuangan, pendapatan, industri, dan perusahaan dengan Firecrawl. Gunakan saat pengguna meminta riset pasar, tren industri, data perusahaan publik, perbandingan keuangan, riset pendapatan, atau laporan pasar terstruktur.
officialresearchweb-scraping
firecrawl-website-design-clone
firecrawl
Ekstrak sistem desain dari situs web mana pun menjadi DESIGN.md yang siap digunakan agen menggunakan bukti hasil scrape Firecrawl. Gunakan saat pengguna menginginkan warna, font, jarak, komponen, pola tata letak, atau panduan merek/antarmuka dari sebuah situs web sehingga agen AI dapat membuat situs web baru, meniru tampilan, atau membangun halaman yang terinspirasi dari desain tersebut.
officialdesignweb-scraping
firecrawl-knowledge-base
firecrawl
Bangun basis pengetahuan dari konten web dengan Firecrawl. Gunakan untuk dokumen referensi lokal, potongan data siap-RAG, dataset fine-tuning, cermin dokumentasi, korpora topik, atau markdown siap-LLM yang diorganisir dari sumber web.
officialweb-scrapingresearch
firecrawl-lead-research
firecrawl
Hasilkan ringkasan intelijen prospek pra-rapat dengan Firecrawl. Gunakan saat pengguna membutuhkan riset perusahaan, riset individu, berita terbaru, poin pembicaraan, titik kesulitan, atau persiapan penjangkauan sebelum panggilan penjualan, pertemuan kemitraan, percakapan dengan investor, atau wawancara pelanggan.
officialresearchweb-scraping