cloud-solution-architect

遵循Azure架构中心最佳实践，设计架构完善、达到生产级别的云系统。该技能提供：

npx skills add https://github.com/microsoft/agent-skills --skill cloud-solution-architect

Cloud Solution Architect

Overview

Design well-architected, production-grade cloud systems following Azure Architecture Center best practices. This skill provides:

10 design principles for Azure applications
6 architecture styles with selection guidance
44 cloud design patterns mapped to WAF pillars
Technology choice frameworks for compute, storage, data, messaging
Performance antipatterns to avoid
Architecture review workflow for systematic design validation

Ten Design Principles for Azure Applications

#	Principle	Key Tactics
1	Design for self-healing	Retry with backoff, circuit breaker, bulkhead isolation, health endpoint monitoring, graceful degradation
2	Make all things redundant	Eliminate single points of failure, use availability zones, deploy multi-region, replicate data
3	Minimize coordination	Decouple services, use async messaging, embrace eventual consistency, use domain events
4	Design to scale out	Horizontal scaling, autoscaling rules, stateless services, avoid session stickiness, partition workloads
5	Partition around limits	Data partitioning (shard/hash/range), respect compute & network limits, use CDNs for static content
6	Design for operations	Structured logging, distributed tracing, metrics & dashboards, runbook automation, infrastructure as code
7	Use managed services	Prefer PaaS over IaaS, reduce operational burden, leverage built-in HA/DR/scaling
8	Use an identity service	Microsoft Entra ID, managed identity, RBAC, avoid storing credentials, zero-trust principles
9	Design for evolution	Loose coupling, versioned APIs, backward compatibility, async messaging for integration, feature flags
10	Build for business needs	Define SLAs/SLOs, establish RTO/RPO targets, domain-driven design, cost modeling, composite SLAs

Architecture Styles

Style	Description	When to Use	Key Services
N-tier	Horizontal layers (presentation, business, data)	Traditional enterprise apps, lift-and-shift	App Service, SQL Database, VNets
Web-Queue-Worker	Web frontend → message queue → backend worker	Moderate-complexity apps with long-running tasks	App Service, Service Bus, Functions
Microservices	Small autonomous services, bounded contexts, independent deploy	Complex domains, independent team scaling	AKS, Container Apps, API Management
Event-driven	Pub/sub model, event producers/consumers	Real-time processing, IoT, reactive systems	Event Hubs, Event Grid, Functions
Big data	Batch + stream processing pipeline	Analytics, ML pipelines, large-scale data	Synapse, Data Factory, Databricks
Big compute	HPC, parallel processing	Simulations, modeling, rendering, genomics	Batch, CycleCloud, HPC VMs

Selection Criteria

Domain complexity → Microservices (high), N-tier (low-medium)
Team autonomy → Microservices (independent teams), N-tier (single team)
Data volume → Big data (TB+), others (GB)
Latency requirements → Event-driven (real-time), Web-Queue-Worker (tolerant)

Cloud Design Patterns

44 patterns organized by primary concern. WAF pillar mapping: R=Reliability, S=Security, CO=Cost Optimization, OE=Operational Excellence, PE=Performance Efficiency.

Messaging & Communication

Pattern	Summary	Pillars
Asynchronous Request-Reply	Decouple request/response with polling or callbacks	R, PE
Claim Check	Split large messages; store payload separately, pass reference	R, PE
Choreography	Services coordinate via events without central orchestrator	R, OE
Competing Consumers	Multiple consumers process messages from shared queue concurrently	R, PE
Messaging Bridge	Connect incompatible messaging systems	R, OE
Pipes and Filters	Decompose complex processing into reusable filter stages	R, OE
Priority Queue	Prioritize requests so higher-priority work is processed first	R, PE
Publisher/Subscriber	Decouple senders from receivers via topics/subscriptions	R, PE
Queue-Based Load Leveling	Buffer requests with a queue to smooth intermittent loads	R, PE
Sequential Convoy	Process related messages in order while allowing parallel groups	R, PE

Reliability & Resilience

Pattern	Summary	Pillars
Bulkhead	Isolate resources per workload to prevent cascading failure	R
Circuit Breaker	Stop calling a failing service; fail fast to protect resources	R
Compensating Transaction	Undo previously committed steps when a later step fails	R
Health Endpoint Monitoring	Expose health checks for load balancers and orchestrators	R, OE
Leader Election	Coordinate distributed instances by electing a leader	R
Retry	Handle transient faults by retrying with exponential backoff	R
Saga	Manage data consistency across microservices with compensating transactions	R
Scheduler Agent Supervisor	Coordinate distributed actions with retry and failure handling	R

Data Management

Pattern	Summary	Pillars
Cache-Aside	Load data on demand into cache from data store	PE
CQRS	Separate read and write models for independent scaling	PE, R
Event Sourcing	Store state as append-only sequence of domain events	R, OE
Index Table	Create indexes over frequently queried fields in data stores	PE
Materialized View	Pre-compute views over data for efficient queries	PE
Sharding	Distribute data across partitions for scale and performance	PE, R
Static Content Hosting	Serve static content from cloud storage/CDN directly	PE, CO
Valet Key	Grant clients limited direct access to storage resources	S, PE

Design & Structure

Pattern	Summary	Pillars
Ambassador	Offload cross-cutting concerns to a helper sidecar proxy	OE
Anti-Corruption Layer	Translate between new and legacy system models	OE, R
Backends for Frontends	Create separate backends per frontend type (mobile, web, etc.)	OE, PE
Compute Resource Consolidation	Combine multiple workloads into fewer compute instances	CO
External Configuration Store	Externalize configuration from deployment packages	OE
Sidecar	Deploy helper components alongside the main service	OE
Strangler Fig	Incrementally migrate legacy systems by replacing pieces	OE, R

Security & Access

Pattern	Summary	Pillars
Federated Identity	Delegate authentication to an external identity provider	S
Gatekeeper	Protect services using a dedicated broker that validates requests	S
Quarantine	Isolate and validate external assets before allowing use	S
Rate Limiting	Control consumption rate of resources by consumers	R, S
Throttling	Control resource consumption to sustain SLAs under load	R, PE

Deployment & Scaling

Pattern	Summary	Pillars
Deployment Stamps	Deploy multiple independent copies of application components	R, PE
Edge Workload Configuration	Configure workloads differently across diverse edge devices	OE
Gateway Aggregation	Aggregate multiple backend calls into a single client request	PE
Gateway Offloading	Offload shared functionality (SSL, auth) to a gateway	OE, S
Gateway Routing	Route requests to multiple backends using a single endpoint	OE
Geode	Deploy backends to multiple regions for active-active serving	R, PE

See Design Patterns Reference for detailed implementation guidance.

Technology Choices

Decision Framework

For each technology area, evaluate: requirements → constraints → tradeoffs → select.

Area	Key Options	Selection Criteria
Compute	App Service, Functions, Container Apps, AKS, VMs, Batch	Hosting model, scaling, cost, team skills
Storage	Blob Storage, Data Lake, Files, Disks, Managed Lustre	Access patterns, throughput, cost tier
Data stores	SQL Database, Cosmos DB, PostgreSQL, Redis, Table Storage	Consistency model, query patterns, scale
Messaging	Service Bus, Event Hubs, Event Grid, Queue Storage	Ordering, throughput, pub/sub vs queue
Networking	Front Door, Application Gateway, Load Balancer, Traffic Manager	Global vs regional, L4 vs L7, WAF
AI services	Azure OpenAI, AI Search, AI Foundry, Document Intelligence	Model needs, data grounding, orchestration
Containers	Container Apps, AKS, Container Instances	Operational control vs simplicity

See Technology Choices Reference for detailed decision trees.

Best Practices

Practice	Key Guidance
API design	RESTful conventions, resource-oriented URIs, HATEOAS, versioning via URL path or header
API implementation	Async operations, pagination, idempotent PUT/DELETE, content negotiation, ETag caching
Autoscaling	Scale on metrics (CPU, queue depth, custom), cool-down periods, predictive scaling, scale-in protection
Background jobs	Use queues or scheduled triggers, idempotent processing, poison message handling, graceful shutdown
Caching	Cache-aside pattern, TTL policies, cache invalidation strategies, distributed cache for multi-instance
CDN	Static asset offloading, cache-busting with versioned URLs, geo-distribution, HTTPS enforcement
Data partitioning	Horizontal (sharding), vertical, functional partitioning; partition key selection for even distribution
Partitioning strategies	Hash-based, range-based, directory-based; rebalancing approach, cross-partition query avoidance
Host name preservation	Preserve original host header through proxies/gateways for cookies, redirects, auth flows
Message encoding	Schema evolution (Avro/Protobuf), backward/forward compatibility, schema registry
Monitoring & diagnostics	Structured logging, distributed tracing (W3C Trace Context), metrics, alerts, dashboards
Transient fault handling	Retry with exponential backoff + jitter, circuit breaker, idempotency keys, timeout budgets

See Best Practices Reference for implementation details.

Performance Antipatterns

Avoid these common patterns that degrade performance under load:

Antipattern	Problem	Fix
Busy Database	Offloading too much processing to the database	Move logic to application tier, use caching
Busy Front End	Resource-intensive work on frontend request threads	Offload to background workers/queues
Chatty I/O	Many small I/O requests instead of fewer large ones	Batch requests, use bulk APIs, buffer writes
Extraneous Fetching	Retrieving more data than needed	Project only required fields, paginate, filter server-side
Improper Instantiation	Recreating expensive objects per request	Use singletons, connection pooling, HttpClientFactory
Monolithic Persistence	Single data store for all data types	Polyglot persistence — right store for each workload
No Caching	Repeatedly fetching unchanged data	Cache-aside pattern, CDN, output caching, Redis
Noisy Neighbor	One tenant consuming all shared resources	Bulkhead isolation, per-tenant quotas, throttling
Retry Storm	Aggressive retries overwhelming a recovering service	Exponential backoff + jitter, circuit breaker, retry budgets
Synchronous I/O	Blocking threads on I/O operations	Async/await, non-blocking I/O, reactive streams

Mission-Critical Design

For workloads targeting 99.99%+ SLO, address these design areas:

Design Area	Key Considerations
Application platform	Multi-region active-active, availability zones, Container Apps or AKS with zone redundancy
Application design	Stateless services, idempotent operations, graceful degradation, bulkhead isolation
Networking	Azure Front Door (global LB), DDoS Protection, private endpoints, redundant connectivity
Data platform	Multi-region Cosmos DB, zone-redundant SQL, async replication, conflict resolution
Deployment & testing	Blue-green deployments, canary releases, chaos engineering, automated rollback
Health modeling	Composite health scores, dependency health tracking, automated remediation, SLI dashboards
Security	Zero-trust, managed identity everywhere, key rotation, WAF policies, threat modeling
Operational procedures	Automated runbooks, incident response playbooks, game days, postmortems

See Mission-Critical Reference for detailed guidance.

Well-Architected Framework (WAF) Pillars

Every architecture decision should be evaluated against all five pillars:

Pillar	Focus	Key Questions
Reliability	Resiliency, availability, disaster recovery	What is the RTO/RPO? How does it handle failures? Is there redundancy?
Security	Threat protection, identity, data protection	Is identity managed? Is data encrypted? Are there network controls?
Cost Optimization	Cost management, efficiency, right-sizing	Is compute right-sized? Are there reserved instances? Is there waste?
Operational Excellence	Monitoring, deployment, automation	Is deployment automated? Is there observability? Are there runbooks?
Performance Efficiency	Scaling, load testing, performance targets	Can it scale horizontally? Are there performance baselines? Is caching used?

WAF Tradeoff Matrix

Optimizing for...	May impact...
Reliability (redundancy)	Cost (more resources)
Security (isolation)	Performance (added latency)
Cost (consolidation)	Reliability (shared failure domains)
Performance (caching)	Cost (cache infrastructure), Reliability (stale data)

Architecture Review Workflow

When reviewing or designing a system, follow this structured approach:

Step 1: Identify Requirements

Functional: What must the system do?
Non-functional:
  - Availability target (e.g., 99.9%, 99.99%)
  - Latency requirements (p50, p95, p99)
  - Throughput (requests/sec, messages/sec)
  - Data residency and compliance
  - Recovery targets (RTO, RPO)
  - Cost constraints

Step 2: Select Architecture Style

Match requirements to architecture style using the selection criteria table above.

Step 3: Choose Technology Stack

Use the technology choices decision framework. Prefer managed services (PaaS) over IaaS.

Step 4: Apply Design Patterns

Select relevant patterns from the 44 cloud design patterns based on identified concerns.

Step 5: Address Cross-Cutting Concerns

Identity & access — Microsoft Entra ID, managed identity, RBAC
Monitoring — Application Insights, Azure Monitor, Log Analytics
Security — Network segmentation, encryption at rest/in transit, Key Vault
CI/CD — GitHub Actions, Azure DevOps Pipelines, infrastructure as code

Step 6: Validate Against WAF Pillars

Review each pillar systematically. Document tradeoffs explicitly.

Step 7: Document Decisions

Use Architecture Decision Records (ADRs):

# ADR-NNN: [Decision Title]

## Status: [Proposed | Accepted | Deprecated]

## Context
[What is the issue we're addressing?]

## Decision
[What did we decide and why?]

## Consequences
[What are the positive and negative impacts?]

References

Design Patterns Reference — Detailed pattern implementations
Technology Choices Reference — Decision trees for Azure services
Best Practices Reference — Implementation guidance
Mission-Critical Reference — High-availability design

Source

Content derived from the Azure Architecture Center — Microsoft's official guidance for cloud solution architecture on Azure. Covers design principles, architecture styles, cloud design patterns, technology choices, best practices, performance antipatterns, mission-critical design, and the Well-Architected Framework.