Skip to main content
Back to the field guide

Meet Forge

The AI Cloud Infrastructure Engineer

Forge builds production IaC across GCP, AWS, and Azure, audits cloud setups for cost waste and security misconfigurations, and diagnoses runtime infrastructure problems.

Forge · Infrastructure10 min readApril 20, 2026

Most cloud infrastructure goes wrong not in deployment but long before it, when someone copy-pastes a Terraform snippet from a blog post, skips the IAM module because it looks complicated, and ships to production with a storage bucket that is publicly readable and an instance type three times larger than the workload needs. The damage is silent: the app works, the tests pass, and the bill arrives at the end of the month as a surprise. The security misconfiguration sits there for months until a compliance review finds it. The pattern repeats because writing good infrastructure as code is a specialist skill that most teams cannot staff full-time and that generalist AI tools consistently get wrong, they produce plausible-looking Terraform that skips the opinions that make infrastructure actually production-grade. That gap is exactly what Forge fills.

Why the generalist approach breaks down

Ask ChatGPT to write you a VPC module in Terraform and you will get something that compiles. Ask a cloud infrastructure engineer to review it and they will immediately flag the missing private subnet routing, the overly permissive security group egress rules, the absence of VPC flow logs, and the fact that the NAT gateway configuration will create a single point of failure in a multi-AZ setup. The generalist tool has no position on those decisions, it produces output that satisfies the literal request and ignores everything that makes infrastructure reliable, secure, and cost-aware. The problems surface three months later, under load, when the on-call engineer is trying to figure out why connections are timing out at 2 a.m.

Cursor and GitHub Copilot have the same blind spot, compounded by the fact that they are editor-level completion tools. They will happily autocomplete a resource "aws_s3_bucket" block without ever mentioning the bucket policy, versioning configuration, or server-side encryption setting that turn a bucket from a liability into an asset. They are not making infrastructure decisions, they are completing patterns they have seen before. When your infrastructure has real requirements around compliance, cost governance, or resilience, autocomplete produces a first draft that looks done but is not. Every infrastructure decision that required judgment is missing.

The deeper problem is that cloud infrastructure is one of the few engineering domains where wrong decisions are expensive in three independent ways simultaneously: they create security exposure, they inflate cost, and they create operational failure modes that do not surface until the system is under real load. A generalist tool that does not have opinions about IAM least-privilege, right-sizing, multi-AZ design, and encrypted storage is not actually useful for infrastructure work, it is useful for infrastructure drafting, and the difference matters. Teams that use generalist tools for IaC consistently end up with a mix of production-grade and dangerously underspecified resources, and the only thing holding it together is that nobody has looked closely enough to notice.

What a cloud infrastructure engineer actually does

On a human engineering team, the infrastructure engineer is the person who owns what runs the application, the compute, the networking, the storage, the access controls, and the glue that holds all of it together across environments. They think in failure modes: what happens when an availability zone goes down, what happens when autoscaling does not trigger fast enough, what happens when a service account is compromised. They write Terraform that documents its own reasoning, variable descriptions, output explanations, comment blocks that capture why a decision was made rather than what was decided. They review other people's infrastructure changes not for syntax but for the security and reliability consequences that the author may not have thought through.

The infrastructure engineer is also the person your team calls when the cloud bill arrives with an unexpected number on it. They know which instance types are oversized for the workload, which reserved instance commitments are about to expire, which idle resources have been running for months because nobody deleted the staging environment from six sprints ago. That combination of operational knowledge, security instinct, and cost awareness is hard to find and harder to keep. Forge makes it available on demand, in the IaC language your team already uses, across the cloud providers you are actually running.

Meet Forge

Forge is Tonone's cloud infrastructure engineer, a purpose-built specialist agent for GCP, AWS, Azure, Cloudflare, and Fly.io, working in Terraform, Pulumi, or CDK depending on what your project already uses. Forge does not write infrastructure that looks production-grade; it writes infrastructure that is production-grade. That means IAM with least-privilege from the first resource, not bolted on later. It means subnet strategy and CIDR planning documented with their rationale, not left implicit. It means cost and security are first-class outputs of every infra build, not afterthoughts surfaced by audits.

Tonone's Forge builds production-grade infrastructure as code across GCP, AWS, Azure, Cloudflare, and Fly.io, with IAM, cost awareness, and security baked in from the first resource, not added as an afterthought.

What Forge actually does

Building production infrastructure from scratch

The forge-infra skill is where Forge earns its name. You describe what you need, a GKE cluster with a private node pool, an RDS instance behind a VPC, a multi-region CDN setup, and Forge detects your cloud provider and target platform from the existing project context, then produces complete, production-grade IaC. Not a starter template. Not a hello-world module. Compute with the right instance family for the workload, networking configured to isolate traffic correctly, storage with encryption and versioning on from day one, and IAM that grants each component the minimum permissions it actually needs. The output includes inline comments that explain why each decision was made, why this CIDR range, why this instance type, why the storage bucket policy is structured the way it is, so the infrastructure is maintainable by whoever works on it next, not just by whoever wrote it. For teams starting a new cloud environment or expanding to a new region, forge-infra compresses weeks of careful infrastructure work into hours, without cutting the corners that create problems later.

Designing networking infrastructure that holds

Networking is the part of cloud infrastructure that looks simple until it is not. A VPC that worked fine with three services starts behaving unexpectedly when you add a fourth, because the subnet strategy was never planned beyond what existed at the time. A firewall rule that was reasonable for a development environment accidentally ships to production with open egress. The forge-network skill addresses this directly: it designs and builds networking infrastructure with a coherent subnet strategy and CIDR planning that leaves room for growth, DNS configuration that handles internal and external resolution correctly, load balancers with health checks and SSL termination configured properly, and firewall rules that follow least-privilege ingress and egress at the rule level, not at the VPC level where it is too broad to be useful. Every networking decision is documented with its rationale, which means the next engineer who reads the configuration understands why it is the way it is rather than inheriting a structure they are afraid to change. For teams that have grown their cloud footprint organically and ended up with networking they do not fully understand, forge-network can also document and rationalize the existing setup before proposing changes.

Auditing existing infrastructure for real risk

The forge-audit skill is what you run when you inherit a cloud environment and need an honest assessment of what you have inherited. Forge reads the existing IaC and cloud configuration and produces a prioritized finding list covering IAM permissions that are over-privileged, public exposure on storage buckets and database instances, resources that are unencrypted at rest or in transit, idle and oversized instances that are running but serving no traffic, and missing backup policies that mean a failure event would result in data loss. The output is not a generic checklist, it is a finding per resource, with the specific misconfiguration, the severity, and the remediation steps in the exact IaC language you are using. The prioritization reflects actual risk: a publicly readable bucket with customer data is severity critical; an idle dev instance without a backup policy is low. Security teams and compliance auditors can use the output directly; engineering teams can use it as a backlog of infrastructure improvements with enough context to act immediately.

Tonone's Forge forge-audit skill audits existing cloud infrastructure for IAM over-privilege, public storage exposure, unencrypted resources, and cost waste, producing a prioritized finding list with remediation steps in your IaC language.

Finding what the cloud bill is actually paying for

The forge-cost skill turns cloud cost analysis from a monthly ritual of confusion into an actionable engineering conversation. Forge analyzes cloud spend to identify idle resources that are running but serving nothing, instances that are sized for peak load they have never actually seen, committed use discount gaps where on-demand pricing is paying for stable workloads that qualify for reservations, and architectural patterns that are more expensive than their alternatives without being more reliable. The output is not a list of metrics, it is a set of specific changes with expected monthly savings per change, so engineering and finance can agree on a prioritized cost reduction plan rather than staring at a bill and guessing. For growing teams where cloud spend is becoming a material line item, forge-cost provides the infrastructure expertise to distinguish necessary spend from waste without requiring a dedicated FinOps function.

Diagnosing runtime infrastructure problems

The forge-diagnose skill is what you reach for when something in the infrastructure is wrong and you cannot figure out why. Cold start latency on a service that was fine last week. Connection timeouts that happen intermittently and do not correlate with any obvious pattern. Autoscaling that triggers too late and leaves the service under-provisioned during traffic spikes. Connection pool exhaustion that looks like application errors but is actually an infrastructure configuration problem. Forge diagnoses these by reading logs, metrics, and configuration together, not just the application logs, not just the cloud console metrics, but the combination of signals that reveals whether the problem is in the application, the infrastructure, or the interaction between them. The output identifies the actual cause rather than the visible symptom, with a remediation plan that addresses the root issue rather than masking it. For teams running on Claude Code, forge-diagnose is the fastest path from an infrastructure incident to a grounded diagnosis that can be acted on.

Inventorying what is actually running

Before Forge can build, audit, or optimize anything, it needs to know what exists. The forge-recon skill performs infrastructure reconnaissance: it inventories all cloud resources across accounts and regions, maps the connections between services, identifies configuration drift between what the IaC definitions say should exist and what is actually running, and flags high-risk items that warrant immediate attention. The output is a readable map of the cloud environment, not a raw export from the cloud console, but an organized summary of what is running, how it is connected, and where the risks are. For teams that have grown their infrastructure faster than their documentation, forge-recon produces the inventory that should have existed from the start. It is also the natural entry point before any forge-audit or forge-cost engagement, grounded context before opinions.

Tonone's Forge forge-recon skill inventories cloud resources across accounts and regions, maps service connections, and identifies configuration drift between IaC definitions and what is actually running.

A worked example

A startup is scaling from a single-region Fly.io deployment to AWS with a proper VPC, private subnets, and a CDN in front of the application. They hand Forge the brief: "Set up a production VPC on AWS with a private ECS cluster, RDS Postgres in a private subnet, and CloudFront in front." Forge starts with a forge-recon of the existing Fly.io setup to understand the current architecture and traffic patterns, then produces a Terraform skeleton with cost and IAM notes inline.

The output is not a template, it is a production-grade starting point with explicit decisions documented, cost considerations noted, and IAM roles scoped to minimum permissions from the beginning:

terraform
# forge-infra output, production VPC skeleton
# Cloud: AWS  Region: us-east-1  Estimated monthly baseline: ~$180-240

module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "5.5.2"

  name = "acme-prod"
  cidr = "10.100.0.0/16"

  azs              = ["us-east-1a", "us-east-1b", "us-east-1c"]
  private_subnets  = ["10.100.1.0/24", "10.100.2.0/24", "10.100.3.0/24"]
  public_subnets   = ["10.100.101.0/24", "10.100.102.0/24", "10.100.103.0/24"]

  # NAT per-AZ for HA, adds ~$135/mo. Drop to single-AZ if cost is priority
  # over resilience in early stages.
  enable_nat_gateway     = true
  single_nat_gateway     = false  # true saves ~$90/mo; acceptable for staging
  one_nat_gateway_per_az = true

  enable_flow_log                      = true
  flow_log_destination_type            = "cloud-watch-logs"
  create_flow_log_cloudwatch_log_group = true
  create_flow_log_cloudwatch_iam_role  = true
}

# ECS cluster, no EC2 launch type; Fargate for ops simplicity
resource "aws_ecs_cluster" "app" {
  name = "acme-prod"
  setting {
    name  = "containerInsights"
    value = "enabled"
  }
}

# RDS Postgres, private subnet only; no public endpoint
module "rds" {
  source     = "terraform-aws-modules/rds/aws"
  version    = "6.6.0"

  identifier = "acme-prod-pg"
  engine     = "postgres"
  engine_version    = "16"
  instance_class    = "db.t4g.medium"  # upgrade to db.r8g.large if p99 > 20ms
  allocated_storage = 100
  storage_encrypted = true  # required, KMS key below

  db_subnet_group_name   = module.vpc.database_subnet_group_name
  vpc_security_group_ids = [aws_security_group.rds.id]
  publicly_accessible    = false  # never true in prod

  backup_retention_period = 7
  deletion_protection     = true
}

# IAM, task execution role scoped to ECR pull + Secrets Manager only
resource "aws_iam_role" "ecs_task_exec" {
  name = "acme-prod-ecs-task-exec"
  assume_role_policy = data.aws_iam_policy_document.ecs_assume.json
  # Inline policy added below, no managed AdministratorAccess
}

This is the kind of infrastructure starting point a senior cloud engineer would produce on their first day with a new client, complete enough to deploy, documented enough to understand, and opinionated enough to prevent the obvious mistakes. The cost notes mean the team can decide how much HA they want to pay for before the infrastructure exists. The IAM comments mean there is no path to an over-privileged task execution role sneaking into production.

If you need production-grade infrastructure as code across AWS, GCP, Azure, or edge providers, whether you are building from scratch, auditing what exists, investigating a cloud bill, or debugging a runtime problem, Forge is the specialist for it. Run /forge-infra with a brief description of what you need and get IaC with IAM, cost, and security baked in from the start.

Forge vs the alternatives

Forge is not competing with Terraform documentation or a cloud provider's wizard, it is the specialist who knows when each tool applies, what the production requirements are, and what a generalist tool will skip. The comparison below captures the functional differences that matter when you are building or auditing real cloud infrastructure.

CapabilityTononeGeneralist chatbotCursor / Copilot
IaC with IAM least-privilege from the startYes, IAM roles scoped to minimum permissions in every forge-infra outputNo, IAM is typically left as an exercise or uses managed admin policiesNo, autocomplete suggests patterns without IAM opinions
Cost awareness in infrastructure outputYes, estimated monthly cost and right-sizing notes inline in the IaCNo, no cost context in generated codeNo, no project-level cost reasoning
Security audit of existing cloud setupYes, forge-audit produces prioritized findings with remediation steps per resourcePartial, can review code you paste, but no cloud-native resource inventoryNo, code suggestions only, no infrastructure audit capability
Runtime infrastructure diagnosticsYes, forge-diagnose reads logs, metrics, and config together to find root causePartial, can reason about logs you paste but lacks cloud contextNo, no runtime observability integration
Multi-cloud coverage (AWS, GCP, Azure, Fly, Cloudflare)Yes, detects provider and produces idiomatic IaC per platformPartial, knows syntax but no production opinions per providerNo, provider-specific completions vary widely in quality
Configuration drift detectionYes, forge-recon compares IaC definitions against what is actually runningNo, no cloud state accessNo, file-level only, no cloud state awareness

Tonone's Forge produces infrastructure as code that is production-grade from the first commit, not a starting template that requires a security review before it is safe to deploy.

Install and try

Tonone is free and MIT-licensed. Install it once and all 23 agents, including Forge, are available in your Claude Code session.

1. Add to marketplace

$ claude plugin marketplace add tonone-ai/tonone

2. Install Forge

$ claude plugin install forge@tonone-ai

Frequently asked questions

What does Tonone's Forge do?
Forge is Tonone's AI cloud infrastructure engineer. It builds production-grade infrastructure as code across GCP, AWS, Azure, Cloudflare, and Fly.io using Terraform, Pulumi, or CDK. It also audits existing cloud setups for security misconfigurations and cost waste, diagnoses runtime infrastructure problems, and inventories cloud resources across accounts and regions.
How is Forge different from asking ChatGPT to write Terraform?
ChatGPT produces Terraform that compiles but typically skips IAM least-privilege, encryption settings, backup policies, and cost-aware instance sizing. Forge is a specialist agent that treats those as first-class requirements, every forge-infra output includes IAM scoped to minimum permissions, cost estimates, and security configuration from the start.
Can Forge audit an existing cloud environment I did not build?
Yes. The forge-audit skill reads your existing IaC and cloud configuration and produces a prioritized finding list covering IAM over-privilege, public storage exposure, unencrypted resources, idle instances, and missing backup policies. Each finding includes severity and remediation steps in your IaC language.
What AI can help me reduce my AWS or GCP cloud bill?
Tonone's forge-cost skill analyzes your cloud spend to find idle resources, oversized instances, committed use discount gaps, and architectural changes that reduce cost without reducing capacity. The output includes expected monthly savings per change so you can prioritize.
What does forge-diagnose do for infrastructure incidents?
forge-diagnose reads logs, metrics, and configuration together to find the actual root cause of runtime infrastructure problems, cold start latency, connection timeouts, autoscaling failures, network anomalies, and connection pool exhaustion. It identifies the cause rather than the symptom, with a remediation plan.
Does Forge work with AWS, GCP, and Azure?
Yes. Forge works across AWS, GCP, Azure, Cloudflare, and Fly.io. It detects your cloud provider from the existing project context and produces idiomatic IaC in Terraform, Pulumi, or CDK depending on what your project already uses.
How do I install Tonone's Forge agent?
Install Tonone via the get-started guide at tonone.ai/get-started. Forge is one of 23 agents included in the Tonone package. Invoke it with slash commands like /forge-infra, /forge-audit, or /forge-cost. Tonone is free and MIT-licensed.
What is forge-recon and when should I run it?
forge-recon performs infrastructure reconnaissance: inventorying all cloud resources across accounts and regions, mapping connections between services, and identifying configuration drift between your IaC definitions and what is actually running. Run it when inheriting a cloud environment or before any audit or cost analysis engagement.

Pairs well with