Verify AWS infrastructure configuration before deployment. Use when validating VPC endpoints, NAT Gateway capacity, security groups, or debugging network path issues that cause Lambda connection timeouts.
**Tech Stack**: AWS CLI, Terraform, VPC, CloudWatch, bash
**Source**: Extracted from PDF S3 upload timeout investigation (2026-01-05) and Infrastructure-Application Contract principle.
---
Use the infrastructure-verification skill when:
**DO NOT use this skill for:**
---
**From CLAUDE.md Principle #15:**
> "Before deploying code that depends on AWS infrastructure (S3, VPC endpoints, NAT Gateway), verify infrastructure exists and is correctly configured. Network path issues cause deterministic failure patterns."
**When to validate:**
**Failure Pattern Types:**
| Pattern | Root Cause | Investigation Priority |
|---------|------------|----------------------|
| **First N succeed, last M fail** | Infrastructure bottleneck (NAT, connection limits) | HIGH - VPC endpoint missing |
| **Random scattered failures** | Performance issue (slow API, memory) | MEDIUM - Optimize code |
| **All operations fail** | Configuration issue (permissions, endpoint) | HIGH - Fix config |
| **Intermittent failures** | Rate limiting, transient network | LOW - Add retries |
**Deterministic pattern** (first N succeed, last M fail) is strongest signal of infrastructure bottleneck.
---
**Use when:** Lambda-in-VPC needs to access S3 or DynamoDB
**Steps:**
```bash
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=vpc-xxx" \
"Name=service-name,Values=com.amazonaws.ap-southeast-1.s3" \
--query 'VpcEndpoints[*].{ID:VpcEndpointId,State:State,Service:ServiceName}' \
--output table
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-xxx \
--query 'VpcEndpoints[0].State' \
--output text
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-xxx \
--query 'VpcEndpoints[0].RouteTableIds' \
--output table
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SubnetIds' \
--output text | xargs -I {} aws ec2 describe-subnets --subnet-ids {}
ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxx" \
--query 'RouteTables[0].RouteTableId' \
--output text)
aws ec2 describe-route-tables \
--route-table-ids $ROUTE_TABLE_ID \
--query 'RouteTables[*].Routes[?GatewayId==`vpce-xxx`]'
```
**Verification checklist:**
**Common issues:**
**Use when:** Investigating Lambda connection timeouts with external services
**Steps:**
```bash
aws ec2 describe-nat-gateways \
--filter "Name=vpc-id,Values=vpc-xxx" \
--query 'NatGateways[*].{ID:NatGatewayId,State:State,PublicIp:NatGatewayAddresses[0].PublicIp}' \
--output table
aws ec2 describe-route-tables \
--filters "Name=vpc-id,Values=vpc-xxx" \
--query 'RouteTables[*].Routes[?NatGatewayId!=`null`].[RouteTableId,DestinationCidrBlock,NatGatewayId]' \
--output table
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '5 minutes ago' +%s)000 \
--filter-pattern "START RequestId" \
--query 'events[*].timestamp' \
--output text | xargs -n1 date -d @
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "ConnectTimeoutError" \
--query 'events[*].message' \
--output text
CONCURRENT_LAMBDAS=$(aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '1 minute ago' +%s)000 \
--filter-pattern "START RequestId" \
--query 'length(events)' \
--output text)
echo "Concurrent Lambdas: $CONCURRENT_LAMBDAS"
echo "NAT Gateway connection limit: ~55,000 (but establishment rate limited)"
```
**NAT Gateway saturation indicators:**
**Solution:** Add VPC Gateway Endpoint for S3/DynamoDB to bypass NAT
**Use when:** Verifying Lambda can reach AWS services
**Steps:**
```bash
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.{VpcId:VpcId,SubnetIds:SubnetIds,SecurityGroupIds:SecurityGroupIds}' \
--output json
aws ec2 describe-security-groups \
--group-ids sg-xxx \
--query 'SecurityGroups[*].IpPermissionsEgress[*].{Proto:IpProtocol,Port:FromPort,Dest:IpRanges[0].CidrIp}' \
--output table
SUBNET_ID=$(aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SubnetIds[0]' \
--output text)
ROUTE_TABLE_ID=$(aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values=$SUBNET_ID" \
--query 'RouteTables[0].RouteTableId' \
--output text)
aws ec2 describe-route-tables \
--route-table-ids $ROUTE_TABLE_ID \
--query 'RouteTables[*].Routes[*].[DestinationCidrBlock,GatewayId,NatGatewayId]' \
--output table
aws logs tail /aws/lambda/network-test --since 1m
```
**Network path checklist:**
**Use when:** After deploying infrastructure changes (VPC endpoints, security groups)
**Steps:**
```bash
cd terraform
terraform output s3_vpc_endpoint_id # Should return vpce-xxx
terraform output s3_vpc_endpoint_state # Should return "available"
aws lambda invoke \
--function-name my-function \
--payload '{"test": true}' \
/tmp/response.json
cat /tmp/response.json | jq .
aws logs tail /aws/lambda/my-function --since 1m --follow
for i in {1..10}; do
aws lambda invoke \
--function-name my-function \
--payload "{\"id\": $i}" \
--invocation-type Event \
/tmp/response_$i.json &
done
wait
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '5 minutes ago' +%s)000 \
--filter-pattern "ConnectTimeoutError" \
--query 'length(events)' \
--output text
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '5 minutes ago' +%s)000 \
--filter-pattern "✅" \
--query 'length(events)' \
--output text
```
**Post-deployment checklist:**
---
**Symptom:**
**Diagnosis:**
```bash
aws ec2 describe-vpc-endpoints \
--filters "Name=vpc-id,Values=vpc-xxx" \
"Name=service-name,Values=com.amazonaws.region.s3"
```
**Fix:**
```hcl
data "aws_route_tables" "vpc_route_tables" {
vpc_id = data.aws_vpc.default.id
}
resource "aws_vpc_endpoint" "s3" {
vpc_id = data.aws_vpc.default.id
service_name = "com.amazonaws.${var.aws_region}.s3"
vpc_endpoint_type = "Gateway"
route_table_ids = data.aws_route_tables.vpc_route_tables.ids
policy = jsonencode({
Version = "2012-10-17"
Statement = [{
Effect = "Allow"
Principal = "*"
Action = "s3:*"
Resource = "*"
}]
})
tags = {
Name = "s3-endpoint"
}
}
output "s3_vpc_endpoint_id" {
value = aws_vpc_endpoint.s3.id
}
output "s3_vpc_endpoint_state" {
value = aws_vpc_endpoint.s3.state
}
```
**Verification:**
```bash
cd terraform
terraform apply
terraform output s3_vpc_endpoint_state # Should be "available"
aws lambda invoke --function-name my-function --payload '{}' /tmp/response.json
aws logs tail /aws/lambda/my-function --since 1m
```
**Symptom:**
**Diagnosis:**
```bash
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--start-time $(date -d '30 minutes ago' +%s)000 \
--filter-pattern "START RequestId" \
| jq -r '.events[] | .timestamp as $ts | ($ts/1000 | strftime("%H:%M:%S")) + " " + (.message | split(" ")[2])'
aws logs filter-log-events \
--log-group-name /aws/lambda/my-function \
--filter-pattern "ConnectTimeoutError" \
| jq -r '.events[].message' | grep -o "RequestId: [a-z0-9-]*"
```
**Root Cause:**
**Fix:** Add S3 VPC Gateway Endpoint (see Issue 1)
**Why this works:**
**Symptom:**
**Diagnosis:**
```bash
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SecurityGroupIds[0]' \
--output text | xargs -I {} aws ec2 describe-security-groups --group-ids {}
```
**Fix:**
```hcl
resource "aws_security_group_rule" "lambda_egress_https" {
type = "egress"
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
security_group_id = aws_security_group.lambda.id
}
```
**Symptom:**
**Diagnosis:**
```bash
aws ec2 describe-vpc-endpoints \
--vpc-endpoint-ids vpce-xxx \
--query 'VpcEndpoints[0].RouteTableIds' \
--output table
aws lambda get-function-configuration \
--function-name my-function \
--query 'VpcConfig.SubnetIds[0]' \
--output text | xargs -I {} aws ec2 describe-route-tables \
--filters "Name=association.subnet-id,Values={}" \
--query 'RouteTables[0].RouteTableId' \
--output text
```
**Fix:**
```hcl
data "aws_route_tables" "vpc_route_tables" {
vpc_id = data.aws_vpc.default.id
}
resource "aws_vpc_endpoint" "s3" {
# ... other config ...
# Attach to ALL route tables (includes Lambda subnets)
route_table_ids = data.aws_route_tables.vpc_route_tables.ids
}
```
---
- Investigating Lambda timeout patterns
- Debugging connection failures
- Analyzing deterministic failure patterns
- Infrastructure confirmed correct but errors persist
- Need to analyze application logs
- Debugging business logic failures
- BEFORE deploying Lambda-in-VPC code
- AFTER deploying infrastructure changes (Terraform apply)
- During post-deployment validation
---
| Type | Services | Cost | Use Case |
|------|----------|------|----------|
| **Gateway** | S3, DynamoDB | FREE | High-throughput data access |
| **Interface** | Most AWS services | ~$7.50/month | Other services (Secrets Manager, etc.) |
| Limit | Value | Impact |
|-------|-------|--------|
| **Concurrent connections** | 55,000 | Theoretical max |
| **Connection establishment rate** | Limited | Causes saturation with concurrent Lambdas |
| **Data transfer cost** | $0.045/GB | Expensive for large transfers |
**Recommendation:** Use VPC Gateway Endpoints for S3/DynamoDB (free, unlimited, faster)
```bash
aws ec2 describe-vpc-endpoints --vpc-endpoint-ids vpce-xxx
aws ec2 describe-nat-gateways --nat-gateway-ids nat-xxx
aws ec2 describe-security-groups --group-ids sg-xxx
aws ec2 describe-route-tables --route-table-ids rtb-xxx
aws lambda get-function-configuration --function-name my-function --query 'VpcConfig'
```
---
```
.claude/skills/infrastructure-verification/
└── SKILL.md # This file (complete skill)
```
---
Leave a review
No reviews yet. Be the first to review this skill!
# Download SKILL.md from killerskills.ai/api/skills/infrastructure-verification/raw