This will be a series of blog posts where we will build up the perfect infrastructure setup for the majority of usecase, aka “The Stack”. We’ll be building everything on top of AWS.

Before diving in, let’s first establish some goals:

Performant: Latency and performance is important as this will serve end-users.
Low cost: We want our base cost to be low, and our costs to scale well with high traffic. Ideally, it should cost nothing if no users are using it.
Low operational overhead: It’s >=2023, nobody wants to nurse servers or services anymore, things should scale up and down without intervention or oversight.
High flexibility: Everything should be built with the foresight of future scalability, both organizationally, code-wise, and in the way things fit together.
Modular: Pieces of the infrastructure should be opt-out, e.g. if you don’t need Pub/Sub, it can be skipped.

Obviously, this is my personal opinion on it, but I’ll be sharing the thinking behind each of the choices as we go along.

Some technology choices upfront:

Everything will be infrastructure as code using AWS CDK.
We’ll be using Rust throughout for each service, as it allows us to squeeze out the most performance while still giving us a nice developer experience.
Federated GraphQL will be our way of facilitating microservices.

What will we be covering?

You can see a sneak peak of the final setup below. All of this will be covered in parts:

Part 0: The introduction and goals (this post)
- The goals of “The Stack” and architecture overview
Part 1: Setting up your AWS Account Structure
- Setting up Control Tower and all of our AWS Accounts
Part 2: Automating Deployments via CI
- Bootstrapping CDK and deploying to all accounts via CI
Part 3: Creating our Frontend
- Creating an SPA and deploying it to S3 + CloudFront
Part 4: A Federated GraphQL API
- Federated GraphQL in Lambda with three subgraphs
Part 5: Adding Databases (DynamoDB)
- Make our subgraphs use DynamoDB for storage
Part 6: Using our API from the Frontend
- Connecting our Frontend to the API and setting up our GraphQL clients
Part 7: Preview Environments in CI
- Spin up environments in Pull Requests using GitHub Actions
Part 8: Asynchronous work and processing
- Queuing up work with SQS and decoupling services via Pub/Sub using EventBridge
Part 9: Notifications and emails
- Sending emails and Push Notifications
Part 10: Monitoring, traces, and debugging
- XRay traces and CloudWatch Dashboards
Part 11 (Bonus): Websocket support for GraphQL
- Support GraphQL subscriptions via API Gateway’s websocket support
Part 12 (Bonus): Video transcoding and image resizing
- Transcode video files with MediaConvert and resize images on-the-fly
Part 13 (Bonus): Mobile App
- Package the Frontend as a Mobile App
Part 14 (Bonus): Stripe integration
- Integrating Stripe for payments and billing
Part 15 (Bonus): Billing breakdown
- Forming an overview of service costs via Billing Tags and the Cost Explorer

Account Structure and Governance

Reorganizing your AWS Account structure is a pain, so let’s see if we can get this right from the beginning. There are a few things that direct our choices here:

Isolation of environments to avoid sharing soft and hard limits between environments (e.g. Lambda concurrency limits, etc)
- We will be putting each environment in their own AWS Account
Security, audit, and control to ease GDPR requirements
- We’ll use AWS Control Tower to give us full control of sub-accounts in our Organization

Let’s first sketch out the Account Governance structure, before diving into the view of each individual AWS account:

Control Tower: This is your central place to control access and policies for all accounts in your organization
Production Multi-tenant: Your primary production account for multi-tenant setup, and most likely were the majority of users will be
Production Single-tenant: While desirable to avoid the operation overhead for single-tenant setups, its good to think in this from the get-go
Integration Test: This will be the account that IaC deployments get tested on to ensure rollout works
Preview: This will be used to spin up Preview Environments later on
Individual Developer: Individual developer accounts to allow easy testing of IaC testing and exploration
Monitoring: Centralize monitoring and observability into one account, allowing access to insights without access to sensitive logs or infrastructure from the other accounts
Logs: Centralized storage of logs, which may require different access considerations than metrics and traces

graph TD
  subgraph ControlTower[AWS: Control Tower]
    AuditLog[Audit Log]
    GuardRails[Guard Rails]
  end
  ControlTower-->AWSProdMultiTenantAccount
  ControlTower-->AWSProdSingleTenantAccount
  ControlTower-->AWSIntegrationTestAccount
  ControlTower-->AWSPreviewAccount
  ControlTower-->AWSIndividualDeveloperAccount
  ControlTower-->AWSMonitoringAccount
  ControlTower-->AWSLogsAccount

  subgraph AWSProdMultiTenantAccount[AWS: Production Multi-tenant]
    AccountFillerProdMultiTenant[...]
  end

  subgraph AWSProdSingleTenantAccount[AWS: Production Single-tenant]
    AccountFillerProdSingleTenant[...]
  end

  subgraph AWSIntegrationTestAccount[AWS: Integration Test]
    AccountFillerIntegrationTest[...]
  end

  subgraph AWSPreviewAccount[AWS: Preview]
    AccountFillerPreview[...]
  end

  subgraph AWSIndividualDeveloperAccount[AWS: Individual Developer]
    AccountFillerIndividualDeveloper[...]
  end

  subgraph AWSMonitoringAccount[AWS: Monitoring]
    direction LR
    CloudWatchDashboards[CloudWatch Dashboards]
    CloudWatchMetrics[CloudWatch Metrics/Alarms]
    XRay[XRay Analytics]
  end

  subgraph AWSLogsAccount[AWS: Logs]
    CloudWatchLogs[CloudWatch Logs]
  end

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class ControlTower,AWSProdMultiTenantAccount,AWSProdSingleTenantAccount,AWSIntegrationTestAccount,AWSPreviewAccount,AWSIndividualDeveloperAccount,AWSMonitoringAccount,AWSLogsAccount container;

Service Infrastructure

Each of the infrastructure accounts (Production, Integration, Developer) all hold the same services and follow the same setup. The infrastructure we will make might seem complex at first, and it is, but as we go through each piece everything will start to make sense.

The diagram gets quite large, so we will split it up into three parts:

Client to Frontend
Client to API
Asynchronous Work and Media

Client to Frontend

Let’s focus first on the Client to Frontend paths:

Public Frontend: A simple hosting of static files in S3 with CloudFront as the CDN in front of it.
Internal Frontend: Hosting of internal applications, secured behind Cognito, and otherwise similar to the Public Frontend.

graph TD
  Client
  Route53

  Client-->Route53
  Route53-->FrontendCloudFront
  Route53-->InternalCloudFront
  Route53-->CertificateACM

  Frontend-->API
  Internal-->API

  subgraph Certificate[ACM: Certificate]
    CertificateACM
  end

  subgraph Frontend[Frontend: Public]
    FrontendCloudFront[CloudFront]
    FrontendS3App[S3: Static UI Files]

    FrontendCloudFront-->FrontendS3App
  end

  subgraph Internal[Frontend: Internal]
    InternalCloudFront[CloudFront]
    InternalCognito[Cognito]
    InternalS3App[S3: Static UI Files]

    InternalCloudFront-->InternalCognito
    InternalCloudFront-->InternalS3App
  end

  subgraph API
    APIFiller[...]
  end

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Client to API

As we can see, the two Frontends need something to talk to, let’s check out the APIs:

API: A Federated GraphQL setup, served using Lambda with API Gateway exposing it to the internet. WAF is added for security and protection, and CloudFront for possibility of cheaper egress pricing via commitment to a certain volume.
Database: DynamoDB tables for storing data.
Monitoring: XRay traces and CloudWatch metrics/alarms.

graph TD
  Client
  Route53

  Client-->Route53
  Route53-->APICloudFront

  subgraph API
    APICloudFront[CloudFront]
    APIWAF[WAF]
    APIAPIGateway[API Gateway]
    APILambdaAuthentication[Lambda: Custom Authorizer]
    APILambdaRouter[Lambda: GraphQL Supergraph
Apollo Router]
    APILambdaServiceReviews[Lambda: GraphQL Subgraph
Reviews Service]
    APILambdaServiceUsers[Lambda: GraphQL Subgraph
Users Service]
    APILambdaServiceProducts[Lambda: GraphQL Subgraph 
Products Service]

    APICloudFront-->APIWAF-->APIAPIGateway

    APIAPIGateway--Cached-->APILambdaAuthentication

    APIAPIGateway-->APILambdaRouter
    APILambdaRouter-->APILambdaServiceReviews
    APILambdaRouter-->APILambdaServiceUsers
    APILambdaRouter-->APILambdaServiceProducts
  end

  subgraph Database
    DatabaseDynamoDB[DynamoDB]
  end

  %% APILambdaAuthentication-->Database
  APILambdaServiceReviews-->Database
  APILambdaServiceUsers-->Database
  APILambdaServiceProducts-->Database

  subgraph Monitoring[Monitoring]
    MonitoringXray[Xray]
    MonitoringCloudWatch[CloudWatch Metrics/Alarms]
  end

  API-->Monitoring

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Asynchronous Work and Media

And finally we can see the Media and Async work (the Database and some APIs reappear here as well):

Media: S3 buckets for images and videos as well as MediaConvert for transcording video for wider device support.
Async: SQS for asynchronous work via a queue and EventBridge for a Pub/Sub style architecture. An initial Analytics Lambda service is set up as the consumer of the Pub/Sub events.
Notification: SES for emails and SNS for mobile push notifications.Z

graph TD
  Client
  Route53

  Client-->Route53
  Route53-->APICloudFront

  subgraph API
    APILambdaServiceReviews[Lambda: GraphQL Subgraph
Reviews Service]
    APILambdaServiceProducts[Lambda: GraphQL Subgraph 
Products Service]
  end

  subgraph Media
    MediaConvert[MediaConvert]
    MediaS3[S3: Media Files
Image Bucket + Video Bucket]
  end

  %% FrontendCloudFront-->MediaS3
  APILambdaServiceProducts--Create Signed URL-->MediaS3
  Client--Upload via Signed URL-->MediaS3
  APILambdaServiceProducts--Create Job-->MediaConvert

  subgraph Database
    DatabaseDynamoDB[DynamoDB]
  end

  subgraph Notification[Notification]
    NotificationSES[SES: Emails]
    NotificationSNS[SNS: Mobile Notification]
  end

  subgraph Async[Async Work]
    AsyncSQS[SQS]
    AsyncEventBridge[Event Bridge: Pub/Sub]
    AsyncLambdaAnalytics[Lambda: Analytics]
    AsyncLambdaNotification[Lambda: Notification]

    AsyncEventBridge-->AsyncLambdaAnalytics
    AsyncSQS-->AsyncLambdaNotification
  end

  DatabaseDynamoDB--Streams-->AsyncEventBridge
  AsyncLambdaAnalytics-->Database
  APILambdaServiceReviews-->AsyncSQS
  AsyncLambdaNotification-->Notification

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

One diagram to rule them all

If we combine all the individual diagrams, we get:

graph TD
  Client
  Route53

  Client-->Route53
  Route53-->FrontendCloudFront
  Route53-->InternalCloudFront
  Route53-->APICloudFront
  Route53-->CertificateACM

  subgraph Certificate[ACM: Certificate]
    CertificateACM
  end

  subgraph Frontend[Frontend: Public]
    FrontendCloudFront[CloudFront]
    FrontendS3App[S3: Static UI Files]

    FrontendCloudFront-->FrontendS3App
  end

  subgraph Internal[Frontend: Internal]
    InternalCloudFront[CloudFront]
    InternalCognito[Cognito]
    InternalS3App[S3: Static UI Files]

    InternalCloudFront-->InternalCognito
    InternalCloudFront-->InternalS3App
  end

  subgraph API
    APICloudFront[CloudFront]
    APIWAF[WAF]
    APIAPIGateway[API Gateway]
    APILambdaAuthentication[Lambda: Custom Authorizer]
    APILambdaRouter[Lambda: GraphQL Supergraph
Apollo Router]
    APILambdaServiceTodo[Lambda: GraphQL Subgraph
Todo Service]
    APILambdaServiceMedia[Lambda: GraphQL Subgraph 
Media Service]

    APICloudFront-->APIWAF-->APIAPIGateway

    APIAPIGateway--Cached-->APILambdaAuthentication

    APIAPIGateway-->APILambdaRouter
    APILambdaRouter-->APILambdaServiceTodo
    APILambdaRouter-->APILambdaServiceMedia
  end

  subgraph Media
    MediaConvert[MediaConvert]
    MediaS3[S3: Media Files
Image Bucket + Video Bucket]
  end

  %% FrontendCloudFront-->MediaS3
  APILambdaServiceMedia--Create Signed URL-->MediaS3
  Client--Upload via Signed URL-->MediaS3
  APILambdaServiceMedia--Create Job-->MediaConvert

  subgraph Database
    DatabaseDynamoDB[DynamoDB]
  end

  %% APILambdaAuthentication-->Database
  APILambdaServiceTodo-->Database
  APILambdaServiceMedia-->Database

  subgraph Notification[Notification]
    NotificationSES[SES: Emails]
    NotificationSNS[SNS: Mobile Notification]
  end

  subgraph Async[Async Work]
    AsyncSQS[SQS]
    AsyncEventBridge[Event Bridge: Pub/Sub]
    AsyncLambdaAnalytics[Lambda: Analytics]
    AsyncLambdaNotification[Lambda: Notification]

    AsyncEventBridge-->AsyncLambdaAnalytics
    AsyncSQS-->AsyncLambdaNotification
  end

  DatabaseDynamoDB--Streams-->AsyncEventBridge
  AsyncLambdaAnalytics-->Database
  APILambdaServiceTodo-->AsyncSQS
  AsyncLambdaNotification-->Notification

  subgraph Monitoring[Monitoring]
    MonitoringXray[Xray]
    MonitoringCloudWatch[CloudWatch Metrics/Alarms]
  end

  API-->Monitoring

  classDef container stroke:#333,stroke-width:2px,fill:transparent,padding:8px
  class Certificate,Frontend,Internal,API,Media,Database,Notification,Async,Monitoring container;

Next Steps

Next up is to start building! Follow along in Part 1 of the series here.