This will be a series of blog posts where we will build up the perfect infrastructure setup for the majority of usecase, aka “The Stack”. We’ll be building everything on top of AWS.
Before diving in, let’s first establish some goals:
- Performant: Latency and performance is important as this will serve end-users.
- Low cost: We want our base cost to be low, and our costs to scale well with high traffic. Ideally, it should cost nothing if no users are using it.
- Low operational overhead: It’s 2023, nobody wants to nurse servers or services anymore, things should scale up and down without intervention or oversight.
- High flexibility: Everything should be built with the foresight of future scalability, both organizationally, code-wise, and in the way things fit together.
- Modular: Pieces of the infrastructure should be opt-out, e.g. if you don’t need Pub/Sub, it can be skipped.
Obviously, this is my personal opinion on it, but I’ll be sharing the thinking behind each of the choices as we go along.
Some technology choices upfront:
- Everything will be infrastructure as code using AWS CDK.
- We’ll be using Rust throughout for each service, as it allows us to squeeze out the most performance while still giving us a nice developer experience.
- Federated GraphQL will be our way of facilitating microservices.
What will we be covering?
You can see a sneak peak of the final setup below. All of this will be covered in parts:
- Part 0: The introduction and goals (this post)
- The goals of “The Stack” and architecture overview
- Part 1: Setting up your AWS Account Structure
- Setting up Control Tower and all of our AWS Accounts
- Part 2: Automating Deployments via CI
- Bootstrapping CDK and deploying to all accounts via CI
- Part 3: Creating our Frontend
- Creating an SPA and deploying it to S3 + CloudFront
- Part 4: A Federated GraphQL API
- Federated GraphQL in Lambda with two subgraphs talking to DynamoDB
- Part 5: Asynchronous work and processing
- Queuing up work with SQS and decoupling services via Pub/Sub using EventBridge
- Part 6: Video transcoding and image resizing
- Transcode video files with MediaConvert and resize images on-the-fly
- Part 7: Notifications and emails
- Sending emails and Push Notifications
- Part 8: Monitoring, traces, and debugging
- XRay traces and CloudWatch Dashboards
- Part 9 (Bonus): Preview Environments in CI
- Spin up environments in Pull Requests using GitHub Actions
- Part 10 (Bonus): Websocket support for GraphQL
- Support GraphQL subscriptions via API Gateway’s websocket support
- Part 11 (Bonus): Mobile App
- Package the Frontend as a Mobile App
- Part 12 (Bonus): Billing breakdown
- Forming an overview of service costs via Billing Tags and the Cost Explorer
Account Structure and Governance
Reorganizing your AWS Account structure is a pain, so let’s see if we can get this right from the beginning. There are a few things that direct our choices here:
- Isolation of environments to avoid sharing soft and hard limits between environments (e.g. Lambda concurrency limits, etc)
- We will be putting each environment in their own AWS Account
- Security, audit, and control to ease GDPR requirements
- We’ll use AWS Control Tower to give us full control of sub-accounts in our Organization
Let’s first sketch out the Account Governance structure, before diving into the view of each individual AWS account:
- Control Tower: This is your central place to control access and policies for all accounts in your organization
- Production Multi-tenant: Your primary production account for multi-tenant setup, and most likely were the majority of users will be
- Production Single-tenant: While desirable to avoid the operation overhead for single-tenant setups, its good to think in this from the get-go
- Integration Test: This will be the account that IaC deployments get tested on to ensure rollout works
- Individual Developer: Individual developer accounts to allow easy testing of IaC testing and exploration
- Monitoring: Centralize monitoring and observability into one account, allowing access to insights without access to sensitive logs or infrastructure from the other accounts
- Logs: Centralized storage of logs, which may require different access considerations than metrics and traces
graph TD subgraph ControlTower[AWS: Control Tower] AuditLog[Audit Log] GuardRails[Guard Rails] end ControlTower-->AWSProdMultiTenantAccount ControlTower-->AWSProdSingleTenantAccount ControlTower-->AWSIntegrationTestAccount ControlTower-->AWSIndividualDeveloperAccount ControlTower-->AWSMonitoringAccount ControlTower-->AWSLogsAccount subgraph AWSProdMultiTenantAccount[AWS: Production Multi-tenant] AccountFillerProdMultiTenant[...] end subgraph AWSProdSingleTenantAccount[AWS: Production Single-tenant] AccountFillerProdSingleTenant[...] end subgraph AWSIntegrationTestAccount[AWS: Integration Test] AccountFillerIntegrationTest[...] end subgraph AWSIndividualDeveloperAccount[AWS: Individual Developer] AccountFillerIndividualDeveloper[...] end subgraph AWSMonitoringAccount[AWS: Monitoring] direction LR CloudWatchDashboards[CloudWatch Dashboards] CloudWatchMetrics[CloudWatch Metrics/Alarms] XRay[XRay Analytics] end subgraph AWSLogsAccount[AWS: Logs] CloudWatchLogs[CloudWatch Logs] end
Service Infrastructure
Each of the infrastructure accounts (Production, Integration, Developer) all hold the same services and follow the same setup. The infrastructure we will make might seem complex at first, and it is, but as we go through each piece everything will start to make sense.
Let’s zoom in a bit on the various pieces our infrastructure consists of:
- Public Frontend: A simple hosting of static files in S3 with CloudFront as the CDN in front of it.
- Internal Frontend: Hosting of internal applications, secured behind Cognito, and otherwise similar to the Public Frontend.
- API: A Federated GraphQL setup, served using Lambda with API Gateway exposing it to the internet. WAF is added for security and protection, and CloudFront for possibility of cheaper egress pricing via commitment to a certain volume.
- Database: DynamoDB tables for storing data.
- Media: S3 buckets for images and videos as well as MediaConvert for transcording video for wider device support.
- Async: SQS for asynchronous work via a queue and EventBridge for a Pub/Sub style architecture. An initial Analytics Lambda service is set up as the consumer of the Pub/Sub events.
- Notification: SES for emails and SNS for mobile push notifications.
- Monitoring: XRay traces and CloudWatch metrics/alarms.
graph TD Client Route53 Client-->Route53 Route53-->FrontendCloudFront Route53-->InternalCloudFront Route53-->APICloudFront Route53-->CertificateACM subgraph Certificate[ACM: Certificate] CertificateACM end subgraph Frontend[Frontend: Public] FrontendCloudFront[CloudFront] FrontendS3App[S3: Static UI Files] FrontendCloudFront-->FrontendS3App end subgraph Internal[Frontend: Internal] InternalCloudFront[CloudFront] InternalCognito[Cognito] InternalS3App[S3: Static UI Files] InternalCloudFront-->InternalCognito InternalCloudFront-->InternalS3App end subgraph API APICloudFront[CloudFront] APIWAF[WAF] APIAPIGateway[API Gateway] APILambdaAuthentication[Lambda: Custom Authorizer] APILambdaRouter[Lambda: GraphQL Supergraph
Apollo Router] APILambdaServiceTodo[Lambda: GraphQL Subgraph
Todo Service] APILambdaServiceMedia[Lambda: GraphQL Subgraph
Media Service] APICloudFront-->APIWAF-->APIAPIGateway APIAPIGateway--Cached-->APILambdaAuthentication APIAPIGateway-->APILambdaRouter APILambdaRouter-->APILambdaServiceTodo APILambdaRouter-->APILambdaServiceMedia end subgraph Media MediaConvert[MediaConvert] MediaS3[S3: Media Files
Image Bucket + Video Bucket] end %% FrontendCloudFront-->MediaS3 APILambdaServiceMedia--Create Signed URL-->MediaS3 Client--Upload via Signed URL-->MediaS3 APILambdaServiceMedia--Create Job-->MediaConvert subgraph Database DatabaseDynamoDB[DynamoDB] end %% APILambdaAuthentication-->Database APILambdaServiceTodo-->Database APILambdaServiceMedia-->Database subgraph Notification[Notification] NotificationSES[SES: Emails] NotificationSNS[SNS: Mobile Notification] end subgraph Async[Async Work] AsyncSQS[SQS] AsyncEventBridge[Event Bridge: Pub/Sub] AsyncLambdaAnalytics[Lambda: Analytics] AsyncLambdaNotification[Lambda: Notification] AsyncEventBridge-->AsyncLambdaAnalytics AsyncSQS-->AsyncLambdaNotification end DatabaseDynamoDB--Streams-->AsyncEventBridge AsyncLambdaAnalytics-->Database APILambdaServiceTodo-->AsyncSQS AsyncLambdaNotification-->Notification subgraph Monitoring[Monitoring] MonitoringXray[Xray] MonitoringCloudWatch[CloudWatch Metrics/Alarms] end API-->Monitoring
I’ve omitted a couple of lines/edges in the diagram above to not make it even more unwieldy.
Next Steps
Next up is to start building! Follow along in Part 1 of the series (will be posted soon).