Skip to main content
On this page
All writing

Faster micro-frontends: optimising CDN behaviour for performance

22 June 20268 min read1,401 words

We've just completed migrating our CDN layer from Akamai to CloudFront, and spent time improving our caching strategy for our microfrontend architecture. We've seen a significant improvement in our performance metrics, particularly in Time To First Byte (TTFB) which reduced by 54% at p90, bringing down Largest Contentful Paint (LCP) by ~36% at p90.

I previously wrote about our caching strategy for our microfrontend architecture a couple months back. At the time we had disabled Akamai caching after an outage, had a three hop architecture (Akamai -> CloudFront -> S3), and were relying on explicit cache rules at the CDN layer to control caching. Most assets weren't explicitly cache-controlled and cache behaviour was split between S3 metadata, Akamai rules, and CDN/browser heuristics.

Dropping Akamai: from three hops to two

Akamai has long been an additional complexity in our architecture we've eyed for removal. Most of the rest of our infrastructure is in AWS, and since the addition of our microfrontend architecture through CloudFront, Akamai was an extra hop with added latency and complexity.

Our infrastructure team worked through the complexity of porting various Akamai rules that had built up over the years to CloudFront, working through a range of issues from legacy TLS version support to changes in default security behaviour. Traffic was incrementally shifted from Akamai to CloudFront over a three week period.

In this post I'm going to focus specifically on how the migration affected our microfrontend architecture, and some of the changes we made to improve caching behaviour at the same time.

Before

MFE routes

API routes

MPA routes

User

Akamai

CloudFront

Python services

Django MPA

S3 bucket

After

MFE routes

API routes

MPA routes

User

CloudFront

S3 bucket

Python services

Django MPA

Origin routing: explicit logic instead of 404-fallback

We previously relied on CloudFront's custom error responses to fall back to the index.html file for our single-page application routes. This meant that every time a user requested a route that didn't match a static asset, the request would go to the origin, get a missing object and then fallback to index.html with a 200 response code.

Since we were already building out a CloudFront function to handle many of the existing Akamai edge rules, we decided to implement the SPA fallback logic in the same function. This allowed us to handle routing at the edge, before the request even reached the origin.

function handler(event) {
  var request = event.request;
  var uri = request.uri || "/";
  var qs = request.querystring || {};
 
  // ... other routing logic, omitted for simplicity
 
  // SPA fallback: if the path is extensionless, rewrite to index.html
  var lastSegment = uri.split("/").pop();
  if (lastSegment === "" || lastSegment.indexOf(".") === -1) {
    request.uri = "/path/to/index.html";
  }
 
  return request;
}

This helped make the behaviour more explicit and predictable, and mitigates the risk of hiding real errors in the origin when requesting static assets, which previously affected us during a CloudFront/S3 outage.

It also helped improve TTFB performance since we avoided two requests to the origin for every SPA route request.

Explicit cache headers, per asset type

Within our microfrontend deployments, we have three types of assets served from S3:

  • index.html - the entrypoint for the host shell, which is mutable and changes with every shell deployment.
  • remoteEntry.js - the entrypoint for each microfrontend, which is also mutable and changes when the microfrontend is updated.
  • assets - the JS chunks and assets for each microfrontend, which are immutable and named with a hash of their content.

Rather than a mix of CDN rules, S3 metadata and browser heuristics, we've moved to explicitly setting cache headers for each of these asset types at the time of upload to S3.

Our deploy pipeline step now sets Cache-Control metadata on each object:

# Upload assets with immutable caching
rclone copy -vv \
  --exclude "index.html" \
  --exclude "remoteEntry.js" \
  -M --metadata-set "Cache-Control=max-age=31536000, immutable" \
  ./dist $DEST
# Upload remoteEntry.js with short caching
rclone copy -vv \
  -M --metadata-set "Cache-Control=max-age=30, s-maxage=86400" \
  ./dist/remoteEntry.js $DEST
# Upload index.html with no caching
rclone copy -vv \
  -M --metadata-set "Cache-Control=no-cache" \
  ./dist/index.html $DEST

Assets

All of the assets generated by the microfrontend builds use Webpack's [contenthash] in their filenames, which means that any change to the content of a chunk will result in a new filename. This allows us to set long cache lifetimes for these assets without worrying about serving stale content.

remoteEntry.js

The remoteEntry.js file is the entrypoint for each microfrontend which points to the various JS chunks that need to be loaded. Since each microfrontend is deployed independently, the filename needs to be fixed at a well-known location. We set a short cache lifetime for end users to prevent browsers caching stale releases for too long. The s-maxage directive allows shared caches (like CloudFront) to cache the file for a longer period, reducing load on the origin.

We added a CloudFront invalidation step in our deploy pipeline to ensure that when remoteEntry.js is updated, the cached version in CloudFront is invalidated, and users will receive the new version on their next request:

aws cloudfront create-invalidation \
  --distribution-id $DISTRIBUTION_ID \
  --paths "/path/to/remoteEntry.js"

Note: Some of our customers access our applications through their own proxies which could cache based on the s-maxage directive. We strip out s-maxage from the Cache-Control header in the CloudFront response so only our own CDN caches the file for the longer period and downstream consumers always get the short cache lifetime.

index.html

For index.html, we set Cache-Control: no-cache to ensure that the browser always checks with the server for the latest version of the file. We could probably optimise this further using a similar approach to remoteEntry.js as there's now a single index.html cache key to bust with the CloudFront function logic, but for now we kept the existing no-cache behaviour to ensure shell updates are rolled out fast.

Measuring the impact

We use Grafana Faro to capture production RUM data, and used this to monitor the impact of the changes on performance metrics. As traffic was migrated from Akamai to the new CloudFront distribution we started to see improvements across the board for page load related metrics.

Metricp50p75p90
TTFB (ms)1493 -> 507 (−66%)2484 -> 1074 (−57%)4483 -> 2041 (−54%)
FCP (ms)2284 -> 1140 (−50%)3524 -> 1988 (−44%)5308 -> 3440 (−35%)
LCP (ms)4312 -> 2688 (−38%)6040 -> 3788 (−37%)8608 -> 5516 (−36%)

These were particularly driven by the reduction in TTFB, which cascaded into FCP and LCP improvements. But we can also see the difference between TTFB and LCP decreasing, suggesting our caching improvements are also helping those other resources load faster from browser or CDN caches.

Core Web Vitals at the 75th percentile (ms), before vs after the migration: Before (Akamai) vs After (CloudFront)
MetricBefore (Akamai)After (CloudFront)
TTFB2484ms1074ms
FCP3524ms1988ms
LCP6040ms3788ms
Core Web Vitals at the 75th percentile (ms), before vs after the migration

We can break down the components of the TTFB request from the Faro data to see the large reduction in request_duration. Unfortunately Faro doesn't instrument request times for <script> tag resources by default, so we can't see exactly how much asset caching has improved request load times.

TTFB sub-components (p50, ms): request_duration — the server's time to respond — fell from ~1056 ms to ~184 ms and drove almost the entire gain. DNS, connection and cache barely moved.: Before (Akamai) vs After (CloudFront)
MetricBefore (Akamai)After (CloudFront)
DNS56ms58ms
Connection47ms33ms
Cache4ms4ms
Request1056ms184ms
Waiting11ms22ms
TTFB sub-components (p50, ms): request_duration — the server's time to respond — fell from ~1056 ms to ~184 ms and drove almost the entire gain. DNS, connection and cache barely moved.

While we can't see the asset request times in Faro, we can see the effect of caching from our CloudFront per-object cache statistics. Filtering to our microfrontend assets we can see the edge cache hit rate rose from 52% to 95%. Both cacheable asset types reach ~100%: remoteEntry.js climbed from 63%, and the immutable content-hashed chunks from 48%.

CloudFront edge cache hit rate for microfrontend assets (hits / requests), before vs after. remoteEntry.js and the immutable content-hashed chunks both reach ~100%; the all MFE requests aggregate is lower because index.html and the favicon are served no-cache by design.: Before (Akamai) vs After (CloudFront)
MetricBefore (Akamai)After (CloudFront)
remoteEntry.js63%100%
Immutable chunks48%100%
All MFE requests52%95%
CloudFront edge cache hit rate for microfrontend assets (hits / requests), before vs after. remoteEntry.js and the immutable content-hashed chunks both reach ~100%; the all MFE requests aggregate is lower because index.html and the favicon are served no-cache by design.

Most of that gain came from eliminating revalidation. Previously, with no explicit cache headers, around 40% of asset requests were revalidations. CloudFront held the file but checked back with the origin (a 304 Not Modified) on almost every request. Setting Cache-Control: immutable on the content-hashed chunks and s-maxage on remoteEntry.js turned those round-trips into true cache hits. This helped reduce origin bandwidth, reducing cost as well as improving performance.

Faro tags each session with a reference to the user's previous session, which lets us separate first-time (cold cache) loads from returning (warm cache) ones. We'd expect returning users to benefit most from the new caching strategy. We can see from First Contentful Paint returning users were ~17% faster than new users before the migration, widening to ~29% afterwards.

FCP (p50, ms) split by new vs returning users (navigate navigations). Returning users — whose browsers already hold cached assets — paint faster in both periods, and their relative advantage over new users widened from 17% to 29% after the migration.: Before (Akamai) vs After (CloudFront)
MetricBefore (Akamai)After (CloudFront)
New users2296ms1156ms
Returning users1912ms816ms
FCP (p50, ms) split by new vs returning users (navigate navigations). Returning users — whose browsers already hold cached assets — paint faster in both periods, and their relative advantage over new users widened from 17% to 29% after the migration.

We didn't see a comparative change to LCP, which is dominated by application render logic and data fetching API calls rather than how quickly JS loads. There, new and returning users improved by the same ~38%.

Where next

There's still a lot more we could improve in terms of performance across our frontend applications. Web Vitals suggests a "good" LCP is under 2.5s for p75. At ~3.8s, we still have room for improvement.

Our architecture was designed for speed of feature development across distributed teams, and while SPAs and microfrontends help us achieve that, they do so at the cost of performance. There's more we could improve at the platform level around index.html caching and shifting a skeleton loading screen inline to the HTML file for FCP. Ultimately, further reductions in LCP will also require focus from our product teams to optimise critical paths to initial render across both the frontend and backend logic.

Also, we've added some complexity to our deployment pipeline by introducing a cache invalidation step. CloudFront cache tags offer a flexible way for us to structure our cache invalidation logic, and we're looking at using them to simplify the logic and improve our ability to invalidate cached assets on demand.

Related posts

Share

Get new posts by email

I write about software development, platform engineering and how AI agents are changing the way we build software.