Faster micro-frontends: optimising CDN behaviour for performance
We've just completed migrating our CDN layer from Akamai to CloudFront, and spent time improving our caching strategy for our microfrontend architecture. We've seen a significant improvement in our performance metrics, particularly in Time To First Byte (TTFB) which reduced by 54% at p90, bringing down Largest Contentful Paint (LCP) by ~36% at p90.
I previously wrote about our caching strategy for our microfrontend architecture a couple months back. At the time we had disabled Akamai caching after an outage, had a three hop architecture (Akamai -> CloudFront -> S3), and were relying on explicit cache rules at the CDN layer to control caching. Most assets weren't explicitly cache-controlled and cache behaviour was split between S3 metadata, Akamai rules, and CDN/browser heuristics.
Dropping Akamai: from three hops to two
Akamai has long been an additional complexity in our architecture we've eyed for removal. Most of the rest of our infrastructure is in AWS, and since the addition of our microfrontend architecture through CloudFront, Akamai was an extra hop with added latency and complexity.
Our infrastructure team worked through the complexity of porting various Akamai rules that had built up over the years to CloudFront, working through a range of issues from legacy TLS version support to changes in default security behaviour. Traffic was incrementally shifted from Akamai to CloudFront over a three week period.
In this post I'm going to focus specifically on how the migration affected our microfrontend architecture, and some of the changes we made to improve caching behaviour at the same time.
Before
After
Origin routing: explicit logic instead of 404-fallback
We previously relied on CloudFront's custom error responses to fall back to the index.html file for our single-page application routes. This meant that every time a user requested a route that didn't match a static asset, the request would go to the origin, get a missing object and then fallback to index.html with a 200 response code.
Since we were already building out a CloudFront function to handle many of the existing Akamai edge rules, we decided to implement the SPA fallback logic in the same function. This allowed us to handle routing at the edge, before the request even reached the origin.
function handler(event) {
var request = event.request;
var uri = request.uri || "/";
var qs = request.querystring || {};
// ... other routing logic, omitted for simplicity
// SPA fallback: if the path is extensionless, rewrite to index.html
var lastSegment = uri.split("/").pop();
if (lastSegment === "" || lastSegment.indexOf(".") === -1) {
request.uri = "/path/to/index.html";
}
return request;
}This helped make the behaviour more explicit and predictable, and mitigates the risk of hiding real errors in the origin when requesting static assets, which previously affected us during a CloudFront/S3 outage.
It also helped improve TTFB performance since we avoided two requests to the origin for every SPA route request.
Explicit cache headers, per asset type
Within our microfrontend deployments, we have three types of assets served from S3:
index.html- the entrypoint for the host shell, which is mutable and changes with every shell deployment.remoteEntry.js- the entrypoint for each microfrontend, which is also mutable and changes when the microfrontend is updated.assets- the JS chunks and assets for each microfrontend, which are immutable and named with a hash of their content.
Rather than a mix of CDN rules, S3 metadata and browser heuristics, we've moved to explicitly setting cache headers for each of these asset types at the time of upload to S3.
Our deploy pipeline step now sets Cache-Control metadata on each object:
# Upload assets with immutable caching
rclone copy -vv \
--exclude "index.html" \
--exclude "remoteEntry.js" \
-M --metadata-set "Cache-Control=max-age=31536000, immutable" \
./dist $DEST
# Upload remoteEntry.js with short caching
rclone copy -vv \
-M --metadata-set "Cache-Control=max-age=30, s-maxage=86400" \
./dist/remoteEntry.js $DEST
# Upload index.html with no caching
rclone copy -vv \
-M --metadata-set "Cache-Control=no-cache" \
./dist/index.html $DESTAssets
All of the assets generated by the microfrontend builds use Webpack's [contenthash] in their filenames, which means that any change to the content of a chunk will result in a new filename. This allows us to set long cache lifetimes for these assets without worrying about serving stale content.
remoteEntry.js
The remoteEntry.js file is the entrypoint for each microfrontend which points to the various JS chunks that need to be loaded. Since each microfrontend is deployed independently, the filename needs to be fixed at a well-known location. We set a short cache lifetime for end users to prevent browsers caching stale releases for too long. The s-maxage directive allows shared caches (like CloudFront) to cache the file for a longer period, reducing load on the origin.
We added a CloudFront invalidation step in our deploy pipeline to ensure that when remoteEntry.js is updated, the cached version in CloudFront is invalidated, and users will receive the new version on their next request:
aws cloudfront create-invalidation \
--distribution-id $DISTRIBUTION_ID \
--paths "/path/to/remoteEntry.js"Note: Some of our customers access our applications through their own proxies which could cache based on the s-maxage directive. We strip out s-maxage from the Cache-Control header in the CloudFront response so only our own CDN caches the file for the longer period and downstream consumers always get the short cache lifetime.
index.html
For index.html, we set Cache-Control: no-cache to ensure that the browser always checks with the server for the latest version of the file. We could probably optimise this further using a similar approach to remoteEntry.js as there's now a single index.html cache key to bust with the CloudFront function logic, but for now we kept the existing no-cache behaviour to ensure shell updates are rolled out fast.
Measuring the impact
We use Grafana Faro to capture production RUM data, and used this to monitor the impact of the changes on performance metrics. As traffic was migrated from Akamai to the new CloudFront distribution we started to see improvements across the board for page load related metrics.
| Metric | p50 | p75 | p90 |
|---|---|---|---|
| TTFB (ms) | 1493 -> 507 (−66%) | 2484 -> 1074 (−57%) | 4483 -> 2041 (−54%) |
| FCP (ms) | 2284 -> 1140 (−50%) | 3524 -> 1988 (−44%) | 5308 -> 3440 (−35%) |
| LCP (ms) | 4312 -> 2688 (−38%) | 6040 -> 3788 (−37%) | 8608 -> 5516 (−36%) |
These were particularly driven by the reduction in TTFB, which cascaded into FCP and LCP improvements. But we can also see the difference between TTFB and LCP decreasing, suggesting our caching improvements are also helping those other resources load faster from browser or CDN caches.
| Metric | Before (Akamai) | After (CloudFront) |
|---|---|---|
| TTFB | 2484ms | 1074ms |
| FCP | 3524ms | 1988ms |
| LCP | 6040ms | 3788ms |
We can break down the components of the TTFB request from the Faro data to see the large reduction in request_duration. Unfortunately Faro doesn't instrument request times for <script> tag resources by default, so we can't see exactly how much asset caching has improved request load times.
| Metric | Before (Akamai) | After (CloudFront) |
|---|---|---|
| DNS | 56ms | 58ms |
| Connection | 47ms | 33ms |
| Cache | 4ms | 4ms |
| Request | 1056ms | 184ms |
| Waiting | 11ms | 22ms |
While we can't see the asset request times in Faro, we can see the effect of caching from our CloudFront per-object cache statistics. Filtering to our microfrontend assets we can see the edge cache hit rate rose from 52% to 95%. Both cacheable asset types reach ~100%: remoteEntry.js climbed from 63%, and the immutable content-hashed chunks from 48%.
| Metric | Before (Akamai) | After (CloudFront) |
|---|---|---|
| remoteEntry.js | 63% | 100% |
| Immutable chunks | 48% | 100% |
| All MFE requests | 52% | 95% |
Most of that gain came from eliminating revalidation. Previously, with no explicit cache headers, around 40% of asset requests were revalidations. CloudFront held the file but checked back with the origin (a 304 Not Modified) on almost every request. Setting Cache-Control: immutable on the content-hashed chunks and s-maxage on remoteEntry.js turned those round-trips into true cache hits. This helped reduce origin bandwidth, reducing cost as well as improving performance.
Faro tags each session with a reference to the user's previous session, which lets us separate first-time (cold cache) loads from returning (warm cache) ones. We'd expect returning users to benefit most from the new caching strategy. We can see from First Contentful Paint returning users were ~17% faster than new users before the migration, widening to ~29% afterwards.
| Metric | Before (Akamai) | After (CloudFront) |
|---|---|---|
| New users | 2296ms | 1156ms |
| Returning users | 1912ms | 816ms |
We didn't see a comparative change to LCP, which is dominated by application render logic and data fetching API calls rather than how quickly JS loads. There, new and returning users improved by the same ~38%.
Where next
There's still a lot more we could improve in terms of performance across our frontend applications. Web Vitals suggests a "good" LCP is under 2.5s for p75. At ~3.8s, we still have room for improvement.
Our architecture was designed for speed of feature development across distributed teams, and while SPAs and microfrontends help us achieve that, they do so at the cost of performance. There's more we could improve at the platform level around index.html caching and shifting a skeleton loading screen inline to the HTML file for FCP. Ultimately, further reductions in LCP will also require focus from our product teams to optimise critical paths to initial render across both the frontend and backend logic.
Also, we've added some complexity to our deployment pipeline by introducing a cache invalidation step. CloudFront cache tags offer a flexible way for us to structure our cache invalidation logic, and we're looking at using them to simplify the logic and improve our ability to invalidate cached assets on demand.