Support wildcards in `save_cache.paths`
After 5 years, it seems like it's not a priority.
This is the primary reason I will never use Circle CI for my personal projects, and recommend to my company that we move off the platform. GitHub Actions supports it and its very important for Node.JS Monorepo maintainers to be able to cache based on globs.
So, I'm on month 2 of using CircleCI and I'm writing my first config.yaml to build my 2nd ever npm-using repo and I've just encountered this "functional deficit" whereby I can't tell it to cache "**/node_modules/".
So now I have the choice of either going down the path of dynamic configs (and a multi-part one at that), or hard-coding a list (and adding some test code that'll alert me if my hard-coded list isn't the folder list I needed, otherwise it'll end up wrong and I won't know).
Given just how important caching is, this shouldn't be that hard.
My teams would love this as well. We're running a monorepo and need to cache from many subdirectories. Each set of such subdirectories is variable by build, so can't hardcode these
This concept is required for so many projects, and now extends to workspace caches as well.
This is a blocker and if concrete open source projects are really to be supported by CircleCI, not implementing this feature is actually having CircleCI consumes more bandwidth and CPU time and requires developers to make their project fit CircleCI's limitations.
A concrete example of this is Heads Open Source Firmware project, building firmware images for different boards.
This "gblob"/"wildcard" support could permit boards definitions to be responsible to cache only their own relative caches in passed workspaces, and then having save_cache being able to combine such workspaces without having save_cache refusing to combine them since a single existing in more then one workspace will be denied to be saved in cache at the end of the build....
If CircleCI developers need an open source project's use case to justify the development time of this feature since prior of 2018, here you go:
The reason the cache is being passed from left to right is to mitigate this gblob/wildcard limitation, which each time exponentially increases left layer's "persisting to workspace" step, compressing and uploading a too big workspace to the next right step "attaching workspace", which needs to download and decompress many redundant workspace caches prior (the number of rows seen there between prep_env and save_cache) of compiled stuff. All of this so that workspace layers are passed along, to bypass save_cache's limitation of not being able to combine different workspaces containing files duplicates, so that a cache can be created. The solution here could be either to permit save_cache to do the same thing as a workspace do: decompressing all workspace archives and leting those archives overwrite whatever is in there, prior of saving the final cache, or as this feature request asks, have the cache layers being able to parameterize what is to be saved in a workspace, so that save_cache is happy. The later is better for resource consumption and speed of builds, where the prior would permit CircleCI users to simply have save_cache depend on whatever previous step to download its workspace cache, decompress it and save the result in save_cache if no cache exists with current keys.
In current latest tests under Heads to add a new right layer (for a new architecture) is nearly hitting the CircleCI 1h step limit build time limit here when building without cache to create one in the last step prior of save_cache step:
Most of this 51 minutes time is consumed into downloading/extracting workspace layers and building another workspace cache layer, not into building stuff.
Where re-building with a saved cache is counter-productive here because the cache layers are too big and contain redundant paths and files:
Long story short, if workspace caches/save caches could support wildcards/gblob definition, programmable by step variables, a save cache step could depend on multiple steps to be completed, and caching only what is required in that step to be passed along as a workspace cache so CircleCI caches in an optimal way (economize space, compression decompression time, bandwidth needed to save/restore those caches and minimize build time and resources used to build Open Source projects and others).
This, for example would be a good step optimizing Heads project use case:
With this CircleCI config change:
But CircleCI fails:
Saying that this gblob/wildcard/configurable and CircleCI known variable is not valid as a path:
"The specified paths did not match any files in /root"
Even though those parameters are known from CircleCI, and could be reused for caching directories/files:
- image: debian:11
Again, multiple posts have referred here over the 4 past years:
This feature request has 121 thumbs up as of today.
Yet, it is not implemented, resulting in CircleCI CI/CD platform loosing precious bandwidth, CPU time required to compress/decompress workspace and pass multiple workspace layers to have a working save_cache that otherwise complains it can't depend on steps containing a single common file. This results in CircleCI inefficiently being used by unknown number of projects and if more important then user's needs, CircleCI costs in storage and bandwidth that could be nullified by implementing this feature.
Under Heads use case, a single step, properly using built caches correctly with parameterized workspace caches/restore_cache could be reduced from 50 minutes a step right now to 5-6 minutes maximum (at worst), while the whole project build time could completed under 30 minutes total (vs 1h50 now) if a cache was available for keys parameterized per CPU architecture and versions actually needed, instead of passing along a too fat cache just because save_cache needs to be called at the end of the build. Heads use case can definitely be optimized outside of this missing feature, but that missing feature would resolve most of the problems, without having to think more about CI then working on the projects requirements.
At the end of the day, a build of a new git commit without cache could pass from nearly 3 hours: https://app.circleci.com/pipelines/github/SergiiDmytruk/heads/64/workflows/8b7c5694-8b9b-4490-b0a4-eb6d465a9364 to 45 minutes if parallelized properly while having a save_cache not containing duplicate files, all being a dependency of save_cache step which would download steps workspace once instead of passing them along for save_cache to not complain...
Otherwise, developers are needed to think outside of the box and to create docker images containing basic cache (and maintain that docker image...), since that one is used as is without being decompressed/recompressed/passed to save_cache even if unneeded (cache already exists for those keys....), resulting is a lot of wasted time/CPU/bandwidth.
The solution IS to have parameter based paths defined under CircleCI config, under git tree, in both persist to workspace and save cache statements.
Please fix this today and have more efficient resource dispatched from CircleCI and happier developers not needing to adapt their work flow to adapt to CircleCI limitations, having CircleCI support real use cases.
Pls support this, we have hundreds sub package in our monorepo.
Hi, I'm wondering if there's any update on this. This is a helpful feature to consider as npm and pnpm also provide workspace support now and it's very inconvenient not being able to properly cache without hectic workarounds.
This would be really useful for us too. Same use case with monorepo yarn workspaces.
It'd be nice to use a standard but capable format for this, maybe even the same conventions as .gitignore. It'd be very useful to flexibly select directories to cache for more complicated project setups.