S3 Files

(allthingsdistributed.com)

216 points | by werner 7 hours ago

29 comments

MontyCarloHall 5 hours ago
This is essentially S3FS using EFS (AWS's managed NFS service) as a cache layer for active data and small random accesses. Unfortunately, this also means that it comes with some of EFS's eye-watering pricing:
— All writes cost $0.06/GB, since everything is first written to the EFS cache. For write-heavy applications, this could be a dealbreaker.
— Reads hitting the cache get billed at $0.03/GB. Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.
— Cache is charged at $0.30/GB/month. Even though everything is written to the cache (for consistency purposes), it seems like it's only used for persistent storage of small files (<128kB), so this shouldn't cost too much.
[-]
- ktimespi 18 minutes ago
  This was my concern too. The whole point of using S3 as a file system instead of EBS / EFS (for me at least) is to minimize cost and I don't really see why I would use this instead of s3fs.
- the8472 4 hours ago
  > Large reads (>128kB) get directly streamed from the underlying S3 bucket, which is free.
  Always uncached? S3 has pretty bad latency.
  [-]
  - MontyCarloHall 3 hours ago
    The threshold at which the cache gets used is configurable, with 128kB the default. The assumption is that any read larger than the threshold will be a long sustained read, for which latency doesn't matter too much. My question is, do reads <128kB (or whatever the threshold is) from files >128kB get saved to the cache, or is it only used for files whose overall size is under the threshold? Frequent random access to large files is a textbook use case for a caching layer like this, but its cost will be substantial in this system.
    [-]
    - the8472 2 hours ago
      NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs. The threshold where the total read duration starts to dominate latency would be somewhere in the dozens to hundreds of megabytes, not kilobytes.
      [-]
      - MontyCarloHall 2 hours ago
        I agree, it's an oddly low threshold. The latency differential of NFS vs. S3 is a couple OOMs, so a threshold of ~10MB seems more appropriate to me. Perhaps it's set intentionally low to avoid racking up immense EFS bills? Setting it higher would effectively mean getting billed $0.03/GB for a huge fraction of reads, which is untenable for most people's applications.
      - antonvs 2 hours ago
        < NVMe read latency is in the 10-100µs range for 128kB blocks. S3 is about 100ms. That's 3-4 OOMs.
        Aren't you comparing local in-process latency to network latency? That's multiple OOM right there.
        [-]
        the8472 2 hours ago
        No, within the same DC network latency does not add that much. After all EFS also manages 600µs average latency. It's really just S3 that's slow. I assume some large fraction of S3 is spread over HDDs, not SSDs.
everfrustrated 5 minutes ago
The best way to think of the architecture of this is it's EFS with a bidirectional sync to S3.
You can write into one and read out from the other and vice versa. Consistency guarantees kept within each but not between.
wbl 4 hours ago
"NFS provides the semantics your applications expect" is one of the funniest things I have ever read.
[-]
- danudey 2 hours ago
  Do your applications not expect any network hiccup to cause them to block indefinitely in a system call making them effectively unkillable and making the filesystem unmountable?
- boulos 1 hour ago
  Compared to roll-your-own with S3 or GCS it does :)
rdtsc 5 hours ago
Synchronization bits is what I was wondering about: https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-fil...
> For example, suppose you edit /mnt/s3files/report.csv through the file system. Before S3 Files synchronizes your changes back to the S3 bucket, another application uploads a new version of report.csv directly to the S3 bucket. When S3 Files detects the conflict, it moves your version of report.csv to the lost and found directory and replaces it with the version from the S3 bucket.
> The lost and found directory is located in your file system's root directory under the name .s3files-lost+found-file-system-id.
abidlabs 2 hours ago
Hugging Face Buckets also recently added support for mounting Buckets as a filesystem: https://huggingface.co/changelog/hf-mount
dabinat 1 hour ago
The problem with using S3 as a filesystem is that it’s immutable, and that hasn’t changed with S3 Files. So if I have a large file and change 1 byte of it, or even just rename it, it needs to upload the entire file all over again. This seems most useful for read-heavy workflows of files that are small enough to fit in the cache.
[-]
- wolttam 1 hour ago
  That’s not that different than CoW filesystems - there is no rule that files must map 1:1 to objects; you can (transparently) divide a file into smaller chunks to enable more fine grained edits.
  [-]
curt15 1 hour ago
How does this compare with ZFS's object storage backend? https://news.ycombinator.com/item?id=46620673
jitl 5 hours ago
I wish they offered some managed bridging to local NVMe storage. AWS NVMe is super fast compared to EBS, and EBS (node-exclusive access as block device) is faster than EFS (multi-node access). I imagine this can go fast if you put some kind of further-cache-to-NVMe FS on top, but a completely vertically integrated option would be much better.
[-]
- MontyCarloHall 3 hours ago
  Since EFS is just an NFS mount, I wonder if you could do this yourself by attaching an NVMe volume to your instance and setting up something like cachefilesd on the NFS mount, pointed to the NVMe.
  Would
```
   mkfs.ext4 /dev/nvme0n1 && \
   mount /dev/nvme0n1 /var/cache/fscache && \
   mount -t s3files -o fsc fs-0aa860d05df9afdfe:/ /home/ec2-user/s3files
```
  work out of the box? It does for EFS. It hardly seems worth it to offer a managed service that's effectively three shell commands, but this is AWS we're talking about.
  [-]
  - jitl 1 hour ago
    AWS's [docs on EFS performance](https://docs.aws.amazon.com/efs/latest/ug/performance-tips.h...) say:
    > Don't use the following mount options:
    > - fsc – This option enables local file caching, but does not change NFS cache coherency, and does not reduce latencies.
    If the S3 Files sync logic ran client-side, we could almost entirely avoid file access latency for cached files and paying for new expensive EFS disks. I already pay for a lot of NVMe disks, let me just use those!
    [-]
    - MontyCarloHall 44 minutes ago
      >This option enables local file caching, but does not change NFS cache coherency, and does not reduce latencies.
      That's true for any NFS setup, not just EFS. The benefit of local NFS caching is to speed up reads of large, immutable files, where latency is relatively negligible. I'm not sure why AWS specifically dissuades users from enabling caching, since it's not like bandwidth to an EFS volume is even in the ballpark of EBS/NVMe bandwidth.
nyc_pizzadev 5 hours ago
This is very close to its first official release: https://fiberfs.io/
Built in cache, CDN compatible, JSON metadata, concurrency safe and it targets all S3 compatible storage systems.
[-]
- mikestorrent 3 hours ago
  How would you compare this to Amazon's own FUSE implementation? I think it's on its 3rd major reincarnation now
gonzalohm 6 hours ago
I cannot 100% confirm this, but I believe AWS insisted a lot in NOT using S3 as a file system. Why the change now?
[-]
- yandie 6 hours ago
  It appears that they put an actual file system in front of S3 (AWS EFS basically) and then perform transparent syncing. The blog post discusses a lot of caveats (consistency, for example) or object namings (incosistencies are emitted as events to customers).
  Having been a fan of S3 for such a long time, I'm really a fan of the design. It's a good compromise and kudos to whoever managed to push through the design.
- PedroBatista 3 hours ago
  People and by people I mean architects and lead devs at big account orgs ( $$$ ) have been using S3 as a filesystem as one of the backbones of their usually wacky mega complex projects.
  So there always been a pressure to AWS make it work like that. I suspect the amount of support tickets AWS receives related to "My S3 backed project is slow/fails sometimes/run into AWS limits (like the max number of buckets per account)" and "Why don't.." questions in the design phase which many times AWS people are in the room, serve as enough of a long applied pressure to overcome technical limitations of S3.
  I'm not a fan of this type of "let's put a fresh coat on top of it and pretend it's something that fundamentally is not" abstractions. But I suspect here is a case of social pressure turbo charged by $$$.
- munk-a 2 hours ago
  I think it opens them up to a huge customer base of less technically apt people who just downloaded some random "S3asYourFS.exe" program but also opens them up to needing to support that functionality and field support calls from less technically apt people. I don't know if that business decision makes sense (since AWS already lacks the CS infrastructure to even deal with professional clients) but the idea that you could get everyone and their brother paying monthly fees to AWS is likely too tempting of a fruit to pass up.
- PunchyHamster 5 hours ago
  Because people will use it as filesystem regardless of the original intent because it is very convenient abstraction. So might as well do it in optimal and supported way I guess ?
- LazyMans 6 hours ago
  They found a way to make money on it by putting a cache in front of it. Less load for them, better performance for you. Maybe you save money, maybe you dont.
- karmasimida 48 minutes ago
  This is how tech people think, but Customer still want this, so it will be built, eventually
- jitl 5 hours ago
  Because without significant engineering effort (see the blog post), the mismatch between object store semantics and file semantics mean you will probably Have A Bad Time. In much earlier eras of S3, there were also some implementation specifics like throughput limits based on key prefixes (that one vanished circa 2016) that made it even worse to use for hierarchical directory shapes.
koolba 5 hours ago
If you though locking semantics over NFS were wonky, just wait till we through a remote S3 backend in the mix!
miguel_martin 4 hours ago
Dumb Q: what would happen if you used this to store a SQLite database? Would it just... work?
My guess is this would only enable a read-replica and not backups as Litestream currently does?
[-]
- laurencerowe 2 hours ago
  SQLite’s locking is not NFS safe so this would not work.
nvartolomei 6 hours ago
> changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT
Single PUT per file I assume?
[-]
- LazyMans 6 hours ago
  Based on docs, correct.
mgaunard 6 hours ago
Zero mention of s3fs which already did this for decades.
[-]
- huntaub 5 hours ago
  This is pretty different than s3fs. s3fs is a FUSE file system that is backed by S3.
  This means that all of the non-atomic operations that you might want to do on S3 (including edits to the middle of files, renames, etc) are run on the machine running S3fs. As a result, if your machine crashes, it's not clear what's going to show up in your S3 bucket or if would corrupt things.
  As a result, S3fs is also slow because it means that the next stop after your machine is S3, which isn't suitable for many file-based applications.
  What AWS has built here is different, using EFS as the middle layer means that there's a safe, durable place for your file system operations to go while they're being assembled in object operations. It also means that the performance should be much better than s3fs (it's talking to ssds where data is 1ms away instead of hdds where data is 30ms away).
  [-]
  - ChocolateGod 5 hours ago
    You can also use something like JuiceFS to make using S3 as a shared filesystem more sane, but you're moving all the metadata to a shared database.
- luke5441 6 hours ago
  A more solid (especially when it comes to caching) solution would be appreciated.
  I thought that would be their https://github.com/awslabs/mountpoint-s3 . But no mention about this one either.
  S3 files does have the advantage of having a "shared" cache via EFS, but then that would probably also make the cache slower.
  [-]
  - PunchyHamster 5 hours ago
    I'd assume you can still have local cache in addition to that.
- bmurphy1976 1 hour ago
  There's also https://github.com/kahing/goofys, a Go equivalent. A bit of a dead project these days.
- moralestapia 2 hours ago
  Yeah, that blog post was written as if sliced bread has been invented again.
  Reading through it, I was only thinking "is this distinguished engineer TOC 2M aware that people have been doing this since forever?".
- rowanG077 6 hours ago
  I was thinking: "No way this has existed for decades". But the earliest I can find it existing is 2008. Strictly speaking not decades but much closer to it than I expected.
PunchyHamster 5 hours ago
Eagerly awaiting on first blogpost where developers didn't read the eventually consistent part, lost the data and made some "genius" workaround with help of the LLM that got them in that spot in the first place
dang 4 hours ago
Since this is the thread that got attention, I've added the announcement link to the toptext and made the title work for both.
themafia 6 hours ago
> we locked a bunch of our most senior engineers in a room and said we weren’t going to let them out till they had a plan that they all liked.
That's one way to do it.
> When you create or modify files, changes are aggregated and committed back to S3 roughly every 60 seconds as a single PUT. Sync runs in both directions, so when other applications modify objects in the bucket, S3 Files automatically spots those modifications and reflects them in the filesystem view automatically.
That sounds about right given the above. I have trouble seeing this as something other than a giant "hack." I already don't enjoy projecting costs for new types of S3 access patterns and I feel like has the potential to double the complication I already experience here.
Maybe I'm too frugal, but I've been in the cloud for a decade now, and I've worked very hard to prevent any "surprise" bills from showing up. This seems like a great feature; if you don't care what your AWS bill is each month.
[-]
- avereveard 6 hours ago
  There is a staggering number of user doing this with extra steps using fsx for lustre, their life greatly simplified today (unless they use gpu direct storage I guess)
  [-]
  - themafia 6 hours ago
    Good point. There's a wide gulf between being able to design your workflow for S3 and trying to map an existing workflow to it.
mbana 5 hours ago
Werner Vogels is awesome. I first discovered about his writing when I learnt about Dynamo DB.
up2isomorphism 4 hours ago
This why today’s sales pitch are often disguised as a tech blog.
goekjclo 6 hours ago
the "under the hood uses EFS" part is the most interesting bit here
gervwyk 5 hours ago
any recommendations for a lambda based sftp sever setup?
Centigonal 3 hours ago
Terrible day for people who sloppily use filesystem vocabulary when referring to S3 objects and prefixes.
minutesmith 5 hours ago
[flagged]
[-]
- glenjamin 4 hours ago
  The way AWS keep their pricing section completely separate from their system and architecture docs, despite architecture being the primary driver of cost, is a major contributor to this
devnotes77 2 hours ago
[dead]
ovaistariq 5 hours ago
TLDR: EFS as a eventually consistent cache in front of S3.
mritchie712 5 hours ago
tldr: this caches your S3 data in EFS.
we run datalakes using DuckLake and this sounds really useful. GCP should follow suit quickly.
[-]
- hiyer 3 hours ago
  I was thinking of using it with Duckdb as well but seems it would be of limited benefit. Parquet objects are in MBs, so they would be streamed directly from S3. With raw parquet objects, it might help with S3 listing if you have a lot of them (shave off a couple of seconds from the query). If you are already on Ducklake, Duckdb will use that for getting the list of relevant objects anyway.
  [-]
  - wenc 1 hour ago
    Maybe the OP is thinking of reading/writing to DuckDB native format files. Those require filesystem semantics for writing. Unfortunately, even NFS or SMB are not sufficiently FS-like for DuckDB.
    Parquet is static append only, so DuckDB has no problems with those living on S3.
- anentropic 4 hours ago
  I am curious about this use case
  How do you see it helping with DuckLake?
DenisM 6 hours ago
TLDR: Eventually consistent file system view on top of s3 with read/write cache.
CrzyLngPwd 6 hours ago
If there is ever a post that needs a TLDR or an AI summary it is that one.
Sell the benefits.
I have around 9 TB in 21m files on S3. How does this change benefit me?
[-]
- dijksterhuis 5 hours ago
  not everything should or needs to be some article geared towards the audience's convenience, or selling something to the audience. pretty much all allthingsdistributed articles are long form articles covering highly technical systems and contain a decent whack of detail/context. in my mind, they veer closer to "computer scientist does blog posts" compared to "5 ways React can boost your page visits" listicles.
  edited slightly ... i really need to turn 10 minute post delay back on.
- jz-amz 6 hours ago
  Check out the "what's new": https://aws.amazon.com/about-aws/whats-new/2026/04/amazon-s3...