[idea] Use higher level API for copying file chunks - Help and Support

Help and Support
Ask a question, report a problem, request a feature...
<< Back To Forum

[idea] Use higher level API for copying file chunks

by Guest on 2024/12/25 07:37:33 PM

https://despairlabs.com/blog/posts/2024-10-27-openzfs-dedup-is-good-dont-use-it/

The thing is, it’s actually pretty rare these days that you have a write operation coming from some kind of copy operation, but you don’t know that came from a copy operation. In the old days, a client program would read the source data and write to the destination data, and the storage system would see these as two unrelated operations. These days though, “copy offloading” is readily available, where instead of reading and writing, the program will tell the storage system “copy this source to that destination” and the storage system is free to do that however it wants. A naive implementation will just do the same read and write as the client would, but a smarter system could do something different, for example, not doing the write and instead just reusing the existing data and bumping a refcount.

For Linux and FreeBSD filesystems, this “offload” facility is the copy_file_range() syscall. Most systems have an equivalent; macOS calls it copyfile(), Windows calls it FSCTL_SRV_COPYCHUNK. NFS and CIFS support something like it, OS block device drivers are getting equivalents, even disk protocols have something like it (eg SCSI EXTENDED COPY or NVMe Copy).

If you put all this together, you end up in a place where so long as the client program (like /bin/cp) can issue the right copy offload call, and all the layers in between can translate it (eg the Window application does FSCTL_SRV_COPYCHUNK, which Samba converts to copy_file_range() and ships down to OpenZFS). And again, because there’s that clear and unambiguous signal that the data already exists and also it’s right there, OpenZFS can just bump the refcount in the BRT.

Most importantly is the space difference. If a block is never cloned, then we never pay for it, and if it is cloned, the BRT entry is only 16 bytes.

https://wiki.samba.org/index.php/Server-Side_Copy

The greatest advantage for Tixati would be the optimization of incomplete-pieces mechanism when stored on the same disk/network location as target download location.

Thank you for making (and the other you for using) Tixati!

by Guest on 2024/12/28 05:11:52 PM

The thing is, there's literally no need to be copying file chunks at all. The current implementation of a single cache folder on a single drive is highly wasteful of bandwidth and presents a significant system bottleneck, as I've argued before.

If the "empty" target files are pre-loaded onto the target drive, then collating all the pieces to transfer into them elsewhere, on a different drive or not, involves more system handles for all those small piece files and more disk accesses (since you are then copying your completed piece into place in the target as well as deleting it from cache). Critically, ALL the torrent activity (every write, validation- and copy-read and subsequent cache deletions, as well as updates to the torrent completion bitfield which is also being stored in the cache) has to go through that SINGLE cache drive interface. The advantage of having the target files stored on different physical drives, spreading the bandwidth load across many drives, which might be helpful in some cases with cheaper "consumer" drives instead of server drives designed to handle heavy request loads, is COMPLETELY lost with a single cache on a single drive.

The far simpler implementation would be to use a single per-file handle and store all incoming data DIRECTLY into the target file(s) in its final location, meaning there is no central bandwidth bottleneck. There's no functional difference except you don't have to copy the data all over again, you can still verify in situ from the target since you know the byte location in the files of every piece, the system doesn't have to handle thousands of small files, and (most importantly) if there is a crash or other SNAFU, ALL the data that has been downloaded is IN PLACE in the target files, not stored in a piece file elsewhere that may not be usable by other clients or by Tixati on restart, so wasted. That may not be complete, but it might make a file viewable or usable in the very worst case instead of a complete chunk being missing. And you still have your completion bitfield in cache to track what should have been downloaded per piece.

Now I agree that there might still need to be some cache mechanism for pieces that cross into files that are not being downloaded, or where the empty target files are waiting to be initialised/written yet pieces are already coming in, and it would require a slight tweak to the initial piece validation algorithm where a piece crosses files (instead of the entire chunk of data conveniently being in its own file parcel in the cache), but these are miniscule programming inconveniences compared to the huge reduction in system resources this would provide. I'd suggest that might also improve end user experience in some systems.

Add Reply

<< Back To Forum