Posit AI Blog: safetensors 0.1.0

safetensors is a new, simple, fast, and safe file format for storing tensors. The design of the file format and its original implementation are being led
by Hugging Face, and it’s getting largely adopted in their popular ‘transformers’ framework. The safetensors R package is a pure-R implementation, allowing to both read and write safetensor files.

The initial version (0.1.0) of safetensors is now on CRAN.

Motivation

The main motivation for safetensors in the Python community is security. As noted
in the official documentation:

The main rationale for this crate is to remove the need to use pickle on PyTorch which is used by default.

Pickle is considered an unsafe format, as the action of loading a Pickle file can
trigger the execution of arbitrary code. This has never been a concern for torch
for R users, since the Pickle parser that is included in LibTorch only supports a subset
of the Pickle format, which doesn’t include executing code.

However, the file format has additional advantages over other commonly used formats, including:

  • Support for lazy loading: You can choose to read a subset of the tensors stored in the file.

  • Zero copy: Reading the file does not require more memory than the file itself.
    (Technically the current R implementation does makes a single copy, but that can
    be optimized out if we really need it at some point).

  • Simple: Implementing the file format is simple, and doesn’t require complex dependencies.
    This means that it’s a good format for exchanging tensors between ML frameworks and
    between different programming languages. For instance, you can write a safetensors file
    in R and load it in Python, and vice-versa.

There are additional advantages compared to other file formats common in this space, and
you can see a comparison table here.

Format

The safetensors format is described in the figure below. It’s basically a header file
containing some metadata, followed by raw tensor buffers.

Posit AI Blog: safetensors 0.1.0

Basic usage

safetensors can be installed from CRAN using:

install.packages("safetensors")

We can then write any named list of torch tensors:

library(torch)
library(safetensors)

tensors <- list(
  x = torch_randn(10, 10),
  y = torch_ones(10, 10)
)

str(tensors)
#> List of 2
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]

tmp <- tempfile()
safe_save_file(tensors, tmp)

It’s possible to pass additional metadata to the saved file by providing a metadata
parameter containing a named list.

Reading safetensors files is handled by safe_load_file, and it returns the named
list of tensors along with the metadata attribute containing the parsed file header.

tensors <- safe_load_file(tmp)
str(tensors)
#> List of 2
#>  $ x:Float [1:10, 1:10]
#>  $ y:Float [1:10, 1:10]
#>  - attr(*, "metadata")=List of 2
#>   ..$ x:List of 3
#>   .. ..$ shape       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 0 400
#>   ..$ y:List of 3
#>   .. ..$ shape       : int [1:2] 10 10
#>   .. ..$ dtype       : chr "F32"
#>   .. ..$ data_offsets: int [1:2] 400 800
#>  - attr(*, "max_offset")= int 929

Currently, safetensors only supports writing torch tensors, but we plan to add
support for writing plain R arrays and tensorflow tensors in the future.

Future directions

The next version of torch will use safetensors as its serialization format,
meaning that when calling torch_save() on a model, list of tensors, or other
types of objects supported by torch_save, you will get a valid safetensors file.

This is an improvement over the previous implementation because:

  1. It’s much faster. More than 10x for medium sized models. Could be even more for large files.
    This also improves the performance of parallel dataloaders by ~30%.

  2. It enhances cross-language and cross-framework compatibility. You can train your model
    in R and use it in Python (and vice-versa), or train your model in tensorflow and run it
    with torch.

If you want to try it out, you can install the development version of torch with:

remotes::install_github("mlverse/torch")

Photo by Nick Fewings on Unsplash

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don’t fall under this license and can be recognized by a note in their caption: “Figure from …”.

Citation

For attribution, please cite this work as

Falbel (2023, June 15). Posit AI Blog: safetensors 0.1.0. Retrieved from https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/

BibTeX citation

@misc{safetensors,
  author = {Falbel, Daniel},
  title = {Posit AI Blog: safetensors 0.1.0},
  url = {https://blogs.rstudio.com/tensorflow/posts/2023-06-15-safetensors/},
  year = {2023}
}

Related articles

Introductory time-series forecasting with torch

This is the first post in a series introducing time-series forecasting with torch. It does assume some prior...

Does GPT-4 Pass the Turing Test?

Large language models (LLMs) such as GPT-4 are considered technological marvels capable of passing the Turing test successfully....