Safe, Fast, and Memory Efficient Loading of LLMs with Safetensors

By default, PyTorch saves and loads models using Python’s pickle module. As pointed out by Python’s official documentation, pickle is not secure:

Warning The pickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Not only is pickle unsafe, but manipulating large PyTorch models with it is inefficient. When you want to load models, PyTorch performs all these steps:

  1. An empty model is created
  2. Load in memory the model weights
  3. Load the weights loaded at step 2 in the empty model created at step 1
  4. Move the model obtained at step 3 on the device for inference, e.g., a GPU

By performing a copy of the model at Step 2, instead of directly loading the model in place, PyTorch needs an available memory of twice the size of the model.

There are various solutions to secure the models and efficiently load them.

In this article, I present safetensors. It’s a model format designed for secure loading whose development has been initiated by Hugging Face. In the following sections, I show you how to save, load, and convert models with safetensors. I also benchmark safetensors against PyTorch pickle using Llama 2 7B as an example.

Note: safetensors is distributed with the Apache 2.0 license.

Visit