AirLLM Tutorial: Run 70B Models with Low Memory Usage

Seb Intel · 38s · Watch on YouTube · 7 sources

Decision Card

Effort: A focused evening — pip install airllm, point it at a 70B checkpoint on Hugging Face, and run one prompt; budget extra time for the ~40GB download and the first-run layer-sharding step (lyogavin/airllm).

Honest take: The video is a 38-second ad (it ends with “comment ‘Air’ and I’ll send you the link”) and conveniently omits the one number that matters: AirLLM’s disk-swapping approach runs roughly 1–3 tokens/sec at best and can drop to seconds-or-more per token, making it unusable for interactive chat. “Run a 70B model” is true; “run it usefully on your MacBook” is not.

Concrete next steps:

  • Read the author’s own write-up before installing to set expectations — HuggingFace blog: AirLLM (15 min).
  • If you proceed, test on an NVMe SSD with a single batch job (summarization, not chat) and measure tokens/sec yourself (~1 hr including download).
  • Skip if you need real-time responses or own a GPU with enough VRAM for a quantized 70B in llama.cpp (8–15 tok/s on a 4090) — that path is 10–20× faster.

TL;DR

AirLLM is an open-source Python library that runs 70B-parameter models on low-VRAM hardware (down to 4GB) by streaming the model into the GPU one layer at a time from disk, plus flash attention to flatten memory on long inputs. It makes large models technically runnable and fully private, but the video skips the catastrophic speed cost of constant disk loading.

Key Points

  • You can run a 70B-parameter model without a supercomputer using the open-source library AirLLM 00:01
  • The core trick is loading the model layer by layer from the hard drive instead of all at once 00:08
  • The author compares it to reading a book page by page rather than memorizing the whole thing first 00:14
  • A flash attention feature keeps memory usage nearly flat even with long inputs 00:17
  • Models like Llama 3.3 70B are claimed to run directly on a MacBook or gaming computer 00:23
  • Inference is fully private with zero API costs since it runs locally 00:27
  • It is pitched at students, researchers, and indie developers as worth trying 00:30
  • The video is a lead-magnet: the link is gated behind commenting “Air” 00:34

Notable Quotes

“Instead of loading the entire model at once, it loads it layer by layer from your hard drive.” 00:08

“There’s also a feature called flash attention, which keeps memory usage nearly flat even with long inputs.” 00:17

“Models like Llama 3.3 70B can run directly on your MacBook or gaming computer. Fully private with zero API costs.” 00:23

Verified Claims

AirLLM runs 70B models on a single 4GB GPU without quantization. 00:01

The technique works by loading the model layer by layer from disk and freeing memory after each. 00:08

Flash attention keeps memory usage nearly flat even with long inputs. 00:17

Llama 3.3 70B can run “directly on your MacBook or gaming computer.” 00:23

Inference is fully private with zero API costs. 00:27

Implied claim: this is a practical way to use a 70B model (speed is never mentioned). 00:30

Tools, Papers & Standards Mentioned

Follow-up Questions

  1. What are real-world tokens/sec for AirLLM on an NVMe SSD vs. a SATA SSD vs. spinning disk, and at what input length does flash attention’s benefit actually show up?
  2. For a fixed 4GB-VRAM machine, how does AirLLM compare against a 4-bit GGUF quantized 70B in llama.cpp on both quality and speed — is layer-streaming ever the better choice?
  3. Does the author’s “3x speedup via block-wise compression” claim hold up, and does its “almost ignorable accuracy loss” survive independent benchmarking (e.g., MMLU)?

Sources