AirLLM Tutorial: Run 70B Models with Low Memory Usage
Decision Card
Effort: A focused evening — pip install airllm, point it at a 70B checkpoint on Hugging Face, and run one prompt; budget extra time for the ~40GB download and the first-run layer-sharding step (lyogavin/airllm).
Honest take: The video is a 38-second ad (it ends with “comment ‘Air’ and I’ll send you the link”) and conveniently omits the one number that matters: AirLLM’s disk-swapping approach runs roughly 1–3 tokens/sec at best and can drop to seconds-or-more per token, making it unusable for interactive chat. “Run a 70B model” is true; “run it usefully on your MacBook” is not.
Concrete next steps:
- Read the author’s own write-up before installing to set expectations — HuggingFace blog: AirLLM (15 min).
- If you proceed, test on an NVMe SSD with a single batch job (summarization, not chat) and measure tokens/sec yourself (~1 hr including download).
- Skip if you need real-time responses or own a GPU with enough VRAM for a quantized 70B in llama.cpp (8–15 tok/s on a 4090) — that path is 10–20× faster.
TL;DR
AirLLM is an open-source Python library that runs 70B-parameter models on low-VRAM hardware (down to 4GB) by streaming the model into the GPU one layer at a time from disk, plus flash attention to flatten memory on long inputs. It makes large models technically runnable and fully private, but the video skips the catastrophic speed cost of constant disk loading.
Key Points
- You can run a 70B-parameter model without a supercomputer using the open-source library AirLLM 00:01
- The core trick is loading the model layer by layer from the hard drive instead of all at once 00:08
- The author compares it to reading a book page by page rather than memorizing the whole thing first 00:14
- A flash attention feature keeps memory usage nearly flat even with long inputs 00:17
- Models like Llama 3.3 70B are claimed to run directly on a MacBook or gaming computer 00:23
- Inference is fully private with zero API costs since it runs locally 00:27
- It is pitched at students, researchers, and indie developers as worth trying 00:30
- The video is a lead-magnet: the link is gated behind commenting “Air” 00:34
Notable Quotes
“Instead of loading the entire model at once, it loads it layer by layer from your hard drive.” 00:08
“There’s also a feature called flash attention, which keeps memory usage nearly flat even with long inputs.” 00:17
“Models like Llama 3.3 70B can run directly on your MacBook or gaming computer. Fully private with zero API costs.” 00:23
Verified Claims
AirLLM runs 70B models on a single 4GB GPU without quantization. 00:01
- GitHub: lyogavin/airllm — states it allows “70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning”
- HuggingFace blog: AirLLM
- Verdict: Confirmed (as a memory claim).
The technique works by loading the model layer by layer from disk and freeing memory after each. 00:08
- Benjamin Marie: AirLLM — Layered Inference for Low-Memory Hardware
- GitHub: lyogavin/airllm
- Verdict: Confirmed.
Flash attention keeps memory usage nearly flat even with long inputs. 00:17
- FlashAttention paper (Dao et al., arXiv:2205.14135) — achieves memory linear (rather than quadratic) in sequence length via block-wise computation
- Verdict: Confirmed.
Llama 3.3 70B can run “directly on your MacBook or gaming computer.” 00:23
- GitHub: lyogavin/airllm — lists Llama 2/3/3.1 support; Llama 3.3 not explicitly named, though it shares the Llama architecture
- Nerd Level Tech: AirLLM — Hype vs Reality
- Verdict: Inconclusive — technically runnable, but “directly on a MacBook” glosses over a large download and impractical speeds.
Inference is fully private with zero API costs. 00:27
- GitHub: lyogavin/airllm — local, self-hosted inference
- Verdict: Confirmed.
Implied claim: this is a practical way to use a 70B model (speed is never mentioned). 00:30
- Ai505: Run 70B Models on Your 4GB GPU (But Pack a Lunch)
- Nerd Level Tech: Hype vs Reality — reports ~1–3 tokens/sec on an RTX 3060 and far slower in some cases; disk I/O per layer is the bottleneck
- Verdict: Disputed — the omitted speed cost makes it impractical for interactive use.
Tools, Papers & Standards Mentioned
- AirLLM — github.com/lyogavin/airllm (official repo; PyPI:
pip install airllm) - Flash Attention — FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness (Dao et al., 2022)
- Llama 3.3 70B — meta-llama/Llama-3.3-70B-Instruct on Hugging Face
- Hugging Face Accelerate (meta device) — referenced as an underlying optimization in AirLLM coverage
Follow-up Questions
- What are real-world tokens/sec for AirLLM on an NVMe SSD vs. a SATA SSD vs. spinning disk, and at what input length does flash attention’s benefit actually show up?
- For a fixed 4GB-VRAM machine, how does AirLLM compare against a 4-bit GGUF quantized 70B in llama.cpp on both quality and speed — is layer-streaming ever the better choice?
- Does the author’s “3x speedup via block-wise compression” claim hold up, and does its “almost ignorable accuracy loss” survive independent benchmarking (e.g., MMLU)?
Sources
- github.com/lyogavin/airllm
- HuggingFace blog: AirLLM
- Benjamin Marie — AirLLM: Layered Inference for Low-Memory Hardware
- FlashAttention paper (arXiv:2205.14135)
- meta-llama/Llama-3.3-70B-Instruct (Hugging Face)
- Ai505 — Run 70B Models on Your 4GB GPU (But Pack a Lunch)
- Nerd Level Tech — AirLLM: Hype vs Reality