DeepSeek V4 AI Beats Billion Dollar Systems…For Free

Two Minute Papers · 10m 04s · Watch on YouTube · 15 sources

Decision Card

Effort: Weekend project — pull deepseek-ai/DeepSeek-V4-Flash from Hugging Face and run it locally on a single high-VRAM GPU, or skip the hardware entirely and hit the hosted API (~$0.14/$0.28 per M input/output tokens) to feed it a 1,500-page document and test long-context recall in an afternoon.

Honest take: The video presents “Engram” as part of V4 (08:31), but DeepSeek’s own repo describes Engram as a separate research line that is not in V4’s shipped architecture — so the most magical-sounding claim in the back half is misattributed. The host is also honest that the headline 1M window degrades hard near its limit (04:47), which matters more for real agent workloads than the benchmark wins.

Concrete next steps:

  • Read the official tech report DeepSeek_V4.pdf on the Hugging Face Pro repo to see CSA/HCA/DSA defined precisely (~1 hr).
  • Run a needle-in-haystack test at 200K vs 1M tokens against the hosted API to measure your recall cliff before trusting the window (~2 hrs).
  • Skip if your use case needs images, audio, or video — V4 is text-only/unimodal (06:49) and won’t help.

TL;DR

DeepSeek-V4 is an open-weights MoE model (1.6T-param Pro, 284B Flash) shipping a 1M-token context window with three stacked KV-cache compression tricks that cut long-context memory ~90% versus V3.2. It roughly matches recent frontier models at a fraction of the price, but it’s unimodal, degrades near its context limit, and parts of its training aren’t fully understood even by its authors.

Key Points

  • DeepSeek-V4 is open-weights with a 1-million-token context window — enough to ingest ~1,500 pages of documentation [00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s)
  • The Pro model roughly matches billion-dollar frontier models from a few months prior [00:53](https://www.youtube.com/watch?v=p7K3xfViWCE&t=53s)
  • A much smaller “Flash” model is somewhat competitive with Pro [01:18](https://www.youtube.com/watch?v=p7K3xfViWCE&t=78s)
  • Pro needs ~3× less compute than its predecessor; Flash ~10× less [01:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=93s)
  • Three compression layers — token-level KV compression, 128-to-1 “Heavily Compressed Attention,” and “Compressed Sparse Attention” (an index) [01:57](https://www.youtube.com/watch?v=p7K3xfViWCE&t=117s)
  • Combined, they cut KV-cache memory by ~90% [03:42](https://www.youtube.com/watch?v=p7K3xfViWCE&t=222s)
  • It’s KV-cache compression, not model compression — you still must load the full model [04:05](https://www.youtube.com/watch?v=p7K3xfViWCE&t=245s)
  • Pro recalls hidden facts better than Google’s Gemini 3.1 Pro in their test, but degrades near the context limit [04:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=273s)
  • Pricing is dramatically cheaper than Anthropic’s Claude — up to ~30× with a discount, 8–20× without [06:14](https://www.youtube.com/watch?v=p7K3xfViWCE&t=374s)
  • Major limits: it’s unimodal (text only, no image/audio) [06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s) and two training-stabilization techniques aren’t fully understood by its creators [06:57](https://www.youtube.com/watch?v=p7K3xfViWCE&t=417s)

Notable Quotes

“A 1 million token context window? In open weights AI? If you ask it to inhale about 1,500 pages of dense documentation it will do it.” [00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s)

“Soon, intelligence will get too cheap to meter.” [06:05](https://www.youtube.com/watch?v=p7K3xfViWCE&t=365s)

“This system is unimodal. Not multimodal. No images or audio. It is blind and deaf, if you will.” [06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s)

Verified Claims

DeepSeek-V4 is open-weights with a 1M-token context window. [00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s)

Three stacked compression techniques (token-level, 128:1 HCA, Compressed Sparse Attention) cut KV-cache memory ~90%. [03:42](https://www.youtube.com/watch?v=p7K3xfViWCE&t=222s)

Pro needs roughly 3× less compute than the prior model. [01:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=93s)

  • HuggingFace blog: DeepSeek-V4
  • Verdict: Confirmed — at 1M tokens V4-Pro uses ~27% of V3.2’s inference FLOPs (≈3.7×) and ~10% of its KV-cache memory.

Pro recalls hidden facts better than Gemini 3.1 Pro. [04:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=273s)

Pricing is far cheaper than Anthropic’s Claude (8–20× without discount, ~30× with). [06:14](https://www.youtube.com/watch?v=p7K3xfViWCE&t=374s)

V4 is unimodal — text only, no image or audio. [06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s)

V4 uses a technique called “Engram.” [08:31](https://www.youtube.com/watch?v=p7K3xfViWCE&t=511s)

Tools, Papers & Standards Mentioned

Follow-up Questions

  1. How steep is V4-Pro’s recall cliff between 200K and 1M tokens on a self-defined multi-fact task, versus the single-needle benchmarks vendors report?
  2. What is the real local-hosting cost (GPU VRAM, throughput) of V4-Flash, given that the 90% saving is on KV cache only and the full model must still be loaded?
  3. If Engram isn’t in shipped V4, what would a future DeepSeek model gain by adding conditional memory as a second sparsity axis on top of MoE — and at what RAM/CXL cost?

Sources