DeepSeek V4 AI Beats Billion Dollar Systems…For Free

Two Minute Papers · 10m 04s · Watch on YouTube · 12 sources

Decision Card

Effort: Afternoon experiment — open DeepSeek’s chat or API (OpenAI-compatible endpoints), paste a 500+ page document, and test long-context recall and a coding task; ~2-3 hours including reading the Hugging Face release notes.

Honest take: The video states DeepSeek “also use a technique called Engram” (08:31), but external reporting indicates the Engram conditional-memory module — a separate January 2026 DeepSeek/Peking University paper — did not actually ship in V4; and the “recalls better than Gemini 3.1 Pro” claim holds only on some benchmarks (MRCR/academic long-context) while Gemini 3.1 Pro wins standard single-needle NIAH at 1M tokens by a wide margin.

Concrete next steps:

Read the official DeepSeek V4 preview release notes for model specs, API compatibility, and pricing (~20 min).
Skim the Hugging Face DeepSeek-V4 deep-dive for the CSA/HCA attention architecture details the video simplifies (~30 min).
If self-hosting interests you, check the DeepSeek-V4-Pro weights on Hugging Face — note it’s a 1.6T-parameter MoE, so API access is the realistic path for most (~15 min to assess).
Skip if you need multimodal input (images, audio, video) — V4 is text-only — or if your workloads never exceed ~100K tokens of context, where existing models already perform well and V4’s headline advantage matters less.

TL;DR

DeepSeek V4 ships as MIT-licensed open weights with a 1-million-token context window, using three stacked compression techniques (token-level compression, 128:1 Heavily Compressed Attention, and Compressed Sparse Attention) to cut KV-cache memory by ~90% versus its predecessor. The Pro model roughly matches recent frontier closed models at a fraction of the price, though it’s text-only, degrades near the context limit, and some of the video’s claims (Engram usage, beating Gemini 3.1 Pro on recall) are shakier than presented.

Key Points

DeepSeek V4 is described in a 58-page research paper and offers a 1M-token context window in an open-weights model — enough to ingest ~1,500 pages of documentation 00:22
The Pro model roughly matches billion-dollar frontier models from a few months prior, and a much smaller Flash model is somewhat competitive with Pro 00:53
For long outputs, the new Pro needs ~3x less compute than the previous generation and Flash needs ~10x less 01:33
Three compression layers work together: token-level compression (paragraph-to-sentence summaries), Heavily Compressed Attention at 128-to-1 (table of contents), and Compressed Sparse Attention (index lookup) 03:29
Together these cut KV-cache memory needs by about 90% 03:36
Important caveat: this is KV-cache compression only — you still must load the full model, so no running Pro “on a toaster” 04:05
In an 8-facts-hidden-in-long-context test, the paper reports Pro recalls better than Gemini 3.1 Pro, though recall degrades as you approach the context limit 04:33
API pricing can be up to 30x cheaper than Anthropic’s Claude with discounts, 8-20x cheaper without 06:14
Key limitations: unimodal (no images or audio), training-stabilization techniques the creators themselves don’t fully understand, and breakdown near the context-window limit 06:49
Coding performance is strong for everyday tasks but still failed the presenter’s more advanced ray-tracing algorithm implementations 05:35

Notable Quotes

“A 1 million token context window? In open weights AI? If you ask it to inhale about 1,500 pages of dense documentation it will do it.” 00:22

“These three reduce memory needs for the KV-cache by about 90%. I had to look twice.” 03:36

“This system is unimodal. Not multimodal. No images or audio. It is blind and deaf, if you will.” 06:49

Verified Claims

1. DeepSeek V4 offers a 1M-token context window as open weights. 00:22 — Sources: DeepSeek V4 Preview Release (official API docs), DeepSeek-V4-Pro on Hugging Face. Official announcement confirms 1M context standard across DeepSeek services; V4-Pro (1.6T/49B active) and V4-Flash (284B/13B active) shipped April 24, 2026 as MIT-licensed open weights. Verdict: Confirmed

2. Heavily Compressed Attention uses 128-to-1 compression. 02:39 — Source: Hugging Face DeepSeek-V4 analysis, which states HCA “compresses KV entries by 128x” and CSA by 4x along the sequence dimension. Verdict: Confirmed

3. KV-cache memory needs drop by about 90%. 03:36 — Sources: Hugging Face DeepSeek-V4 analysis (Pro uses “10% of the KV cache memory” vs V3.2 at 1M tokens), vLLM blog on DeepSeek V4 support. Verdict: Confirmed

4. Pro needs ~3x less compute, Flash ~10x less, than the previous generation. 01:33 — Source: Hugging Face DeepSeek-V4 analysis: Pro requires 27% of single-token inference FLOPs vs DeepSeek-V3.2 (~3.7x reduction); Flash uses 10% of the FLOPs (10x reduction). Matches the video’s figures. Verdict: Confirmed

5. V4-Pro recalls hidden facts in long context better than Gemini 3.1 Pro. 04:33 — Sources: Long-Context Retrieval 2026: Needle-in-Haystack Test, BenchLM comparison. Mixed evidence: DeepSeek V4-Pro surpasses Gemini 3.1 Pro on MRCR and academic long-context benchmarks (83.5% at 1M), but on standard single-needle NIAH at 1M tokens Gemini 3 scores ~99% vs DeepSeek’s 78%, with DeepSeek degrading much more from 200K to 1M (-18 points vs -0.5). Verdict: Disputed (benchmark-dependent; the video’s blanket framing overstates it)

6. Pricing is 8-30x cheaper than Anthropic’s Claude. 06:14 — Sources: Anthropic vs DeepSeek pricing comparison, MindStudio pricing analysis. V4-Pro output is $3.48/M vs $25/M for Claude Opus (~7x); V4-Flash ($0.14/$0.28 per M) runs 35-100x cheaper than frontier closed models, and cached input hits $0.0028/M. The 8-30x range is plausible depending on model pairing and caching. Verdict: Confirmed (order of magnitude, varies by model pair)

7. The system is unimodal — no image or audio input. 06:49 — Sources: DeepSeek V4 Preview Release (official), Hugging Face DeepSeek-V4 analysis. Neither official source lists any multimodal capability, consistent with text-only, but neither states it explicitly. Verdict: Inconclusive (consistent with, but not explicitly confirmed by, primary sources checked)

8. DeepSeek V4 uses the Engram memory technique. 08:31 — Sources: Kili Technology: DeepSeek V4 Guide — Engram Memory & Release Status, Engram explained (Medium). Engram is a real DeepSeek/Peking University conditional-memory paper (January 2026, open-sourced), but reporting indicates Engram is absent from V4 itself — mHC and sparse attention made it in, Engram did not. Verdict: Disputed

Tools, Papers & Standards Mentioned

DeepSeek V4 (Pro / Flash) — Official release announcement, DeepSeek-V4-Pro weights on Hugging Face
Compressed Sparse Attention (CSA) & Heavily Compressed Attention (HCA) — Hugging Face technical deep-dive
Engram (conditional memory module) — DeepSeek Engram architecture explainer, Kili Technology guide
Google Gemini 3.1 Pro — Artificial Analysis model comparison
Anthropic Claude — Anthropic vs DeepSeek API pricing
vLLM (serving support for V4’s attention) — vLLM blog: DeepSeek V4 efficient long-context attention

Follow-up Questions

Which benchmark family better predicts real-world long-context use — standard single-needle NIAH (where Gemini 3.1 Pro leads) or MRCR-style multi-round retrieval (where V4-Pro leads) — and how do both correlate with agentic document-analysis tasks?
What exactly are the two training-stabilization techniques the paper reports without full explanation, and has follow-up research clarified why they work?
Since Engram (with its reported NIAH gains from 84.2% to 97% on a 27B model) did not ship in V4, is it slated for a future DeepSeek release, and what does its “Sparsity Allocation Law” (20-25% of sparse parameters to memory) imply for other open-weight architectures?