DeepSeek V4 AI Beats Billion Dollar Systems…For Free
Decision Card
Effort: Weekend project — pull deepseek-ai/DeepSeek-V4-Flash from Hugging Face and run it locally on a single high-VRAM GPU, or skip the hardware entirely and hit the hosted API (~$0.14/$0.28 per M input/output tokens) to feed it a 1,500-page document and test long-context recall in an afternoon.
Honest take: The video presents “Engram” as part of V4 (08:31), but DeepSeek’s own repo describes Engram as a separate research line that is not in V4’s shipped architecture — so the most magical-sounding claim in the back half is misattributed. The host is also honest that the headline 1M window degrades hard near its limit (04:47), which matters more for real agent workloads than the benchmark wins.
Concrete next steps:
- Read the official tech report
DeepSeek_V4.pdfon the Hugging Face Pro repo to see CSA/HCA/DSA defined precisely (~1 hr). - Run a needle-in-haystack test at 200K vs 1M tokens against the hosted API to measure your recall cliff before trusting the window (~2 hrs).
- Skip if your use case needs images, audio, or video — V4 is text-only/unimodal (06:49) and won’t help.
TL;DR
DeepSeek-V4 is an open-weights MoE model (1.6T-param Pro, 284B Flash) shipping a 1M-token context window with three stacked KV-cache compression tricks that cut long-context memory ~90% versus V3.2. It roughly matches recent frontier models at a fraction of the price, but it’s unimodal, degrades near its context limit, and parts of its training aren’t fully understood even by its authors.
Key Points
- DeepSeek-V4 is open-weights with a 1-million-token context window — enough to ingest ~1,500 pages of documentation
[00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s) - The Pro model roughly matches billion-dollar frontier models from a few months prior
[00:53](https://www.youtube.com/watch?v=p7K3xfViWCE&t=53s) - A much smaller “Flash” model is somewhat competitive with Pro
[01:18](https://www.youtube.com/watch?v=p7K3xfViWCE&t=78s) - Pro needs ~3× less compute than its predecessor; Flash ~10× less
[01:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=93s) - Three compression layers — token-level KV compression, 128-to-1 “Heavily Compressed Attention,” and “Compressed Sparse Attention” (an index)
[01:57](https://www.youtube.com/watch?v=p7K3xfViWCE&t=117s) - Combined, they cut KV-cache memory by ~90%
[03:42](https://www.youtube.com/watch?v=p7K3xfViWCE&t=222s) - It’s KV-cache compression, not model compression — you still must load the full model
[04:05](https://www.youtube.com/watch?v=p7K3xfViWCE&t=245s) - Pro recalls hidden facts better than Google’s Gemini 3.1 Pro in their test, but degrades near the context limit
[04:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=273s) - Pricing is dramatically cheaper than Anthropic’s Claude — up to ~30× with a discount, 8–20× without
[06:14](https://www.youtube.com/watch?v=p7K3xfViWCE&t=374s) - Major limits: it’s unimodal (text only, no image/audio)
[06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s)and two training-stabilization techniques aren’t fully understood by its creators[06:57](https://www.youtube.com/watch?v=p7K3xfViWCE&t=417s)
Notable Quotes
“A 1 million token context window? In open weights AI? If you ask it to inhale about 1,500 pages of dense documentation it will do it.”
[00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s)
“Soon, intelligence will get too cheap to meter.”
[06:05](https://www.youtube.com/watch?v=p7K3xfViWCE&t=365s)
“This system is unimodal. Not multimodal. No images or audio. It is blind and deaf, if you will.”
[06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s)
Verified Claims
DeepSeek-V4 is open-weights with a 1M-token context window.
[00:22](https://www.youtube.com/watch?v=p7K3xfViWCE&t=22s)
- deepseek-ai/DeepSeek-V4-Pro (Hugging Face), HuggingFace blog: DeepSeek-V4
- Verdict: Confirmed — Pro (1.6T MoE) and Flash (284B) both ship with a 1M-token window by default.
Three stacked compression techniques (token-level, 128:1 HCA, Compressed Sparse Attention) cut KV-cache memory ~90%.
[03:42](https://www.youtube.com/watch?v=p7K3xfViWCE&t=222s)
- MarkTechPost: DeepSeek-V4 CSA/HCA, vLLM blog: DeepSeek V4 attention
- Verdict: Confirmed — sources report DSA/CSA cut KV-cache memory ~90% and per-token FLOPs ~73% vs DeepSeek-V3.2.
Pro needs roughly 3× less compute than the prior model.
[01:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=93s)
- HuggingFace blog: DeepSeek-V4
- Verdict: Confirmed — at 1M tokens V4-Pro uses ~27% of V3.2’s inference FLOPs (≈3.7×) and ~10% of its KV-cache memory.
Pro recalls hidden facts better than Gemini 3.1 Pro.
[04:33](https://www.youtube.com/watch?v=p7K3xfViWCE&t=273s)
- BenchLM: V4-Pro vs Gemini 3.1 Pro, Digital Applied: Needle-in-Haystack 2026
- Verdict: Disputed — V4-Pro leads on MRCR-1M, but on NIAH single-needle at 1M it trails Gemini 3 (78% vs 99%). Benchmark-dependent.
Pricing is far cheaper than Anthropic’s Claude (8–20× without discount, ~30× with).
[06:14](https://www.youtube.com/watch?v=p7K3xfViWCE&t=374s)
- VentureBeat: V4 at ~1/6 the cost of Opus 4.7, Anthropic vs DeepSeek pricing
- Verdict: Inconclusive — directionally true, but the multiplier depends on tier: Pro is ~6–7× cheaper than Opus 4.7, while Flash reaches 35–100× cheaper. The video’s “8–20×” overstates Pro and understates Flash.
V4 is unimodal — text only, no image or audio.
[06:49](https://www.youtube.com/watch?v=p7K3xfViWCE&t=409s)
- DataCamp: DeepSeek V4 features
- Verdict: Confirmed — V4 is a text-only model.
V4 uses a technique called “Engram.”
[08:31](https://www.youtube.com/watch?v=p7K3xfViWCE&t=511s)
- deepseek-ai/Engram (GitHub), Kili: DeepSeek V4 & Engram
- Verdict: Disputed — Engram (conditional memory, Jan 2026) is a related DeepSeek research line, but sources state it is not part of V4’s shipped architecture.
Tools, Papers & Standards Mentioned
- DeepSeek-V4 technical report (“Towards Highly Efficient Million-Token Context Intelligence”) — PDF on Hugging Face
- DeepSeek-V4-Pro / V4-Flash model weights — Pro, Flash
- Compressed Sparse Attention (CSA) / Heavily Compressed Attention (HCA) — MarkTechPost overview
- DeepSeek Sparse Attention (DSA), originating in DeepSeek-V3.2-Exp — DSA from first principles, vLLM implementation notes
- Engram (“Conditional Memory via Scalable Lookup”) — GitHub repo + paper
- Gemini 3.1 Pro (comparison baseline) — LLM-Stats comparison
Follow-up Questions
- How steep is V4-Pro’s recall cliff between 200K and 1M tokens on a self-defined multi-fact task, versus the single-needle benchmarks vendors report?
- What is the real local-hosting cost (GPU VRAM, throughput) of V4-Flash, given that the 90% saving is on KV cache only and the full model must still be loaded?
- If Engram isn’t in shipped V4, what would a future DeepSeek model gain by adding conditional memory as a second sparsity axis on top of MoE — and at what RAM/CXL cost?
Sources
- https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro
- https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf
- https://huggingface.co/deepseek-ai/DeepSeek-V4-Flash
- https://huggingface.co/blog/deepseekv4
- https://www.marktechpost.com/2026/04/24/deepseek-ai-releases-deepseek-v4-compressed-sparse-attention-and-heavily-compressed-attention-enable-one-million-token-contexts/
- https://vllm-project.github.io/2026/04/24/deepseek-v4.html
- https://www.tensoreconomics.com/p/deepseek-sparse-attention-from-first
- https://github.com/deepseek-ai/Engram
- https://kili-technology.com/blog/data-story-deepseek-v4
- https://benchlm.ai/compare/deepseek-v4-pro-base-vs-gemini-3-1-pro
- https://www.digitalapplied.com/blog/long-context-retrieval-needle-in-haystack-2026
- https://llm-stats.com/models/compare/gemini-3.1-pro-preview-vs-deepseek-v4-pro-max
- https://venturebeat.com/technology/deepseek-v4-arrives-with-near-state-of-the-art-intelligence-at-1-6th-the-cost-of-opus-4-7-gpt-5-5
- https://pricepertoken.com/compare/provider/anthropic-vs-deepseek
- https://www.datacamp.com/blog/deepseek-v4