DeepSeek-OCR Explained

Introduction

There’s been a lot of noise around the new DeepSeek OCR model and this caused a lot of confusion because the biggest headline that seems to be circulating is that DeepSeek effectively compresses data 10 times smaller and that almost seems impossible because in information theory there’s a limit to how much we can actually compress data without losing information which is referred to as entropy.

So when DeepSeek OCR seemingly escaped the entropy limits by compressing information 10 times smaller, it’s creating a huge discourse around what this means and what downstream impact this has on AI. So let’s find out what’s really going on with this and understand the implication behind the new technique that DeepSeek is demonstrating.

Current Headlines and Claims

Quick shout out to Woven for sponsoring this video and more on them later.

Okay, one of the biggest headlines that are circulating right now around DeepSeek OCR are:

DeepSeek compressed data 10 times smaller
Picture is worth a thousand words or 10 tokens
We can now easily get tens of millions of context window

So hearing all this it seems like this OCR model is a pretty big deal. I mean compressing information 10 times smaller seems like a pretty incredible technological leap.

Information Theory Fundamentals

Let’s start by briefly reviewing fundamental concepts in information theory. When you’re watching this video, you are able to capture information that I’m speaking to you by listening to the sound waves through the speaker. And your brain had decades and decades of practicing by piecing together different combinations of sound waves to derive real meaning behind the sequence of sound waves that make up a language.

Now, the sound waves in and of itself doesn’t actually mean anything. For example, if I say to you in Korean, that probably carries no meaning to you because you aren’t used to those specific sequence of sound waves.

In a similar way, human language in a written form is composed of continuous alphabets from a finite set of alphabets in English containing 26 letters. Now, similar to sound waves, the letters in the alphabet in and of itself don’t actually mean anything. But when we group them together and form words and assign meaning to them, the syntax and semantics provide richer representation by bunching together letters into words.

Tokenization in AI

Now, in order to transfer this kind of understanding to AI, we chose to mainly do it in what’s called tokens, where we assign a number to a set of predetermined characters that make up words. And you might be wondering at this point, this seems like a really roundabout way to teach computers how to model languages.

And the biggest reason why we opted to do it this way is because we prioritize computation over compression. The goal wasn’t to model human language as compact as possible, but rather to make them as easily processable as possible. So even though it may appear to be inefficient to use tokens, it was effectively the best way to model human language because it provided the right balance between structure and scalability.

DeepSeek OCR Innovation

Okay, the next question here is what does all this have to do with DeepSeek OCR? When we use tokens, we’re essentially trying to compress meaning into sequence of symbols that computers can understand. And there’s nothing inherently wrong about that. But human language by nature is heavily redundant and repetitive in a very repetitive and redundant way.

And since tokens need to be generated for each word that we give as input, there’s a limit to how much information we can cram in without losing the very information that it represents. For example, the sentence “Caleb writes code” cannot be compressed further than the symbolic representation of each word. So if Caleb has a token ID of 100, writes has 59, and code has 67, then you really can’t go beyond this vector to compress it without losing its meaning.

And this symbolic entropy puts an upper bound to how much information can be compressed in representation. So given this limit, how did DeepSeek overcome the immediate challenge and compress them by a factor of 10?

Sponsor Message: Woven

Achieving this kind of technology from DeepSeek probably requires hiring a lot of high-quality talent to work on the model. And hiring can be tough. So here’s a quick sponsored video from Woven that helps support me to make videos like this.

I’ve been looking to hire a software developer in my previous company. And one thing I always found was that candidates always had different skill sets. And some people were really good at code reviews and others were good at system debugging and now with AI agentic programming. So coming up with coding evaluations for each role took a lot of time and effort to build scenarios and give feedback. It just wasn’t fun for everyone involved in the process.

Woven is a human-powered technical assessment tool that makes hiring streamlined. So if you’re looking to hire engineers, Woven is offering 14 days free trial with 20% off of your first hire. Check the link in the description.

DeepSeek’s Technical Breakthrough

Okay, so the question is this. How did DeepSeek compress data without losing information? DeepSeek OCR used vision models to essentially sidestep data compression happening in text tokens, but rather they use images as input and latent space as a way to compress how information is represented.

And in one of the previous videos called “Autoencoders to Diffusion for Beginners”, I talked about how data compression and feature extraction in images work if you want to learn more about them.

Anyway, DeepSeek shifted the focus on where data compression happens where instead of compressing text into symbolic representation, they used latent space and the result was quite astounding where DeepSeek achieved:

10 times compression while maintaining 97% accuracy
20 times compression while maintaining 60% accuracy

And one important concept to clarify here is that images aren’t necessarily smaller than text in terms of storage. As a matter of fact, we all know that images take up more space than text in their storage. But what we’re talking about here is representation efficiencies where the latent representation of an image can be far more dense in information than text where the structure is constrained by tokens as its lowest common denominator.

Andreas Karpathy’s Perspective

And given the shortcoming of text tokens, the pain point behind them is also expressed quite strongly by Andreas Karpathy in a post that said, “Delete the tokenizer at the input.”

I already ranted about how much I dislike the tokenizer. Tokenizers are ugly, separate, not end to end stage. It imports all ugliness of Unicode by encoding. It inherits a lot of historical baggage, security jailbreak risks. It makes two characters that look identical to the eye look as two completely different tokens internally in the network. The tokenizer must go.

Model Architecture and Innovation

Now, as far as the model architecture of OCR is concerned, I haven’t had much time to dig through it to say anything substantial, but the model architecture doesn’t appear to be groundbreaking. In other words, it’s a clever mix of different components like SAM, CNN, and vision model. So, the true innovation here from DeepSeek isn’t in the parts, but in the composition.

Closing Thoughts

As a closing thought, I recently had a conversation with a friend of mine, and he has been saying how he thinks in pictures rather than words. And from someone who thinks more in words than pictures, it was hard for me to understand what he meant by thinking in pictures until I started to read more about DeepSeek’s new model.

Could it be that going forward, we’re now seeing a shift in models to train to think in pictures rather than words? How will context engineering look like going forward where so many AI companies are built around managing context well? And how do image-based input change the AI industry in context engineering?

Original Article Link: https://www.youtube.com/watch?v=uWrBH4iN5y4