• Synaptiks
  • Posts
  • From Base Models to Reasoning Models

From Base Models to Reasoning Models

How Deepseek-R1 was created ?

Introduction

AI is evolving at an insane pace—what felt like a decade of progress now happens in a year. If you’ve been following recent developments, you’ve probably heard about reasoning models like Deepseek-R1 or OpenAI’s O3, which recently cracked the ARC-AGI benchmark—a test designed to challenge human-like reasoning.

But what exactly are reasoning models? How are they different from traditional base models? And why is everyone so excited about them?

In this article, we’ll break it all down in a clear and intuitive way—no unnecessary jargon, no PhD required. We’ll take a deep dive into the Deepseek release, which gives us rare behind-the-scenes insights into how these next-gen models are built. Since they’ve open-sourced a lot of details, we have a unique opportunity to understand what’s really happening under the hood.

To make things concrete, we’re going to walk through all the recent releases from Deepseek and break down how they built their models step by step. I’ve packed everything into the image below to give you a clear overview.

Let’s start at the top—what exactly is a Base Model?

Deepseek models

What is a Base Model ?

A base model is a super large neural network trained on whole internet data. Its primary function is simple: completing text. But while a base model alone isn’t impressive, it’s the first building block of every modern LLM. (basically a super autocompletion machine)

Here’s how the Llama 3.1 405B base model responds to the prompt “What is 2+2 ?” (Spoiler alert: it’s not particularly useful. 🙂)

Base Model output - not really helpful right ?

What is an Chat Model ?

A Chat Model is what most of us interact with daily—an AI assistant fine-tuned to engage in helpful, structured, and context-aware conversations with users (ChatGPT for example). Unlike a raw base model, which just predicts and completes text, a chat model is trained to follow instructions, answer questions, and assist meaningfully.

This transformation happens through Supervised Fine-Tuning (SFT), where the model learns from high-quality human demonstrations, and Reinforcement Learning from Human Feedback (RLHF), which helps it align better with user preferences by optimizing responses based on human ratings.

Here’s how the Deepseek-V3-Chat responds to the prompt “What is 2+2 ?” (Spoiler alert: it’s much more useful right ? 😃 )

Deepseek-V3-Chat output

Supervised Fine-Tuning (SFT)

The most crucial step in transforming a Base Model into a Chat Model is Supervised Fine-Tuning (SFT). This process teaches the model to become a helpful assistant by training it on carefully curated, high-quality human-written conversations.

In simple terms, SFT takes the base model and specializes it using a dataset of human-crafted dialogues, guiding it to follow instructions, provide useful answers, and engage in meaningful interactions.

Yes, this means that a lot of human effort goes into creating these datasets—real people write and refine conversations that the model then learns from.

Here’s an example of how a single training example might look:

Single datapoint that would be used during SFT

Reinforcement Learning (RL) - A quick intro

Let’s step back from the world of LLMs for a second and look at Reinforcement Learning (RL) through a classic example: the game of Go.

In 2016, AlphaGo, an AI trained with RL, defeated the world champion in Go, a game so complex that brute-force strategies don’t work. Instead of just memorizing moves, AlphaGo learned by playing millions of games against itself, improving over time by rewarding good decisions and penalizing bad ones.

This works well for Go because evaluating success is easy—either you win or you lose. The AI always knows whether a move led to a good or bad outcome.

Now, let’s bring this back to LLMs. Imagine asking an AI to summarize a text—how do we automatically determine if the summary is actually good? There’s no clear “win” or “lose” signal like in Go. Quality is subjective, making it much harder to scale RL for language models.

Reinforcement Learning from Human Feedback (RLHF)

After Supervised Fine-Tuning (SFT), our LLM is now capable of having decent conversations. But we want to take it further—aligning it better with human preferences so that its responses feel more natural, helpful, and safe.

Earlier, we mentioned that RL is hard to apply to LLMs because there's no clear win/loss signal like in a game of Go. But RLHF (Reinforcement Learning from Human Feedback) is a clever workaround that has proven to improve performance.

Here’s how it works:

  1. The LLM generates multiple responses to a prompt.

  2. Humans rank the responses from best to worst based on quality, helpfulness, and correctness.

  3. Since this is too slow and expensive to do at scale, we train a separate model (called a reward model) to predict how humans would rank responses.

  4. Once this reward model is trained, we use it to automatically rank future responses, allowing us to fine-tune the LLM using reinforcement learning.

And after this process? We now have our Chat Model (Deepseek-V3-Chat)!

What is a Reasoning Model ?

A Reasoning Model is an AI that doesn’t just predict an answer but thinks through the problem using intermediate steps before responding.

Instead of the typical "Question → Answer" format you get with models like GPT-4o, reasoning models follow a different approach:
"Question → Thinking → Answer"

This structured reasoning process allows them to break down complex problems and explain their thought process. Some models, like Deepseek-R1, make this reasoning visible, while others hide it during inference.

Deepseek R1 reasoning (cutted)

Where Do These Models Shine?

Reasoning models are particularly useful for tasks that require multi-step thinking and logical breakdown, such as:

  • Coding

  • Solving puzzles

  • Mathematical proofs

  • Advanced problem-solving

Are They Always the Best Choice?

Not necessarily. For simple tasks like translation or basic Q&A, traditional models work just fine—there’s no need for extra reasoning.

There’s also a tradeoff:

  • Slower inference time since the model takes additional steps before answering

  • Higher computational cost compared to standard LLMs

Reasoning models are powerful, but they should be used strategically where their capabilities truly matter.

Inference-time scaling

Let’s understand this with a real-world example. When you were in school taking an exam, did you instantly write down an answer, or did you take a moment to think through the problem?

Well, LLMs work the same way. Some questions can be answered immediately, while others require reasoning before reaching a conclusion.

The simplest way to encourage an LLM to reason step by step is through prompt engineering techniques. A well-known approach is "Chain of Thought" prompting, which explicitly asks the model to produce intermediate reasoning steps before answering.

A classic trick is adding "Let's think step by step." to the prompt. This simple phrase nudges the model to break down the problem, improving performance on tasks that require logical reasoning.

As you might guess, producing longer responses increases inference time and cost.

Deepseek-R1-Zero

Now that we’ve covered all the key components, we can dive into Deepseek’s reasoning models, starting with R1-Zero.

The big novelty? They skipped Supervised Fine-Tuning (SFT) entirely. Traditionally, models go through SFT before reinforcement learning to shape them into useful assistants. But Deepseek jumped straight from a base model to a reasoning model using only RL—something we haven’t seen before at this scale.

They used Reinforcement Learning (RL) in two key ways. The accuracy reward used the LeetCode compiler to verify coding answers and a deterministic system to evaluate mathematical responses. The format reward relied on an LLM judge to ensure responses followed the expected format.

This model isn’t top-tier, but it does exhibit basic reasoning skills. More importantly, it demonstrated an "Aha! moment"—showing that reinforcement learning alone can push models toward reasoning without human-annotated conversations.

Deepseek-R1

Now let’s take a look at the main reasoning model, Deepseek-R1. The first important thing to note is that they started training from the base model again, but this time, they used Deepseek-R1-Zero to generate data for Supervised Fine-Tuning (SFT).

Once the SFT phase was completed, they applied the same Reinforcement Learning (RL) process as in R1-Zero, but with an additional improvement: a new reward to prevent language mixing, which had been an issue in the previous version.

After that, the model went through the standard Chat Model training process, with one key difference: they heavily integrated Chain of Thought (CoT) reasoning in both the SFT dataset and RLHF stages. This ensured the model not only aligned well with human preferences but also developed stronger step-by-step reasoning capabilities.

Deepseek-R1-Distill

DeepSeek has released smaller models through a process they refer to as distillation. Unlike classical methods, this involves fine-tuning smaller LLMs—such as Llama 8B and 70B, as well as Qwen models ranging from 0.5B to 32B parameters—using an SFT dataset generated by larger LLMs like DeepSeek-V3 and an intermediate checkpoint of DeepSeek-R1. Notably, the same SFT data used in this distillation process was also employed in training DeepSeek-R1.

Performance comparisons reveal that while the distilled models are not as robust as DeepSeek-R1, they perform surprisingly well relative to DeepSeek-R1-Zero, despite their significantly smaller size. This indicates the effectiveness of distillation in creating compact yet capable reasoning models.

What comes next ?

Well … if we listen to Sam Altman we should have the world best coder in our pocket soon enough 😄 

Sources

Reply

or to participate.