Choose a prompt to exploreEach example highlights a different behavior of the probability distribution
1The Logit Pipeline
A language model doesn't output words directly — it outputs ~50,000 raw scores called logits, one per vocabulary token. These pass through a 5-stage pipeline: raw logits are scaled by temperature, converted to probabilities via softmax, filtered by top-k/top-p to remove unlikely tokens, and finally sampled to pick one token. Click any stage below to inspect the actual numbers and formulas at each step.
1Raw Logits
mat
floor
table
bed
ground
couch
2Temperature
mat
floor
table
bed
ground
couch
3Softmax
mat
floor
table
bed
ground
couch
4Top-K / Top-P
mat
floor
table
bed
ground
couch
5Sample
mat
floor
table
bed
ground
couch
2Token Builder
You replace the model's sampling step. The left panel shows your sentence — gray tokens are the prompt, colored tokens are your picks (green = confident, red = surprising). Click any chosen token to rewind. The right panel shows the live probability distribution — click any bar to pick that token manually. Use the sliders to reshape probabilities in real-time: temperature flattens or sharpens the distribution, top-k/top-p filter out unlikely tokens, and repetition penalty discourages repeats. 'Sample' picks a random token weighted by probability (what real LLMs do in production), while 'Greedy' always picks the #1 token (deterministic, like setting temperature to zero).
Sentence Workspace
The cat sat on the
Distribution at Current Position
Temperature1.00
Top-KOff
Top-POff
Rep. PenaltyOff
#TokenRawLogit
1 mat86.2%13.8
2 floor6.4%11.2
3 table2.1%10.1
4 bed1.6%9.8
5 ground1.2%9.5
6 couch0.87%9.2
7 chair0.64%8.9
8 roof0.29%8.1
9 fence0.17%7.6
10 edge0.13%7.3
11 sofa<0.1%7.0
12 wall<0.1%6.8
13 top<0.1%6.5
14 other<0.1%6.2
15 window<0.1%6.0
16 porch<0.1%5.6
17 grass<0.1%5.3
18 counter<0.1%5.0
19 rug<0.1%4.7
20 step<0.1%4.4
21 however<0.1%4.0
22 although<0.1%3.7
23 perhaps<0.1%3.4
24 never<0.1%3.4
25 often<0.1%3.4
26 simply<0.1%3.3
27 maybe<0.1%3.3
28 always<0.1%3.2
29 sometimes<0.1%3.0
30 usually<0.1%2.6
3Sampling Strategies Compared
Four strategies build sentences side-by-side from the same prompt. Click 'Next Token' to generate one token for each strategy simultaneously. Greedy always picks #1 (deterministic), while Top-K, Top-P, and Creative sample differently — so their sentences diverge over time. Keep clicking to watch how each strategy shapes the output differently.
Greedy (T=0)
Always picks the most likely token. Deterministic.
The cat sat on the
logP: PPL:
Top-K=40
Sample from top 40 tokens. Common default.
The cat sat on the
logP: PPL:
Top-P=0.9
Nucleus sampling. Adapts to distribution shape.
The cat sat on the
logP: PPL:
Creative (T=1.5)
High temperature + nucleus. More diverse outputs.
The cat sat on the
logP: PPL:
4Entropy Landscape
Entropy measures the model's uncertainty at each position. Each token is colored by confidence: dark blue = nearly certain (<0.5 bits), cyan = confident, yellow = moderate, orange = high uncertainty, red = very uncertain (>5 bits, dozens of plausible tokens). Hover any token to see the top predictions. Dashed red borders mark 'surprises' where the chosen token wasn't the model's top pick. Compare passages — factual text is mostly blue, creative writing is yellow/red, and code reveals fascinating patterns.
The United States of America was founded in 1776. The Declaration of Independence was signed on July 4, 1776.
<0.5 bits
0.5-1.5
1.5-3.0
3.0-5.0
>5.0 bits
Surprise
5The Long Tail
The vocabulary has ~50,000 tokens but the distribution is extremely spiky. The top 10 tokens typically hold 60-80% of probability mass, the top 100 cover 95-99%, and the remaining ~49,900 share the last 1-5%. The chart shows this on a log scale — the Head (top 10) dominates, the Body (top 100) is plausible, and the Tail (100+) contains misspellings, rare languages, and encoding artifacts. Toggle the CDF overlay to visualize cumulative probability.
50,257
Vocabulary Size
"mat" (86.2%)
Top 1 Token
1
Tokens for 50% Mass
2
Tokens for 90% Mass
8
Tokens for 99% Mass
200
Effective Vocab (non-zero)
6The Math
The exact formulas behind every step above — softmax, temperature scaling, top-k, nucleus (top-p) sampling, repetition penalty, log probabilities, perplexity, entropy, and KL divergence. Each topic includes the formula, plain-English explanation, and key properties. Expand any topic below.
6a: Softmax Function
P(tokeni)=exp(zi)jexp(zj)P(\text{token}_i) = \frac{\textcolor{#e63946}{\exp(z_i)}}{\textcolor{#2796d2}{\sum_j \exp(z_j)}}

Softmax converts arbitrary real-valued logits into a valid probability distribution. It exponentiates each logit (making them all positive) then divides by the sum (making them sum to 1).

Numerical stability: softmax(z)=softmax(zmax(z))\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} - \max(\mathbf{z}))

Without subtracting max: exp(100)=2.69×1043\exp(100) = 2.69 \times 10^{43} → overflow!

With subtraction: exp(100100)=exp(0)=1\exp(100 - 100) = \exp(0) = 1 → safe

6b: Temperature Scaling
P(tokeniT)=exp(zi/T)jexp(zj/T)P(\text{token}_i \mid T) = \frac{\textcolor{#e63946}{\exp(z_i / T)}}{\sum_j \exp(z_j / T)}

Temperature controls how "sharp" or "flat" the distribution is. It works by dividing all logits by TT before softmax.

T0T \to 0: All probability concentrates on the argmax (greedy/deterministic)

T=1T = 1: Original distribution (model's true beliefs)

TT \to \infty: Uniform distribution (random)

Proof that T0T \to 0 gives argmax: As T0T \to 0, zi/T±z_i/T \to \pm\infty. The largest logit dominates exponentially: exp(zmax/T)exp(zother/T)\exp(z_{\max}/T) \gg \exp(z_{\text{other}}/T), so P(argmax)1P(\text{argmax}) \to 1.

6c: Top-K Sampling

1. Sort tokens by probability: P[1]P[2]P[V]P_{[1]} \geq P_{[2]} \geq \cdots \geq P_{[V]}

2. Keep only the top KK tokens: S={token1,,tokenK}S = \{\text{token}_1, \ldots, \text{token}_K\}

3. Zero out the rest: P[i]=0P_{[i]} = 0 for i>Ki > K

4. Renormalize: P[i]=P[i]/jSP[j]P'_{[i]} = P_{[i]} \,/\, \sum_{j \in S} P_{[j]}

Top-K is a hard information-theoretic cutoff. The problem: K=40K=40 might be too many tokens for a peaked distribution (wasting probability on unlikely tokens) and too few for a flat one (cutting off plausible options).

6d: Nucleus (Top-P) Sampling

1. Sort tokens by probability: P[1]P[2]P[V]P_{[1]} \geq P_{[2]} \geq \cdots \geq P_{[V]}

2. Find smallest KK such that: i=1KP[i]p\sum_{i=1}^{K} P_{[i]} \geq p

3. Keep only these KK tokens, zero out the rest

4. Renormalize

Top-P (nucleus sampling) adapts to the shape of the distribution. For peaked distributions, only a few tokens are kept. For flat distributions, many tokens are kept. The parameter pp controls how much cumulative probability mass to retain.

Key insight: The same pp value on a peaked vs flat distribution yields very different effective KK values. This adaptive behavior is why top-p is generally preferred over top-k.

6e: Repetition Penalty

if tokenialready_generated\text{token}_i \in \text{already\_generated}:

logiti=logiti/penalty\text{logit}_i = \text{logit}_i \,/\, \textcolor{#e63946}{\text{penalty}} (if logiti>0\text{logit}_i > 0)

logiti=logiti×penalty\text{logit}_i = \text{logit}_i \times \textcolor{#e63946}{\text{penalty}} (if logiti<0\text{logit}_i < 0)

Repetition penalty directly modifies the logits of already-generated tokens. Positive logits are divided by the penalty (making them less likely), and negative logits are multiplied (making them even less likely). A penalty of 1.0 = no effect. Typical values: 1.1–1.5.

6f: Log Probabilities & Perplexity
logP(sequence)=ilogP(tokenitoken<i)\log P(\text{sequence}) = \sum_i \log P(\text{token}_i \mid \text{token}_{<i})
perplexity=21Nilog2P(tokenitoken<i)\text{perplexity} = 2^{\displaystyle -\frac{1}{N} \sum_i \log_2 P(\text{token}_i \mid \text{token}_{<i})}

Perplexity = the weighted average number of tokens the model was choosing between. A perplexity of 10 means the model was, on average, equally uncertain between 10 options.

Human-written English: perplexity ~20–50. Highly predictable text (code, facts): perplexity ~5–15. Random text: perplexity ~50,000 (vocab size).

6g: Entropy
H=xP(x)log2P(x)H = -\sum_x P(x) \log_2 P(x)

Entropy measures the "uncertainty" or "information content" of a distribution. Maximum entropy = log2(vocab_size)15.6\log_2(\text{vocab\_size}) \approx 15.6 bits for a 50k vocabulary (uniform distribution). Typical LLM entropy per position: 2–6 bits.

Relationship to perplexity: perplexity=2H\text{perplexity} = 2^H. An entropy of 3 bits = perplexity of 8. This is why entropy and perplexity tell the same story in different units.

6h: KL Divergence
DKL(PQ)=xP(x)logP(x)Q(x)D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

KL divergence measures how different two distributions are. It's not symmetric: DKL(PQ)DKL(QP)D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P). KL = 0 means the distributions are identical. Higher KL = more different.

Use this to quantify "how much does temperature change the distribution?" Comparing T=1.0T=1.0 vs T=0.7T=0.7 typically gives KL ~ 0.1–0.5 bits, while T=1.0T=1.0 vs T=2.0T=2.0 gives KL ~ 1–3 bits.

7Key Takeaways
The core ideas from every section above, distilled into single sentences you should be able to explain to someone else. If any point feels unclear, scroll back to the relevant section and interact with it.
What You Now Understand
  • Language models don't generate words, they generate probability distributions. Every "response" is a sequence of independent sampling events from ~50,000-token distributions.
  • Temperature doesn't add randomness — it reshapes existing uncertainty. At T=1, you see the model's true beliefs. Higher T flattens them. Lower T sharpens them. At T=0, only the top token survives.
  • Top-k and top-p are safety nets, not creativity dials. They truncate the tail to prevent sampling garbage tokens. Top-p adapts to distribution shape; top-k doesn't.
  • Greedy decoding ≠ the "best" sequence. The most probable sequence is not the sequence of most probable tokens. Beam search exists for this reason.
  • Perplexity is the model's confusion level. Perplexity of 10 means the model is, on average, choosing between 10 equally likely tokens. Human-written English: perplexity ~20-50. Random text: perplexity ~50,000.
  • Repetition penalty is a hack that works. Without it, models get stuck in loops. The penalty directly modifies logits of already-generated tokens.
  • The distribution is mostly empty. 99% of probability mass lives in the top ~100 tokens out of 50,000. The rest of the vocabulary is practically zero — but not exactly zero.
  • You've been the sampler this whole time. Every choice you made in the Token Builder is exactly what model.generate() does millions of times per day.