LMVIZ - Language Model Visualizer | Built for Curious Minds

Choose a prompt to exploreEach example highlights a different behavior of the probability distribution

1 — The Logit Pipeline

A language model doesn't output words directly — it outputs ~50,000 raw scores called logits, one per vocabulary token. These pass through a 5-stage pipeline: raw logits are scaled by temperature, converted to probabilities via softmax, filtered by top-k/top-p to remove unlikely tokens, and finally sampled to pick one token. Click any stage below to inspect the actual numbers and formulas at each step.

1Raw Logits

mat

floor

table

bed

ground

couch

→

2Temperature

mat

floor

table

bed

ground

couch

→

3Softmax

mat

floor

table

bed

ground

couch

→

4Top-K / Top-P

mat

floor

table

bed

ground

couch

→

5Sample

mat

floor

table

bed

ground

couch

2 — Token Builder

You replace the model's sampling step. The left panel shows your sentence — gray tokens are the prompt, colored tokens are your picks (green = confident, red = surprising). Click any chosen token to rewind. The right panel shows the live probability distribution — click any bar to pick that token manually. Use the sliders to reshape probabilities in real-time: temperature flattens or sharpens the distribution, top-k/top-p filter out unlikely tokens, and repetition penalty discourages repeats. 'Sample' picks a random token weighted by probability (what real LLMs do in production), while 'Greedy' always picks the #1 token (deterministic, like setting temperature to zero).

Sentence Workspace

The cat sat on the

Distribution at Current Position

#TokenRawLogit

1 mat86.2%13.8

2 floor6.4%11.2

3 table2.1%10.1

4 bed1.6%9.8

5 ground1.2%9.5

6 couch0.87%9.2

7 chair0.64%8.9

8 roof0.29%8.1

9 fence0.17%7.6

10 edge0.13%7.3

11 sofa<0.1%7.0

12 wall<0.1%6.8

13 top<0.1%6.5

14 other<0.1%6.2

15 window<0.1%6.0

16 porch<0.1%5.6

17 grass<0.1%5.3

18 counter<0.1%5.0

19 rug<0.1%4.7

20 step<0.1%4.4

21 however<0.1%4.0

22 although<0.1%3.7

23 perhaps<0.1%3.4

24 never<0.1%3.4

25 often<0.1%3.4

26 simply<0.1%3.3

27 maybe<0.1%3.3

28 always<0.1%3.2

29 sometimes<0.1%3.0

30 usually<0.1%2.6

3 — Sampling Strategies Compared

Four strategies build sentences side-by-side from the same prompt. Click 'Next Token' to generate one token for each strategy simultaneously. Greedy always picks #1 (deterministic), while Top-K, Top-P, and Creative sample differently — so their sentences diverge over time. Keep clicking to watch how each strategy shapes the output differently.

Greedy (T=0)

Always picks the most likely token. Deterministic.

The cat sat on the

logP: —PPL: —

Top-K=40

Sample from top 40 tokens. Common default.

The cat sat on the

logP: —PPL: —

Top-P=0.9

Nucleus sampling. Adapts to distribution shape.

The cat sat on the

logP: —PPL: —

Creative (T=1.5)

High temperature + nucleus. More diverse outputs.

The cat sat on the

logP: —PPL: —

4 — Entropy Landscape

Entropy measures the model's uncertainty at each position. Each token is colored by confidence: dark blue = nearly certain (<0.5 bits), cyan = confident, yellow = moderate, orange = high uncertainty, red = very uncertain (>5 bits, dozens of plausible tokens). Hover any token to see the top predictions. Dashed red borders mark 'surprises' where the chosen token wasn't the model's top pick. Compare passages — factual text is mostly blue, creative writing is yellow/red, and code reveals fascinating patterns.

The United States of America was founded in 1776. The Declaration of Independence was signed on July 4, 1776.

Show only surprises (chosen ≠ top prediction)

<0.5 bits

0.5-1.5

1.5-3.0

3.0-5.0

>5.0 bits

Surprise

5 — The Long Tail

The vocabulary has ~50,000 tokens but the distribution is extremely spiky. The top 10 tokens typically hold 60-80% of probability mass, the top 100 cover 95-99%, and the remaining ~49,900 share the last 1-5%. The chart shows this on a log scale — the Head (top 10) dominates, the Body (top 100) is plausible, and the Tail (100+) contains misspellings, rare languages, and encoding artifacts. Toggle the CDF overlay to visualize cumulative probability.

50,257

Vocabulary Size

"mat" (86.2%)

Top 1 Token

Tokens for 50% Mass

Tokens for 90% Mass

Tokens for 99% Mass

200

Effective Vocab (non-zero)

6 — The Math

The exact formulas behind every step above — softmax, temperature scaling, top-k, nucleus (top-p) sampling, repetition penalty, log probabilities, perplexity, entropy, and KL divergence. Each topic includes the formula, plain-English explanation, and key properties. Expand any topic below.

6a: Softmax Function▼

P(\text{token}_i) = \frac{\textcolor{#e63946}{\exp(z_i)}}{\textcolor{#2796d2}{\sum_j \exp(z_j)}}

Softmax converts arbitrary real-valued logits into a valid probability distribution. It exponentiates each logit (making them all positive) then divides by the sum (making them sum to 1).

Numerical stability: $\text{softmax}(\mathbf{z}) = \text{softmax}(\mathbf{z} - \max(\mathbf{z}))$

Without subtracting max: $\exp(100) = 2.69 \times 10^{43}$ → overflow!

With subtraction: $\exp(100 - 100) = \exp(0) = 1$ → safe

6b: Temperature Scaling▼

P(\text{token}_i \mid T) = \frac{\textcolor{#e63946}{\exp(z_i / T)}}{\sum_j \exp(z_j / T)}

Temperature controls how "sharp" or "flat" the distribution is. It works by dividing all logits by $T$ before softmax.

$T \to 0$ : All probability concentrates on the argmax (greedy/deterministic)

$T = 1$ : Original distribution (model's true beliefs)

$T \to \infty$ : Uniform distribution (random)

Proof that $T \to 0$ gives argmax: As $T \to 0$ , $z_i/T \to \pm\infty$ . The largest logit dominates exponentially: $\exp(z_{\max}/T) \gg \exp(z_{\text{other}}/T)$ , so $P(\text{argmax}) \to 1$ .

6c: Top-K Sampling▼

1. Sort tokens by probability: $P_{[1]} \geq P_{[2]} \geq \cdots \geq P_{[V]}$

2. Keep only the top $K$ tokens: $S = \{\text{token}_1, \ldots, \text{token}_K\}$

3. Zero out the rest: $P_{[i]} = 0$ for $i > K$

4. Renormalize: $P'_{[i]} = P_{[i]} \,/\, \sum_{j \in S} P_{[j]}$

Top-K is a hard information-theoretic cutoff. The problem: $K=40$ might be too many tokens for a peaked distribution (wasting probability on unlikely tokens) and too few for a flat one (cutting off plausible options).

6d: Nucleus (Top-P) Sampling▼

1. Sort tokens by probability: $P_{[1]} \geq P_{[2]} \geq \cdots \geq P_{[V]}$

2. Find smallest $K$ such that: $\sum_{i=1}^{K} P_{[i]} \geq p$

3. Keep only these $K$ tokens, zero out the rest

4. Renormalize

Top-P (nucleus sampling) adapts to the shape of the distribution. For peaked distributions, only a few tokens are kept. For flat distributions, many tokens are kept. The parameter $p$ controls how much cumulative probability mass to retain.

Key insight: The same $p$ value on a peaked vs flat distribution yields very different effective $K$ values. This adaptive behavior is why top-p is generally preferred over top-k.

6e: Repetition Penalty▼

if $\text{token}_i \in \text{already\_generated}$ :

$\text{logit}_i = \text{logit}_i \,/\, \textcolor{#e63946}{\text{penalty}}$ (if $\text{logit}_i > 0$ )

$\text{logit}_i = \text{logit}_i \times \textcolor{#e63946}{\text{penalty}}$ (if $\text{logit}_i < 0$ )

Repetition penalty directly modifies the logits of already-generated tokens. Positive logits are divided by the penalty (making them less likely), and negative logits are multiplied (making them even less likely). A penalty of 1.0 = no effect. Typical values: 1.1–1.5.

6f: Log Probabilities & Perplexity▼

\log P(\text{sequence}) = \sum_i \log P(\text{token}_i \mid \text{token}_{<i})

\text{perplexity} = 2^{\displaystyle -\frac{1}{N} \sum_i \log_2 P(\text{token}_i \mid \text{token}_{<i})}

Perplexity = the weighted average number of tokens the model was choosing between. A perplexity of 10 means the model was, on average, equally uncertain between 10 options.

Human-written English: perplexity ~20–50. Highly predictable text (code, facts): perplexity ~5–15. Random text: perplexity ~50,000 (vocab size).

6g: Entropy▼

H = -\sum_x P(x) \log_2 P(x)

Entropy measures the "uncertainty" or "information content" of a distribution. Maximum entropy = $\log_2(\text{vocab\_size}) \approx 15.6$ bits for a 50k vocabulary (uniform distribution). Typical LLM entropy per position: 2–6 bits.

Relationship to perplexity: $\text{perplexity} = 2^H$ . An entropy of 3 bits = perplexity of 8. This is why entropy and perplexity tell the same story in different units.

6h: KL Divergence▼

D_{\text{KL}}(P \| Q) = \sum_x P(x) \log \frac{P(x)}{Q(x)}

KL divergence measures how different two distributions are. It's not symmetric: $D_{\text{KL}}(P \| Q) \neq D_{\text{KL}}(Q \| P)$ . KL = 0 means the distributions are identical. Higher KL = more different.

Use this to quantify "how much does temperature change the distribution?" Comparing $T=1.0$ vs $T=0.7$ typically gives KL ~ 0.1–0.5 bits, while $T=1.0$ vs $T=2.0$ gives KL ~ 1–3 bits.

7 — Key Takeaways

The core ideas from every section above, distilled into single sentences you should be able to explain to someone else. If any point feels unclear, scroll back to the relevant section and interact with it.

What You Now Understand

Language models don't generate words, they generate probability distributions. Every "response" is a sequence of independent sampling events from ~50,000-token distributions.
Temperature doesn't add randomness — it reshapes existing uncertainty. At T=1, you see the model's true beliefs. Higher T flattens them. Lower T sharpens them. At T=0, only the top token survives.
Top-k and top-p are safety nets, not creativity dials. They truncate the tail to prevent sampling garbage tokens. Top-p adapts to distribution shape; top-k doesn't.
Greedy decoding ≠ the "best" sequence. The most probable sequence is not the sequence of most probable tokens. Beam search exists for this reason.
Perplexity is the model's confusion level. Perplexity of 10 means the model is, on average, choosing between 10 equally likely tokens. Human-written English: perplexity ~20-50. Random text: perplexity ~50,000.
Repetition penalty is a hack that works. Without it, models get stuck in loops. The penalty directly modifies logits of already-generated tokens.
The distribution is mostly empty. 99% of probability mass lives in the top ~100 tokens out of 50,000. The rest of the vocabulary is practically zero — but not exactly zero.
You've been the sampler this whole time. Every choice you made in the Token Builder is exactly what model.generate() does millions of times per day.