Peek into the model's mind — build sentences token-by-token from live probability distributions
Softmax converts arbitrary real-valued logits into a valid probability distribution. It exponentiates each logit (making them all positive) then divides by the sum (making them sum to 1).
Numerical stability:
Without subtracting max: → overflow!
With subtraction: → safe
Temperature controls how "sharp" or "flat" the distribution is. It works by dividing all logits by before softmax.
: All probability concentrates on the argmax (greedy/deterministic)
: Original distribution (model's true beliefs)
: Uniform distribution (random)
Proof that gives argmax: As , . The largest logit dominates exponentially: , so .
1. Sort tokens by probability:
2. Keep only the top tokens:
3. Zero out the rest: for
4. Renormalize:
Top-K is a hard information-theoretic cutoff. The problem: might be too many tokens for a peaked distribution (wasting probability on unlikely tokens) and too few for a flat one (cutting off plausible options).
1. Sort tokens by probability:
2. Find smallest such that:
3. Keep only these tokens, zero out the rest
4. Renormalize
Top-P (nucleus sampling) adapts to the shape of the distribution. For peaked distributions, only a few tokens are kept. For flat distributions, many tokens are kept. The parameter controls how much cumulative probability mass to retain.
Key insight: The same value on a peaked vs flat distribution yields very different effective values. This adaptive behavior is why top-p is generally preferred over top-k.
if :
(if )
(if )
Repetition penalty directly modifies the logits of already-generated tokens. Positive logits are divided by the penalty (making them less likely), and negative logits are multiplied (making them even less likely). A penalty of 1.0 = no effect. Typical values: 1.1–1.5.
Perplexity = the weighted average number of tokens the model was choosing between. A perplexity of 10 means the model was, on average, equally uncertain between 10 options.
Human-written English: perplexity ~20–50. Highly predictable text (code, facts): perplexity ~5–15. Random text: perplexity ~50,000 (vocab size).
Entropy measures the "uncertainty" or "information content" of a distribution. Maximum entropy = bits for a 50k vocabulary (uniform distribution). Typical LLM entropy per position: 2–6 bits.
Relationship to perplexity: . An entropy of 3 bits = perplexity of 8. This is why entropy and perplexity tell the same story in different units.
KL divergence measures how different two distributions are. It's not symmetric: . KL = 0 means the distributions are identical. Higher KL = more different.
Use this to quantify "how much does temperature change the distribution?" Comparing vs typically gives KL ~ 0.1–0.5 bits, while vs gives KL ~ 1–3 bits.
model.generate() does millions of times per day.