Inside the Transformer
Ongoing explorations into how Transformers really work (TBC)
This is an ongoing blog where I explore and improve my understanding of the Transformer family:
I’ll keep sharing important things I discover about Transformers over time.
Attention
How to reduce KV cache
https://www.spaces.ac.cn/archives/10091/comment-page-1
Self-attention
Multi-head attention
https://arxiv.org/pdf/2002.07028
https://arxiv.org/pdf/2106.09650
https://medium.com/@hassaanidrees7/exploring-multi-head-attention-why-more-heads-are-better-than-one-006a5823372b
https://medium.com/@nirashanelki/the-secret-of-multi-head-attention-2fdb72208b7f
https://sanjayasubedi.com.np/deeplearning/multihead-attention-from-scratch/
Linear Attention
Nyströmformer/ Performer
Flash Attention
Multi-head Latent Attention
DeepSeek
Scaling laws
Stacking of attentions to achieve better performance via @Shuangfei Zhai’s tweets.
Masks
Casual / Chunk-based Casual/ Bi-directional
Position embedding
Sinusoidal
limited seq length; independence of PE: the difference between pos 1 and 2 is the same as position 2 and 500? (breaks if it goes beyond wavelength?)
Abs. PE/ Learnable P.E./ ALiBi/
RoPE
invariant to shift
1D/ 2D RoPE (ViT) / beta-base encoding
RoPE is a rotary transformation applied to the Query (Q) and Key (K) in Attention.?
the only one that applies to linear attentions so far?
RoPE base number LLaMA 3 choose
how to do multimodal?
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
Rotary Positional Encoding (RoPE has an inductive bias towards left-to-right ordering, Sitan Chen’s Train for theWorst, Plan for the Best: Understanding Token Ordering in Masked Diffusions)