Large language models (LLMs) are remarkably efficient across a wide range of natural language processing tasks and well beyond them. However, a comprehensive theoretical analysis of the LLMs' generalization capabilities remains elusive. In our paper, we approach this task by drawing an equivalence between autoregressive transformer-based language models and Markov chains defined on a finite state space. This allows us to study the multi-step inference mechanism of LLMs from first principles. We relate the obtained results to the pathological behavior observed with LLMs such as repetitions and incoherent replies with high temperature. Finally, we leverage the proposed formalization to derive pre-training and in-context learning generalization bounds for LLMs under realistic data and model assumptions. Experiments with the most recent Llama and Gemma herds of models show that our theory correctly captures their behavior in practice.
Large language models as Markov chains
Submitted to ArXiV, 2 February 2025
Type:
Conference
Date:
2025-02-02
Department:
Communication systems
Eurecom Ref:
8082
Copyright:
© EURECOM. Personal use of this material is permitted. The definitive version of this paper was published in Submitted to ArXiV, 2 February 2025 and is available at :
See also:
PERMALINK : https://www.eurecom.fr/publication/8082