In-context learning (ICL), an emerging capability of large language models that enables them to perform statistical inference at test time without parameter adaptation. Current theoretical studies of in-context learning often study the mechansim of statistical learning such as supervised learning. We aim to study the mechansim of in-context learning for langauge translation which is more closely aligned with main application of LLMs for NPL. Our study is based on discrete optimal transport theory, namely solving the following linear program [Cuturi 2013] $$ P_\lambda = \arg \min_{P} \sum_{i=1}^n \sum_{j=1}^n\| x_i - y_j \|^2 P_{ij} + \lambda\underbrace{\sum_{i=1}^n \sum_{j=1}^n P_{ij} \log(P_{ij})}_{-\text{entropy regularization}} \; \text{subject to $P$ is a doubly stochastic matrix}$$ where \( x_1, y_1, \dots, x_n, y_n \in \mathbb{R}^d \). OT has many applications from sorting to cell perturbation analysis in Bioinformatics. Here, we show that OT can also be used to explain the internal mechansim of LLMs for tranlsation.
Observation: OT and Translation. The following plot shows the solution of OT ( \( P_0 \)) when \( x_i, y_i\) are word embeddings of English and French words respectively in the first layer of BERT multilingual pre trained model.
There exists a choice of attention parameters, independent of \( n \) , \( d \), and the input, such that the attention weights at layer \( \ell \) approximate the regularized optimal transport solution \(P_\lambda \) with error: \(O( \frac{\sqrt{n} e^{1/\lambda}}{\sqrt{\ell}} ) \)
We prove that each attention layer can implement one step of gradient descent (with adaptive stepsize) on the dual Lagrangian function of the OT problem. This construction is based on a particular encoding of inputs \( x_i, y_i \) in the word embeddings and a constructive choice of parameters. By inducation, we can prove that several attention layers can simulate several iterations of gradient descent, thereby optimizing the dual program iteratively.
Current analysis requires O(ε⁻²) depth for O(ε) accuracy, while Sinkhorn needs only O(log 1/ε) iterations. Tightening this gap remains an open problem.
A transformer with fixed parameters can solve OT for arbitrary n. What is the minimal n needed during training for the model to learn the assignment task?
We demonstrate the role of prompt engineering for in-context assignment. This suggests a computational expressivity perspective on why prompt engineering works.
Can standard softmax attention solve assignment with a general cost function C(xᵢ, yⱼ)? We conjecture that attention combined with feedforward layers can handle general cases.
@inproceedings{daneshmand2026incontextOT, title = {In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?}, author = {Daneshmand, Hadi}, booktitle = {Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS)}, year = {2026}, publisher = {AISTATS 2026} }