Can Transformers Sort? — AISTATS 2026

Motivation

How transformers
Translate or Sort?

In-context learning (ICL), an emerging capability of large language models that enables them to perform statistical inference at test time without parameter adaptation. Current theoretical studies of in-context learning often study the mechansim of statistical learning such as supervised learning. We aim to study the mechansim of in-context learning for langauge translation which is more closely aligned with main application of LLMs for NPL. Our study is based on discrete optimal transport theory, namely solving the following linear program [Cuturi 2013] $$ P_\lambda = \arg \min_{P} \sum_{i=1}^n \sum_{j=1}^n\| x_i - y_j \|^2 P_{ij} + \lambda\underbrace{\sum_{i=1}^n \sum_{j=1}^n P_{ij} \log(P_{ij})}_{-\text{entropy regularization}} \; \text{subject to $P$ is a doubly stochastic matrix}$$ where $ x_1, y_1, \dots, x_n, y_n \in \mathbb{R}^d $. OT has many applications from sorting to cell perturbation analysis in Bioinformatics. Here, we show that OT can also be used to explain the internal mechansim of LLMs for tranlsation.

Observation: OT and Translation. The following plot shows the solution of OT ( $ P_0 $) when $ x_i, y_i$ are word embeddings of English and French words respectively in the first layer of BERT multilingual pre trained model.

Observation: Attention and OT. We observed that attention weights between English and French embeddings converges to the an approximate OT solution across layers of the pretrained model.

We aim to explain the above iterative internal mechansim of attentions for langauge translateion.

Contributions

Theorem — Approximation Bound (informal)

There exists a choice of attention parameters, independent of $ n $ , $ d $, and the input, such that the attention weights at layer $ \ell $ approximate the regularized optimal transport solution $P_\lambda $ with error: $O( \frac{\sqrt{n} e^{1/\lambda}}{\sqrt{\ell}} ) $

Remark 1: Result holds for all input sizes n while parameters remain frozen

Remark 2: Increasing transformer depth reduces the approximation error

Idea

Simulating gradient descent
with softmax attention

We prove that each attention layer can implement one step of gradient descent (with adaptive stepsize) on the dual Lagrangian function of the OT problem. This construction is based on a particular encoding of inputs $ x_i, y_i $ in the word embeddings and a constructive choice of parameters. By inducation, we can prove that several attention layers can simulate several iterations of gradient descent, thereby optimizing the dual program iteratively.

In other words, we construct a transformers that can iteratively solve OT. The following plot shows attention weight dynamics in this transformer.

Open Questions

Future directions

Depth-efficient guarantees

Current analysis requires O(ε⁻²) depth for O(ε) accuracy, while Sinkhorn needs only O(log 1/ε) iterations. Tightening this gap remains an open problem.

Learning with small prompts

A transformer with fixed parameters can solve OT for arbitrary n. What is the minimal n needed during training for the model to learn the assignment task?

Prompt engineering theory

We demonstrate the role of prompt engineering for in-context assignment. This suggests a computational expressivity perspective on why prompt engineering works.

Generalized cost functions

Can standard softmax attention solve assignment with a general cost function C(xᵢ, yⱼ)? We conjecture that attention combined with feedforward layers can handle general cases.

Citation

@inproceedings{daneshmand2026incontextOT,
  title     = {In-Context Learning for Discrete Optimal Transport: Can Transformers Sort?},
  author    = {Daneshmand, Hadi},
  booktitle = {Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS)},
  year      = {2026},
  publisher = {AISTATS 2026}
}

GitHub Repository Contact Author

       Acknowledgements: 

      Funded partially by NSF TRIPODS program (award DMS-2022448).

      We thank Jacob Andreas and Amir Reiszade for helpful discussions. 
 
      The website template and a part of the code is implemented by the assitance of generative AI

In-Context Learning for Discrete Optimal Transport

How transformers Translate or Sort?

Simulating gradient descentwith softmax attention

Future directions

How transformers
Translate or Sort?

Simulating gradient descent
with softmax attention