Large Language Models: Scaling New Heights

Abstract

In this article we derive the fundementals behind Large Language Models from first principles. We look at the history of language models/chatbots and important steps in Machine Learning research that enabled the development of LLMs. Furthermore, we look into what it takes to train an LLM, including training data and computing resources and explore the evolution of the architecture of LLMs to understand why ever more data and compute seem to be required to train better models. This will also enable us to understand the limitations of LLMs, when to use them, and how to leverage their capabilities effectively. Lastly, we look at the current state of the art in LLMs, how recent advances in performance have been achieved and what the future might hold for the field.

Posted: 05 May 2026

Introduction

Large Language Models have seen an incredible rise in popularity among a wide range of users in the last five or so years, following the release of ChatGPT [1] . They are being extensively used in a wide range of applications, from chatbots to code generation to content creation and more.
What seems to appeal to many users is the ability to converse with a machine in a natural way, asking it questions, requesting it to perform tasks, or simply having a chat. With this comes the probably human urge to anthropomorphise the interaction, i.e. personifying the model, attributing personality, emotions and maybe even intelligence to the text generation [2] . This anthropomorphisation leads to a rift between perceived intelligence/correctness of the chatbot answers and the actual underlying stochastic nature of the text generation. However, if we take a look back at the history of Language Processing and the development of artificial intelligence in this regards, we find that this is not just a recent development.

Brief History of LLMs

Brushing over some details to keep this section brief, let us take a look at the beginning of "Chatbots".

What stands out is that already with ELIZA, a "chatbot" that would look at the user input and dissect it according to a set of rules and generate an answer based on the language structure, users would anthropomorphise the interaction, attributing human-like understanding and emotions to the machine [2] . While it was not a feature to get the users of ELIZA captured through attribution of a personality to the machine, modern applications use exactly this attachement to keep users active.

Also noteworthy about ELIZA is the approach used to generate text: a dissection of the input, basically classifying what the user wrote into subject, object, verbs etc, to then use these parts to generate a fitting answer. Until the 2010s, one leading stream of Natural Language Processing kept this approach, basically labelling all parts of text input to then further process them. This is in line with how Artificial Intelligence (AI) research was conducted in general up until this time. But once the problem of how to (efficiently) train neural networks was solved and software became available to use Graphical Processing Units (GPUs) for training and inference, DeepLearning, i.e. machine learning with many layers of neural network components, exploded in popularity. This, in combination with the publication of more advanced architecture blocs, such as the Transformer in 2017 led to more and more capable machine learning models [6] .

OpenAI used the transformer in its architectures to build the first Generative Pre-trained Transformer in 2018. Back then the company still published research articles describing their model and even code and weights (which are needed to genearte text with the model without training it first) [7] . The follow-up models, i.e. GPT-2 and GPT-3, used essentially the same architecture, but were scaled up significantly in terms of number of parameters and training data, leading to significant improvements in performance [9] [10] . This leads us to the main topic of this article: the scaling of models. How far can we scale and what are the limiting factors to scaling models?

Model Scale Over Time

If we look at the evolution of LLMs over time and pay particular attention to the size of the models (measured in number of parameters in the model, i.e. how many numbers do we have to adjust during training and store to run the model), we can see a clear trend towards larger and larger models.

Log-scale y-axis Show model labels

Source: LLM Models Data Spreadsheet

As OpenAI had figured out with their initial GPT models, just scaling the model up, i.e. using the same architecture but with more layers and thus more trainable parameters, would improve its performance. However, to train a larger, optentially more capable model, tends to require more data to train the model on. To understand what data goes into the models, and how the model "sees" the data, we have to first understand how the data is prepared for training, how models are built and trained. Afterwards we will see how they become parts of applications, i.e. chatbots, ChatGPT and others.

Training Data

To get from a neural network architecture to a working language model, we first have to train the model using text data. The plot above shows the size of the text corpus used to train the model as the size of the circle. Note that larger models tend to require more data for training. Below, we visualize the same models colored by their training dataset composition.

Source: LLM Models Data Spreadsheet

The oldest models in the graph, i.e. GPT-1 and GPT-2, used only books as a source, then complementing with wiki entries and reddit scrapes. More recent models have been trained on as much text as is available on the internet, scraping websites and using repositories of text data, such as Library Genesis to put together massive datasets. The most recent models, however, require more training data than is available through internet text alone, including copyrighted material that is being used in the training. In these cases, synthetic data is added to the mix. The latter could, for example, be generated by another LLM to reach a ceratin size of training data. The size of modern training datasets, over 10 trillion tokens, sometimes up to 100 times this size, amounts to hundreds of Terabytes of uncompressed text [11] . Having this much text data from all over the internet also means that, inevitably, harmful content makes its way into these datasets. To filter out content that the model should not train on, companies like OpenAI hire small armies of often underpaid workers through subcontractors to sift through the content, for text, images and video, and flag content that is deemed not appropriate according to company rules [12] . Another way to go about it is to train another Neural Network to do the filtering by providing examples of content that is supposed to be filtered.

Since most training datasets are not public, it is hard to know exactly what data was used to train which model. An example for a, nowadays relatively small but high quality dataset is the FineWeb dataset, which consists of 15 Trillion tokens and is about 44TB in size. To illustrate how training data comes together, let us walk through one real entry from FineWeb — row #7 of the CC-MAIN-2013-20 snapshot — and follow it from a raw webpage all the way to the integer token IDs that a model actually trains on.

A Real FineWeb Entry: From Webpage to Model Input

Common Crawl — raw web snapshot

The Common Crawl project continuously crawls billions of webpages and stores the raw HTTP responses in compressed WARC archive files on S3. Each WARC record bundles the URL, crawl date, and the full HTML (scripts, ads, navigation — everything).

WARC/1.0
<!DOCTYPE html>
<html>
<head><title>Five Reasons I Love Boston | 17 and Baking</title>
  <link rel="stylesheet" href="/wp-content/themes/..." />
  <script src="http://cdn.ads.example.com/tracker.js"></script>
</head>
<body>
  <nav id="header-nav"> ... hundreds of lines of menus, sidebars, comments ... </nav>
  <article id="post-1234">
    <h1>Five Reasons I Love Boston</h1>
    <p>1. The water. The Atlantic Ocean, as deep and true as denim ...</p>
    ...
  </article>
  <footer> ... ads, tracking pixels ... </footer>
</body>
</html>

Text extraction — strip boilerplate

A text extractor (e.g. Trafilatura) parses the HTML, removes navigation, scripts, ads, and comment threads, and keeps only the main article text. The result is a clean UTF-8 string — the text field in the FineWeb row.

Five Reasons I Love Boston
1. The water.
The Atlantic Ocean, as deep and true as denim, so blue it melts
into the sky, horizonless. And the Charles River. Years from now,
I'll remember riding the Red Line from Boston into Cambridge at
night — the way the lights streak across the black water like
crayons lined up in a box.
After my childhood in Seattle to my college years in Boston, I
don't think I could live anywhere but a coast.
2. The seasons.
I always come back to school right at the tail end of summer. ...
[... 550 tokens total ...]

Quality filtering — is this worth keeping?

FineWeb applies a pipeline of heuristic and model-based filters before a document is kept. The row above passes all checks and gets the following metadata attached:

Documents that fail — e.g. short log-in pages, spam, non-English text, or near-duplicates of other pages — are dropped at this stage.

Tokenisation — text → integer numbers

Before the text can be fed to a neural network, it must be converted into numbers. A tokeniser splits the text into small pieces — sub-words or characters — and assigns each a unique integer ID. The model never sees letters: it only ever works with sequences of these integers. Tokenisation is covered in detail in the next section.

Training — predicting the next word, over and over

Once the dataset is tokenised, training works by repeatedly asking the model one simple question: given everything so far, what word comes next? The model reads a chunk of text (a context window), makes a prediction, and its weights are nudged based on whether the prediction was right. This happens billions of times across trillions of tokens.

Context fed to the model Correct next word

"Five" → Reasons

"Five Reasons" → I

"Five Reasons I" → Love

"Five Reasons I Love" → Boston

… and so on for every position in every document → …

The model gets no explicit rules about grammar or facts — everything it learns about language comes from training this prediction task repeatedly.

Tokenisation

As mentioned in the explanation box above, all text is converted to numbers for the model to work with. What we call a "model" is just a series of operations with numbers, usually matrix multiplications and sums with some functions that transform the outputs at some intermediary points. To get the model input (the numbers or as we will call them from now on "tokens") for the text that has been curated as the dataset, a tokeniser is used. The tokenisation tries to balance the number of different tokens and the legnth of the input. Assuming we are trying to encode/tokenise a text of about 5000 unicode (UTF-8) characters (what your browser provably uses), if we were to keep a binary representation of the text for example, we would have two tokens, 0 and 1, which is what we would call "bit level encoding". With this encoding we would have a very long input sequence of roughly 40,000 bits of text, as each 0 and 1 would represent a single token. As it turns out, models that were trained on character level encoding are not competitive with models trained on word (++) level encoded training data. The GPT-4 tokenisation for example uses 100,277 different tokens, which for our 5000 character unicode text would result in about 1,300 tokens. This is close to word level encoding, but importantly spaces are an exception in this, meaning tokens are usually a word in combination with a space [9] .

Try Tokenising Text

Choose the cl100k_base encoding and enter some text to see the token output. Try entering some numbers, letters, or symbols to see how they are tokenised.

Source: TikTokenizer

Something interesting to note about the tokenisation is how it handles numbers as input. If the text is large numbers, they are split and encoded in packets of three digits. This and the fact that the words are encoded as a whole word are reasons why LLMs are not very good at doing math or even counting letters in a word. A way to counteract this is of course to include repeated sequences of counting examples in the training data, such that the model associates the tokens for numbers with certain words.

Model Training

As mentioned above in the text-preprocessing section, the training of the model consists of repeatedly asking the model to predict the next token in a sequence of tokens. Based on a measure of how far off the model's predictions are from the actual next token, the model's parameters are adjusted to improve its predictions. This means that all the numbers that the model is made up of that are used to multiply with and transform the input, are slightly adjusted based on the error of the model's predictions towards a lower error. This is called back-propagation and forms the base of how neural networks are trained in general, not just LLMs.

To truly appreciate the sheer number of parameters that can be adjusted during training, take a look at the following illustration of the first GPT models from bbycroft.

Explore the model structure

Choose one of the models (maybe start with the smallest: nano-gpt) and use the slider on the left to go through the model's layers as the input does. Compare the different sizes of the models to see what the "scaling" of the networks looks like.

Use the mouse to interact with the visualisation.Scrolling is locked when the mouse is over the iframe to allow for easier interaction with the model visualisation, but will be unlocked again when the mouse leaves the iframe area.

Source: bbycroft.net/llm

Note how the larger models use the same building blocks as the small models. This is at the core of the philosophy of scale: taking an established architecture and making it bigger, possibly with some optimisations to account for the larger numbers of parameters, use more training data and hopefully get a "better" model.

As an example for model scaling: GPT-2 (released in 2019) had around 1.5 billion paramters and was trained on 100 billion tokens (which cost an estimated 40,000 USD). Llama 3 (from 2024) has 405 billion parameters and was trained on 15 trillion tokens (cost an estimated USD100 million) [13] .

However, as you saw in the graph of the first section, models first grew exponentially and became a lot more capable with scale, but it seems that we have reached a point where it is no longer possible to just scale further. For one, the models have to be trained using real physical resources (and a lot of money) and deploying ever larger models across the world to serve customers (you) also takes more and more resources. Every query to a language model will use more memory and electricity as the model grows. What is more is that the models are never trained on the most recent text from for example news outlets. By the time they are trained, some of the text are able to generate might be based on outdated training data. Lately this has not been a very visible problem as new models are being released quite frequently but the early GPT models, for example, were blamed to not being able to recite recent historical accounts. On the other hand, training and retraining a scaled version of the biggest models becomes prohibitively expensive and the returns in terms of performance improvements seem to be diminishing.

This is a good place for a reminder that the models do not have any concept of "facts" or meaning of words in general. What they generate is based on how probable it is that the words follow each other. So after the initial training of the model (as in Generative Pre-trained Transformer), the result is an "internet text generator" or an expensive auto-complete if you will that is trained to imitate the training data. So you might ask: how is a large language model like the one behind ChatGPT able to answer questions a user submits as input? This is the real advancement in language processing where OpenAI figured out that finetunining such a general model can lead to very capable chatbot/assistant like models.

Post-Training or Fine-Tuning

In order to turn a text generator into an assistant the pre-trained model can be fine-tuned on a smaller dataset that is more specific to a certain task, such as answering questions in a chatbot. This requires a way more carefully cureated dataset than the initial training for the text generator. Companies like OpenAI pay experts in their field to generate question and answer pairs, or even conversations with questions about a topic. This is then the "ground-truth" that is being used to fine-tune a model after the first training. The multiturn question/answer datasets and the post-training process using this data is described as "Reinforcement Learning with Human Feedback" [14] .

This phase of the training also shapes how the model generates answer, in what tone and "character", such as an optimistic, supportive and helpful assistant. An example dataset is the OpenAssistant dataset, a large collection of human generated conversations that are used to train assistant-like models.

Fine-Tuning Dataset: OpenAssistant Conversations Dataset

Use the dataset viewer below to see the conversations used for fine-tuning the model from the dataset that is available on HuggingFace.

Source: HuggingFace - OpenAssistant Dataset

Additonally, after training, most models that are used in an application, get additional text input that specifies how to behave and answer questions, before the user inputs any text. This can be seen in the standard Tokenisation example below. With the gpt-4o encoding, you can see system messages being injected into the "context window" before the user input.

Try Tokenising Text

Choose the gpt-4o encoding and see how the encoded text also includes system messages that have special delimiters to treat them differently than the user input.

Source: TikTokenizer

This post-training fine-tunining gets us from internet text generator to language model ready to include in an assitant application like ChatGPT.

Sharp Edges

While the model might generally generate output that sounds correct and satisfies the users of an application where the model is embedded, it remains a stochastic token generator. That means the same input submitted twice will generate most likely two different token sequences, and accordingly two different answers.

A few other sharp edges to be aware of are listed below.

Major Problem Points

Hallucination

Model does not "not know" anything, but if fine-tuned and instructed to say so will generate a statistically similar answer. The "boundary" of knowledge can be found with probing.

Models can't count

Tokenization makes representation for model complicated. If question aims for single token answer, model will likely fail to produce correct token.

Models can't spell

Model operates on token level, not character level.
Ideally training would be with each letter as token, but sequences would be too long.

Model needs tokens to think

Logic and computation must be distributed over many tokens (forward passes). For logic and math, try to use code instead of plain text descriptions.

A lot of these problems can be counteracted with different strategies, such as detecting certain types of statements and offloading them to other tools behind the scenes, such as running code snippets or searching on google and then summarising the results in the chat application. Especially the last point, that models need many tokens to compute something that is of better quality in the output, can be helped with by artificially filling the context window, i.e. the input that goes to the model. For example, a lot of recent models generate text mimicing a conversation without a second party to get more tokens to work with. If you use any application that flashes "thinking" or something similar while output is being produced, chances are the context window is being filled before a summary of the "self-talk" is sent to you. We will discuss this strategy in a bit more detail in the next section. However, it remains a fundamental problem that the model now sounds and replies like a human is that it hides the nature of a token generator behind confidence. This brings us back to the anthropomorphisation and the model is certainly finetuned to be as engaging as possible so users want to keep using it.

Better Models Beyond Scaling

The question now is of course: How can we get better models if scaling might have reached a ceiling? Well, it seems most likely that true innovation in architecture and maybe training of the language models is required. Similarly to other parts of Machine Learning, fundamental improvements in architectural blocks, such as the transformer have largely changed the way models are built and how well they perform. But the focus on scaling and the engineering challenges that come with very large models used many of the resources that could have been used to research more fundamental questions regarding deep learning.

Here, we look again at the development of LLMs over time and focus this time on one of the many benchmark scores, the MMLU as a bit of a proxy of how "well" the models are generating text, and the context window size of the models. As a reminder: the context window holds the input that the model is working with, you may see it as a sort of "working memory" for the model. The MMLU is a test to measure how well a text model does on a variety of different tasks [18] . However, especially non-open-source synthetic datasets can be manipulated to inject the actual test data into the training such that the score is artificially high and the model performance does not generalise well. As always when evaluating Machine Learning models, multiple metrics should be compared and ideally confidence intervals on the metrics reported and taken into account.

MoE models Reasoning models

Source: LLM Models Data Spreadsheet

By using the filters for Mixture of Experts (MoE) models and "Reasoning" models above, you will find that these are the two most important innovations from the last few years with MoE models appearing from 2023 and "reasoning" becoming popular in 2025. They led to models achieving high scores in many metrics while aiding to more parameter efficient models, in different ways. Let's first look at the MoE models.

Mixture of Experts (MoE) Models

The idea behind Mixture of Experts (MoE) models is to have a large number of parameters (since many parameters usually means that a lot "information" can be stored in the model), but only use a fraction of them for each forward pass (token prediction) in order to be more computationally efficient. This is achieved by having different "experts" in the model that are specialised in different tasks and a "gating" or "routing" mechanism that decides which experts to use for each input. "Experts" here are really just a sub-set of the parameters in the model, that were trained to be useful for certain token predictions based on particular input tokens. The gating/routing is usually done by a smaller neural network first "routing" the input tokens, i.e. taking the input and deciding which experts (sub-network) to activate. Using this approach allows for very larg models with many parameters, but only a fraction of them are used for each input, making it more efficient to train and deploy.

Gating / Routing — try it yourself

Click a token to see how the gating network decides which two of the eight experts to activate. The scores are illustrative — a real gating network outputs continuous values learned during training.

The routing itself is not explicitly trained during the training process but it becomes more optimal to use only certain parts of the network and those will then in turn be re-inforced through the training. It is important to emphasize that the expert selection is per token, not per prompt or "task". As in the illustration above, a particular token leads to the gating network to activate (a set of) experts. The activations of the experts are then weighted and merged together.

Reasoning Models

A second type of moddel that recently became popular are "reasoning" models. The term "reasoning" is in this context not well defined and we assume here that it refers to multi-step answering of questions. This is usually the behaviour the model training tries to produce: break down a task defined in a prompt and work it out in multiple steps. Some applications will show a blinked "reasoning" or "thinking" while a model is producing this kind of answer. However, as discussed earlier, this remains an anthropomorphism.

The reasoning itself can be part of the interaction with the user or hidden, so to speak internal to the application, and then sumarised for the user. See the examples of reasoning in two different models below.

Model answers: “Emily buys 3 apples and 2 oranges. Each orange costs USD2. The total cost of all the fruit is USD13. What is the cost of apples?”

ChatGPT answer to the apple/orange maths question

One thing to observe is how easily this kind of reasoning model tends to "overthink", i.e. produce a lot of output to answer a question that would have a single token as the most efficient answer.

As presented in the DeepSeek R1 paper, building such a model can be achieved using Reinforcement Learning [19] . As dicussed above for general post-training of an LLM, Reinforcement Learning with Human Feedback (RLHF) is often being used to fine-tune a model towards become an assitant instead of remaining an internet text generator. In the case of DeepSeekR1 however, this fine-tuning with human feedback on an instruction dataset is being skipped and instead the model trains this fine-tuning with itself. To judge how well it does in generating the answers at this stage, two systems can be used, an LLM judge that rewards the correct format of the responses, for example marking "reasoning" steps with special tags. The other system can be, like for DeepSeek, be an outside system verifying responses to mathematical questions and coding related tasks. And, surprinsingly, this setup alone, without explicitly instructing the model to "reason" leads to the model learning that filling its context window (working memory) using this reasoning approach, produces better responses.

DeepSeek further improved this kind of model (named "DeepSeek-R1-Zero") by adding another stage of "supervised fine-tuning", i.e. post-training with instructions and another round of Reinforcement Learning. This process then produces "DeepSeek-R1", the best performing of the R1 models.

How DeepSeek-R1 is trained — stage by stage

Both models start from the same base, but take different paths after that.

■ DeepSeek-V3-Base A large language model pre-trained on web text, code and books — knows a lot, but hasn't been taught to reason step-by-step yet.

Path A - R1-Zero

↓

Step 1 Reinforcement Learning only The model tries to solve math and coding problems. It gets a simple reward: correct answer = positive, wrong answer = negative. No human examples of how to reason are given - the model has to figure it out itself.

↓

□ DeepSeek-R1-Zero Reasoning model -ut responses can be hard to read and sometimes mix English and Chinese mid-sentence.

Path B - DeepSeek-R1

↓

Step 1 Cold-start fine-tuning A small collection of hand-crafted examples shows the model what clear, readable step-by-step reasoning looks like — setting a good baseline before the RL training begins.

↓

Step 2 Reinforcement Learning Same reward-based training as R1-Zero, but now with an extra bonus for keeping the language consistent (no mid-response language switching).

↓

Step 3 Fine-tuning on a large dataset The team generated ~800,000 high-quality examples by having the RL model solve problems and keeping only correct answers. This mix includes both reasoning tasks (math, code) and general tasks (writing, Q&A) to make the model broadly helpful.

↓

Step 4 Final Reinforcement Learning A final round of reward-based training to balance strong reasoning with being an assitant.

↓

★ DeepSeek-R1 Matching OpenAI o1 on math and coding benchmarks.

Model Destillation

One last interesting point to cover regarding recent model developments is "destillation". The DeepSeek team tested whether applying the pure Reinforcement Learning approach could also help pre-trained smaller models (such as Llama and Qwen based models with fewer than 70B parameters) and "destilling" knowledge of a large model into a smaller model can improve performance. As it turns out, "destillation" is far more effective than just pure Reinforcement Learning on a small model. "Destillation" refers in this context to training on an instruction dataset that has been prepared, not by human experts, but by another LLM, such as DeepSeek-V3. The fine-tuning on this dataset seems to have a significant positive impact on the model performance.

Smaller models which have been destilled (read: fine-tuned) with a dataset generated by a larger LLM, generally perform very well, making this approach very useful to building efficient but capable models when larger models are already available.

Do's and Dont's for Modern Models

We end this writeup with a few practical recommendations on how to use LLMs effectively. These are grounded in how LLMs actually work under the hood — not necessarily how most people use them.

✕ Don't treat it as a search engine

LLMs don't look facts up (some applications may and then summarise with an LLM) - they generally generate text based on patterns in their training data. As they are fine-tuned to be "helpful assitants" the answers can be confident but wrong. Always verify factual or knowledge-based outputs with a reliable source.

✓ Work collaboratively in steps

LLMs work best as a back-and-forth collaborator, not a single-shot oracle. Instead of asking one massive question and hoping for a perfect answer or perfectly generated code based on a single prompt, break your task into smaller pieces: ask it to outline a plan first, then tackle each part, review as you go and correct mistakes early. Modern AI agents (like those built into coding assistants) try to do exactly this as well.

✕ Don't use reasoning models for simple tasks

Reasoning models like DeepSeek-R1 or OpenAI o1 are designed for problems that require long chains of thought such as maths proofs, logic puzzles, complex coding challenges. For quick questions, drafting emails, or summarising text, a standard model is faster and cheaper.

✓ Let it write code for calculations — don't trust its raw numbers

As covered in the tokenisation section, LLMs don't actually compute numbers - they predict tokens. Generated code can be run and verified, rather than relying on a number the model had to produce as a token.

Always consider that training of these models and inference (text generation) with these models uses real-world resources (material, energy and water) and is based on sometimes questionable work practices and ethics and use them responsibly [20] .

Introduction

Brief History of LLMs

Model Scale Over Time

Training Data

A Real FineWeb Entry: From Webpage to Model Input

Tokenisation

Try Tokenising Text

Model Training

Explore the model structure

Post-Training or Fine-Tuning

Fine-Tuning Dataset: OpenAssistant Conversations Dataset

Try Tokenising Text

Sharp Edges

Major Problem Points

Better Models Beyond Scaling

Mixture of Experts (MoE) Models

Gating / Routing — try it yourself

Reasoning Models

Model answers: “Emily buys 3 apples and 2 oranges. Each orange costs USD2. The total cost of all the fruit is USD13. What is the cost of apples?”

How DeepSeek-R1 is trained — stage by stage

Model Destillation

Do's and Dont's for Modern Models

References