🎉 Deepchecks’ New Major Release: Evaluation for LLM-Based Apps!  Click here to find out more 🚀

5 Approaches to Solve LLM Token Limits

If you would like to contribute your own blog post, feel free to reach out to us via blog@deepchecks.com. We typically pay a symbolic fee for content that’s accepted by our reviewers.

Introduction

Large language models (LLMs) like ChatGPT are gaining traction in the Natural Language Processing (NLP) domain. These models use an extensive dictionary of vocabulary and knowledge to understand text and generate human-like responses. LLMs can understand contextual information, memorize previous inputs, and generate responses with different levels of technicality (e.g., explaining to a 5-year-old vs. a college graduate).

Training an LLM requires a large text corpus, a robust neural network architecture, and a tokenizer module to break down the text. The tokenized text is passed to a neural network with a self-attention mechanism, which helps it focus on key aspects of the text. The text tokens are stored in memory before the neurons in the network process them. The larger number of tokens, the more memory is required, so LLMs limit the number of tokens they can process.

The token limit places certain boundaries on the applications of the language model. This article will discuss a few methods to solve the tokens limit problem. But first, let’s understand what a token is.

What Is a Token?

A token is the fundamental building block of a language model. It is generated by splitting a large text corpus into smaller bits. The tokenization process is vital to training NLP models as it reduces the input complexity, and the tokens can be converted into embeddings that the model can understand.

There is no strict definition of a token. It can be a word, a group of words, punctuation, or even a part of a word (sub-text). According to the ChatGPT LLM tokenizer, some general rules of thumb for defining tokens are:

  • 1 token ~= 4 chars in English
  • 1 token ~= ¾ words
  • 100 tokens ~= 75 words

Or

  • 1–2 sentences ~= 30 tokens
  • 1 paragraph ~= 100 tokens
  • 1,500 words ~= 2048 tokens

The OpenAI tokenization tool displays how the model splits the text.

Each tokenization method has advantages and drawbacks and is used according to the model and application requirements. Common tokenization techniques include:

Word-based Tokenization

A text document is broken down into words to form the required tokens. This text splitting is also called rule-based tokenization since the process involves certain hard-coded rules to identify an individual word. The most common rule is splitting based on white space. Let’s take the following sentence as an example:

“It’s a sunny day!”

Splitting the sentence based on white space would yield us the following tokens; “It’s”, “a”, “sunny”, “day!”. These words provide more information to the machine-learning model as they can be processed individually.

Splitting Further

The sentence we used in the previous example contains contractions and punctuation, and it is clear that plain white-space splitting is not the best choice. We can apply additional rules to tokenize complex sentence structures in this case. This way, we will split “day!” into “day” and “!”. The exclamation mark hints at the model of emphasis in this sentence. Furthermore, the contraction “it’s” is split into “it” and “s” to build a sense of the full term “it is.”

Keras Tokenizer

The Keras tokenization model can be accessed via the keras.preprocessing.text class. The tokenizer module creates a dictionary of vocabulary based on the text passed to it. The module preprocesses the text to all lower-case by default. Here’s a Python example of how you can use the Keras tokenizer.

from keras.preprocessing.text import Tokenizer

documents = ['A Lot of random text that is to be tokenized by the Keras tokenizer',
             'The Keras Tokenizer takes in multiple text documents to fit its Tokenizer',
             "It's a convenient way to preprocess text for NLP training"]

tk = Tokenizer(num_words=100)
tk.fit_on_texts(documents)
print(tk.word_index)
{'text': 1, 'to': 2, 'tokenizer': 3, 'the': 4, 'keras': 5, 'alot': 6, 'of': 7, 'random': 8, 'that': 9, 'is': 10, 'be': 11, 'tokenized': 12, 'by': 13, 'takes': 14, 'in': 15, 'multiple': 16, 'documents': 17, 'fit': 18, 'its': 19, "it's": 20, 'a': 21, 'convenient': 22, 'way': 23, 'preprocess': 24, 'for': 25, 'nlp': 26, 'training': 27}

The tokenizer indexes each element of the corpus. These elements can be further used to generate vector representations of the text.

Deepchecks For LLM VALIDATION

5 Approaches to Solve LLM Token Limits

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION

Limitations on Token Inputs for LLMs

An LLM model size is defined by the number of input tokens it can accept. Each model has a different limitation. Since the tokens will be stored and processed in memory, the limitations help keep the model efficient and optimize resource utilization. Some limitations of the GPT models are mentioned below.

ModelTokens Limit
GPT 3.5 Turbo4096
GPT 48192
GPT 4 32k32,768

Although the max token limitation is necessary, it defines the LLM parameters and limits the model’s performance and usability. Having an upper bound of the token count means the model cannot process any text beyond it. Any contextual information outside the max token window is not considered during processing and may limit the results. It also hinders users with large text documents for processing. Let’s discuss a few ways to solve the token limitation problem.

Working Around Token Limitations

Certain techniques can help in overcoming token limitations. Let’s discuss these in detail.

1. Truncation

The easiest way to bring your text within the max token limit is to clip it from either end. Clipping means we remove words/sentences from the text’s start or end. It is a simple fix, but it comes at the cost of loss of information. The model will not process the truncated text and might miss the important context.

Truncation can be done on a character or word level, depending on the requirement. The following Python code shows how to truncate the text from the end of the sentence.

def truncate(long_text: str, i: int):
    return ‘ ’.join(long_text.split()[:-i])

txt = "This is probably a long sentence!"
short_text = truncate(txt, 3)
print(short_text)

Truncation

2. Chunk Processing

Another method of processing long text bodies is by breaking the text into smaller chunks. Each chunk is passed as input to the LLM individually and produces independent results. The results are then combined to form a single output.

However, the final result is prone to errors since the individual chunks contain only part of the overall information, and stitching the final results may still leave gaps.

3. Summarize

Text can convey meaning in multiple formats. A lengthier corpus may not necessarily add value to the text’s overall meaning. Summarize your text in a way that fits within the token limit of the model and retains its valuable information. The following examples portray how intelligently summarizing text can solve LLM problems.

Long_text = “It’s such a fine day today, The sun is out, and the sky is blue. Can you tell me what the weather will be like tomorrow?”

Short_text = “It's sunny today. What will the weather be like tomorrow?”

Shorter_text = “Tell me the weather forecast for tomorrow”

The three versions of the text each ask the LLM about the weather forecast tomorrow in three different ways. Notice how the `Short_text` and `shorter_text` ask the same question but with significantly fewer tokens. This way, text can be summarized to get relevant outputs.

4. Remove Redundant Terms

Stop word removal is a common technique in NLP to reduce the corpus size. Stop words include meaningless terms like “to” and “the” often appearing in the text. These are important for sentence formation, but modern LLMs focus more on key terms. We can simply write the following terms

“Weather forecast tomorrow Texas”

The LLM will analyze the terms and pick out actions closest to these entities. It will have enough information to understand that you are asking for tomorrow’s weather forecast in Texas.

However, this method is not reliable with complex sentences. Before moving on with this technique, it should be manually verified that the sentence makes enough sense to convey its true meaning. Otherwise, the corpus will render incorrect results.

The Python library NLTK provides a helpful collection of stop words for removal.

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
 
nltk.download('stopwords')
nltk.download('punkt')

long_sentence = "It's such a fine day today, The sun is out, and the sky is blue. Can you tell me what the weather will be like tomorrow?"

word_tokens = word_tokenize(long_sentence)

short_sent = ' '.join([t for t in word_tokens if t not in stopwords.words('english')])

print(short_sent)

Remove Redundant Terms

5. Fine-tuning Language Models

Fine-tuning refers to training a model to perform better on niche tasks. A fine-tuned LLM will produce better results for specific problems with lesser input data; hence it can be used within the token limitation range.

Fin-tuning involves using the existing weights of a model and continuing to train it with specific data. The model develops a richer understanding of the new information and performs well for similar cases. Popular LLMs like ChatGPT provide guides for fine-tuning their models. Moreover, Huggingface provides an easy, no-code tool called AutoTrain for LLM fine-tuning. The tool allows you to select relevant parameters and the desired open-source model from the Huggingface Hub.

Summary

For NLP-related training, large text collections are broken down into simpler forms called tokens. A token can be a word, punctuation, part of a word, or a collection of words forming a partial sentence. These tokens are converted into an embedding which the model process to understand the text.

Every LLM has a maximum limit on the number of tokens it can process. These limitations are placed to maintain model efficiency and control resource utilization but limit the model’s usability. LLMs cannot accept large documents like books, and the model loses out on important contextual information.

However, certain techniques can be used to bypass the token limitation. These include text truncation, sampling, and summarizing. The methods enable users to input the main parts of their text but at the cost of reduced accuracy.

Deepchecks For LLM VALIDATION

5 Approaches to Solve LLM Token Limits

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison
TRY LLM VALIDATION