LLM Cost

Large language models (LLMs) are remarkable tools for understanding and generating text close to human language and have become everyday tools for many organizations and individuals. Yet, behind their impressive capabilities comes an unavoidable caveat: the significant cost of operating these models. To put things in perspective, the newly introduced GPT-4o model costs $5 per 1 million input tokens and $15 for 1 million output tokens. When the requests scale up, the cost will rapidly increase and become unmanageable. Effective strategies are required to reduce the operational costs of LLMs; this article will explore a few promising techniques.

Why do LLMs incur high costs?

It’s crucial to understand why LLMs cost so much and what key factors contribute to such costs before diving deep into figuring out ways to solve them. LLMs are complex and require high computational power to train and operate. Some of the key factors that affect costs are,

  • Size of the model
  • Number of requests made to the model
  • Amount of computational power required to process each request

LLM providers charge consumers based on the number of tokens processed. A token can be a word, a part of a word, or even a single character. A higher number of tokens in your request will result in a higher cost. Some providers opt for a tiered pricing model based on volume. This benefits large organizations with massive volumes since the per-token rates are lower for higher usage tiers. Yet, generally speaking, the larger the model is and the higher the number of tokens in a request, the steeper the operational costs become. Keeping these factors in mind, let’s look at some strategies to reduce costs incurred by LLMS.

LLM cost optimization techniques

1. Change to smaller, task-specific models

The latest and greatest general-purpose LLMs may not always be required for your specific use case. Choosing a smaller, more task-specific model can be beneficial in terms of cost. Considering specific tasks, such as named-entity recognition or text summarization, a smaller model fine-tuned for that specific task will often generate better results than the larger generic counterparts in their areas of expertise.

Another strategy is to set up multiple agents or an LLM router so a cascade of models can be utilized to handle different types of questions. Cheaper models can be used first to answer the question, and if the results are unsatisfactory, the question can be passed onto a more expensive model. This technique helps leverage the best of both worlds while significantly reducing operational costs. Such a setup can be observed in the image below.

Change to smaller, task-specific models

Image Credits – AI Jason

2. Optimize LLM Prompts

Crafting concise and specific prompts will help reduce the number of tokens processed per request, which will translate into lower costs. One powerful technique to reduce prompt length is to use prompt compression tools such as LLMLingua. LLMLingua uses a well-trained compact language model (ex:- GPT2-small, LLaMA-7B) to compress prompts into shorter, efficient representations without losing their original meaning. It can achieve up to 20x compression with minimal performance loss. A sample workflow utilizing LLMLingua to compress prompts is summarized in the image below.

Optimize LLM Prompts

Image Credits – LLMLingua

3. Cache Responses

Traditional caching systems generally rely on an exact match for the query in the cache to determine if the requested content is available. This might not always be very effective in the case of LLMs. Instead, a semantic caching technique is more appropriate. Semantic caching identifies and stores similar or related queries, increasing cache hit probability and enhancing overall caching efficiency. This type of caching can be achieved using tools such as GPTCache.

When a new query is provided, GPTCache converts the query into embeddings and uses a vector store to perform a similarity search on these embeddings. If a query match is found, GPTCache will retrieve the appropriate response and serve it back to the user. This way the query will never have to hit the LLM, thereby reducing cost and improving the response time for the user.


LLM Cost

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

4. Chat History Summarization

Natural and coherent conversations can be achieved using LLMs by allowing them to remember previous interactions by maintaining chat history. However, long chat histories can increase the number of tokens that need to be processed, thereby increasing costs. One technique to alleviate this cost is to utilize tools such as LangChain’s Conversation Memory to summarize the conversation and forward this summary to the LLM rather than the entire chat history.

Once LangChain is integrated, it starts to process and store the conversation history, and — in instances where the LLM requires further context to answer a particular prompt — the Conversation Memory interface will generate a summary of the chat history and provide this additional context for the LLM. This way, the LLM will have to process fewer tokens throughout the conversation, ultimately reducing the overall cost without compromising conversation quality. The diagram below shows the token usage for utilizing the summary technique instead of storing and passing the entire conversation in memory. Utilizing the summary technique also reduces the chances of hitting the token limits of particular LLMs.

Chat History Summarization

Image Credits – James Briggs

5. Model Distillation

Model Distillation can help save on LLM Costs by transferring/ distilling knowledge from a large, expensive model into a smaller one. The smaller lightweight model learns to mimic the behavior of the larger model and absorb its knowledge without going through the same extensive training process. While distilling, it’s also possible to tailor the model to business-specific use cases, making it more specialized and efficient.

The benefit of distillation is that the distilled model can achieve comparable performance to its larger counterpart while warranting less computational resources. This will ultimately result in lower operational costs. Although technical expertise is required to perform this process effectively, the benefits are significant if done correctly. An early example of a distilled model is Microsoft’s Orca-2 model. Research by Google has also found that the smaller model (770 million params) using the distillation technique outperformed its larger counterpart (540 billion params) in benchmark datasets. This shows the effectiveness of this approach and promise when it comes to cost optimization.