Getting Started with LlamaIndex

This blog post was written by Brain John Aboze as part of the Deepchecks Community Blog. If you would like to contribute your own blog post, feel free to reach out to us via We typically pay a symbolic fee for content that's accepted by our reviewers.


Are you tired of language models’ generic answers? Imagine if they could learn directly from your notes, documents, or even your favorite websites. That’s the power of LLamaindex. This innovative framework lets you connect your data to cutting-edge large language Models (LLMs), transforming them into ultra-smart tools tailored to your needs. With LLamaindex, those LLMs can answer complex questions based on your information, generate summaries aligned with your interests, and even become engaging conversational AI assistants that truly understand your world.

Getting Started with LlamaIndex

Photo by wildercr from Pixabay

Initially known as the GPT Index, Llamaindex is a powerful data framework designed to enhance applications powered by LLMs. LLamaindex provides the tools to easily ingest, structure, and access your private or specialized data. This allows LLMs to use this data safely to generate more informed and accurate text. It’s available in both Python and Typescript. Think of LLamaindex as a bridge between your custom data and powerful LLMs like GPT-4. Whether your information lives in APIs, databases, or PDFs, LLamaindex organizes it for smooth interaction with these intelligent language models. This bridge greatly enriches the data LLMs access, enabling them to create highly tailored applications and workflows. It acts as a multi-faceted orchestrator when working with data and LLMs:

  • Ingestion: LLamaindex smoothly brings your data in, regardless of its source.
  • Structuring: It organizes your data in a way LLMs can easily understand and utilize.
  • Retrieval: When needed, LLamaindex efficiently finds the most relevant information.
  • Integration: It simplifies connecting your data with different application frameworks.

Source: Llamaindex

Power of Context Augmentation

LLMs are trained on massive amounts of public information; they often can’t access domain-specific information, the latest updates, your company’s internal documents, or a specialized industry database.

Fine-tuning an LLM with your data is one option, but it has drawbacks:

  • Cost: Training LLMs is expensive.
  • Up-to-date Info: The costs make frequent updates to keep up with new knowledge challenging.
  • Trust: Knowing why an LLM provides a particular answer is hard.

Retrieval Augmented Generation (RAG) offers a smarter alternative. Here’s how it works:

  • Retrieve: When you ask a question, RAG finds relevant information from your data sources.
  • Augmentation: It enriches your question with this targeted knowledge and context.
  • Generate: The LLM creates an answer informed by its general knowledge andyour specific data.

While fine-tuning LLMs can be useful, it has limitations. RAG offers a compelling solution, addressing these concerns:

  • Affordability: No costly training means RAG keeps your application costs manageable.
  • Freshness: Data is retrieved live when needed, guaranteeing access to the most up-to-date information.
  • Transparency: LLamaindex often lets you track the LLM’s answers back to their source, increasing trust in the results.

Ultimately, the best approach depends on your application’s specific needs and the resources you have available. Consider the trade-offs between flexibility, cost, and customization when making your decision.

RAG Overview

You might wonder, “Why bother with RAG? Can’t I just dump all my data into the LLM and let it sort things out?” It’s a good question, but here’s the catch: LLMs have limits. Think of an LLM’s understanding as reading a book where you can only see a few lines at a time. That’s the context window. Every time you move forward, you forget what came before. This makes it hard for the LLM to grasp the bigger picture. RAG shines by focusing your LLM’s attention. Instead of overwhelming it with everything, RAG works like a spotlight:

  • Finds the Right Info: It pinpoints the most relevant parts of your data related to the question asked.
  • Provides Focused Context: RAG gives the LLM only what it needs to understand and generate a meaningful answer.

Hence, this makes the systems more cost-effective and improves answers. Let’s look at what a RAG workflow looks like in detail.

RAG workflow

RAG workflow, Llamaindex

The workflow includes the following processes:

  • Indexing: RAG transforms your data (text, PDFs, databases, etc.) into a format LLMs easily understand. A key technique is creating vector embeddings – numerical representations of your information’s meaning.
  • Querying: When you ask a question, RAG pinpoints the most relevant data from your organized index.
  • Contextualizing: RAG feeds this targeted knowledge to your query, enhancing the LLM’s understanding.
  • Generating: Now, the LLM can create a highly informed answer using its general knowledge and the specific insights extracted from your data.

Getting Started with LlamaIndex

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Key Stages of Building RAG Applications

We will delve into the broader processing of building LLM-powered applications fueled by your data using LlamaIndex. RAG is a powerful core concept; let’s get started.

Installation and setup

Before we discuss all the cool things LlamaIndex can do, let’s set it up on your system.

For the tutorial, we will be utilizing the Python version of Llamaindex.

Step 1: Create a Virtual Environment (Mac/Linux and Windows)

Virtual environments keep project dependencies separate, avoiding conflicts. Here’s how to create one (instructions will vary slightly based on your setup):

  • Mac/Linux:
    • Open a terminal
    • Run: python3 -m venv llama-env
    • Activate it with source llama-env/bin/activate
  • Windows:
    • Open a command prompt
    • Run: python -m venv llama-env
    • Activate it with: llama-env\Scripts\activate

Step 2: Install LlamaIndex

Inside your activated virtual environment, run this command: ​​pip install llama-index openai

Step 3: Set Up Your OpenAI API Key

Security Reminder: Never share your OpenAI API key publicly. It gives access to your OpenAI account.

Loading data

Llamaindex provides data connectors that ingest information from all your sources-APIs, databases, PDFs, and more. You can view a comprehensive list of all the data loaders available on Llama Hub. For this article, we will ingest a PDF file, so we will utilize the SimpleDirectoryReader, one of the most used data connectors.

Let’s install PyPDF2, a powerful tool for working with PDFs. It allows us to easily split, merge, crop, and modify PDF files. Here’s how to install it:  pip install pypdf

Next, let’s fetch a PDF online (Here, it’s an Nvidia PDF on their story) and save it in our local storage. I am using the same directory as my notebook.

Fetch Data

This will create a local directory and save the PDF locally. Next, we will load the PDF into a LlamaIndex document object.

Loading PDF

The LlamaIndex document object is a core abstraction within LlamaIndex. It acts as a container for a given data source, whether you create it manually or use one of LlamaIndex’s automatic data loaders. It has some attributes, such as metadata, which stores additional information about the document, and relationships, which help track how it connects to other documents.

Chunking (Node Parsing)

The next step is to parse this document object into chunks (or nodes) and eventually turn each node into an embedding, which is just a numeric representation of the meaning of that chunk of text. Transformer models (that large language model often used) work with a specific amount of text at a time, which is the context window we discussed earlier. Even if the LLM context is large enough to fit the entire text, a single numeric representation (vector) trying to summarize pages loses important details. It’s like describing a whole movie based on a single blurry snapshot. The ideal is for each chunk of text fed to the model to have a complete meaning – a sentence or paragraph makes sense independently.

Splitting your documents into meaningful chunks (like sentences or paragraphs) helps the transformer model focus on each part individually, leading to better understanding and more accurate results. Chunks that are too short might not have enough information for the model to work with. Chunks that are too big go back to the problem of losing detail within the model’s limited attention window. The perfect chunk size might depend on your specific model and the kind of text you’re working with.

Let’s explore the range of text node parsers offered by LlamaIndex, from basic options to more advanced parsers.


  • TokenTextSplitter: TokenTextSplitter: TokenTextSplitter breaks down our text into chunks for processing. It focuses on “tokens, ” essentially individual words within your text. The goal is to create chunks roughly the same size in terms of the number of words they contain. It has a default setting of the chunk_size of 1024 tokens and a chunk_overlap of 20 words.
    TokenTextSplitterChunk overlap is a clever strategy to make sure you don’t miss important information that might fall on the boundary between two chunks of text.Creating a small overlap between chunks ensures that related information isn’t split awkwardly. This means that if you need to recall details from that boundary area, you’ll be able to find them in at least one of the overlapping chunks. The amount of overlap helps control how much redundancy you want – a larger overlap means more potential for finding boundary information but also slightly less efficient storage.
  • SentenceSplitter:  This parser prioritizes keeping whole sentences and paragraphs together within each chunk. This approach helps avoid the issue of “hanging sentences” or sentence fragments appearing at the end of a chunk, which can happen with the TokenTextSplitter that focuses on word count. By respecting sentence boundaries, the SentenceSplitter ensures that each chunk represents a more cohesive unit of meaning. This is also the same as the SimpleNodeParser. Its default settings are a chuck size of 1024 words and an overlap of 200 words.
  • LangchainNoderParser: This versatile parser lets you combine any LangChain text splitter with the power of LlamaIndex nodes. LangChain, another popular framework for language model applications, offers a variety of text-splitting techniques (explore them here). To take advantage of this node parser, have LangChain installed.LangchainNoderParser
  • SentenceWindowNodeParser: This powerful parser within LlamaIndex breaks your documents into sentence-sized nodes and provides valuable context. Each node contains the individual sentence and a “window” of surrounding sentences stored in the metadata. This context window offers these benefits:
    • Improved Understanding: The language model can process each sentence with greater awareness of its place within the text.
    • Enhanced Retrieval: Search results can be more accurate since the context helps pinpoint the most relevant sentences.

    It also allows customization by adjusting parameters like `window_size,` which controls the number of sentences included in the context window, and `metadata_keys,` which specifies where to store the context window and the original sentence within the node.SentenceWindowNodeParser

  • SemanticSplitterNodeParser: This parser takes a smart approach to splitting your text. Instead of using fixed sizes, it analyzes the meaning (semantics) of your sentences with the help of embeddings.  This ensures that each chunk consists of sentences closely related in meaning, creating more cohesive units of information. Inspired by Greg Kamradt’s techniques (see his tutorial:, it uses embeddings (default is OpenAI’s text-embedding-ada-002 model) to find natural breakpoints within your text. This ensures that each chunk contains sentences that are closely related conceptually, helping language models to process the text more effectively, which is an improvement over conventional methods that might split text less meaningfully. Key parameters include:
    • buffer_size: Sets the initial window size for comparing sentences.
    • Breakpoint_percentile_threshold: Determines when the parser should create a new chunk. It analyzes the similarity scores between sentences: the lower the similarity, the bigger the difference in meaning. A high threshold means the parser is strict – it will only split chunks when there’s a sharp drop in similarity, resulting in fewer but larger chunks. Conversely, a low threshold leads to more splits since even smaller differences in similarity will trigger a new chunk, creating many smaller chunks.

    Let’s utilize some of its default parameter settings as follows:SemanticSplitterNodeParser

  • HierarchicalNodeParser: The HierarchicalNodeParser enables you to break down documents into a structured, multi-level hierarchy of chunks (nested chunks). This means that larger chunks can contain smaller nested chunks within them, providing different levels of context. When you search, relevant results may be clustered together. The HierarchicalNodeParser lets you retrieve an entire parent chunk (e.g., a paragraph or chapter) if several smaller chunks within it are relevant, ensuring you don’t miss the bigger picture. You can process the text at different levels of granularity – broader sections for general understanding and smaller chunks for fine-grained details. Imagine a document broken down into:
    • Top Level: Large chunks (e.g., entire chapters)
    • Mid-Level: Medium chunks within each chapter (e.g., paragraphs)
    • Leaf Nodes: Smallest chunks (e.g., individual sentences)

    The HierarchicalNodeParser helps preserve the relationships between different sections of your text, giving language models a richer understanding of the document’s structure. Let’s use its existing default settings and define the chunk sizes for the three levels as follows:

With all the different node parsers defined, we can get the respective node objects as follows:

Getting Nodes

These node types vary in size and carry important information. Each TextNode of a given node type has a unique ID and also contains .metadata and .relationship properties. These provide additional information about the node itself and its connections to other nodes, respectively, enhancing how the language model can understand and process your data. Let’s see how this looks in practice with SentenceWindowNodeParser below:




Try to examine the other node structures and observe how they are different.


After you’ve loaded your data, LlamaIndex structures it for effortless retrieval. It does this using an Index, a specialized data structure designed for efficient searches by language models. LlamaIndex offers several index types, including:

  • Summary Index: Provides quick access to basic information about your data.
  • Tree Index: Organizes data hierarchically, useful for exploring relationships within your text.
  • Keyword Table Index: Allows for fast lookups based on specific keywords.
  • Vector Store Index (most popular): Each node is paired with a ‘vector embedding’, a numerical representation of its meaning. This enables fast similarity searches, letting you find data relevant to your query even if the exact words don’t match.

A vector embedding, often called an embedding, is a numerical representation of a text’s semantics or meaning. Vector embeddings are the key to semantic search. They turn the meaning of your text into a mathematical format that computers can easily compare. This means you can find related information based on concepts, not just keywords! LlamaIndex uses OpenAI’s powerful text-embedding-ada-002 by default, but many other embedding types exist for specialized tasks, which can be specified with the embed_model argument in VectorStoreIndex. Note that these Indices can be created from nodes or from documents (using the .from_documents() method); however, we will be building our indices from our nodes as follows:


Storing Indexing

While these indices make querying your data fast and easy, they must be recreated with each session. This can become computationally expensive, especially when working with large amounts of text. By persisting them to some form of memory, you avoid the need to recreate them each time, saving significant computational resources and only having to add to it with the increasing knowledge base.

Persisting index to local directory

Let’s just take one of the indices to understand this process better, so let’s say we want to persist the setence_window_nodes_indices.

Storing Indexing

This yields a local_storage folder with the following file structure:
├── default__vector_store.json
├── docstore.json
├── graph_store.json
├── image__vector_store.json
└── index_store.json

This leads to the creation of JSON files for persistent storage. However, two files are particularly important for us: docstore.json and default__vector_store.json. The docstore.json file contains the node IDs with respective metadata and reference document information, while the default__vector_store.json contains all the embeddings. Take a look at these files on your local machine.

In the else statement, we use load_indices_from_storage to reload our indices from the local directory. This means you can skip all the previous setup steps in future sessions! Simply load your saved indices and start making queries immediately, saving time and effort.

Persisting index to Vector Store

Vector databases are designed to handle large-scale data. They excel at storing and searching massive collections of high-dimensional vectors, making them ideal for applications where speed and scalability are essential. With a well-designed vector database, you can perform similarity searches across billions of data points with lightning-fast results. There are two main categories of vector stores to consider:

  • Self-hosted (like Chroma DB): You manage these on your servers. This gives you maximum control over data privacy and security and eliminates the need for internet connectivity. It is ideal if you have strict data regulations or in-house technical expertise.
  • Cloud-based (like Pinecone): These are hosted by cloud providers. Advantages include easy setup, effortless scaling as your data grows, and access to them from anywhere. They are perfect for rapid development and when convenience is a priority.

While data persistence in vector stores is important, it’s outside the scope of this article.

LLM Application

Data-backed LLM applications streamline the way you interact with your knowledge. Instead of sifting through files or databases, you get direct answers to questions, engage in informative chats, or even delegate tasks to adaptive decision-making systems.

  • Query Engines: These systems enable direct question-and-answer interactions with your data. A query is processed, relevant information is retrieved from the knowledge base, and the LLM generates a response, often including source references.
  • Chat Engines: Designed for conversational interactions. They handle multiple turns of back-and-forth dialogue with your data, maintaining context across the conversation.
  • Agents: LLM-powered decision-makers that interact with the environment through tools. Unlike previous categories, agents can take multiple steps to complete a task, adapting their actions based on feedback. This allows them to handle complex, multi-stage problems.

In our use case, we need to make queries on the knowledge base; hence, we will create a query engine from our retrieved indices as follows:

LLM Application


Throughout this guide, we’ve explored how LlamaIndex unlocks the hidden potential of your data. By structuring it intelligently, LLMs can move beyond generic responses and truly understand your world. We’ve covered:

  • Data Ingestion: How LlamaIndex connects to various sources, opening up new possibilities.
  • Node Parsing: The importance of breaking down information into meaningful chunks for the LLM to process.
  • Indexing: Transforming your data into searchable formats with different indices, each with its advantages.
  • Persistence: Avoiding redundant work by saving indices for future sessions.
  • The Potential: The building blocks of powerful LLM applications – query engines, chatbots, and even AI agents.

Think of LlamaIndex as the bridge between your knowledge and the incredible power of large language models. Whether you want to supercharge your search, build an insightful chatbot, or create a problem-solving AI assistant, the path starts here. This is just the beginning – as LLM technology and tools like LlamaIndex evolve, the possibilities are endless.

From RAGs to Riches! The future is customized. Thank you for reading


Getting Started with LlamaIndex

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Recent Blog Posts

Precision vs. Recall in the Quest for Model Mastery
Precision vs. Recall in the Quest for Model Mastery

Webinar Event
The Best LLM Safety-Net to Date:
Deepchecks, Garak, and NeMo Guardrails 🚀
June 18th, 2024    8:00 AM PST

Register NowRegister Now