Data Indexing in RAG – Geekashram | Empowering tech aspirants to start their tech career.

Home
Articles
Data Indexing in RAG

Data Indexing

We have to follow 5 steps for data indexing

Step 1: Getting the uploaded PDF file

Step2: Convert the uploaded PDF to text

Step 3: Split the extracted text into chunks

Step 4: Generate embeddings from chunks

Step 5: Delete the uploaded PDF after processing

Let us create a route i.e api/v1/pdf/indexing/new with post method for pdf indexing


const express = require('express');
const httpServer =  express();
const PORT = 4000;

// PDF Indexing Route

httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {

});


httpServer.listen(PORT, () => {
    console.log(`HTTP Server is on : ${PORT}`);
});

Step 1: Getting the uploaded PDF file

AtFirst, we will be converting PDF to Text using multer library. Multer is a node.js middleware for handling form-data, which is primarily used for uploading files.

Install multer from the terminal


npm install multer

Import multer, then setup disk storage for PDF storing. In disk storage we will be mentioning two things destination & filename of the PDF. As given in the code


    const express = require('express');
    const httpServer =  express();
    // Importing multer
    const multer  = require('multer')

    const PORT = 4000;
    
    // Setup of disk storage for storing the pdf
    const storage = multer.diskStorage({
        destination: function (req, file, cb) {
            // Pass the path of destination file in second parameter
            cb(null, '/uploads/pdf')
    },
        filename: function (req, file, cb) {
              // Pass the file name in second parameter
            cb(null, file.originalname)
        }
    })

    httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {
 
    });

    httpServer.listen(PORT, () => {
        console.log(`HTTP Server is on : ${PORT}`);
    });

Multer accepts an option object.

The following are the options that can be passed to Multer.

Key	Description
`dest` or `storage`	Where to store the files
`fileFilter`	Function to control which files are accepted
`limits`	Limits of the uploaded data
`preservePath`	Keep the full path of files instead of just the base name

Let’s pass storage, fileFilter & limits(size of the file) for now.


const express = require('express');
const httpServer =  express();
// Importing multer
const multer  = require('multer')

const PORT = 4000;

// Setup of disk storage for storing the pdf
const storage = multer.diskStorage({
    destination: function (req, file, cb) {
        // Pass the path of destination file in second parameter
        cb(null, '/uploads/pdf')
    },
    filename: function (req, file, cb) {
        // Pass the file name in second parameter
        cb(null, file.originalname)
    }
})

// Initialize multer with storage, PDF-only filter, and 10MB size limit
const uploads = multer({
    // Specify custom storage engine
    storage: storage,

    // Filter to accept only PDF files
    fileFilter: (req, file, cb) => {
        // Reject file if not PDF
        if(file.mimetype != 'application/pdf'){
            cb(null, false)
        }
        else{
            cb(null, true)
        }
    },

    // Limit file size to 10MB
    limits: {
        fieldSize: 10 * 1024 * 1024
    }
}


httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {

});

httpServer.listen(PORT, () => {
    console.log(`HTTP Server is on : ${PORT}`);
});

Let’s add pdfMulterUpload as a middleware.

AtFirst we’ll check for uploaded PDF

Then destructing originalname pdfOriginalName, path as pdfPath, size as pdfSize then log.

Step 5: Now, let’s work on 5th step i.e deleting the uploaded PDF after processing using unlinkSync method.



httpServer.post("/api/v1/pdf/indexing/new", pdfMulterUpload.single('pdf'), async (req, res) => {
  try {
    // Step 1: Getting the uploaded PDF file from multer
    const { originalname: pdfOriginalName, path: pdfPath, size: pdfSize } = req.file;
    console.log(pdfOriginalName, pdfPath, pdfSize);

    // Step 2: Convert the uploaded PDF to text
    // TODO: Implement PDF to text conversion logic here

    // Step 3: Split the extracted text into chunks
    // TODO: Implement text chunking logic here

    // Step 4: Generate embeddings from chunks
    // TODO: Convert each chunk into vector embeddings


    // Step 5: Delete the uploaded PDF after processing
    try {
      await fs.unlinkSync(pdfPath);
      console.log('PDF is deleted successfully');
    } catch (err) {
      throw new Error(`Unable to delete the pdf with name: ${pdfOriginalName}`);
    }
    
    // Final response after successful indexing
    res.status(201).json({
      success: true,
      message: `PDF with the name ${pdfOriginalName} is indexed successfully`
    });
  } catch (err) {
      // Catch any error during the whole process and send a failure response
    console.log(`Error in pdf data indexing ${err.message}`);
    res.status(500).send({
      success: false,
      message: "Something went wrong"
    });
  }
});

Step 2: Convert the uploaded PDF to text

Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.


npm install pdf-parse

Atfist, read the pdf into buffer then we will parse the buffer to extract the plain text.


const express = require('express');
const httpServer = express();
const PORT = 4000;

const multer = require('multer');
const fs = require('fs');
const pdf = require('pdf-parse'); // For extracting text from PDF

// Step 1: Setting up multer storage for uploading PDFs
const storage = multer.diskStorage({
  destination: function (req, file, cb) {
    cb(null, 'uploads/pdf');
  },
  filename: function (req, file, cb) {
    cb(null, file.originalname);
  }
});

const pdfMulterUpload = multer({
  storage: storage,
  fileFilter: (req, file, cb) => {
    if (file.mimetype !== 'application/pdf') {
      cb(null, false);
    } else {
      cb(null, true);
    }
  },
  limits: {
    fileSize: 10 * 1024 * 1024
  }
});

httpServer.post("/api/v1/pdf/indexing/new", pdfMulterUpload.single('pdf'), async (req, res) => {
  try {
    // Step 1: Getting the uploaded PDF file from multer
    const { originalname: pdfOriginalName, path: pdfPath, size: pdfSize } = req.file;
    console.log(pdfOriginalName, pdfPath, pdfSize);

    // Step 2: Convert the uploaded PDF to text
    let dataBuffer = fs.readFileSync(pdfPath); // Reading PDF file into buffer
    let pdfParseResult = await pdf(dataBuffer); // Parsing the buffer to extract text
    let pdfText = pdfParseResult.text; // Extracted plain text from PDF
    console.log(pdfText);

    // Step 3: Split the extracted text into chunks
    // TODO: Implement text chunking logic here

    // Step 4: Generate embeddings from chunks
    // TODO: Convert each chunk into vector embeddings

    // Step 5: Delete the uploaded PDF after processing
    try {
      await fs.unlinkSync(pdfPath); // Removing the uploaded file from storage
      console.log('PDF is deleted successfully');
    } catch (err) {
      throw new Error(`Unable to delete the pdf with name: ${pdfOriginalName}`);
    }

    res.status(201).json({
      success: true,
      message: `PDF with the name ${pdfOriginalName} is indexed successfully`
    });
  } catch (err) {
    console.log(`Error in pdf data indexing ${err.message}`);
    res.status(500).send({
      success: false,
      message: "Something went wrong"
    });
  }
});

httpServer.listen(PORT, () => {
  console.log(`HTTP Server started on ${PORT}`);
});

Step 2: Split the extracted text into chunks

Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.

The next step is converting the text into chunks so that the computer (AI model) can understand it more effectively.

Imagine you’re reading a book with 320 pages. One option is to read all the pages in one go, while another is to read one page at a time. Which option makes more sense?

Obviously, reading one page, then pausing for a while before reading the next, is more manageable and effective. AI models operate in a similar way. Understanding a small portion of text is much easier than trying to process a large amount at once. This process is known as chunking.

There are 3 chunking strategies mentioned below.

Fixed window size without token overlapping
Fixed window size with token overlapping
Page wise (Can be useful for presentation)

Fixed window size without token overlapping
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Words 1–200
Chunk 2 → Words 201–400
Chunk 3 → Words 401–600
… and so on.
Downside? If something important mentioned between word 200 and 201, the meaning might get lost because the chunks don’t “see” each other.

Fixed window size with token overlapping
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Words 1–200
Chunk 2 → Words 150–350
Chunk 3 → Words 300–500
… and so on.
This helps AI or any reader understand better, because they get a little reminder of what was just said.

Page wise (Can be useful for presentation)
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Page 1
Chunk 2 → Page 2
Chunk 3 → Page 3
… and so on.
Useful when you want chunks that match visual content like slides, or when you’re converting PDFs and don’t want to mess with layout.

For now we will be using Fixed window size without token overlapping strategy


// Step 3: Split the extracted text into chunks
let chunks = []; // Initialize an array to store the text chunks

const totalWords = pdfText.split(' '); // Split the PDF text into individual words

const totalWordsCount = totalWords.length; // Get the total number of words (not characters)

const windowSize = 1000; // Define how many words should go into each chunk

const totalPossibleChunks = Math.ceil(totalWordsCount / windowSize); // Calculate the number of chunks we can create

for (let i = 0; i < totalPossibleChunks; i++) { // Loop through and create each chunk
  const start = windowSize * i; // Starting index for the chunk
  const end = start + windowSize; // Ending index (non-inclusive)
  const chunk = totalWords.slice(start, end).join(" ") // Get a slice of words from start to end & Join words back into a string
 chunks.push(chunk) //  Push into chunks array
}

console.log(chunks); // Log all the generated chunks

Step 4: Generate embeddings from chunks

There are 2 steps involved in this task

Creating the vector embedding from the chunks (text-embedding-3-small)
Storing the embedding in vector database

Let’s us first convert the vector embedding from the chunks

Install Required Packages


npm install openai dotenv

Create a .env file in root directory


OPENAI_API_KEY=your_openai_api_key_here

Set Up OpenAI Client


const  OpenAI  = require('openai');
require('dotenv').config();
const openaiClient = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

Create Embeddings from Chunks


chunks.forEach(async(chunk, index)=>{
     const vector_embedding_result  =  await openaiClient.embeddings.create({
      model: 'text-embedding-3-small',
      input: chunk
    });
      console.log(vector_embedding_result?.data[0]?.embedding)
    })

The embedding will look like this [-0.0123, 0.0987, 0.0432, -0.0714, 0.0023, …, 0.0019]

Stay ahead with Geekashram insights

Get the latest coding tips, industry trends, and exclusive learning resources delivered straight to your inbox. Subscribe now and never miss an update!

Regd Office : Taregna Dih, Masaurhi, Patna - 804452

info@geekashram.in

+91 9934630909

Data Indexing

Stay ahead with Geekashram insights

Connect with us