Data Indexing
We have to follow 5 steps for data indexing
Step 1: Getting the uploaded PDF file
Step2: Convert the uploaded PDF to text
Step 3: Split the extracted text into chunks
Step 4: Generate embeddings from chunks
Step 5: Delete the uploaded PDF after processing
Let us create a route i.e api/v1/pdf/indexing/new with post method for pdf indexing
const express = require('express');
const httpServer = express();
const PORT = 4000;
// PDF Indexing Route
httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {
});
httpServer.listen(PORT, () => {
console.log(`HTTP Server is on : ${PORT}`);
});
Step 1: Getting the uploaded PDF file
AtFirst, we will be converting PDF to Text using multer library. Multer is a node.js middleware for handling form-data
, which is primarily used for uploading files.
Install multer from the terminal
npm install multer
Import multer, then setup disk storage for PDF storing. In disk storage we will be mentioning two things destination & filename of the PDF. As given in the code
const express = require('express');
const httpServer = express();
// Importing multer
const multer = require('multer')
const PORT = 4000;
// Setup of disk storage for storing the pdf
const storage = multer.diskStorage({
destination: function (req, file, cb) {
// Pass the path of destination file in second parameter
cb(null, '/uploads/pdf')
},
filename: function (req, file, cb) {
// Pass the file name in second parameter
cb(null, file.originalname)
}
})
httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {
});
httpServer.listen(PORT, () => {
console.log(`HTTP Server is on : ${PORT}`);
});
Multer accepts an option object.
The following are the options that can be passed to Multer.
Key | Description |
---|---|
dest or storage | Where to store the files |
fileFilter | Function to control which files are accepted |
limits | Limits of the uploaded data |
preservePath | Keep the full path of files instead of just the base name |
Let’s pass storage, fileFilter & limits(size of the file) for now.
const express = require('express');
const httpServer = express();
// Importing multer
const multer = require('multer')
const PORT = 4000;
// Setup of disk storage for storing the pdf
const storage = multer.diskStorage({
destination: function (req, file, cb) {
// Pass the path of destination file in second parameter
cb(null, '/uploads/pdf')
},
filename: function (req, file, cb) {
// Pass the file name in second parameter
cb(null, file.originalname)
}
})
// Initialize multer with storage, PDF-only filter, and 10MB size limit
const uploads = multer({
// Specify custom storage engine
storage: storage,
// Filter to accept only PDF files
fileFilter: (req, file, cb) => {
// Reject file if not PDF
if(file.mimetype != 'application/pdf'){
cb(null, false)
}
else{
cb(null, true)
}
},
// Limit file size to 10MB
limits: {
fieldSize: 10 * 1024 * 1024
}
}
httpServer.post("/api/v1/pdf/indexing/new", (req, res) => {
});
httpServer.listen(PORT, () => {
console.log(`HTTP Server is on : ${PORT}`);
});
Let’s add pdfMulterUpload as a middleware.
AtFirst we’ll check for uploaded PDF
Then destructing originalname pdfOriginalName, path as pdfPath, size as pdfSize then log.
Step 5: Now, let’s work on 5th step i.e deleting the uploaded PDF after processing using unlinkSync method.
httpServer.post("/api/v1/pdf/indexing/new", pdfMulterUpload.single('pdf'), async (req, res) => {
try {
// Step 1: Getting the uploaded PDF file from multer
const { originalname: pdfOriginalName, path: pdfPath, size: pdfSize } = req.file;
console.log(pdfOriginalName, pdfPath, pdfSize);
// Step 2: Convert the uploaded PDF to text
// TODO: Implement PDF to text conversion logic here
// Step 3: Split the extracted text into chunks
// TODO: Implement text chunking logic here
// Step 4: Generate embeddings from chunks
// TODO: Convert each chunk into vector embeddings
// Step 5: Delete the uploaded PDF after processing
try {
await fs.unlinkSync(pdfPath);
console.log('PDF is deleted successfully');
} catch (err) {
throw new Error(`Unable to delete the pdf with name: ${pdfOriginalName}`);
}
// Final response after successful indexing
res.status(201).json({
success: true,
message: `PDF with the name ${pdfOriginalName} is indexed successfully`
});
} catch (err) {
// Catch any error during the whole process and send a failure response
console.log(`Error in pdf data indexing ${err.message}`);
res.status(500).send({
success: false,
message: "Something went wrong"
});
}
});
Step 2: Convert the uploaded PDF to text
Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.
npm install pdf-parse
Atfist, read the pdf into buffer then we will parse the buffer to extract the plain text.
const express = require('express');
const httpServer = express();
const PORT = 4000;
const multer = require('multer');
const fs = require('fs');
const pdf = require('pdf-parse'); // For extracting text from PDF
// Step 1: Setting up multer storage for uploading PDFs
const storage = multer.diskStorage({
destination: function (req, file, cb) {
cb(null, 'uploads/pdf');
},
filename: function (req, file, cb) {
cb(null, file.originalname);
}
});
const pdfMulterUpload = multer({
storage: storage,
fileFilter: (req, file, cb) => {
if (file.mimetype !== 'application/pdf') {
cb(null, false);
} else {
cb(null, true);
}
},
limits: {
fileSize: 10 * 1024 * 1024
}
});
httpServer.post("/api/v1/pdf/indexing/new", pdfMulterUpload.single('pdf'), async (req, res) => {
try {
// Step 1: Getting the uploaded PDF file from multer
const { originalname: pdfOriginalName, path: pdfPath, size: pdfSize } = req.file;
console.log(pdfOriginalName, pdfPath, pdfSize);
// Step 2: Convert the uploaded PDF to text
let dataBuffer = fs.readFileSync(pdfPath); // Reading PDF file into buffer
let pdfParseResult = await pdf(dataBuffer); // Parsing the buffer to extract text
let pdfText = pdfParseResult.text; // Extracted plain text from PDF
console.log(pdfText);
// Step 3: Split the extracted text into chunks
// TODO: Implement text chunking logic here
// Step 4: Generate embeddings from chunks
// TODO: Convert each chunk into vector embeddings
// Step 5: Delete the uploaded PDF after processing
try {
await fs.unlinkSync(pdfPath); // Removing the uploaded file from storage
console.log('PDF is deleted successfully');
} catch (err) {
throw new Error(`Unable to delete the pdf with name: ${pdfOriginalName}`);
}
res.status(201).json({
success: true,
message: `PDF with the name ${pdfOriginalName} is indexed successfully`
});
} catch (err) {
console.log(`Error in pdf data indexing ${err.message}`);
res.status(500).send({
success: false,
message: "Something went wrong"
});
}
});
httpServer.listen(PORT, () => {
console.log(`HTTP Server started on ${PORT}`);
});
Step 2: Split the extracted text into chunks
Extract texts from PDFs using pdf-parse library. Install the library from the terminal using the given command.
The next step is converting the text into chunks so that the computer (AI model) can understand it more effectively.
Imagine you’re reading a book with 320 pages. One option is to read all the pages in one go, while another is to read one page at a time. Which option makes more sense?
Obviously, reading one page, then pausing for a while before reading the next, is more manageable and effective. AI models operate in a similar way. Understanding a small portion of text is much easier than trying to process a large amount at once. This process is known as chunking.
There are 3 chunking strategies mentioned below.
- Fixed window size without token overlapping
- Fixed window size with token overlapping
- Page wise (Can be useful for presentation)
Fixed window size without token overlapping
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Words 1–200
Chunk 2 → Words 201–400
Chunk 3 → Words 401–600
… and so on.
Downside? If something important mentioned between word 200 and 201, the meaning might get lost because the chunks don’t “see” each other.
Fixed window size with token overlapping
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Words 1–200
Chunk 2 → Words 150–350
Chunk 3 → Words 300–500
… and so on.
This helps AI or any reader understand better, because they get a little reminder of what was just said.
Page wise (Can be useful for presentation)
Book example:
Using a 200-word chunk, but you overlap by 50 words:
Chunk 1 → Page 1
Chunk 2 → Page 2
Chunk 3 → Page 3
… and so on.
Useful when you want chunks that match visual content like slides, or when you’re converting PDFs and don’t want to mess with layout.
For now we will be using Fixed window size without token overlapping strategy
// Step 3: Split the extracted text into chunks
let chunks = []; // Initialize an array to store the text chunks
const totalWords = pdfText.split(' '); // Split the PDF text into individual words
const totalWordsCount = totalWords.length; // Get the total number of words (not characters)
const windowSize = 1000; // Define how many words should go into each chunk
const totalPossibleChunks = Math.ceil(totalWordsCount / windowSize); // Calculate the number of chunks we can create
for (let i = 0; i < totalPossibleChunks; i++) { // Loop through and create each chunk
const start = windowSize * i; // Starting index for the chunk
const end = start + windowSize; // Ending index (non-inclusive)
const chunk = totalWords.slice(start, end).join(" ") // Get a slice of words from start to end & Join words back into a string
chunks.push(chunk) // Push into chunks array
}
console.log(chunks); // Log all the generated chunks
Step 4: Generate embeddings from chunks
There are 2 steps involved in this task
- Creating the vector embedding from the chunks (text-embedding-3-small)
- Storing the embedding in vector database
Let’s us first convert the vector embedding from the chunks
Install Required Packages
npm install openai dotenv
Create a .env
file in root directory
OPENAI_API_KEY=your_openai_api_key_here
Set Up OpenAI Client
const OpenAI = require('openai');
require('dotenv').config();
const openaiClient = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
Create Embeddings from Chunks
chunks.forEach(async(chunk, index)=>{
const vector_embedding_result = await openaiClient.embeddings.create({
model: 'text-embedding-3-small',
input: chunk
});
console.log(vector_embedding_result?.data[0]?.embedding)
})
The embedding will look like this [-0.0123, 0.0987, 0.0432, -0.0714, 0.0023, …, 0.0019]