
Kunal Gupta
Aug 13, 2025
Document search has evolved beyond simple text extraction. This technical deep-dive explores building a production-ready search engine that combines Retrieval Augmented Generation (RAG) with Typesense’s powerful indexing capabilities. The system intelligently processes diverse PDF types, maintains visual context through image conversion, and leverages both text and vector search for superior accuracy. Based on real-world implementation experience, this guide covers everything from PDF classification and SQLite storage to advanced search strategies and future enhancement opportunities.
- Building a Hybrid Document Search Engine: From PDF Processing to AI-Powered Retrieval
- The Challenge: Beyond Simple PDF Text Extraction
- PDF to Text: A Multimodal Approach
- Custom Plain Text Search Solution
- Conclusion
Building a Hybrid Document Search Engine: From PDF Processing to AI-Powered Retrieval
In the era of information overload, finding specific content within large document repositories has become a critical challenge especially for private documents. While ChatGPT and similar tools have revolutionized text processing, they still struggle with complex document formats, especially PDFs containing mixed content like handwritten notes, tables, and diagrams. This led me to build a custom document search engine that combines traditional text processing with modern AI capabilities.
The Challenge: Beyond Simple PDF Text Extraction
Most existing PDF processing solutions fall short when dealing with real-world documents. Basic libraries like PyPDF2
or pdfplumber
work well for simple text-based PDFs but fail when documents contain:
- Handwritten annotations alongside typed text
- Complex tables and diagrams
- Scanned documents with varying quality
- Mixed content types within a single page like, text embedded inside images and computer generated text
Even ChatGPT’s default PDF processing capabilities have limitations when it comes to maintaining the visual context of the original document. For instance, when converting a table to text, ChatGPT may lose the structure, making it hard to understand the relationships between different data points. It also does not process text that is embedded inside PDFs.
PDF to Text: A Multimodal Approach
My solution begins with intelligent PDF classification and processing. Rather than applying a one-size-fits-all approach, the system first analyzes each PDF page to determine its content type:
def classify_pdf_page(page: PageObject) -> PDFPageClassification:
"""Classify a PDF page based on its content."""
text = page.extract_text().strip()
images = page.images
if len(text) > 0 and len(images) > 0:
return PDFPageClassification.TEXT_WITH_IMAGES
elif len(text) > 0:
return PDFPageClassification.TEXT_ONLY
elif len(images) > 0:
return PDFPageClassification.IMAGE_ONLY
else:
return PDFPageClassification.EMPTY
Based on this classification, we apply different processing strategies:
text: str | None = None
im_data_list: list[ImData] = []
case PDFPageClassification.TEXT_WITH_IMAGES:
text = page.extract_text().strip()
for i, image in enumerate(page.images):
im_text = pytesseract.image_to_string(image.image)
im_data_list.append({"im_index": i, "im_text": im_text})
case PDFPageClassification.TEXT_ONLY:
text = page.extract_text().strip()
case PDFPageClassification.IMAGE_ONLY:
for i, image in enumerate(page.images):
im_text = pytesseract.image_to_string(image.image)
im_data_list.append({"im_index": i, "im_text": im_text})
We store the above extracted information in a SQLite database that stores the original PDF blob, metadata, and processed page data. The extracted text and relevant metadata to retrieve the exact page blobs from DB is also stored in Typesense.
Typesense
Setting UpBefore integrating with Typesense, you need a running instance. The easiest way to get started is using Docker Compose for local development.
Docker Compose Setup
Create a docker-compose.yml
file in your project root:
version: "3"
services:
typesense:
image: typesense/typesense:0.25.2
entrypoint: sh -c "/opt/typesense-server --data-dir /data --api-key=your-secure-api-key --enable-cors"
ports:
- "8108:8108"
volumes:
- typesense-data:/data
volumes:
typesense-data:
driver: local
Starting the Service
# Start Typesense in the background
docker compose up -d
# Verify the server is running
curl 'http://localhost:8108/keys' \
-X GET \
-H "X-TYPESENSE-API-KEY: your-secure-api-key"
# Expected response: { "keys": [] }
Typesense Integration
Typesense stores the processed text and image data for fast search.
from typesense import Client
basic_fields = [
{"name": "file_sha256", "type": "string", "facet": True},
{"name": "page_index", "type": "int32"},
{"name": "text_source_type", "type": "string"},
{"name": "text", "type": "string"},
{"name": "im_index", "type": "int32", "optional": True},
]
collection_name = "TypesenseTextIndex"
def update_typesense_text_index():
# Initialize Typesense text index utility class
ts_client = Client({
"api_key": "your-api-key",
"nodes": [{"host": "localhost", "port": 8708, "protocol": "http"}],
"connection_timeout_seconds": 60,
})
# Create collection with schema defined in TypesenseTextIndexUtils
schema = {
"name": collection_name,
"enable_nested_fields": False,
"fields": basic_fields,
}
client.collections.create(schema)
# Get list of PDF files that haven’t been imported to Typesense yet
pdf_files = db_utils.get_pdf_file_data_not_imported_in_typesense()
# Ingesting PDF Files into Typesense
all_docs = []
# Process each PDF file and add its pages to Typesense
for file_sha, file_name in tqdm(pdf_files):
# Retrieve page data for this PDF from SQLite database
pdf_page_data_result = db_utils.get_pdf_page_data_by_file_sha(file_sha=file_sha)
# Extract list of page data objects
pdf_page_data_list = [i.pdf_page_data for i in pdf_page_data_result]
# Add documents to Typesense collection using bulk import
collection = client.collections[collection_name]
docs_to_import = [document.model_dump(mode="json") for document in documents]
all_docs.extend(docs_to_import)
responses = collection.documents.import_(all_docs)
for response in responses:
if not response["success"]:
print(f"Failed to import all documents to typesense. Skipping. Failure Reason: {response}")
Why This Architecture?
- SQLite provides ACID compliance and complex queries for metadata. This can be replaced with any SQL database.
- Typesense offers both full-text and vector search capabilities in a single, optimized package.
- Separation of concerns allows independent optimization of storage and search where the search can be scaled independently.
- Incremental sync ensures only new documents are processed after the initial setup.
Comparison with Modern Libraries
While building this system, I evaluated several alternatives:
Traditional Libraries:
- PyPDF2/pypdf: Great for simple text extraction but struggles with complex layouts
- pdfplumber: Better table detection but limited OCR capabilities
- PyMuPDF (fitz): Fast processing but requires additional work for mixed content
Modern AI-Powered Solutions:
- Docling: Advanced document processing framework with custom model support and Hugging Face integration. Enables fully local processing without external API dependencies, making it ideal for batch processing large document volumes without additional costs. Supports multiple document formats and provides fine-grained control over extraction pipelines.
- Kreuzberg: Focuses on structured data extraction with good table support, but limited community adoption
- Marker: Recent AI-powered PDF to markdown converter, excellent for academic papers but struggles with complex business documents
- LangChain Document Loaders: Good integration with LLM workflows but basic PDF processing capabilities
Paid API-Based Solutions:
For production applications requiring high accuracy and reliability, several commercial APIs offer superior document processing capabilities:
-
Azure Document Intelligence (formerly Form Recognizer): Excellent table extraction with high accuracy, processes small documents in under a minute. Supports 100+ languages and custom model training.
-
AWS Textract: Robust table and form extraction with confidence scores. Good for financial documents and invoices. Pricing based on pages processed.
-
Google Document AI: Strong OCR capabilities with specialized processors for different document types. Good integration with Google Cloud ecosystem.
-
Unstructured.io: Comprehensive document processing pipeline, but unreliable with tabular data extraction and can be overkill for focused use cases
-
LlamaIndex: Strong ecosystem integration but observed data loss with tables and slow processing speeds
Key Advantage of Current Approach:
The custom classification system allows for optimized processing of each content type while maintaining cost control. For instance, handwritten annotations are processed through OCR while preserving the original text extraction for typed content. This hybrid approach ensures no information is lost while maintaining processing efficiency and avoiding per-page API costs for large document volumes.
Custom Plain Text Search Solution
The search system combines multiple retrieval strategies to provide comprehensive results.
Hybrid Search Implementation
Typesense supports hybrid search out of the box. The system uses a combination of text and vector search to retrieve relevant documents:
q
- The search query text, with notes about wildcard usagequery_by
- Fields to search against, explaining the importance of field order for relevance rankingvector_query
- The vector search configuration, including:- Empty vector array and auto-embedding.
- Alpha parameter for balancing semantic vs keyword search weights.
- EF parameter for Hierarchical Navigable Small World (HNSW) search quality vs speed trade-off. The EF parameter can be tinkered with to improve t
sort_by
- Sorting criteria with explanation of tie-breaking mechanism and field limits.per_page
- Number of results per page with maximum limit.
search_parameters = {
"q": query,
"query_by": "text,embedding",
"vector_query": f"embedding:([], alpha: {vector_to_text_results_ratio}, ef:128)",
"sort_by": "_text_match:desc,_vector_query(embedding:([])):asc",
"per_page": 10
}
Visual Context Preservation
Here’s where the system truly shines. Instead of just returning text snippets, it reconstructs the visual context:
PAGE_SEPARATOR = "\n--------------------- End of Page ------------------------\n\n"
DOC_SEPARATOR = "####################### End of Document #######################\n\n"
TEMP_IMAGES_FOLDER = ".temp"
def format_result_text(results: Sequence[DBPDFPageData]) -> str:
return f"\n{PAGE_SEPARATOR}\n".join([result.text for result in results])
def run_query_llm(cursor: QuerySQLite, search_utils: SearchUtils, query: str, number_of_results: int = 10) -> str:
# Get search results from vector database
results = run_query_on_vector_db(search_utils=search_utils, query=query, number_of_results=number_of_results)
# Convert relevant pages back to images
images_paths = write_result_images_to_image_folder_and_get_image_paths(
output_folder=TEMP_IMAGES_FOLDER,
)
# Format text context
text = format_result_text(results=results)
# Send both text and images to LLM
answer = call_llm(
ai_provider=AIProvider.OPEN_AI,
system_prompt=SYSTEM_PROMPT,
user_message=query + "\n\nDocs:\n\n" + text,
image_paths=images_paths,
)
return answer
The run_query_llm
function orchestrates the entire search and response generation process through several key steps:
Step 1: Vector Database Search
results = run_query_on_vector_db(search_utils=search_utils, query=query, number_of_results=number_of_results)
This calls the Typesense hybrid search with the following process:
- Converts the user query into embeddings using the configured embedding model
- Performs both text-based and vector-based search simultaneously
- Uses the
alpha
parameter (default 0.9) to weight semantic vs keyword matching - Returns the top-k most relevant document pages ranked by combined relevance score
Step 2: Image Generation from PDF Pages
images_paths = write_result_images_to_image_folder_and_get_image_paths(output_folder=TEMP_IMAGES_FOLDER)
This function performs several operations:
- Groups search results by document SHA256 hash to avoid duplicate processing
- Retrieves original PDF blobs from SQLite database using file hashes
- For small documents (< 4 pages): Converts entire PDF to images
- For larger documents: Converts only the specific pages that matched the search
- Uses
pdf2image.convert_from_bytes()
with optimized settings (100 DPI, PNG format, multi-threading) - Saves images to temporary folder and returns file paths for LLM consumption
Step 3: Text Context Formatting
text = format_result_text(results=results)
This creates a structured text representation:
- Concatenates all matched page texts with clear page separators
- Maintains document boundaries using predefined separators
- Preserves the original text extraction while providing clear context boundaries
- Format:
Page content + "\n--------------------- End of Page ------------------------\n\n"
Step 4: Multimodal LLM Integration
SYSTEM_PROMPT = """Given a user question and support documents, \
answer the user’s question.
"""
answer = call_llm(
ai_provider=AIProvider.OPEN_AI,
system_prompt=SYSTEM_PROMPT,
user_message=query + "\n\nDocs:\n\n" + text,
image_paths=images_paths,
) # You can use Langchain or the OpenAI API directly. This is just a stub.
The final step combines everything for the LLM:
- System Prompt: Provides instructions on how to analyze documents and respond
- User Message: Combines the original query with formatted text context
- Images: Sends the converted PDF page images for visual analysis
- Provider: Uses OpenAI’s GPT-4 Vision for multimodal understanding
This approach ensures the LLM has both textual content for semantic understanding and visual context for structural comprehension, leading to more accurate and contextually aware responses.
Why This Approach Works:
- Preserves Structure: Tables, diagrams, and layouts remain intact and the LLM Vision capability is used directly.
- Context Awareness: LLM sees both text content and visual arrangement.
- Accuracy: Reduces hallucination by providing accurate context to the LLM to answer the query.
- Flexibility: Works with any vision-capable LLM. Similarly, Typesense allows to use any custom Vector Embedding model and the default one’s are free to use from their HuggingFace Repository.
Conclusion
Building this document search engine taught me that effective information retrieval requires more than just throwing documents at an LLM. The key insights were:
- Content-aware processing: Different document types need different extraction strategies
- Hybrid storage: Combining traditional databases with modern search engines provides the best of both worlds
- Visual context matters: Preserving document structure is crucial for accurate information extraction
The system successfully handles complex, real-world documents that would challenge even modern AI tools. By combining traditional text processing with modern AI capabilities, it provides accurate, contextual answers while maintaining the visual integrity of the original documents.
As AI continues to evolve, this architecture provides a solid foundation for incorporating new capabilities while maintaining reliability and performance. The modular design ensures that individual components can be upgraded without rebuilding the entire system.
The future of document search lies not in replacing human understanding but in augmenting it with intelligent tools that preserve context, maintain accuracy, and provide actionable insights from our ever-growing private and public repositories of information and such custom solutions can be built to support the same.