Surya Digital - Building a Hybrid Document Search Engine with Typesense and LLMs

Kunal Gupta

Aug 13, 2025

Overview

Document search has evolved beyond simple text extraction. This technical deep-dive explores building a production-ready search engine that combines Retrieval Augmented Generation (RAG) with Typesense’s powerful indexing capabilities. The system intelligently processes diverse PDF types, maintains visual context through image conversion, and leverages both text and vector search for superior accuracy. Based on real-world implementation experience, this guide covers everything from PDF classification and SQLite storage to advanced search strategies and future enhancement opportunities.

Building a Hybrid Document Search Engine: From PDF Processing to AI-Powered Retrieval
The Challenge: Beyond Simple PDF Text Extraction
PDF to Text: A Multimodal Approach
Custom Plain Text Search Solution
Conclusion

Building a Hybrid Document Search Engine: From PDF Processing to AI-Powered Retrieval

In the era of information overload, finding specific content within large document repositories has become a critical challenge especially for private documents. While ChatGPT and similar tools have revolutionized text processing, they still struggle with complex document formats, especially PDFs containing mixed content like handwritten notes, tables, and diagrams. This led me to build a custom document search engine that combines traditional text processing with modern AI capabilities.

The Challenge: Beyond Simple PDF Text Extraction

Most existing PDF processing solutions fall short when dealing with real-world documents. Basic libraries like PyPDF2 or pdfplumber work well for simple text-based PDFs but fail when documents contain:

Handwritten annotations alongside typed text
Complex tables and diagrams
Scanned documents with varying quality
Mixed content types within a single page like, text embedded inside images and computer generated text

Even ChatGPT’s default PDF processing capabilities have limitations when it comes to maintaining the visual context of the original document. For instance, when converting a table to text, ChatGPT may lose the structure, making it hard to understand the relationships between different data points. It also does not process text that is embedded inside PDFs.

PDF to Text: A Multimodal Approach

My solution begins with intelligent PDF classification and processing. Rather than applying a one-size-fits-all approach, the system first analyzes each PDF page to determine its content type:

def classify_pdf_page(page: PageObject) -> PDFPageClassification:
    """Classify a PDF page based on its content."""
    text = page.extract_text().strip()
    images = page.images
    
    if len(text) > 0 and len(images) > 0:
        return PDFPageClassification.TEXT_WITH_IMAGES
    elif len(text) > 0:
        return PDFPageClassification.TEXT_ONLY
    elif len(images) > 0:
        return PDFPageClassification.IMAGE_ONLY
    else:
        return PDFPageClassification.EMPTY

Based on this classification, we apply different processing strategies:

text: str | None = None
im_data_list: list[ImData] = []
case PDFPageClassification.TEXT_WITH_IMAGES:
    text = page.extract_text().strip()
    
    for i, image in enumerate(page.images):
        im_text = pytesseract.image_to_string(image.image)
        im_data_list.append({"im_index": i, "im_text": im_text})

case PDFPageClassification.TEXT_ONLY:
    text = page.extract_text().strip()
case PDFPageClassification.IMAGE_ONLY:
    for i, image in enumerate(page.images):
        im_text = pytesseract.image_to_string(image.image)
        im_data_list.append({"im_index": i, "im_text": im_text})

We store the above extracted information in a SQLite database that stores the original PDF blob, metadata, and processed page data. The extracted text and relevant metadata to retrieve the exact page blobs from DB is also stored in Typesense.

Setting Up Typesense

Before integrating with Typesense, you need a running instance. The easiest way to get started is using Docker Compose for local development.

Docker Compose Setup

Create a docker-compose.yml file in your project root:

version: "3"
services:
  typesense:
    image: typesense/typesense:0.25.2
    entrypoint: sh -c "/opt/typesense-server --data-dir /data --api-key=your-secure-api-key --enable-cors"
    ports:
      - "8108:8108"
    volumes:
      - typesense-data:/data
volumes:
  typesense-data:
    driver: local

Starting the Service

    # Start Typesense in the background
docker compose up -d

    # Verify the server is running
curl 'http://localhost:8108/keys' \
    -X GET \
    -H "X-TYPESENSE-API-KEY: your-secure-api-key"

    # Expected response: { "keys": [] }

Typesense Integration

Typesense stores the processed text and image data for fast search.

from typesense import Client

basic_fields = [
    {"name": "file_sha256", "type": "string", "facet": True},
    {"name": "page_index", "type": "int32"},
    {"name": "text_source_type", "type": "string"},
    {"name": "text", "type": "string"},
    {"name": "im_index", "type": "int32", "optional": True},
]

collection_name = "TypesenseTextIndex"

def update_typesense_text_index():
    # Initialize Typesense text index utility class
    ts_client = Client({
        "api_key": "your-api-key",
        "nodes": [{"host": "localhost", "port": 8708, "protocol": "http"}],
        "connection_timeout_seconds": 60,
    })
    # Create collection with schema defined in TypesenseTextIndexUtils
    schema = {
        "name": collection_name,
        "enable_nested_fields": False,
        "fields": basic_fields,
    }
    client.collections.create(schema)
    
    # Get list of PDF files that haven’t been imported to Typesense yet
    pdf_files = db_utils.get_pdf_file_data_not_imported_in_typesense()
    # Ingesting PDF Files into Typesense
    all_docs = []
    # Process each PDF file and add its pages to Typesense
    for file_sha, file_name in tqdm(pdf_files):
        # Retrieve page data for this PDF from SQLite database
        pdf_page_data_result = db_utils.get_pdf_page_data_by_file_sha(file_sha=file_sha)

        # Extract list of page data objects
        pdf_page_data_list = [i.pdf_page_data for i in pdf_page_data_result]

        # Add documents to Typesense collection using bulk import
        collection = client.collections[collection_name]
        docs_to_import = [document.model_dump(mode="json") for document in documents]
        all_docs.extend(docs_to_import)

    responses = collection.documents.import_(all_docs)
    for response in responses:
        if not response["success"]:
            print(f"Failed to import all documents to typesense. Skipping. Failure Reason: {response}")

Why This Architecture?

SQLite provides ACID compliance and complex queries for metadata. This can be replaced with any SQL database.
Typesense offers both full-text and vector search capabilities in a single, optimized package.
Separation of concerns allows independent optimization of storage and search where the search can be scaled independently.
Incremental sync ensures only new documents are processed after the initial setup.

Comparison with Modern Libraries

While building this system, I evaluated several alternatives:

Traditional Libraries:

PyPDF2/pypdf: Great for simple text extraction but struggles with complex layouts
pdfplumber: Better table detection but limited OCR capabilities
PyMuPDF (fitz): Fast processing but requires additional work for mixed content

Modern AI-Powered Solutions:

Docling: Advanced document processing framework with custom model support and Hugging Face integration. Enables fully local processing without external API dependencies, making it ideal for batch processing large document volumes without additional costs. Supports multiple document formats and provides fine-grained control over extraction pipelines.
Kreuzberg: Focuses on structured data extraction with good table support, but limited community adoption
Marker: Recent AI-powered PDF to markdown converter, excellent for academic papers but struggles with complex business documents
LangChain Document Loaders: Good integration with LLM workflows but basic PDF processing capabilities

Paid API-Based Solutions:

For production applications requiring high accuracy and reliability, several commercial APIs offer superior document processing capabilities:

Azure Document Intelligence (formerly Form Recognizer): Excellent table extraction with high accuracy, processes small documents in under a minute. Supports 100+ languages and custom model training.
AWS Textract: Robust table and form extraction with confidence scores. Good for financial documents and invoices. Pricing based on pages processed.
Google Document AI: Strong OCR capabilities with specialized processors for different document types. Good integration with Google Cloud ecosystem.
Unstructured.io: Comprehensive document processing pipeline, but unreliable with tabular data extraction and can be overkill for focused use cases
LlamaIndex: Strong ecosystem integration but observed data loss with tables and slow processing speeds

Key Advantage of Current Approach:
The custom classification system allows for optimized processing of each content type while maintaining cost control. For instance, handwritten annotations are processed through OCR while preserving the original text extraction for typed content. This hybrid approach ensures no information is lost while maintaining processing efficiency and avoiding per-page API costs for large document volumes.

Custom Plain Text Search Solution

The search system combines multiple retrieval strategies to provide comprehensive results.

Hybrid Search Implementation

Typesense supports hybrid search out of the box. The system uses a combination of text and vector search to retrieve relevant documents:

q - The search query text, with notes about wildcard usage
query_by - Fields to search against, explaining the importance of field order for relevance ranking
vector_query - The vector search configuration, including:
- Empty vector array and auto-embedding.
- Alpha parameter for balancing semantic vs keyword search weights.
- EF parameter for Hierarchical Navigable Small World (HNSW) search quality vs speed trade-off. The EF parameter can be tinkered with to improve t
sort_by - Sorting criteria with explanation of tie-breaking mechanism and field limits.
per_page - Number of results per page with maximum limit.

search_parameters = {
    "q": query,
    "query_by": "text,embedding",
    "vector_query": f"embedding:([], alpha: {vector_to_text_results_ratio}, ef:128)",
    "sort_by": "_text_match:desc,_vector_query(embedding:([])):asc",
    "per_page": 10
}

Visual Context Preservation

Here’s where the system truly shines. Instead of just returning text snippets, it reconstructs the visual context:

PAGE_SEPARATOR = "\n--------------------- End of Page ------------------------\n\n"
DOC_SEPARATOR = "####################### End of Document #######################\n\n"
TEMP_IMAGES_FOLDER = ".temp"


def format_result_text(results: Sequence[DBPDFPageData]) -> str:

    return f"\n{PAGE_SEPARATOR}\n".join([result.text for result in results])

def run_query_llm(cursor: QuerySQLite, search_utils: SearchUtils, query: str, number_of_results: int = 10) -> str:
    # Get search results from vector database
    results = run_query_on_vector_db(search_utils=search_utils, query=query, number_of_results=number_of_results)
    
    # Convert relevant pages back to images
    images_paths = write_result_images_to_image_folder_and_get_image_paths(
        output_folder=TEMP_IMAGES_FOLDER,
    )
    
    # Format text context
    text = format_result_text(results=results)
    
    # Send both text and images to LLM
    answer = call_llm(
        ai_provider=AIProvider.OPEN_AI,
        system_prompt=SYSTEM_PROMPT,
        user_message=query + "\n\nDocs:\n\n" + text,
        image_paths=images_paths,
    )
    return answer

The run_query_llm function orchestrates the entire search and response generation process through several key steps:

Step 1: Vector Database Search

results = run_query_on_vector_db(search_utils=search_utils, query=query, number_of_results=number_of_results)

This calls the Typesense hybrid search with the following process:

Converts the user query into embeddings using the configured embedding model
Performs both text-based and vector-based search simultaneously
Uses the alpha parameter (default 0.9) to weight semantic vs keyword matching
Returns the top-k most relevant document pages ranked by combined relevance score

Step 2: Image Generation from PDF Pages

images_paths = write_result_images_to_image_folder_and_get_image_paths(output_folder=TEMP_IMAGES_FOLDER)

This function performs several operations:

Groups search results by document SHA256 hash to avoid duplicate processing
Retrieves original PDF blobs from SQLite database using file hashes
For small documents (< 4 pages): Converts entire PDF to images
For larger documents: Converts only the specific pages that matched the search
Uses pdf2image.convert_from_bytes() with optimized settings (100 DPI, PNG format, multi-threading)
Saves images to temporary folder and returns file paths for LLM consumption

Step 3: Text Context Formatting

text = format_result_text(results=results)

This creates a structured text representation:

Concatenates all matched page texts with clear page separators
Maintains document boundaries using predefined separators
Preserves the original text extraction while providing clear context boundaries
Format: Page content + "\n--------------------- End of Page ------------------------\n\n"

Step 4: Multimodal LLM Integration

SYSTEM_PROMPT = """Given a user question and support documents, \
answer the user’s question.
"""

answer = call_llm(
    ai_provider=AIProvider.OPEN_AI,
    system_prompt=SYSTEM_PROMPT,
    user_message=query + "\n\nDocs:\n\n" + text,
    image_paths=images_paths,
) # You can use Langchain or the OpenAI API directly. This is just a stub.

The final step combines everything for the LLM:

System Prompt: Provides instructions on how to analyze documents and respond
User Message: Combines the original query with formatted text context
Images: Sends the converted PDF page images for visual analysis
Provider: Uses OpenAI’s GPT-4 Vision for multimodal understanding

This approach ensures the LLM has both textual content for semantic understanding and visual context for structural comprehension, leading to more accurate and contextually aware responses.

Why This Approach Works:

Preserves Structure: Tables, diagrams, and layouts remain intact and the LLM Vision capability is used directly.
Context Awareness: LLM sees both text content and visual arrangement.
Accuracy: Reduces hallucination by providing accurate context to the LLM to answer the query.
Flexibility: Works with any vision-capable LLM. Similarly, Typesense allows to use any custom Vector Embedding model and the default one’s are free to use from their HuggingFace Repository.

Conclusion

Building this document search engine taught me that effective information retrieval requires more than just throwing documents at an LLM. The key insights were:

Content-aware processing: Different document types need different extraction strategies
Hybrid storage: Combining traditional databases with modern search engines provides the best of both worlds
Visual context matters: Preserving document structure is crucial for accurate information extraction

The system successfully handles complex, real-world documents that would challenge even modern AI tools. By combining traditional text processing with modern AI capabilities, it provides accurate, contextual answers while maintaining the visual integrity of the original documents.

As AI continues to evolve, this architecture provides a solid foundation for incorporating new capabilities while maintaining reliability and performance. The modular design ensures that individual components can be upgraded without rebuilding the entire system.

The future of document search lies not in replacing human understanding but in augmenting it with intelligent tools that preserve context, maintain accuracy, and provide actionable insights from our ever-growing private and public repositories of information and such custom solutions can be built to support the same.