Chicago Historical Travel Assistant

Project Overview

The Chicago Historical Travel Assistant is a locally-hosted AI system designed to make historical Chicago documents interactive, searchable, and easy to explore. The system combines **PDF processing, large language model summarization, semantic search, and a web interface** to deliver factual answers from archival content.

Architecture and Workflow

The project is built as a modular pipeline with clear separation between data processing, AI summarization, retrieval, and user interface. The workflow can be broken down into a multi-stage pipeline:

PDF Processing: Historical PDFs are loaded from a local directory. Text is extracted using pdfplumber, cleaning up artifacts from scanned pages and preserving the original structure of paragraphs and headings.
Text Chunking: Extracted text is split into 400–600 word chunks. This ensures that each portion of text is small enough for efficient AI summarization while retaining enough context for accurate information retrieval.
Summarization with Local LLM: Each chunk is processed by a locally-hosted LLaMA-based model through Ollama. Summaries are strictly factual, avoiding speculative language, and are limited to concise paragraphs. Retries are built in for failed chunk summarizations.
Data Storage: Summaries are saved in summary_chunks.json along with metadata such as the source PDF, chunk index, and a short preview. This structured format enables rapid access for search and display.
Semantic Retrieval: User queries are encoded using sentence-transformer embeddings and compared with chunk embeddings for relevance ranking. This allows the system to return the most contextually relevant summaries. The system previously supported keyword and year-based filtering but now relies on semantic search for accurate retrieval.
Web Interface: A clean Streamlit app allows users to input queries about Chicago's history, architecture, events, and landmarks. Results are displayed in expandable cards with source PDF and chunk citation, along with a relevance score computed from the semantic embeddings.

The modular design ensures that each stage of the workflow—data ingestion, summarization, retrieval, and UI—can be maintained and upgraded independently. For example, upgrading to a larger local LLM or integrating new historical PDFs does not require changing the retrieval or frontend code.

Final Outcome

The final Streamlit app enables users to explore Chicago's rich history interactively. Users receive **factually accurate, cited summaries** of historical events, landmarks, and architecture directly sourced from archival PDFs. The semantic search ensures that queries like "why is the river reversed" return precise, contextually relevant information. A live version of the app is available Chicago Historical Travel Assistant.

Technical Highlights

End-to-end AI pipeline integrating PDF extraction, text chunking, summarization, and semantic retrieval.
Local LLaMA-based model via Ollama ensures data privacy and offline capability.
Semantic search using sentence-transformer embeddings for highly relevant results.
JSON-based storage of summaries and metadata for fast retrieval and easy expansion.
Interactive Streamlit interface with expandable results, source citations, and clean design.
Robust error handling for summarization failures, including automatic retries.