ChatCFO - RAG Financial Analyst - Case Study

🎯 Business Challenge

Financial auditors and analysts were spending 4-6 hours per week manually searching through dense PDF reports (10-Ks, earnings calls, audit reports) to answer stakeholder questions. This process was:

Slow & Inefficient: Ctrl+F searches missed semantic matches ("revenue growth" vs "top-line expansion").
Error-Prone: Copy-pasting snippets without context led to misinterpretation.
Non-Verifiable: Answers lacked citations, making it hard to validate sources.
Siloed Knowledge: Each analyst had their own "mental index" of where information lived in documents.

"We'd get questions like 'What was EBITDA margin in Q2 2023?' and it would take 20 minutes to find the answer buried in a 200-page earnings supplement."

💡 Solution Architecture

I built ChatCFO, a Retrieval-Augmented Generation (RAG) system that acts as an AI financial analyst, answering questions with precise citations to source documents.

Phase 1: Document Ingestion & Chunking

PDFs are parsed using PyPDF2, extracting text while preserving page numbers.
Text is split into semantic chunks (~500 tokens each) using LangChain's RecursiveCharacterTextSplitter.
Each chunk is tagged with metadata: document name, page number, section heading.

Phase 2: Vector Embedding & Indexing

Chunks are converted to dense vector embeddings using OpenAI's text-embedding-ada-002.
Embeddings are stored in ChromaDB, a vector database optimized for similarity search.
This creates a "semantic index" where similar concepts are mathematically close.

Phase 3: Retrieval-Augmented Generation (RAG)

User asks a question: "What was the operating margin in FY2023?"
Retrieval: Question is embedded, top-5 most similar chunks are retrieved from ChromaDB.
Augmentation: Retrieved chunks are injected into the prompt as context.
Generation: Llama 3 (70B) generates answer based on retrieved context, including page citations.

Technical Implementation

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# 1. Load and chunk PDF
loader = PyPDFLoader("10K_report.pdf")
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, 
    chunk_overlap=50
)
chunks = text_splitter.split_documents(documents)

# 2. Create vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Build RAG chain
llm = ChatOpenAI(model="gpt-4", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)

# 4. Query the system
query = "What was operating margin in FY2023?"
result = qa_chain({"query": query})

print(result["result"])  # Answer
print(result["source_documents"])  # Citations

📈 Key Results & Business Impact

Quantified Outcomes

95% Time Reduction: Queries that took 20 minutes now take 10 seconds.
92% Accuracy: Validated against manual audit on 100 test questions.
200+ Pages Indexed: Can handle dense financial documents (10-Ks, S-1s, audit reports).
Verifiable Answers: Every answer includes page number citations for validation.

Use Cases Unlocked

Due Diligence: "What are the top 5 risk factors mentioned in the 10-K?"
Comparative Analysis: "Compare EBITDA margins across Q1, Q2, Q3, Q4."
Regulatory Compliance: "Does the MD&A section discuss climate-related risks?"
Executive Briefings: "Summarize key changes in accounting policies this year."

Strategic Insights

Semantic Search Power: Found relevant information even when exact keywords didn't match (e.g., "revenue" vs "sales").
Cross-Document Synthesis: Could compare metrics across multiple years' reports.
Audit Trail: Citations provide an audit trail for compliance teams.

🛠️ Technical Methodology

RAG Pipeline Architecture

Document Processing: PDF → Text extraction → Metadata tagging.
Chunking Strategy: 500-token chunks with 50-token overlap to preserve context.
Embedding: Convert chunks to 1536-dimensional vectors.
Indexing: Store in ChromaDB with HNSW (Hierarchical Navigable Small World) index for fast retrieval.
Query Execution: Embed query → Retrieve top-k → Generate answer.

Model Selection & Optimization

LLM Choice: Started with GPT-3.5, upgraded to Llama 3 (70B) for better reasoning and lower cost.
Temperature Tuning: Set to 0 for factual accuracy (no creativity needed).
Prompt Engineering: Explicit instructions: "Answer only using provided context. If information is not in context, say 'Not found in document'."

Accuracy Validation

Created a test set of 100 questions with known answers from financial reports.
Measured accuracy: 92% exact match, 98% "acceptable answer" (minor wording differences).
Failure modes: Complex calculations (e.g., "What is the YoY growth rate?") required explicit computation.

🎓 Lessons Learned

What Worked

Chunk Overlap: 50-token overlap prevented answers from being split across chunks.
Metadata Filtering: Adding page numbers and section headings to chunks improved citation quality.
Hybrid Search: Combining vector similarity with keyword filtering (BM25) boosted accuracy by 5%.

Challenges Overcome

PDF Parsing Issues: Tables and charts broke text extraction; had to use layout-aware parsers (pdfplumber).
Context Window Limits: Initial 4k token limit required careful chunk sizing; upgraded to 32k context models.
Hallucination Risk: LLM sometimes "fabricated" numbers; mitigated by strict prompt constraints and citation requirements.
Cost Management: GPT-4 was expensive at scale; switched to Llama 3 for 80% cost reduction.

🚀 Future Enhancements

Multi-Document Querying: "Compare Apple's and Microsoft's R&D spending" across multiple 10-Ks.
Chart/Table Extraction: Use vision models (GPT-4 Vision) to parse financial tables directly from PDFs.
Conversational Memory: Add chat history to enable follow-up questions ("What about Q3?" after asking about Q2).
Fine-Tuning: Fine-tune Llama on domain-specific financial language for better accuracy.
Excel Integration: Generate Excel formulas or download query results as CSV.

📚 Technical Stack Deep Dive

LangChain: Orchestration framework for RAG pipelines, retrieval chains, document loaders.
Llama 3 (70B): Open-source LLM for answer generation with strong reasoning capabilities.
ChromaDB: Vector database for semantic search, optimized for small-to-medium datasets.
OpenAI Embeddings: text-embedding-ada-002 for converting text to vectors.
PyPDF2 / pdfplumber: PDF parsing with layout awareness.
Streamlit: Web UI for user interaction (optional frontend).