📚 Document Processing in LangChain RAG
The Library Analogy 📖
Imagine you have a HUGE library with thousands of books. You want to find answers to questions—but reading every book would take forever!
What if you could:
- Bring books into your library (Document Loading)
- Label each book with helpful info (Metadata)
- Cut books into easy-to-read cards (Text Splitting)
- Group cards by meaning (Semantic Chunking)
- Make cards the perfect size (Chunk Tuning)
- Clean and organize cards (Document Transformers)
That’s exactly what Document Processing does for AI! Let’s explore each step.
1. Document Loading Strategies 📥
What Is It?
Document loading is how we bring information INTO our AI system. Just like opening a book before you can read it!
The Story
Think of a librarian who can read ANY type of book:
- 📄 Regular paper books (PDF files)
- 💻 Computer screens (Web pages)
- 📝 Handwritten notes (Text files)
- 📊 Number charts (CSV/Excel)
LangChain has special “readers” called Document Loaders for each type!
Common Loaders
# Load a PDF file
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("story.pdf")
docs = loader.load()
# Load a web page
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://example.com")
docs = loader.load()
# Load a text file
from langchain.document_loaders import TextLoader
loader = TextLoader("notes.txt")
docs = loader.load()
🎯 Key Point
Different files need different loaders—like needing different keys for different doors!
graph TD A["Your Files"] --> B{File Type?} B -->|PDF| C["PyPDFLoader"] B -->|Web| D["WebBaseLoader"] B -->|Text| E["TextLoader"] B -->|CSV| F["CSVLoader"] C --> G["Documents Ready!"] D --> G E --> G F --> G
2. Document Object and Metadata 🏷️
What Is It?
A Document in LangChain is like a card with TWO parts:
- page_content - The actual text (what the book says)
- metadata - Extra info (who wrote it, when, which page)
The Story
Imagine a library card for each book:
- Front: The story itself
- Back: Title, author, page number, date added
This “back of the card” info helps you find things FAST!
Example
from langchain.schema import Document
# Create a document with metadata
doc = Document(
page_content="The sun is a star.",
metadata={
"source": "science_book.pdf",
"page": 42,
"author": "Dr. Smith",
"topic": "astronomy"
}
)
# Access the parts
print(doc.page_content)
# "The sun is a star."
print(doc.metadata["page"])
# 42
Why Metadata Matters 🌟
| Without Metadata | With Metadata |
|---|---|
| “The answer is 42” | “The answer is 42” from page 5 of math_guide.pdf |
| No context | Full context! |
| Can’t verify | Can check source |
🎯 Key Point
Metadata is your treasure map—it shows exactly WHERE information came from!
3. Text Splitting Strategies ✂️
What Is It?
Text splitting means cutting BIG documents into SMALL pieces (called “chunks”).
The Story
Imagine you have a 500-page book. Your AI brain can only look at one small piece at a time. So we need to:
- Cut the book into small cards
- Make sure each card makes sense on its own
- Keep related ideas together
Why Split?
- AI models have limited memory (context window)
- Smaller chunks = faster searching
- Better chunks = better answers
Common Splitting Methods
1. Character Splitter (Simple cuts)
from langchain.text_splitter import (
CharacterTextSplitter
)
splitter = CharacterTextSplitter(
chunk_size=100,
chunk_overlap=20
)
chunks = splitter.split_text(long_text)
2. Recursive Splitter (Smart cuts)
from langchain.text_splitter import (
RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
separators=["\n\n", "\n", " ", ""]
)
chunks = splitter.split_documents(docs)
How Recursive Splitting Works
graph TD A["Big Document"] --> B{Try split by paragraph} B -->|Too big?| C{Try split by line} C -->|Too big?| D{Try split by word} D -->|Still big?| E["Split by character"] B -->|Good size!| F["Done ✓"] C -->|Good size!| F D -->|Good size!| F E --> F
🎯 Key Point
RecursiveCharacterTextSplitter is the BEST for most cases—it keeps ideas together!
4. Semantic Chunking 🧠
What Is It?
Semantic chunking splits text by MEANING, not just by size. It keeps related ideas together!
The Story
Regular splitting is like cutting a cake with a ruler—you might cut through the best part!
Semantic splitting is like cutting between layers—each piece is complete and delicious!
How It Works
- Look at each sentence
- Check if it’s SIMILAR to nearby sentences
- Keep similar sentences together
- Split when meaning CHANGES
Example
from langchain_experimental.text_splitter import (
SemanticChunker
)
from langchain.embeddings import OpenAIEmbeddings
# Create semantic chunker
chunker = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile"
)
# Split by meaning!
chunks = chunker.split_text(long_text)
Regular vs Semantic Splitting
| Regular Splitting | Semantic Splitting |
|---|---|
| Cuts at character count | Cuts at meaning boundaries |
| May split mid-sentence | Keeps complete thoughts |
| Fast but rough | Smarter but slower |
| Good for simple text | Great for complex topics |
🎯 Key Point
Use semantic chunking when your content has different topics mixed together!
5. Chunk Size and Overlap Tuning ⚙️
What Is It?
Chunk size = How big each piece should be Overlap = How much pieces should share at the edges
The Story
Imagine cutting a photo into puzzle pieces:
- Too small: You can’t see the picture in each piece
- Too big: Pieces don’t fit together well
- Overlap: Like puzzle pieces with matching edges—they connect better!
The Magic Numbers
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # Size of each chunk
chunk_overlap=50 # Shared text between chunks
)
Visualizing Overlap
Chunk 1: [AAAAAAAAAABBBB]
Chunk 2: [BBBBCCCCCCCCCC]
Chunk 3: [CCCCDDDDDDDDDD]
The "BBBB" and "CCCC" parts OVERLAP!
This keeps context connected.
Tuning Guide
| Chunk Size | Best For |
|---|---|
| 100-300 | Short Q&A, definitions |
| 300-500 | General documents |
| 500-1000 | Technical/detailed content |
| 1000+ | Long-form analysis |
| Overlap | When To Use |
|---|---|
| 0% | Independent facts |
| 10-15% | General documents |
| 20-30% | Connected narratives |
Example Tuning
# For a FAQ document (short answers)
faq_splitter = RecursiveCharacterTextSplitter(
chunk_size=200,
chunk_overlap=20
)
# For a research paper (dense content)
paper_splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100
)
🎯 Key Point
Start with 500/50 (chunk_size/overlap), then adjust based on your results!
6. Document Transformers 🔄
What Is It?
Document Transformers clean up and improve your chunks AFTER splitting.
The Story
After cutting your book into cards, you might want to:
- Remove duplicate cards
- Clean up messy text
- Add more helpful labels
- Translate languages
Common Transformers
1. Remove Duplicates
from langchain.document_transformers import (
EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(
embeddings=embeddings
)
unique_docs = filter.transform_documents(docs)
2. Clean HTML
from langchain.document_transformers import (
Html2TextTransformer
)
transformer = Html2TextTransformer()
clean_docs = transformer.transform_documents(
html_docs
)
3. Add Context with LLM
from langchain.document_transformers import (
LongContextReorder
)
reorderer = LongContextReorder()
ordered_docs = reorderer.transform_documents(
docs
)
Transformer Pipeline
graph TD A["Raw Documents"] --> B["Split into Chunks"] B --> C["Remove Duplicates"] C --> D["Clean HTML/Formatting"] D --> E["Reorder for Context"] E --> F["Ready for AI! ✨"]
🎯 Key Point
Transformers are your clean-up crew—they make your chunks perfect for AI!
🎉 The Complete Pipeline
Now you understand the WHOLE process:
graph TD A["📄 Your Files"] --> B["📥 Load Documents"] B --> C["🏷️ Add Metadata"] C --> D["✂️ Split Text"] D --> E["🧠 Semantic Chunking"] E --> F["⚙️ Tune Size/Overlap"] F --> G["🔄 Transform & Clean"] G --> H["🚀 Ready for RAG!"]
Full Example
# 1. Load
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("my_book.pdf")
docs = loader.load()
# 2. Split with good settings
from langchain.text_splitter import (
RecursiveCharacterTextSplitter
)
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(docs)
# 3. Transform (remove duplicates)
from langchain.document_transformers import (
EmbeddingsRedundantFilter
)
filter = EmbeddingsRedundantFilter(embeddings)
final_docs = filter.transform_documents(chunks)
# Now ready for vector storage & RAG! 🎯
🌟 Summary
| Step | Purpose | Think Of It As… |
|---|---|---|
| Loading | Bring files in | Opening a book |
| Metadata | Add helpful labels | Library card |
| Splitting | Cut into pieces | Making flashcards |
| Semantic | Split by meaning | Grouping related cards |
| Tuning | Perfect the size | Finding the sweet spot |
| Transform | Clean and polish | Final inspection |
💪 You Did It!
You now understand how to prepare documents for RAG:
- ✅ Load any file type
- ✅ Attach useful metadata
- ✅ Split text smartly
- ✅ Keep meanings together
- ✅ Tune for best results
- ✅ Clean and transform
Your AI can now read your library like a pro! 📚🤖
