LangChain RAG Pipeline: Complete Production Tutorial
Category: AI Coding Difficulty: Intermediate Updated: 2026-05-28
Build a production-ready RAG pipeline with LangChain. Covers document loading, chunking, vector stores, retrieval chains, and streaming responses with OpenAI embeddings and ChromaDB.
Why LangChain for RAG?
LangChain is the most popular framework for building RAG systems. It provides reusable components for document loading, text splitting, vector stores, and retrieval chains — letting you focus on your application logic instead of boilerplate integration code.
1. Setup
pip install langchain langchain-openai langchain-chroma chromadb
2. Load & Split Documents
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
loader = DirectoryLoader("./docs/", glob="**/*.md")
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", " ", ""]
)
chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
# Output: Split into 47 chunks 3. Create Vector Store
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Indexed {vectorstore._collection.count()} vectors") 4. Build the RAG Chain
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
system_prompt = (
"You are a helpful assistant. Answer the question using ONLY the provided context. "
'If you cannot answer from the context, say "I don\'t have enough information." '
"Always cite the source document names.
"
"Context: {context}"
)
prompt = ChatPromptTemplate.from_messages([
("system", system_prompt),
("human", "{input}")
])
document_chain = create_stuff_documents_chain(llm, prompt)
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
rag_chain = create_retrieval_chain(retriever, document_chain)
# Query
response = rag_chain.invoke({"input": "What are the system requirements?"})
print(response["answer"])
print(f"Sources: {[d.metadata['source'] for d in response['context']]}") 5. Key Parameters to Tune
- chunk_size: 500-1500 tokens. Smaller = more precise, larger = more context
- chunk_overlap: 10-20% of chunk size. Prevents information loss at boundaries
- k (retrieved chunks): 3-5 for most use cases. More = richer context but more tokens
- temperature: 0 for factual Q&A, 0.3-0.7 for creative synthesis