RAG Recommendations That Actually Work

Retrieval-Augmented Generation gets pitched as a chatbot trick. It's actually a great recommendation primitive. Here's the shape of an engine I built over 10,000+ service listings.

Embed once, query forever

Generate embeddings for every listing with text-embedding-3-small, store the vectors in Pinecone, and you've turned fuzzy semantic search into a fast nearest-neighbor lookup.

vector = client.embeddings.create(
    model="text-embedding-3-small",
    input=listing_text,
).data[0].embedding
 
index.upsert([(listing_id, vector, {"category": category})])

Context beats cleverness

The biggest accuracy gains didn't come from a fancier model — they came from what I embedded. Concatenating the title, description, category, and a few structured attributes into one normalized string outperformed any prompt tweak.

Filter in the vector store, not after

Pinecone metadata filters let you constrain by category, location or availability during retrieval. Filtering after the fact wastes recall and latency.

Keep the LLM for the last mile

Use vector similarity to get the top-N candidates, then let the LLM re-rank or explain — not search from scratch. Cheaper, faster, and far easier to evaluate.

RAG isn't magic. It's a good index plus disciplined inputs. Get those right and the rest is tuning.