Maximum Chunk Size for RAG

#27

by mox - opened Jan 22, 2024

Jan 22, 2024

What would be the maximum Chunk Size that I can use with this embedding model, if I want to split up my documents into chunks for RAG?

intfloat

Owner Jan 23, 2024

It would be 512 tokens.

Mihail

Feb 21, 2024

Hi, I have a follow up question. What is the expected behaviour when the passed text is longer than 512 tokens? I assume it gets cut off at 512.

Heidi0039

Jun 24, 2024

•

edited Jun 24, 2024

Should we account for the "passage:" prefix when chunking the documents?
i.e. should f"passage: {doc.page_content}" be 512 tokens long or doc.page_content itself?

And with this being the max_len for a chunk, is there an optimal_len we should aim for?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment