medium severityLanceDB create_fts_index (Tantivy & native FTS)

Full-text searches fail to match documents containing long words, URLs, base64 strings, or IDs longer than 40 characters, even when exact text matches exist. No results returned for queries with long tokens; indexed data appears missing those terms.

Root cause

LanceDB FTS default tokenizer ("simple") filters out and omits any tokens longer than max_token_length=40 characters during indexing. Long tokens like base64 strings, URLs, or technical IDs >40 chars are dropped entirely, making them unsearchable.[LanceDB FTS Docs](https://docs.lancedb.com/indexing/fts-index)

lancedbftstokenizationmax_token_lengthfilteringdropped tokens

Citations