The Theoretical Limitations of Embedding-Based Retrieval

(arxiv.org)

77 points | by fzliu 7 hours ago

3 comments

Straw 2 hours ago
In the theoretical section, they extrapolate assuming a polynomial from 40 to thousands of dimensions. Why do they trust a polynomial fit to extrapolate two orders of magnitude? Why do we even think it's polynomial instead of exponential in the first place? Most things like this increase exponentially with dimension.
In fact, I think we can do it in d=2k dimensions, if we're willing to have arbitrarily precise query vectors.
Embed our points as (sin(theta), cos(theta), sin(2 x theta), cos(2 x theta)..., sin(k x theta), cos(k x theta)), with theta uniformly spaced around the circle, and we should be able to select any k of them.
Using a few more dimensions we can then ease the precision requirements on the query.
[-]
- namibj 1 hour ago
  In practice you're actually hitting further problems because you don't have those synthetic top-k tasks but rather open-domain documents and queries to support. And if you hope to get better than "just" having the top-k correct and instead get some sort of inclusion/exclusion boundary between what should be matched and what should not be matched, you'll hit the same bounds as apply to context length limitations for kq dimenionality in a transformer's attention layers, as I mentioned about 6 weeks ago: https://news.ycombinator.com/item?id=44570650
gdiamos 6 hours ago
Their idea is that capacity of even 4096-wide vectors limits their performance.
Sparse models like BM25 have a huge dimension and thus don’t suffer from this limit, but they don’t capture semantics and can’t follow instructions.
It seems like the holy grail is a sparse semantic model. I wonder how splade would do?
[-]
- CuriouslyC 4 hours ago
  We already have "sparse" embeddings. Google's Matryoshka embedding schema can scale embeddings from ~150 dimensions to >3k, and it's the same embedding with layers of representational meaning. Imagine decomposing an embedding along principle components, then streaming the embedding vectors in order of their eigenvalue, kind of the idea.
  [-]
  - 3abiton 2 hours ago
    Doesn't PCA compress the embeddings in this case, ie reduce the accuracy? It's similar to quantization.
    [-]
    - CuriouslyC 54 minutes ago
      Component analysis doesn't fundamentally reduce information, it just rotates it into a more informative basis. People usually drop vectors using the eigenvalues to do dimensionality reduction, but you don't have to do that.
  - jxmorris12 3 hours ago
    Matryoshka embeddings are not sparse. And SPLADE can scale to tens or hundreds of thousands of dimensions.
    [-]
    - CuriouslyC 3 hours ago
      If you consider the actual latent space the full higher dimensional representation, and you take the first principle component, the other vectors are zero. Pretty sparse. No it's not a linked list sparse matrix. Don't be a pedant.
- tkfoss 4 hours ago
  Wouldn't holy grail then be parallel channels for candidate generation;
```
  euclidean embedding
  hyperbolic embedding
  sparse BM25 / SPLADE lexical search
  optional multi-vector signatures

  ↓ merge & deduplicate candidates
```
  followed by weight scoring, expansion (graph) & rerank (LLM)?
ArnavAgrawal03 4 hours ago
we used multi-vector models at Morphik, and I can confirm the real-world effectiveness, especially when compared with dense-vector retrieval.
[-]
- codingjaguar 3 hours ago
  Curious is that colbert-like ones?