Sparse Embedding Retrieval with Qdrant and FastEmbed

_{Last Updated:
September 24, 2024}

In this notebook, we will see how to use Sparse Embedding Retrieval techniques (such as SPLADE) in Haystack.

We will use the Qdrant Document Store and FastEmbed Sparse Embedders.

Why SPLADE?

Sparse Keyword-Based Retrieval (based on BM25 algorithm or similar ones) is simple and fast, requires few resources but relies on lexical matching and struggles to capture semantic meaning.
Dense Embedding-Based Retrieval takes semantics into account but requires considerable computational resources, usually does not work well on novel domains, and does not consider precise wording.

While good results can be achieved by combining the two approaches ( tutorial), SPLADE (Sparse Lexical and Expansion Model for Information Retrieval) introduces a new method that encapsulates the positive aspects of both techniques. In particular, SPLADE uses Language Models like BERT to weigh the relevance of different terms in the query and perform automatic term expansions, reducing the vocabulary mismatch problem (queries and relevant documents often lack term overlap).

Main features:

Better than dense embedding Retrievers on precise keyword matching
Better than BM25 on semantic matching
Slower than BM25
Still experimental compared to both BM25 and dense embeddings: few models; supported by few Document Stores

Resources

Install dependencies

!pip install -U fastembed-haystack qdrant-haystack wikipedia transformers

Sparse Embedding Retrieval

Indexing

Create a Qdrant Document Store

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True  # set this parameter to True, otherwise the collection schema won't allow to store sparse vectors
)

Download Wikipedia pages and create raw documents

We download a few Wikipedia pages about animals and create Haystack documents from them.

nice_animals=["Capybara", "Dolphin"]

import wikipedia
from haystack.dataclasses import Document

raw_docs=[]
for title in nice_animals:
    page = wikipedia.page(title=title, auto_suggest=False)
    doc = Document(content=page.content, meta={"title": page.title, "url":page.url})
    raw_docs.append(doc)

Initialize a `FastembedSparseDocumentEmbedder`

The FastembedSparseDocumentEmbedder enrichs a list of documents with their sparse embeddings.

We are using prithvida/Splade_PP_en_v1, a good sparse embedding model with a permissive license.

We also want to embed the title of the document, because it contains relevant information.

For more customization options, refer to the docs.

from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder

sparse_doc_embedder = FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1",
                                                      meta_fields_to_embed=["title"])
sparse_doc_embedder.warm_up()

# let's try the embedder
print(sparse_doc_embedder.run(documents=[Document(content="An example document")]))

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]



.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



README.md:   0%|          | 0.00/133 [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/755 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



model.onnx:   0%|          | 0.00/532M [00:00<?, ?B/s]


Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]

{'documents': [Document(id=cd69a8e89f3c179f243c483a337c5ecb178c58373a253e461a64545b669de12d, content: 'An example document', sparse_embedding: vector with 19 non-zero elements)]}

Indexing pipeline

from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline

indexing = Pipeline()
indexing.add_component("cleaner", DocumentCleaner())
indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
indexing.add_component("sparse_doc_embedder", sparse_doc_embedder)
indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

indexing.connect("cleaner", "splitter")
indexing.connect("splitter", "sparse_doc_embedder")
indexing.connect("sparse_doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21068632e0>
🚅 Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
  - sparse_doc_embedder.documents -> writer.documents (List[Document])

Let’s index our documents!

⚠️ If you are running this notebook on Google Colab, please note that Google Colab only provides 2 CPU cores, so the sparse embedding generation could be not as fast as it can be on a standard machine.

indexing.run({"documents":raw_docs})

Calculating sparse embeddings: 100%|██████████| 152/152 [02:29<00:00,  1.02it/s]
200it [00:00, 2418.48it/s]             





{'writer': {'documents_written': 152}}

document_store.count_documents()

Retrieval

Retrieval pipeline

Now, we create a simple retrieval Pipeline:

FastembedSparseTextEmbedder: transforms the query into a sparse embedding
QdrantSparseEmbeddingRetriever: looks for relevant documents, based on the similarity of the sparse embeddings

from haystack import Pipeline
from haystack_integrations.components.retrievers.qdrant import QdrantSparseEmbeddingRetriever
from haystack_integrations.components.embedders.fastembed import FastembedSparseTextEmbedder

sparse_text_embedder = FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1")

query_pipeline = Pipeline()
query_pipeline.add_component("sparse_text_embedder", sparse_text_embedder)
query_pipeline.add_component("sparse_retriever", QdrantSparseEmbeddingRetriever(document_store=document_store))

query_pipeline.connect("sparse_text_embedder.sparse_embedding", "sparse_retriever.query_sparse_embedding")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f21067cf3d0>
🚅 Components
  - sparse_text_embedder: FastembedSparseTextEmbedder
  - sparse_retriever: QdrantSparseEmbeddingRetriever
🛤️ Connections
  - sparse_text_embedder.sparse_embedding -> sparse_retriever.query_sparse_embedding (SparseEmbedding)

Try the retrieval pipeline

question = "Where do capybaras live?"

results = query_pipeline.run({"sparse_text_embedder": {"text": question}})

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00,  9.02it/s]

import rich

for d in results['sparse_retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

id: 6a485709ae51c55b78252571c0808ef17129b32e930ea7d461c12d9afaf40672

Its karyotype has 2n = 66 and FN = 102, meaning it has 66 chromosomes with a total of 102 arms. == Ecology == 
Capybaras are semiaquatic mammals found throughout all countries of South America except Chile. They live in 
densely forested areas near bodies of water, such as lakes, rivers, swamps, ponds, and marshes, as well as flooded 
savannah and along rivers in the tropical rainforest. They are superb swimmers and can hold their breath underwater
for up to five minutes at a time.
score: 0.5607053126371688
---

id: fcc9a816e7f2312988dbd20146e4a5c07d8d8b409a373c4c3986d85c26dc0d61

Capybaras have adapted well to urbanization in South America. They can be found in many areas in zoos and parks, 
and may live for 12 years in captivity, more than double their wild lifespan. Capybaras are docile and usually 
allow humans to pet and hand-feed them, but physical contact is normally discouraged, as their ticks can be vectors
to Rocky Mountain spotted fever. The European Association of Zoos and Aquaria asked Drusillas Park in Alfriston, 
Sussex, England, to keep the studbook for capybaras, to monitor captive populations in Europe.
score: 0.5577329835824506
---

id: d70f54cc66a83b56210c801ecd49c95bae5fef4ab38989d38b26dc53449b192d
 In 2011, one specimen was spotted on the Central Coast of California. These escaped populations occur in areas 
where prehistoric capybaras inhabited; late Pleistocene capybaras inhabited Florida and Hydrochoerus 
hesperotiganites in California and Hydrochoerus gaylordi in Grenada, and feral capybaras in North America may 
actually fill the ecological niche of the Pleistocene species. === Diet and predation === Capybaras are herbivores,
grazing mainly on grasses and aquatic plants, as well as fruit and tree bark. They are very selective feeders and 
feed on the leaves of one species and disregard other species surrounding it.
score: 0.5567185168202262
---

id: a1cb26bcd9d053fc8e7a3c8b6716801b37ca37940c6f8b7865d6f6bb50b38f2f
 The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species 
and can be found in groups as large as 100 individuals, but usually live in groups of 10–20 individuals. The 
capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology ==
Its common name is derived from Tupi ka'apiûara, a complex agglutination of kaá (leaf) + píi (slender) + ú (eat) + 
ara (a suffix for agent nouns), meaning "one who eats slender leaves", or "grass-eater".
score: 0.5562936393318461
---

id: 15755cebd1049a00c4656aaa7cf6c417966b81e482732e1c97288d58a08b53b2
The capybara or greater capybara (Hydrochoerus hydrochaeris) is a giant cavy rodent native to South America. It is 
the largest living rodent and a member of the genus Hydrochoerus. The only other extant member is the lesser 
capybara (Hydrochoerus isthmius). Its close relatives include guinea pigs and rock cavies, and it is more distantly
related to the agouti, the chinchilla, and the nutria.
score: 0.5559830683084014
---

id: d8bd93eefa0c2feeb162972cd915fe29e3b13c98181a976d4751296c51547c77
 Capybara have flourished in cattle ranches. They roam in home ranges averaging 10 hectares (25 acres) in 
high-density populations.
Many escapees from captivity can also be found in similar watery habitats around the world. Sightings are fairly 
common in Florida, although a breeding population has not yet been confirmed.
score: 0.5550636661344844
---

id: 5429437120fdd0611ba2f14db51a28dfd894cd7221264c805dfe43de6ba95f7e
 In Japan, following the lead of Izu Shaboten Zoo in 1982, multiple establishments or zoos in Japan that raise 
capybaras have adopted the practice of having them relax in onsen during the winter. They are seen as an attraction
by Japanese people. Capybaras became popular in Japan due to the popular cartoon character Kapibara-san.
In August 2021, Argentine and international media reported that capybaras had been causing serious problems for 
residents of Nordelta, an affluent gated community north of Buenos Aires built atop wetland habitat.
score: 0.5535668539880305
---

id: 4261a2ae42f4edf8885c6e4c483356c994ad7699fd814e3b8f5da115aa5560ea
 Alloparenting has been observed in this species. Breeding peaks between April and May in Venezuela and between 
October and November in Mato Grosso, Brazil. === Activities ===
Though quite agile on land, capybaras are equally at home in the water. They are excellent swimmers, and can remain
completely submerged for up to five minutes, an ability they use to evade predators.
score: 0.553533523584911
---

id: b819073f1e3b6973dd1d22299a34a882acc348ba45257456cc85c3960fedd9ce
 The studbook includes information about all births, deaths and movements of capybaras, as well as how they are 
related.
Capybaras are farmed for meat and skins in South America. The meat is considered unsuitable to eat in some areas, 
while in other areas it is considered an important source of protein. In parts of South America, especially in 
Venezuela, capybara meat is popular during Lent and Holy Week as the Catholic Church previously issued special 
dispensation to allow it to be eaten while other meats are generally forbidden.
score: 0.5527757086170884
---

id: a1fc3b907198e9d3e1d95d4c08f0fefae6861ce810d1b32479b50753326bdf58
 == Conservation and human interaction ==
Capybaras are not considered a threatened species; their population is stable throughout most of their South 
American range, though in some areas hunting has reduced their numbers. Capybaras are hunted for their meat and 
pelts in some areas, and otherwise killed by humans who see their grazing as competition for livestock. In some 
areas, they are farmed, which has the effect of ensuring the wetland habitats are protected. Their survival is 
aided by their ability to breed rapidly.
score: 0.5503703349299455
---

Understanding SPLADE vectors

(Inspiration: FastEmbed SPLADE notebook)

We have seen that our model encodes text into a sparse vector (= a vector with many zeros). An efficient representation of sparse vectors is to save the indices and values of nonzero elements.

Let’s try to understand what information resides in these vectors…

question = "Where do capybaras live?"
sparse_embedding = sparse_text_embedder.run(text=question)["sparse_embedding"]
rich.print(sparse_embedding.to_dict())

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00, 10.06it/s]

{
    'indices': [
        2015,
        2100,
        2427,
        2444,
        2555,
        2693,
        3224,
        3269,
        3295,
        3562,
        4111,
        4761,
        4982,
        5430,
        5917,
        6178,
        6552,
        7713,
        8843,
        9089,
        9230,
        9277,
        10746,
        14627,
        15267,
        20709
    ],
    'values': [
        0.7443006634712219,
        1.2232322692871094,
        0.7982208132743835,
        1.8504852056503296,
        0.031874656677246094,
        0.22175012528896332,
        0.17087453603744507,
        0.03717103973031044,
        1.8334054946899414,
        0.18768127262592316,
        0.03902499005198479,
        0.5681754946708679,
        0.07937325537204742,
        0.30040717124938965,
        0.33065155148506165,
        2.4437994956970215,
        1.7612168788909912,
        0.0731465145945549,
        0.18527895212173462,
        0.33103543519973755,
        0.29275140166282654,
        0.04728797823190689,
        0.04782348498702049,
        0.0030497254338115454,
        0.6497660875320435,
        2.6444451808929443
    ]
}

from transformers import AutoTokenizer

# we need the tokenizer vocabulary
tokenizer = AutoTokenizer.from_pretrained("Qdrant/Splade_PP_en_v1") # ONNX export of the original model

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]



special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

def get_tokens_and_weights(sparse_embedding, tokenizer):
    token_weight_dict = {}
    for i in range(len(sparse_embedding.indices)):
        token = tokenizer.decode([sparse_embedding.indices[i]])
        weight = sparse_embedding.values[i]
        token_weight_dict[token] = weight

    # Sort the dictionary by weights
    token_weight_dict = dict(sorted(token_weight_dict.items(), key=lambda item: item[1], reverse=True))
    return token_weight_dict


rich.print(get_tokens_and_weights(sparse_embedding, tokenizer))

{
    '##bara': 2.6444451808929443,
    'cap': 2.4437994956970215,
    'live': 1.8504852056503296,
    'location': 1.8334054946899414,
    'habitat': 1.7612168788909912,
    '##y': 1.2232322692871094,
    'species': 0.7982208132743835,
    '##s': 0.7443006634712219,
    'predator': 0.6497660875320435,
    'origin': 0.5681754946708679,
    'nest': 0.33103543519973755,
    'tribe': 0.33065155148506165,
    'cave': 0.30040717124938965,
    'migration': 0.29275140166282654,
    'move': 0.22175012528896332,
    'genus': 0.18768127262592316,
    'breed': 0.18527895212173462,
    'forest': 0.17087453603744507,
    'grow': 0.07937325537204742,
    'shelter': 0.0731465145945549,
    'habitats': 0.04782348498702049,
    'refuge': 0.04728797823190689,
    'animal': 0.03902499005198479,
    'plant': 0.03717103973031044,
    'region': 0.031874656677246094,
    'reproduction': 0.0030497254338115454
}

Very nice! 🦫

tokens are ordered by relevance
the query is expanded with relevant tokens/terms: “location”, “habitat”…

Hybrid Retrieval

Ideally, techniques like SPLADE are intended to replace other approaches (BM25 and Dense Embedding Retrieval) and their combinations.

However, sometimes it may make sense to combine, for example, Dense Embedding Retrieval and Sparse Embedding Retrieval. You can find some positive examples in the appendix of this paper ( An Analysis of Fusion Functions for Hybrid Retrieval). Make sure this works for your use case and conduct an evaluation.

Below we show how to create such an application in Haystack.

In the example, we use the Qdrant Hybrid Retriever: it compares dense and sparse query and document embeddings and retrieves the most relevant documents , merging the scores with Reciprocal Rank Fusion.

If you want to customize the behavior more, see Hybrid Retrieval Pipelines ( tutorial).

from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack_integrations.components.embedders.fastembed import FastembedSparseDocumentEmbedder, FastembedDocumentEmbedder
from haystack.document_stores.types import DuplicatePolicy
from haystack import Pipeline

document_store = QdrantDocumentStore(
    ":memory:",
    recreate_index=True,
    return_embedding=True,
    use_sparse_embeddings=True,
    embedding_dim = 384
)

hybrid_indexing = Pipeline()
hybrid_indexing.add_component("cleaner", DocumentCleaner())
hybrid_indexing.add_component("splitter", DocumentSplitter(split_by='sentence', split_length=4))
hybrid_indexing.add_component("sparse_doc_embedder", FastembedSparseDocumentEmbedder(model="prithvida/Splade_PP_en_v1", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("dense_doc_embedder", FastembedDocumentEmbedder(model="BAAI/bge-small-en-v1.5", meta_fields_to_embed=["title"]))
hybrid_indexing.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.OVERWRITE))

hybrid_indexing.connect("cleaner", "splitter")
hybrid_indexing.connect("splitter", "sparse_doc_embedder")
hybrid_indexing.connect("sparse_doc_embedder", "dense_doc_embedder")
hybrid_indexing.connect("dense_doc_embedder", "writer")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8292170>
🚅 Components
  - cleaner: DocumentCleaner
  - splitter: DocumentSplitter
  - sparse_doc_embedder: FastembedSparseDocumentEmbedder
  - dense_doc_embedder: FastembedDocumentEmbedder
  - writer: DocumentWriter
🛤️ Connections
  - cleaner.documents -> splitter.documents (List[Document])
  - splitter.documents -> sparse_doc_embedder.documents (List[Document])
  - sparse_doc_embedder.documents -> dense_doc_embedder.documents (List[Document])
  - dense_doc_embedder.documents -> writer.documents (List[Document])

hybrid_indexing.run({"documents":raw_docs})

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]



special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]



README.md:   0%|          | 0.00/28.0 [00:00<?, ?B/s]



tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]



.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]



ort_config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/706 [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]



model_optimized.onnx:   0%|          | 0.00/66.5M [00:00<?, ?B/s]



vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]


Calculating sparse embeddings: 100%|██████████| 152/152 [02:14<00:00,  1.13it/s]
Calculating embeddings: 100%|██████████| 152/152 [00:41<00:00,  3.68it/s]
200it [00:00, 655.45it/s]





{'writer': {'documents_written': 152}}

document_store.filter_documents()[0]

Document(id=5e2d65ac05a8a238b359773c3d855e026aca6e617df8a011964b401d8b242a1e, content: ' Overall, they tend to be dwarfed by other Cetartiodactyls. Several species have female-biased sexua...', meta: {'title': 'Dolphin', 'url': 'https://en.wikipedia.org/wiki/Dolphin', 'source_id': '6584a10fad50d363f203669ff6efc19e7ae2a5a28ca9351f5cceb5ba88f8e847'}, embedding: vector of size 384, sparse_embedding: vector with 129 non-zero elements)

from haystack_integrations.components.retrievers.qdrant import QdrantHybridRetriever
from haystack_integrations.components.embedders.fastembed import FastembedTextEmbedder


hybrid_query = Pipeline()
hybrid_query.add_component("sparse_text_embedder", FastembedSparseTextEmbedder(model="prithvida/Splade_PP_en_v1"))
hybrid_query.add_component("dense_text_embedder", FastembedTextEmbedder(model="BAAI/bge-small-en-v1.5", prefix="Represent this sentence for searching relevant passages: "))
hybrid_query.add_component("retriever", QdrantHybridRetriever(document_store=document_store))

hybrid_query.connect("sparse_text_embedder.sparse_embedding", "retriever.query_sparse_embedding")
hybrid_query.connect("dense_text_embedder.embedding", "retriever.query_embedding")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7f1fe8293190>
🚅 Components
  - sparse_text_embedder: FastembedSparseTextEmbedder
  - dense_text_embedder: FastembedTextEmbedder
  - retriever: QdrantHybridRetriever
🛤️ Connections
  - sparse_text_embedder.sparse_embedding -> retriever.query_sparse_embedding (SparseEmbedding)
  - dense_text_embedder.embedding -> retriever.query_embedding (List[float])

question = "Where do capybaras live?"

results = hybrid_query.run(
    {"dense_text_embedder": {"text": question},
     "sparse_text_embedder": {"text": question}}
)

Calculating sparse embeddings: 100%|██████████| 1/1 [00:00<00:00,  9.95it/s]
Calculating embeddings: 100%|██████████| 1/1 [00:00<00:00, 12.05it/s]

import rich

for d in results['retriever']['documents']:
  rich.print(f"\nid: {d.id}\n{d.content}\nscore: {d.score}\n---")

id: fcc9a816e7f2312988dbd20146e4a5c07d8d8b409a373c4c3986d85c26dc0d61

Capybaras have adapted well to urbanization in South America. They can be found in many areas in zoos and parks, 
and may live for 12 years in captivity, more than double their wild lifespan. Capybaras are docile and usually 
allow humans to pet and hand-feed them, but physical contact is normally discouraged, as their ticks can be vectors
to Rocky Mountain spotted fever. The European Association of Zoos and Aquaria asked Drusillas Park in Alfriston, 
Sussex, England, to keep the studbook for capybaras, to monitor captive populations in Europe.
score: 0.6666666666666666
---

id: 6a485709ae51c55b78252571c0808ef17129b32e930ea7d461c12d9afaf40672

Its karyotype has 2n = 66 and FN = 102, meaning it has 66 chromosomes with a total of 102 arms. == Ecology == 
Capybaras are semiaquatic mammals found throughout all countries of South America except Chile. They live in 
densely forested areas near bodies of water, such as lakes, rivers, swamps, ponds, and marshes, as well as flooded 
savannah and along rivers in the tropical rainforest. They are superb swimmers and can hold their breath underwater
for up to five minutes at a time.
score: 0.6666666666666666
---

id: d8bd93eefa0c2feeb162972cd915fe29e3b13c98181a976d4751296c51547c77
 Capybara have flourished in cattle ranches. They roam in home ranges averaging 10 hectares (25 acres) in 
high-density populations.
Many escapees from captivity can also be found in similar watery habitats around the world. Sightings are fairly 
common in Florida, although a breeding population has not yet been confirmed.
score: 0.6428571428571428
---

id: a1cb26bcd9d053fc8e7a3c8b6716801b37ca37940c6f8b7865d6f6bb50b38f2f
 The capybara inhabits savannas and dense forests, and lives near bodies of water. It is a highly social species 
and can be found in groups as large as 100 individuals, but usually live in groups of 10–20 individuals. The 
capybara is hunted for its meat and hide and also for grease from its thick fatty skin. == Etymology ==
Its common name is derived from Tupi ka'apiûara, a complex agglutination of kaá (leaf) + píi (slender) + ú (eat) + 
ara (a suffix for agent nouns), meaning "one who eats slender leaves", or "grass-eater".
score: 0.45
---

id: d70f54cc66a83b56210c801ecd49c95bae5fef4ab38989d38b26dc53449b192d
 In 2011, one specimen was spotted on the Central Coast of California. These escaped populations occur in areas 
where prehistoric capybaras inhabited; late Pleistocene capybaras inhabited Florida and Hydrochoerus 
hesperotiganites in California and Hydrochoerus gaylordi in Grenada, and feral capybaras in North America may 
actually fill the ecological niche of the Pleistocene species. === Diet and predation === Capybaras are herbivores,
grazing mainly on grasses and aquatic plants, as well as fruit and tree bark. They are very selective feeders and 
feed on the leaves of one species and disregard other species surrounding it.
score: 0.45
---

id: 15755cebd1049a00c4656aaa7cf6c417966b81e482732e1c97288d58a08b53b2
The capybara or greater capybara (Hydrochoerus hydrochaeris) is a giant cavy rodent native to South America. It is 
the largest living rodent and a member of the genus Hydrochoerus. The only other extant member is the lesser 
capybara (Hydrochoerus isthmius). Its close relatives include guinea pigs and rock cavies, and it is more distantly
related to the agouti, the chinchilla, and the nutria.
score: 0.30952380952380953
---

id: 4261a2ae42f4edf8885c6e4c483356c994ad7699fd814e3b8f5da115aa5560ea
 Alloparenting has been observed in this species. Breeding peaks between April and May in Venezuela and between 
October and November in Mato Grosso, Brazil. === Activities ===
Though quite agile on land, capybaras are equally at home in the water. They are excellent swimmers, and can remain
completely submerged for up to five minutes, an ability they use to evade predators.
score: 0.2361111111111111
---

id: a1fc3b907198e9d3e1d95d4c08f0fefae6861ce810d1b32479b50753326bdf58
 == Conservation and human interaction ==
Capybaras are not considered a threatened species; their population is stable throughout most of their South 
American range, though in some areas hunting has reduced their numbers. Capybaras are hunted for their meat and 
pelts in some areas, and otherwise killed by humans who see their grazing as competition for livestock. In some 
areas, they are farmed, which has the effect of ensuring the wetland habitats are protected. Their survival is 
aided by their ability to breed rapidly.
score: 0.20202020202020202
---

id: 5429437120fdd0611ba2f14db51a28dfd894cd7221264c805dfe43de6ba95f7e
 In Japan, following the lead of Izu Shaboten Zoo in 1982, multiple establishments or zoos in Japan that raise 
capybaras have adopted the practice of having them relax in onsen during the winter. They are seen as an attraction
by Japanese people. Capybaras became popular in Japan due to the popular cartoon character Kapibara-san.
In August 2021, Argentine and international media reported that capybaras had been causing serious problems for 
residents of Nordelta, an affluent gated community north of Buenos Aires built atop wetland habitat.
score: 0.125
---

id: a87985b28681d12e9897eae531fb5e93ecd0c702a5419708ddcbc03ae13c0ed0

They can have a lifespan of 8–10 years, but tend to live less than four years in the wild due to predation from big
cats like the jaguars and pumas and non-mammalian predators like eagles and the caimans. The capybara is also the 
preferred prey of the green anaconda. == Social organization == Capybaras are known to be gregarious. While they 
sometimes live solitarily, they are more commonly found in groups of around 10–20 individuals, with two to four 
adult males, four to seven adult females, and the remainder juveniles.
score: 0.1
---

📚 Docs on Sparse Embedding support in Haystack

(Notebook by Stefano Fiorucci)