Building RAG Applications with NVIDIA NIM and Haystack on K8s

Retrieval-augmented generation (RAG) systems combine generative AI with information retrieval for contextualized answer generation. Building reliable and performant RAG applications at scale is challenging. In this blog, we show how to use Haystack and NVIDIA NIM to create a RAG solution which is easy to deploy/maintain, standardized and enterprise-ready, that can run on-prem as well as on cloud native environments. This recipe is applicable in the cloud, on-premise or even in air-gapped environments.

About Haystack

Haystack, by deepset, is an open source framework for building production-ready LLM applications, RAG pipelines and state-of-the-art search systems that work intelligently over large document collections.

Figure 1 - Haystack Retrieval-augmented generation (RAG) pipeline.
Figure 1 - Haystack Retrieval-augmented generation (RAG) pipeline

Haystack’s growing ecosystem of community integrations provide tooling for evaluation, monitoring, transcription, data ingestion and more. The NVIDIA Haystack integration allows using NVIDIA models and NIMs in Haystack pipelines, giving the flexibility to pivot from prototyping in the cloud to deploying on-prem.

About NVIDIA NIM

NVIDIA NIM is a collection of containerized microservices designed for optimized inference of state-of-the-art AI models. The container uses a variety of components to serve AI models and exposes them via standard API. Models are optimized using TensorRT or TensorRT-LLM (depending on the type of the model), applying procedures such as quantization, model distribution, optimized kernel/runtimes and inflight- or continuous batching among others allowing even further optimization if needed. Learn more about NIM here.

This tutorial shows how to build a Haystack RAG pipeline leveraging NVIDIA NIMs hosted on the NVIDIA API catalog. Then, we provide instructions on deploying NIMs on your infrastructure in a Kubernetes environment for self-hosting AI foundation models. Note that hosting NIMs requires NVIDIA AI Enterprise license.

Build a Haystack RAG Pipeline with NVIDIA NIMs hosted on the NVIDIA API Catalog

For RAG pipelines, Haystack provides 3 components that can be connected with NVIDIA NIM:

Figure 2 -  Haystack Indexing and RAG pipeline with NVIDIA NIMs
Figure 2 - Haystack Indexing and RAG pipelines with NVIDIA NIMs

For this section, we have provided scripts and instructions for building a RAG pipeline leveraging NIMs hosted on the NVIDIA API catalog as part of the GitHub repository. We also provide a Jupyter Notebook for building the same RAG pipeline using NIMs deployed on your infrastructure in a Kubernetes environment.

Vectorize Documents with Haystack Indexing Pipelines

Our indexing pipeline implementation is available in the indexing tutorial. Haystack provides several preprocessing components for document cleaning, splitting, embedders, as well as converters extracting data from files in different formats. In this tutorial, we will store PDF files in a QdrantDocumentStore. NvidiaDocumentEmbedder is used to connect with NIMs hosted on the NVIDIA API catalog. Below is an example of how to initialize the embedder component with the snowflake/arctic-embed-l NIM hosted on the NVIDIA API catalog.

from haystack.utils.auth import Secret
from haystack_integrations.components.embedders.nvidia import NvidiaDocumentEmbedder


embedder = NvidiaDocumentEmbedder(model="snowflake/arctic-embed-l",
                                  api_url="https://ai.api.nvidia.com/v1/retrieval/snowflake/arctic-embed-l",
                                  batch_size=1)

Creating the Haystack RAG Pipeline

In our example, we will create a simple question/answering RAG pipeline using both NVIDIA NeMo Retriever Embedding NIM and LLM NIM. For this pipeline, we use the NvidiaTextEmbedder to embed the query for retrieval, and the NvidiaGenerator to generate a response. Example below shows how to instantiate the generator using meta/llama3-70b-instruct LLM NIM hosted on the NVIDIA API catalog.

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    api_url="https://integrate.api.nvidia.com/v1",
    model_arguments={
        "max_tokens": 1024
    }
)

We use Haystack pipelines to connect various components of this RAG pipeline including query embedders and LLM generators. Below is an example of a RAG pipeline:

from haystack import Pipeline
from haystack.utils.auth import Secret
from haystack.components.builders import PromptBuilder
from haystack_integrations.components.embedders.nvidia import NvidiaTextEmbedder
from haystack_integrations.components.generators.nvidia import NvidiaGenerator
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore

document_store = QdrantDocumentStore(embedding_dim=1024, host="qdrant")

embedder = NvidiaTextEmbedder(model="snowflake/arctic-embed-l", 
                                  api_key=Secret.from_env_var("NVIDIA_EMBEDDINGS_KEY"), 
                                  api_url="https://ai.api.nvidia.com/v1/retrieval/snowflake/arctic-embed-l")

retriever = QdrantEmbeddingRetriever(document_store=document_store)

prompt = """Answer the question given the context.
Question: {{ query }}
Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}
Answer:"""
prompt_builder = PromptBuilder(template=prompt)

generator = NvidiaGenerator(
    model="meta/llama3-70b-instruct",
    api_url="https://integrate.api.nvidia.com/v1",
    model_arguments={
        "max_tokens": 1024
    }
)

rag = Pipeline()
rag.add_component("embedder", embedder)
rag.add_component("retriever", retriever)
rag.add_component("prompt", prompt_builder)
rag.add_component("generator", generator)

rag.connect("embedder.embedding", "retriever.query_embedding")
rag.connect("retriever.documents", "prompt.documents")
rag.connect("prompt", "generator")

Indexing Files and Deploying the Haystack RAG Pipeline

Hayhooks allows the deployment of RAG pipelines in a containerized environment. In our example, we have provided a docker-compose file to setup both the Qdrant database, and the RAG pipeline. As we are leveraging NIMs hosted on the NVIDIA API catalog, we need to set the API keys for the NIMs in the .env file. The instructions below expect NVIDIA_API_KEY (for NvidiaGenerator) and NVIDIA_EMBEDDINGS_KEY (for NvidiaDocumentEmbedder and NvidiaTextEmbedder).

Executing docker-compose up will launch 3 containers: qdrant, hayhooks and qdrant-setup (which will run our indexing pipeline and stop). The Qdrant database will be deployed on the localhost and exposed at port 6333. The Qdrant dashboard allows users to inspect the vectorized documents at localhost:6333/dashboard.

Serializing Pipelines

Haystack pipelines defined in Python can be serialized to YAML by calling dump() on the pipeline object, as shown in our RAG pipeline tutorial. The YAML definition is as follows:

components:
  embedder:
    ...
    type: haystack_integrations.components.embedders.nvidia.text_embedder.NvidiaTextEmbedder
  generator:
    init_parameters:
      api_key:
        ...
    type: haystack_integrations.components.generators.nvidia.generator.NvidiaGenerator
  prompt:
    init_parameters:
      template: "Answer the question given the context.\nQuestion: {{ query }}\nContext:\n\
        {% for document in documents %}\n    {{ document.content }}\n{% endfor %}\n\
        Answer:"
    type: haystack.components.builders.prompt_builder.PromptBuilder
  retriever:
    init_parameters:
      document_store:
        init_parameters:
          ...
        type: haystack_integrations.document_stores.qdrant.document_store.QdrantDocumentStore
      ...
    type: haystack_integrations.components.retrievers.qdrant.retriever.QdrantEmbeddingRetriever

connections:
- receiver: retriever.query_embedding
  sender: embedder.embedding
- receiver: prompt.documents
  sender: retriever.documents
- receiver: generator.prompt
  sender: prompt.prompt
max_loops_allowed: 100
metadata: {}

Deploy the RAG Pipeline

To deploy the RAG pipeline, execute hayhooks deploy rag.yaml which will expose the pipeline on http://localhost:1416/rag by default. You can then visit http://localhost:1416/docs for the API docs and try out the pipeline.

Figure 3 - API Doc UI interface for trying out the RAG Pipeline
Figure 3 - API Doc UI interface for trying out the RAG Pipeline

For production, Haystack provides Helm charts and instructions to create services running Hayhooks with a container orchestrator like Kubernetes.

In the next sections, we will show how to deploy, monitor and autoscale NIMs on your infrastructure in a Kubernetes environment for self-hosting AI foundation models. Finally, we will provide instructions on how to use them in the Haystack RAG pipeline.

Self-hosting NVIDIA NIMs on a Kubernetes cluster

Kubernetes Cluster Environment

In this tutorial, the setup environment consists of a DGX H100 with 8 H100 GPUs each having 80GB of memory as host and with Ubuntu as the operating system. Docker is used as the container runtime. Kubernetes is deployed on it using Minikube. To enable GPU utilization in Kubernetes, we install essential NVIDIA software components using the GPU Operator.

NVIDIA NIMs Deployment

As part of this setup, we deploy following NIMs into the Kubernetes cluster using Helm charts:

The LLM NIM Helm chart is on GitHub, while the NVIDIA NeMo Retriever Embedding NIM Helm chart is in the NGC private registry, requiring Early Access ( apply for Early Access). Figure 4 illustrates the deployment of NIMs on a Kubernetes cluster running on a DGX H100. The GPU Operator components are deployed via its Helm chart and are part of the GPU Operator stack. Prometheus and Grafana are deployed via Helm charts for monitoring the Kubernetes cluster and the NIM.

Figure 4 - NVIDIA NIMs and  other components deployment on a Kubernetes cluster
Figure 4 - NVIDIA NIMs and other components deployment on a Kubernetes cluster

The LLM NIM Helm chart contains the LLM NIM container, which runs within a pod and references the model via Persistent Volume (PV) and Persistent Volume Claim (PVC). The LLM NIM pods are autoscaled using the Horizontal Pod Autoscaler (HPA) based on custom metrics and are exposed via Kubernetes ClusterIP service. To access the LLM NIM, we deploy an ingress and expose it at the /llm endpoint.

Similarly, the NeMo Retriever Embedding NIM Helm chart includes the Retriever Embedding NIM container, which runs within a pod and references the model on the host via PV and PVC. The NeMo Retriever Embedding NIM pods are also autoscaled via HPA and are exposed via Kubernetes ClusterIP service. To access the NeMo Retriever Embedding NIM, we deploy an ingress and expose it at the /embedding endpoint.

Users and other applications can access the exposed NIMs via the ingress. The vector database Qdrant is deployed using this helm chart.

Now, let’s take a closer look at the deployment process for each NIM:

LLM NIM deployment

  1. Create the namespace, if it is not already created yet:

kubectl create namespace nim-llm

  1. Add a Docker registry secret that will be used for pulling NIM containers from NGC and replace <ngc-cli-api-key> with the API key from NGC. Follow this link for generating an API key in NGC.
kubectl create secret -n nim-llm docker-registry nvcrimagepullsecret \
    --docker-server=nvcr.io \
    --docker-username='$oauthtoken' --docker-password=<ngc-cli-api-key>  
  1. Create a generic secret ngc-api, which is used to pull the model within the NIM container.
kubectl create secret -n nim-llm generic ngc-api \
    --from-literal=NGC_CLI_API_KEY=<ngc-cli-api-key> 
  1. Create nim-llm-values.yaml file with the below content. Adjust repository and tag values depending on your environment.
image:
   repository: "nvcr.io/nvidia/nim/nim-llm/meta-llama3-8b-instruct" # container image location
   tag: 24.05 # LLM NIM version you want to deploy

model:
  ngcAPISecret: ngc-api  # name of a secret in the cluster that includes a key named NGC_CLI_API_KEY and is an NGC API key
resources:
  limits:
    nvidia.com/gpu: 1
  requests:
    nvidia.com/gpu: 1
persistence:
  enabled: true
  size: 30Gi
imagePullSecrets:
  - name: nvcrimagepullsecret # secret created to pull nvcr.io image
  1. We assume that the helm chart for the LLM NIM is located here: ./nims/helm/nim-llm/. You can change the command accordingly depending on where the helm chart is located. Deploy the LLM NIM by running the following command:

helm -n nim-llm install nim-llm -f ./nims/helm/nim-llm/ nim-llm-values.yaml

  1. The deployment takes a few minutes to start the containers, download models, and become ready. You can monitor the pods with the below command:
kubectl get pods -n nim-llm

Example Output

NAME        READY   STATUS    RESTARTS   AGE
nim-llm-0   1/1     Running   0          8m21s
  1. Install an ingress controller, if it has not been installed already. Then, create a file ingress-nim-llm.yaml with the below content to create the ingress for the LLM NIM. Make sure to change the host (here nims.example.com) with your fully qualified domain name.
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nim-llm-ingress
  namespace: nim-llm
  annotations:
    nginx.ingress.kubernetes.io/use-regex: "true"
    nginx.ingress.kubernetes.io/rewrite-target: /$2
spec:
  rules:
    - host: nims.example.com
      http:
        paths:
          - path: /llm(/|$)(.*)
            pathType: ImplementationSpecific
            backend:
              service:
                name: nim-llm
                port:
                  number: 8000

Deploy the ingress with the below command:

kubectl apply -f ingress-nim-llm.yaml
  1. Access the exposed service by making a curl request for testing (replace nims.example.com with you own fully qualified domain name)
curl -X 'POST' 'http://nims.example.com/llm/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a polite and respectful chatbot helping people plan a vacation.",
      "role": "system"
    },
    {
      "content": "What shall i do in France in one line?",
      "role": "user"
    }
  ],
  "model": "meta-llama3-8b-instruct",
  "temperature": 0.5,
  "max_tokens": 1024,
  "top_p": 1,
  "stream": false
}'

Example output:

{"id":"cmpl-0027fdbe808747e987c444d1f86b0543","object":"chat.completion","created":1716325880,"model":"meta-llama3-8b-instruct","choices":[{"index":0,"message":{"role":"assistant","content":"In France, you can stroll along the Seine River in Paris, visit the iconic Eiffel Tower, indulge in croissants and cheese, and explore the charming streets of Montmartre, or head to the French Riviera for a luxurious getaway."},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":39,"total_tokens":92,"completion_tokens":53}}

Now, we have the LLM NIM up and running.

NeMo Retriever Embedding NIM deployment

The deployment of the NeMo Retriever Embedding NIM is similar to the LLM NIM.

  1. Follow steps 1 - 3 as LLM NIM deployment but replace namespace with nim-embedding in the commands.

  2. Create nim-embedding-values.yaml file with the below content. Adjust following:

    • ngcModel.org : The ID of the organization where the model is located in NGC.
    • ngcModel.path : Replace <org-id> with the ID of the organization and <team-name> with the team name under the organization where the model is located.
    • image.repository and image.tag values depending on your environment.
ngcModel:
  directoryName: nv-embed-qa_v4
  org: <org-id>
  path: <org-id>/<team-name>/nv-embed-qa:4
  template: NV-Embed-QA_template.yaml
  name: NV-Embed-QA-4.nemo

replicaCount: 1

image:
  repository: nvcr.io/nvidia/nim/nemo-retriever-embedding-microservice
  tag: "24.04"

imagePullSecrets:
  - name: nvcrimagepullsecret

envVars:
  - name: TRANSFORMERS_CACHE
    value: /scratch/.cache

modelStorage:
  class: ""
  size: 10Gi

service:
  type: ClusterIP
  port: 8080
  1. We assume that the helm chart for the NeMo Retriever Embedding NIM is located here: ./nims/helm/nim-embedding/. You can change the command accordingly depending on where the helm chart is located. Deploy the NeMo Retriever Embedding NIM by running the following command
cd ./nims/helm/nim-embedding/ && helm dependency build

helm -n nim-embedding install nim-embedding -f ./nims/helm/nim-embedding/ nim-embedding-values.yaml
  1. The deployment takes a few minutes to start the containers, download models, and become ready. You can monitor the pods with the below command:
kubectl get pods -n nim-embedding

Example Output

NAME                                     READY   STATUS    RESTARTS   AGE
nim-embedding-nemo-embedding-ms-d58c..   1/1     Running   0          87m
  1. Create a file ingress-nim-embedding.yaml similar to the LLM NIM ingress with service name nim-embedding-nemo-embedding-ms, port 8080, and path /embedding(/|$)(.*). Afterwards, deploy the ingress.

  2. Access the exposed service by making a curl request for testing (replace in below the nims.example.com with your fully qualified domain name).

curl 'GET' \
  'http://nims.example.com/embedding/v1/models' \
  -H 'accept: application/json'

Example output:

{"object":"list","data":[{"id":"NV-Embed-QA","created":0,"object":"model","owned_by":"organization-owner"}]}

Now, we have the NeMo Retriever Embedding NIM up and running.

Once the above procedure is completed, you will have API endpoints of LLM NIM and NeMo Retriever Embedding NIM.

Operational Considerations

Monitoring and autoscaling are essential for deployed NIMs to ensure efficient, effective, and reliable operation. Monitoring tracks performance metrics, detects errors, and optimizes resource utilization, while autoscaling dynamically adjusts resources to match changing workloads, ensuring the NIMs can handle sudden spikes or dips in demand. This enable NIMs to provide accurate and timely responses, even under heavy loads, while optimizing costs and maintaining high availability. In this section, we will delve into details of deploying monitoring and enabling autoscaling for NIMs.

Monitoring

NVIDIA NIM metrics are collected with the open-source tool Prometheus and visualized with the Grafana dashboards. NVIDIA dcgm-exporter is the preferred tool to collect GPU telemetry. We follow the instructions from here for the deployment of Prometheus and Grafana.

Visualizing NVIDIA NIM Metrics

By default, NVIDIA NIM metrics are exposed at http://localhost:8000/metrics by the NIM container. All the exposed metrics are listed here. Using a Prometheus ServiceMonitor they can be published to Prometheus and viewed in the Grafana dashboard. The Prometheus ServiceMonitor is used to define application to scrape metrics from within Kubernetes cluster.

  1. Create a file service-monitor-nim-llm.yaml with the below content. We currently only configure it to scrape metrics from LLM NIM but can be extended to other NIMs as well.
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nim-llm-sm
  namespace: nim-llm
spec:
  endpoints:
  - interval: 30s
    targetPort: 8000
    path: /metrics
  namespaceSelector:
    matchNames:
    - nim-llm
  selector:
    matchLabels:
      app.kubernetes.io/name: nim-llm
  1. Create a Prometheus ServiceMonitor using the below command:
kubectl apply -f service-monitor-nim-llm.yaml

In the Prometheus UI under Status -> Targets, you will see the below ServiceMonitor once it’s deployed.

Figure 5 - Prometheus UI showing the deployed ServiceMonitor
Figure 5 - Prometheus UI showing the deployed ServiceMonitor

  1. Let’s check some inference metrics on the Prometheus UI. Figure 6 shows the stacked graph for request_success_total NIM metric.

Figure 6 - Prometheus UI showing the plot of request_success_total metric indicating number of finished requests.
Figure 6 - Prometheus UI showing the plot of request_success_total metric indicating number of finished requests

Autoscaling NVIDIA NIM

In this tutorial, we use the Kubernetes Horizontal Pod Autoscaler - HPA to adjust the scaling of the NIM pods. We have defined custom metrics to monitor the average GPU usage of each NVIDIA NIM and used by the Horizontal Pod Autoscaler (HPA) to dynamically adjust the number of NIM pods. See the metrics definition below:

  • nim_llm_gpu_avg : avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-llm-.*"})
  • nim_embedding_gpu_avg : avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-emedding-.*"})

The average GPU usage metric is used as an example and must be adjusted to the specific application environment.

Let’s deploy the HPA.

  1. Create a file with the name prometheus_rule_nims.yaml with the below content to create the Prometheus rules for the above custom metric. Adjust the labels (app, other Prometheus labels) according to the current deployed Prometheus instance.
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  labels:
    app: kube-prometheus-stack
    app.kubernetes.io/instance: kube-prometheus-stack-1710254997
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/part-of: kube-prometheus-stack
    app.kubernetes.io/version: 56.8.2
    chart: kube-prometheus-stack-56.8.2
    heritage: Helm
    release: kube-prometheus-stack-1710254997
  name: kube-prometheus-stack-1709-gpu.rules
  namespace: prometheus
spec:
  groups:
  - name: gpu.rules
    rules:
    - expr: avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-llm-.*"})
      record: nim_llm_gpu_avg
    - expr: avg by (kubernetes_node, pod, namespace, gpu) (DCGM_FI_DEV_GPU_UTIL{pod=~"nim-embedding-.*"})
      record: nim_embedding_gpu_avg
  1. Create custom Prometheus recording rules by running the below command:
kubectl apply -f prometheus_rule_nims.yaml
  1. In Prometheus UI, under Status -> Rules, you can see the above two created rules as shown in Figure 7.

Figure 7 - Prometheus rules tab showing the created custom rules to record GPU usage by NVIDIA NIM.
Figure 7 - Prometheus rules tab showing the created custom rules to record GPU usage by NVIDIA NIM

  1. Install prometheus-adapter to query the custom metrics based on the custom recording rules created above and register them to the custom metrics API for HPA to fetch. Replace in below command <prometheus-service-name> with the name of the Prometheus service in Kubernetes.
helm upgrade --install prometheus-adapter prometheus-community/prometheus-adapter --set prometheus.url="http://<prometheus-service-name>.prometheus.svc.cluster.local"
  1. Query the custom metrics API to see if the metrics have been registered using the below command:
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep llms

Example Output:

"name": "namespaces/nim_embedding_gpu_avg",
"name": "pods/nim_embedding_gpu_avg",
"name": "pods/nim_llm_gpu_avg",
"name": "namespaces/nim_llm_gpu_avg",
  1. A separate HPA definition is created for the two NVIDIA NIM. Within this definition, we specify the minimum and maximum number of replicas, the metric to monitor, and the target value for that metric. Below is the definition for the LLM NIM HPA and you can create the similar for the NeMo Retriever Embedding NIM using nim_embedding_gpu_avg metric.

LLM NIM HPA file:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nim-llm-hpa
  namespace: nim-llm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: nim-llm
  minReplicas: 1
  maxReplicas: 4
  metrics:
    - type: Pods
      pods:
        metric:
          name: nim_llm_gpu_avg
        target:
          type: AverageValue
          averageValue: 30
  1. Create the two HPAs using the below commands:
kubectl apply -f hpa_nim_llm.yaml
kubectl apply -f hpa_nim_embedding.yaml
  1. Check the status of HPAs:

kubectl get hpa -A

Example Output:

NAMESPACE       NAME                REFERENCE                                    TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
nim-embedding   nim-embedding-hpa   Deployment/nim-embedding-nemo-embedding-ms   0/30      1         4         1          94s
nim-llm         nim-llm-hpa         StatefulSet/nim-llm                          0/30      1         4         1          94s
  1. Send some requests to LLM NIM and see the LLM NIM pod getting scaled as shown below:
NAME        READY   STATUS    RESTARTS   AGE
nim-llm-0   1/1     Running   0          3h47m
nim-llm-1   1/1     Running   0          3m30s

Also, Figure 8 shows the Prometheus graph showing the scaling of LLM NIM.

Figure 8 - Prometheus graph showing the scaling of LLM NIM.
Figure 8 - Prometheus graph showing the scaling of LLM NIM.

We have now deployed NIMs on your infrastructure in a scalable fashion. We can now use them in the RAG pipeline. The next section provides the details for the same.

Use Self-hosted NVIDIA NIMs in the RAG Pipeline

This section provides instructions to use previously deployed NIMs on your infrastructure in a Kubernetes cluster for NvidiaTextEmbedder, NvidiaDocumentEmbedder and NvidiaGenerator in the Haystack RAG pipeline, replacing <self-hosted-emedding-nim-url> with the endpoint of the NeMo Retriever Embedding NIM and <self-hosted-llm-nim-url> with the LLM NIM. The provided notebook in the repository has examples of how to use the self-hosted NIMs.

NvidiaDocumentEmbedder:

embedder = NvidiaDocumentEmbedder(
    model=embedding_nim_model,
    api_url="http://<self-hosted-emedding-nim-url>/v1"
)

NvidiaTextEmbedder:

# initialize NvidiaTextEmbedder with the self-hosted NeMo Retriever Embedding NIM URL
embedder = NvidiaTextEmbedder(
    model=embedding_nim_model,
    api_url="http://<self-hosted-embedding-nim-url>/v1"
)

NvidiaGenerator:

# initialize NvidiaGenerator with the self-hosted LLM NIM URL
generator = NvidiaGenerator(
    model=llm_nim_model_name,
    api_url="http://<self-hosted-llm-nim-url>/v1",
    model_arguments={
        "temperature": 0.5,
        "top_p": 0.7,
        "max_tokens": 2048,
    },
)

Summary

In this blog, we provide a comprehensive walkthrough for building robust and scalable RAG applications using Haystack and NVIDIA NIMs. We cover building the RAG pipeline by leveraging NIMs hosted on the NVIDIA API catalog and also using self-hosted NIMs deployed on your infrastructure in a Kubernetes environment. Our step-by-step instructions detail how to deploy NIMs in a Kubernetes cluster, monitor their performance, and scale them as needed.

By leveraging proven deployment patterns, our architecture ensures a responsive user experience and predictable query times, even in the face of high or bursty user queries and document indexing workloads. Moreover, our deployment recipe is flexible, allowing for easy implementation in cloud, on-premise, or air-gapped environments. With this guide, we aim to provide a resource for anyone looking to build reliable and performant RAG applications at scale.