# QnA

### Overview

AquilaX QnA, aka Securitron, is a compact, instruction‑tuned transformer optimized for low‑latency CPU inference. It blends general-domain Q\&A with precise AquilaX expertise, supporting real‑time streaming and minimal resource usage.

***

#### Model Specs

* **Name**: AquilaX QnA (Securitron)
* **Architecture**: Instruction‑tuned transformer
* **Fine‑Tuning**: General conversational + AquilaX domain data
* **Context** Window: 8192 tokens
* **Memory**: ≥ 4 GB RAM (CPU)
* **Platforms**: CPU (quantized) & CUDA GPU
* **API Access**: Integrate via the[ AquilaX Securitron API](https://developers.aquilax.ai/api-reference/genai/securitron?playground=open).
* **Interactive Demo**: Try the chatbot at [AquilaX](https://aquilax.ai/app/home) Home.

***

### Key Features

**CPU-Optimized Performance:**

* Quantized for minimal memory usage and fast inference on CPUs.
* No GPU required for efficient operation.

**Dual Knowledge Base:**

* Handles general queries with clarity.
* Delivers precise answers for AquilaX-specific topics.

**Real-Time Streaming:**

* Supports token-by-token response generation for interactive experiences.

**Context-Aware Responses:**

* Maintains a limited conversation history for coherent follow-ups.
* Automatically manages history to optimize memory.

**Custom System Prompt:**

* Configured as Securitron, ensuring professional and consistent responses.

***

### Installation

#### Prerequisites

* Python 3.8+
* PyTorch (CPU or GPU version)
* Transformers library
* Optional: CUDA for GPU acceleration

#### Install Dependencies

```bash
pip install torch transformers
```

#### Download the Model

Load the model and tokenizer directly from Hugging Face:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("AquilaX-AI/QnA")
model = AutoModelForCausalLM.from_pretrained("AquilaX-AI/QnA")
```

***

### Inference Example

The following code demonstrates how to perform inference with the AquilaX QnA model:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("AquilaX-AI/QnA")
model = AutoModelForCausalLM.from_pretrained("AquilaX-AI/QnA")

# System prompt
prompt = "<|im_start|>system\nYou are Securitron, a helpful AI assistant.<|im_end|>"

# Initialize history
history = []

# Set device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model.to(device)

while True:
    user_input = input("\nUser Question: ")
    if user_input.lower() == 'break':
        break

    # Format user input
    user = f"<|im_start|>user\n{user_input}<|im_end|>\n<|im_start|>assistant"
    history.append(user)
    history = history[-5:]  # Keep last 5 exchanges

    # Build prompt
    full_prompt = prompt + "\n".join(history)
    inputs = tokenizer(full_prompt, return_tensors="pt", truncation=True).input_ids.to(device)

    # Stream response
    streamer = TextStreamer(tokenizer, skip_prompt=True)
    response = model.generate(
        input_ids=inputs,
        streamer=streamer,
        max_new_tokens=512,
        use_cache=True,
        pad_token_id=151645,
        eos_token_id=151645,
        num_return_sequences=1
    )

    # Update history
    decoded = tokenizer.decode(response[0]).split('<|im_start|>assistant')[-1].split('<|im_end|>')[0].strip()
    history.append(decoded + "<|im_end|>")
```

#### Key Notes

* Device: Automatically uses GPU if available; defaults to CPU.
* Streaming: TextStreamer enables real-time response display.
* History: Limits to 5 exchanges to optimize memory.
* API Alternative: Use the[ Securitron API](https://developers.aquilax.ai/api-reference/genai/securitron?playground=open) for simpler integration.

***

### Input and Output Format

#### Input Format

```
<|im_start|>system
You are Securitron, a helpful AI assistant.
<|im_end|>
<|im_start|>user
{user_question}
<|im_end|>
<|im_start|>assistant
```

#### Output Format

```
<|im_start|>assistant
{generated_response}
<|im_end|>
```

* Responses are streamed in real-time with TextStreamer.
* Cleaned output is plain text for user display.

***

### Performance Optimization

* CPU Efficiency: Quantized model ensures low memory usage.
* History Management: Limit to 5 exchanges to reduce memory overhead.
* GPU Support: Enable CUDA for faster inference if available.
* Response Length: Adjust max\_new\_tokens for shorter or longer outputs.

***

### Deployment Considerations

* Environment: Use a virtual environment to manage dependencies.
* Error Handling: Add try-catch for robust error management.
* Scalability: For production, deploy via FastAPI or use the[ Securitron API](https://developers.aquilax.ai/api-reference/genai/securitron?playground=open).

***

### Support and Contributions

For support or updates, contact the [AquilaX](https://aquilax.ai/book-a-demo) team or visit the model’s Hugging Face repository AquilaX-AI/QnA

***

> Credit on Engineering team: [Suriya](https://www.linkedin.com/in/suriya-s-83b25524a) & [Pachaiappan](https://www.linkedin.com/in/pachaiappan/)


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.aquilax.ai/ai-models/qna.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
