QnA

AquilaX QnA (Securitron)

Overview

AquilaX QnA, aka Securitron, is a compact, instruction‑tuned transformer optimized for low‑latency CPU inference. It blends general-domain Q&A with precise AquilaX expertise, supporting real‑time streaming and minimal resource usage.


Model Specs

  • Name: AquilaX QnA (Securitron)

  • Architecture: Instruction‑tuned transformer

  • Fine‑Tuning: General conversational + AquilaX domain data

  • Context Window: 8192 tokens

  • Memory: ≥ 4 GB RAM (CPU)

  • Platforms: CPU (quantized) & CUDA GPU

  • API Access: Integrate via the AquilaX Securitron API.

  • Interactive Demo: Try the chatbot at AquilaX Home.


Key Features

CPU-Optimized Performance:

  • Quantized for minimal memory usage and fast inference on CPUs.

  • No GPU required for efficient operation.

Dual Knowledge Base:

  • Handles general queries with clarity.

  • Delivers precise answers for AquilaX-specific topics.

Real-Time Streaming:

  • Supports token-by-token response generation for interactive experiences.

Context-Aware Responses:

  • Maintains a limited conversation history for coherent follow-ups.

  • Automatically manages history to optimize memory.

Custom System Prompt:

  • Configured as Securitron, ensuring professional and consistent responses.


Installation

Prerequisites

  • Python 3.8+

  • PyTorch (CPU or GPU version)

  • Transformers library

  • Optional: CUDA for GPU acceleration

Install Dependencies

Download the Model

Load the model and tokenizer directly from Hugging Face:


Inference Example

The following code demonstrates how to perform inference with the AquilaX QnA model:

Key Notes

  • Device: Automatically uses GPU if available; defaults to CPU.

  • Streaming: TextStreamer enables real-time response display.

  • History: Limits to 5 exchanges to optimize memory.

  • API Alternative: Use the Securitron API for simpler integration.


Input and Output Format

Input Format

Output Format

  • Responses are streamed in real-time with TextStreamer.

  • Cleaned output is plain text for user display.


Performance Optimization

  • CPU Efficiency: Quantized model ensures low memory usage.

  • History Management: Limit to 5 exchanges to reduce memory overhead.

  • GPU Support: Enable CUDA for faster inference if available.

  • Response Length: Adjust max_new_tokens for shorter or longer outputs.


Deployment Considerations

  • Environment: Use a virtual environment to manage dependencies.

  • Error Handling: Add try-catch for robust error management.

  • Scalability: For production, deploy via FastAPI or use the Securitron API.


Support and Contributions

For support or updates, contact the AquilaX team or visit the model’s Hugging Face repository AquilaX-AI/QnA


Credit on Engineering team: Suriya & Pachaiappan

Last updated

Was this helpful?