QnA
AquilaX QnA (Securitron)
Overview
AquilaX QnA, aka Securitron, is a compact, instruction‑tuned transformer optimized for low‑latency CPU inference. It blends general-domain Q&A with precise AquilaX expertise, supporting real‑time streaming and minimal resource usage.
Model Specs
Name: AquilaX QnA (Securitron)
Architecture: Instruction‑tuned transformer
Fine‑Tuning: General conversational + AquilaX domain data
Context Window: 8192 tokens
Memory: ≥ 4 GB RAM (CPU)
Platforms: CPU (quantized) & CUDA GPU
API Access: Integrate via the AquilaX Securitron API.
Interactive Demo: Try the chatbot at AquilaX Home.
Key Features
CPU-Optimized Performance:
Quantized for minimal memory usage and fast inference on CPUs.
No GPU required for efficient operation.
Dual Knowledge Base:
Handles general queries with clarity.
Delivers precise answers for AquilaX-specific topics.
Real-Time Streaming:
Supports token-by-token response generation for interactive experiences.
Context-Aware Responses:
Maintains a limited conversation history for coherent follow-ups.
Automatically manages history to optimize memory.
Custom System Prompt:
Configured as Securitron, ensuring professional and consistent responses.
Installation
Prerequisites
Python 3.8+
PyTorch (CPU or GPU version)
Transformers library
Optional: CUDA for GPU acceleration
Install Dependencies
Download the Model
Load the model and tokenizer directly from Hugging Face:
Inference Example
The following code demonstrates how to perform inference with the AquilaX QnA model:
Key Notes
Device: Automatically uses GPU if available; defaults to CPU.
Streaming: TextStreamer enables real-time response display.
History: Limits to 5 exchanges to optimize memory.
API Alternative: Use the Securitron API for simpler integration.
Input and Output Format
Input Format
Output Format
Responses are streamed in real-time with TextStreamer.
Cleaned output is plain text for user display.
Performance Optimization
CPU Efficiency: Quantized model ensures low memory usage.
History Management: Limit to 5 exchanges to reduce memory overhead.
GPU Support: Enable CUDA for faster inference if available.
Response Length: Adjust max_new_tokens for shorter or longer outputs.
Deployment Considerations
Environment: Use a virtual environment to manage dependencies.
Error Handling: Add try-catch for robust error management.
Scalability: For production, deploy via FastAPI or use the Securitron API.
Support and Contributions
For support or updates, contact the AquilaX team or visit the model’s Hugging Face repository AquilaX-AI/QnA
Credit on Engineering team: Suriya & Pachaiappan
Last updated
Was this helpful?