NL-JSON Model
The NL-JSON model is designed to simplify the interaction between users and complex systems by translating natural language commands into structured JSON format.
Hugging Face (Model): https://huggingface.co/AquilaX-AI/NL-JSON-Start-Scan
GitHub (Code): https://github.com/AquilaX-AI/NL-JSON-Start-Scan
What is the model made for?
The NL-JSON
(AKA AquilaX AI Ask) model is designed to simplify the interaction between users and complex systems by translating natural language commands into structured JSON format. Its primary purpose is to automate workflows such as security scanning, code analysis, and technical tasks. By understanding user intent from everyday language, it generates machine-readable JSON instructions, enabling non-technical users to operate advanced tools efficiently. This model enhances usability, scalability, and integration across platforms, making it ideal for environments.
What is NL -> JSON?
The Natural Language to JSON (NL-JSON
) component translates human-readable text into a structured JSON format. This allows users to input commands in plain English, which the system then converts into machine-readable JSON. The generated JSON can then be fed into backend systems to perform specific tasks. This ensures accurate interpretation of user requests and seamless execution by the backend.
What problem does it resolve?
The NL-JSON
model resolves the problem of translating human instructions into machine-executable commands, particularly in complex systems like security scanners or code analyzers. Traditionally, users needed to understand technical configurations and syntax to interact with these systems. The NL-JSON model eliminates this barrier by allowing users to input natural language commands, which it then converts into JSON format, automating processes and reducing the need for technical expertise. This simplifies user interaction, reduces errors, and enhances productivity in tasks like vulnerability scanning or code analysis.
Model Training
The model training process focuses on training a sequence-to-sequence (Seq2Seq
) model with a specific architecture using the Hugging Face Transformers library. The model is optimized for natural language generation tasks, and this report summarizes the setup and parameters used for training.
Data Collation
The DataCollatorForSeq2Seq
was used to handle dynamic padding during training and evaluation, ensuring input sequences are properly padded for batch processing.
Optimizer
adamw_torch
was chosen as the optimizer and the learning rate was initialized at 3e-5
Training Process
A Seq2SeqTrainer
was used for managing the training process. The training involved:
Model: Custom Seq2Seq model
Datasets: Tokenized datasets for training and evaluation were used.
Metrics: Custom `compute_metrics` function was employed to evaluate the model's performance after each epoch.
Training Script
Conclusion
This setup is designed to optimize the model's performance with proper data collation, warmup steps, gradient accumulation, and effective learning rate scheduling. The chosen parameters, including FP16 and label smoothing, aim to enhance model performance while ensuring efficient GPU usage.
How is it used in AquilaX ?
In AquilaX, we leverage the Natural Language to JSON (NL-JSON) model to convert user-provided scan instructions, typically written in plain English, into a structured JSON format. This transformation allows our backend systems to interpret and act on these commands accurately. For example, a user might request "Scan the network for vulnerabilities," and the model translates it into a specific JSON object that the backend understands. This JSON is then processed by the AquilaX system to execute the requested security scans or related tasks. By automating this conversion, we ensure seamless communication between user inputs and backend operations.
Natural Language Command Parsing
The NL-JSON model allows users to input queries or commands in natural language, such as "Execute a secrets scan on the repo https://github.com/AquilaX-AI/vulnapp.” or "Execute a security scan on https://github.com/AquilaX-AI/vulnapp to uncover PII."The model then interprets these commands and translates them into structured JSON data, which can be understood by various scanners (e.g., SAST, Secrets, PII, SCA, IAC,Container,Malware,API).
Command Execution via JSON
Once the natural language input is parsed into JSON, the AquilaX system processes it. Each command is associated with specific parameters or actions in the JSON format. This structured JSON can trigger the appropriate scanners (e.g., for secrets, SAST, or PII) and provide necessary configurations.
Dataset Size
The dataset includes over 2000 test samples for various security scanning tasks, such as detecting Secrets, personally identifiable information (PII), and Static application Security testing (SAST) in the GitHub repository AquilaX-AI/vulnapp.
The model testing scenarios include combinations of tasks like secrets detection, PII identification, SAST (Static Application Security Testing), SCA (Software Composition Analysis), IaC (Infrastructure as Code), container security, and API security.
Individual cases
Types of Tests: The dataset tests the model across multiple domains:
Secrets Detection: Scanning the repository for hardcoded secrets, API keys,
credentials, and other sensitive information.
PII Identification: Detecting personal data such as email addresses, social
security numbers, and credit card details.
SAST (Static Application Security Testing): Identifying vulnerabilities such
as SQL injection, cross-site scripting (XSS), and buffer overflow in the
codebase.
SCA (Software Composition Analysis): Analysing third-party dependencies
and libraries for known vulnerabilities and outdated packages.
IaC (Infrastructure as Code): Scanning infrastructure configuration files for
misconfigurations and security risks.
Container Security: Analyzing Docker configurations for vulnerabilities.
API Security: Checking for vulnerabilities in API endpoints, such as improper
access controls and rate-limiting issues.
Combinations Cases Secrets + PII
Combined cases
Queries in this category combine the detection of hardcoded secrets (e.g., API keys, passwords) with PII scanning (e.g., social security numbers, email addresses). Example:
"Find AWS credentials and detect any personally identifiable information in the repo."
"Look for email addresses and any hardcoded passwords." Secrets + SAST
This category involves scanning for hardcoded secrets while simultaneously performing static code analysis to detect security vulnerabilities. Example:
"Detect hardcoded secrets and perform static analysis for vulnerabilities." PII + SAST
Combines PII identification with static analysis to uncover potential security flaws like SQL injections or insecure deserialization. Example:
"Find social security numbers and scan for SQL injection vulnerabilities." SAST + SCA
Static analysis for code vulnerabilities paired with software composition analysis to detect outdated dependencies and known CVEs (Common Vulnerabilities and Exposures). Example:
"Perform static analysis for vulnerabilities and identify outdated dependencies."
SCA + IaC: Scans for software vulnerabilities in third-party dependencies combined with infrastructure as code analysis for configuration misconfigurations (e.g., Terraform, Kubernetes files). Example:
"Check for known CVEs and scan for misconfigured cloud services." API + Secrets
Focuses on scanning for broken access controls in API endpoints while checking for hardcoded secrets in API configurations. Example:
"Scan API endpoints for broken access controls and detect any hardcoded secrets."
Model Testing Results:
From our tests cases we view that the model is performing 99.25% of the case correctly.
Conclusion
In this evaluation, we have successfully processed test cases for both individual and combination scanners, achieving varying levels of accuracy across different categories. However, there remain several questions related to both individual and combination test cases that were not fully processed by the model. Moving forward, our next steps will involve identifying and addressing these unprocessed questions. We plan to refine and enhance our model by incorporating these questions into our training data, allowing for more comprehensive coverage and improved performance across all categories. This continuous improvement process will ensure that the model is better equipped to handle diverse scenarios and achieve higher accuracy in future tests.
Last updated