NL-JSON Model

The NL-JSON model is designed to simplify the interaction between users and complex systems by translating natural language commands into structured JSON format.

Hugging Face (Model): https://huggingface.co/AquilaX-AI/NL-JSON-Start-Scan

GitHub (Code): https://github.com/AquilaX-AI/NL-JSON-Start-Scan

What is the model made for?

The NL-JSON (AKA AquilaX AI Ask) model is designed to simplify the interaction between users and complex systems by translating natural language commands into structured JSON format. Its primary purpose is to automate workflows such as security scanning, code analysis, and technical tasks. By understanding user intent from everyday language, it generates machine-readable JSON instructions, enabling non-technical users to operate advanced tools efficiently. This model enhances usability, scalability, and integration across platforms, making it ideal for environments.

What is NL -> JSON?

The Natural Language to JSON (NL-JSON) component translates human-readable text into a structured JSON format. This allows users to input commands in plain English, which the system then converts into machine-readable JSON. The generated JSON can then be fed into backend systems to perform specific tasks. This ensures accurate interpretation of user requests and seamless execution by the backend.

What problem does it resolve?

The NL-JSON model resolves the problem of translating human instructions into machine-executable commands, particularly in complex systems like security scanners or code analyzers. Traditionally, users needed to understand technical configurations and syntax to interact with these systems. The NL-JSON model eliminates this barrier by allowing users to input natural language commands, which it then converts into JSON format, automating processes and reducing the need for technical expertise. This simplifies user interaction, reduces errors, and enhances productivity in tasks like vulnerability scanning or code analysis.

Model Training

The model training process focuses on training a sequence-to-sequence (Seq2Seq) model with a specific architecture using the Hugging Face Transformers library. The model is optimized for natural language generation tasks, and this report summarizes the setup and parameters used for training.

Data Collation

The DataCollatorForSeq2Seq was used to handle dynamic padding during training and evaluation, ensuring input sequences are properly padded for batch processing.

Optimizer

adamw_torch was chosen as the optimizer and the learning rate was initialized at 3e-5

Training Process

A Seq2SeqTrainer was used for managing the training process. The training involved:

  • Model: Custom Seq2Seq model

  • Datasets: Tokenized datasets for training and evaluation were used.

  • Metrics: Custom `compute_metrics` function was employed to evaluate the model's performance after each epoch.

Training Script

train.py
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq
from datasets import load_dataset, Dataset
import pandas as pd
import nltk
import evaluate
import numpy as np
import os
from dotenv import load_dotenv

load_dotenv()

# Check if GPU is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("AquilaX-AI/NL-JSON-Start-Scan")
model = AutoModelForSeq2SeqLM.from_pretrained("AquilaX-AI/NL-JSON-Start-Scan").to(device)

#saving previous model on hub
print("saving previous model on hub...")
model.push_to_hub(repo_id = "AquilaX-AI/Last-NL-JSON-Start-Scan",token=os.getenv('HF_WRITE'))
tokenizer.push_to_hub(repo_id="AquilaX-AI/Last-NL-JSON-Start-Scan",token=os.getenv('HF_WRITE'))

# Add special tokens
special_tokens_dict = {'additional_special_tokens': ['{', '}']}
num_added_toks = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
print(f"Added {num_added_toks} special tokens")

# Load the dataset
print("Loading dataset...")
data = load_dataset('csv', data_files='final-11.csv')
df = pd.DataFrame(data['train'])

# Shuffle and preprocess dataframe
df = df.sample(frac=1, random_state=42).reset_index(drop=True)
df['questions'] = df['questions'].str.replace(',', '', regex=False).str.lower()
df['json'] = df['json'].str.lower()

# Convert back to Dataset
dataset = Dataset.from_pandas(df)
split_data = dataset.train_test_split(test_size=0.2)
print("Dataset loaded and split into train and test sets")

# Preprocessing function
prefix = "Translate the following text to JSON: "

def preprocess_function(examples):
    inputs = [prefix + doc for doc in examples["questions"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True)
    labels = tokenizer(text_target=examples["json"], max_length=256, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

# Apply preprocessing
print("Tokenizing dataset...")
tokenized_dataset = split_data.map(preprocess_function, batched=True)
print("Tokenization complete")

# Load ROUGE metric
nltk.download("punkt", quiet=True)
metric = evaluate.load("rouge")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    decoded_preds = ["\n".join(nltk.sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(nltk.sent_tokenize(label.strip())) for label in decoded_labels]
    result = metric.compute(predictions=decoded_preds, references=decoded_labels, use_stemmer=True)
    print("ROUGE scores calculated")
    return result

# Data collator
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

from transformers.optimization import Adafactor, AdafactorSchedule

# Global Parameters
L_RATE = 3e-5
BATCH_SIZE = 4
PER_DEVICE_EVAL_BATCH = 4
WEIGHT_DECAY = 0.01
SAVE_TOTAL_LIM = 1
NUM_EPOCHS = 4

# Set up training arguments
training_args = Seq2SeqTrainingArguments(
   output_dir="flag-model",
   evaluation_strategy="epoch",
   learning_rate=L_RATE,
   per_device_train_batch_size=BATCH_SIZE,
   per_device_eval_batch_size=PER_DEVICE_EVAL_BATCH,
   gradient_accumulation_steps=4,
   weight_decay=WEIGHT_DECAY,
   lr_scheduler_type="cosine",
   warmup_ratio=0.1,  # Alternatively, use warmup_ratio
   save_steps=1000000,
   warmup_steps=100,
   seed=42,
   logging_steps = 100,
   max_grad_norm=1.0,
   fp16=True, 
   save_total_limit=SAVE_TOTAL_LIM,
   num_train_epochs=NUM_EPOCHS,
   dataloader_num_workers=4,
   label_smoothing_factor=0.1,
   predict_with_generate=True,  
   adam_beta1=0.9,  # AdamW beta1 parameter
   adam_beta2=0.999,  # AdamW beta2 parameter
   adam_epsilon=1e-8,  # AdamW beta3 parameter
   push_to_hub=True,
   hub_model_id="AquilaX-AI/NL-JSON-Start-Scan",
   hub_token= os.getenv('HF_WRITE'),
)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
   model=model,
   args=training_args,
   train_dataset=tokenized_dataset['train'],
   eval_dataset=tokenized_dataset["test"],
   tokenizer=tokenizer,
   data_collator=data_collator,
   compute_metrics=compute_metrics,
)

# Start training
print("Starting training...")
trainer.train()
print("Training complete")
print("New Model Pushed on HuggingFace Hub.")

requiarments.txt
torch
nltk
transformers[torch]
tokenizers
evaluate
rouge_score
sentencepiece
huggingface_hub
datasets==2.16.0
pandas
python-dotenv

Conclusion

This setup is designed to optimize the model's performance with proper data collation, warmup steps, gradient accumulation, and effective learning rate scheduling. The chosen parameters, including FP16 and label smoothing, aim to enhance model performance while ensuring efficient GPU usage.

How is it used in AquilaX ?

In AquilaX, we leverage the Natural Language to JSON (NL-JSON) model to convert user-provided scan instructions, typically written in plain English, into a structured JSON format. This transformation allows our backend systems to interpret and act on these commands accurately. For example, a user might request "Scan the network for vulnerabilities," and the model translates it into a specific JSON object that the backend understands. This JSON is then processed by the AquilaX system to execute the requested security scans or related tasks. By automating this conversion, we ensure seamless communication between user inputs and backend operations.

Natural Language Command Parsing

The NL-JSON model allows users to input queries or commands in natural language, such as "Execute a secrets scan on the repo https://github.com/AquilaX-AI/vulnapp.” or "Execute a security scan on https://github.com/AquilaX-AI/vulnapp to uncover PII."The model then interprets these commands and translates them into structured JSON data, which can be understood by various scanners (e.g., SAST, Secrets, PII, SCA, IAC,Container,Malware,API).

Command Execution via JSON

Once the natural language input is parsed into JSON, the AquilaX system processes it. Each command is associated with specific parameters or actions in the JSON format. This structured JSON can trigger the appropriate scanners (e.g., for secrets, SAST, or PII) and provide necessary configurations.

Dataset Size

The dataset includes over 2000 test samples for various security scanning tasks, such as detecting Secrets, personally identifiable information (PII), and Static application Security testing (SAST) in the GitHub repository AquilaX-AI/vulnapp.

The model testing scenarios include combinations of tasks like secrets detection, PII identification, SAST (Static Application Security Testing), SCA (Software Composition Analysis), IaC (Infrastructure as Code), container security, and API security.

Individual cases

Types of Tests: The dataset tests the model across multiple domains:

  • Secrets Detection: Scanning the repository for hardcoded secrets, API keys,

    credentials, and other sensitive information.

  • PII Identification: Detecting personal data such as email addresses, social

    security numbers, and credit card details.

  • SAST (Static Application Security Testing): Identifying vulnerabilities such

    as SQL injection, cross-site scripting (XSS), and buffer overflow in the

    codebase.

  • SCA (Software Composition Analysis): Analysing third-party dependencies

    and libraries for known vulnerabilities and outdated packages.

  • IaC (Infrastructure as Code): Scanning infrastructure configuration files for

    misconfigurations and security risks.

  • Container Security: Analyzing Docker configurations for vulnerabilities.

  • API Security: Checking for vulnerabilities in API endpoints, such as improper

    access controls and rate-limiting issues.

    Combinations Cases Secrets + PII

Combined cases

Queries in this category combine the detection of hardcoded secrets (e.g., API keys, passwords) with PII scanning (e.g., social security numbers, email addresses). Example:

  • "Find AWS credentials and detect any personally identifiable information in the repo."

  • "Look for email addresses and any hardcoded passwords." Secrets + SAST

This category involves scanning for hardcoded secrets while simultaneously performing static code analysis to detect security vulnerabilities. Example:

  • "Detect hardcoded secrets and perform static analysis for vulnerabilities." PII + SAST

Combines PII identification with static analysis to uncover potential security flaws like SQL injections or insecure deserialization. Example:

  • "Find social security numbers and scan for SQL injection vulnerabilities." SAST + SCA

Static analysis for code vulnerabilities paired with software composition analysis to detect outdated dependencies and known CVEs (Common Vulnerabilities and Exposures). Example:

  • "Perform static analysis for vulnerabilities and identify outdated dependencies."

SCA + IaC: Scans for software vulnerabilities in third-party dependencies combined with infrastructure as code analysis for configuration misconfigurations (e.g., Terraform, Kubernetes files). Example:

  • "Check for known CVEs and scan for misconfigured cloud services." API + Secrets

Focuses on scanning for broken access controls in API endpoints while checking for hardcoded secrets in API configurations. Example:

  • "Scan API endpoints for broken access controls and detect any hardcoded secrets."

Model Testing Results:

From our tests cases we view that the model is performing 99.25% of the case correctly.

Conclusion

In this evaluation, we have successfully processed test cases for both individual and combination scanners, achieving varying levels of accuracy across different categories. However, there remain several questions related to both individual and combination test cases that were not fully processed by the model. Moving forward, our next steps will involve identifying and addressing these unprocessed questions. We plan to refine and enhance our model by incorporating these questions into our training data, allowing for more comprehensive coverage and improved performance across all categories. This continuous improvement process will ensure that the model is better equipped to handle diverse scenarios and achieve higher accuracy in future tests.

Last updated