Review

Findings Review Model that significantly reduces noise and improves the accuracy of static analysis

Abstract

Static Application Security Testing (SAST) tools are essential for identifying vulnerabilities in source code. However, their reliance on deterministic rule-based engines often results in excessive noise, including a high number of false positives (FPs). These FPs, which may represent incorrect or non-critical issues, create inefficiencies in vulnerability management. AquilaX addresses this challenge by introducing an AI-powered Findings Review Model that significantly reduces noise and improves the accuracy of static analysis. By intelligently classifying findings as true positives (TPs) or false positives (FPs), AquilaX enables organizations to streamline vulnerability triage and focus on critical security risks.

Introduction

Traditional SAST tools rely on static rules to identify vulnerabilities, which often fail to capture the complexity and logical intricacies of modern codebases. As a result, security teams are burdened with manually reviewing findings to distinguish between TPs and FPs—a process that is time-consuming, subjective, and prone to human error.

AquilaX tackles this inefficiency by introducing an AI-driven Findings Review Model that automates the classification of vulnerabilities. By leveraging anonymized data from real-world code reviews, the model significantly reduces noise, improves accuracy, and enables scalable vulnerability management. This white paper outlines the challenges of traditional SAST tools, AquilaX’s innovative approach, and the technical implementation and results of the Findings Review Model.

Key Challenges in Static Scanners

1. Noise and False Positives

Static rules often fail to accurately assess complex logic in modern codebases.
High volumes of findings, many of which are false positives, create inefficiencies in vulnerability triage.

2. Human Dependence

Manual review of findings is time-consuming and requires significant domain expertise.
Subjectivity in human judgment can lead to inconsistent classification of vulnerabilities.

AquilaX’s Approach

AquilaX adopts a novel approach to address the limitations of traditional SAST tools. Instead of modifying the scanners themselves, AquilaX focuses on improving the consumption of findings by automating the classification process. The Findings Review Model is trained on anonymized data derived from real-world code reviews, enabling it to classify vulnerabilities from any static scanner with high accuracy.

Process Overview

A) Data Collection:

Source code is scanned using over 30 parallel scanners grouped into 8 categories (e.g., 1st-party code scanning, 3rd-party library scanning, infrastructure scanning, malware detection etc…).
All findings are initially stored as unverified.

B) Human Review:

Security engineers review findings to classify them as FP or TP.
Anonymized review data is stored in relational databases specific to project groups.

C) Model Training:

The anonymized dataset is used to train our AI model.
The model learns to automate the classification process, creating a feedback loop for continuous improvement.

D) Automation:

The trained model reviews new findings and automates the FP/TP classification process.

E) Daily Model Retraining:

Every day, the model is retrained with new data from PostgreSQL to ensure continuous improvement and adaptability.

Technical Implementation

Training Data Preparation

The Findings Review Model is trained on data extracted from a PostgreSQL database, including the following fields:

CWE ID: Common Weakness Enumeration identifier.
CWE Name: Descriptive name of the identified vulnerability.
Affected Line: Specific code line or segment affected.
Partial Code: A snippet of the code where the issue was detected.
File Name: File containing the affected code.
Org ID: Unique identifier for the organization.
Status: Human-reviewed classification (TP or FP).
Label: Binary label assigned for training (TP: 1, FP: 0).

Sample data used for training is illustrated below:

cwe_id

cwe_name

affected_line

partial_code

file_name

org_id

status

CWE-693

Protection Mechanism Failure

Incompatible License Found

'@syncfusion/ej2-popups' is non-standard.

package-lock.json

6717514af741711f86f27de5

CWE-798

Use of Hard-coded Credentials

AWS-ID: AAGAA**********AAAAA

//# sourceMappingURL=pdfmake.min.js.map\nthis...

datatables.bundle.js

6717514af731711f86d27de5

CWE-353

Missing Support for Integrity Check

list.html

4717514af741711f86327de5

CWE-178

Improper Handling of Case Sensitivity

Vite dev server option server.fs.deny can be...

vite 4.5.1

package-lock.json

6717514af741711f86d27de5

CWE-353

Missing Support for Integrity Check

create-app.html

6717514af741711f86d27e575

Model Architecture

Base Model: microsoft/graphcodebert-base
Labels: TP (1) and FP (0)
Tokenization: Custom prompts constructed using fields from the dataset, formatted for optimal model understanding.

Why AquilaX Chose This Model

AquilaX selected its AI model based on performance, scalability, and adaptability. After evaluating traditional classifiers (like SVMs and random forests) and advanced architectures (like LSTMs), a transformer-based architecture was chosen for its superior handling of complex code structures. The graphcodebert-base model was selected for its balance between accuracy and computational efficiency, enabling the processing of large findings volumes. This architecture supports diverse findings from multiple code scanning tools and categories, with continuous retraining ensuring scalability for evolving security needs. Hugging Face Transformers and PyTorch allowed custom tokenization strategies for tailored input prompts. This binary classification model streamlines vulnerability triage, prioritizes critical issues, and maintains data privacy compliance through anonymized training.

Training Process

Frameworks and Tools: Hugging Face Transformers, PyTorch.
Training Configuration:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    num_train_epochs=15,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_data,
    eval_dataset=test_data
)

trainer.train()

The model is trained and validated using an 80% to 20% train-test split.
Training Time: Less than 10 minutes on an NVIDIA RTX 4090 GPU.

Execution Code Snippet

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model = AutoModelForSequenceClassification.from_pretrained("AquilaX-AI/Review").to(device)
tokenizer = AutoTokenizer.from_pretrained("AquilaX-AI/Review")

partial_code =  # snippet of insecure code
cwe_id = # CWE ID for Path Traversal
cwe_name =  # CWE Name
affected_line =  # line number in the code file
file_name = # file name
org_id = # Optional

start = time.time()

prompt = f"""partial_code: {partial_code} , cwe_id: {cwe_id} , cwe_name: {cwe_name}, affected_line: {affected_line},file_name: {file_name}, org_id: {org_id}"""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
predicted_class = model.config.id2label[predicted_class_id]

print(predicted_class)

Results

The AquilaX Findings Review Model has been rigorously evaluated to assess its performance in classifying findings as true positives (TPs) or false positives (FPs). The results demonstrate the model's exceptional ability to reduce noise and improve the efficiency of static analysis.

Key Metrics

Accuracy: The model achieved an accuracy of 93.54% on the test dataset, indicating its strong capability to correctly classify findings.
Precision and Recall:
- Precision (TPs): 91.48%, reflecting the model's ability to minimize false positives.
- Recall (TPs): 96.98%, indicating the model's effectiveness in identifying true vulnerabilities.
F1 Score: The F1 score of 94.15% balances precision and recall, showcasing the model's robustness in handling imbalanced datasets.
False Positive Classification:
- Precision (FPs): 96%, demonstrating the model's ability to correctly identify false positives.
- Recall (FPs): 90%, indicating the model's effectiveness in filtering out non-critical findings.
Reduction in Manual Effort: By automating the classification process, the model significantly reduces the manual review workload, enabling security teams to focus on critical vulnerabilities.

Classification Report

The detailed classification report highlights the model's performance across both classes (False Positives and True Positives):

Class

Precision

Recall

F1-Score

False Positive

0.96

0.90

0.93

True Positive

0.91

0.97

0.94

Accuracy

0.94

Macro Avg

0.94

0.93

Weighted Avg

0.94

Performance Comparison

Baseline (Rule-Based Systems): Traditional rule-based systems typically achieve an accuracy of 65-70%, with a high false positive rate of 40-50%.
AquilaX Model: The AI-powered model outperforms rule-based systems, achieving 93.54% accuracy and significantly reducing false positives.

Limitations

The model's performance is dependent on the quality and diversity of the training data. Findings from niche or highly specialized codebases may require additional fine-tuning.
The current model focuses on static analysis findings and does not incorporate dynamic or runtime data, which could further enhance its accuracy.

Conclusion

The AquilaX Findings Review Model represents a significant advancement in static application security testing. By leveraging AI to automate the classification of findings, the model addresses the critical challenges of noise and false positives inherent in traditional rule-based systems. With an accuracy of 93.54%, precision of 91.48%, and recall of 96.98%, the model significantly outperforms traditional methods. By continuously retraining with new data and integrating feedback loops, AquilaX ensures that the system remains adaptive and effective in the face of evolving codebases and emerging vulnerabilities

Future Work

To further enhance the capabilities of the AquilaX Findings Review Model, the following areas are proposed for future development:

Integration with DAST and RASP: Expanding the training data to include findings from dynamic and runtime analysis will provide a more comprehensive view of application security.
Real-Time Monitoring: Implementing real-time monitoring and active learning will enable the model to adapt to new vulnerabilities and code patterns dynamically.
Multi-Modal Approaches: Combining code analysis with behavioral data and runtime insights will further improve the model's accuracy and applicability.
Cross-Language Support: Extending the model's capabilities to support additional programming languages and frameworks will broaden its utility.

Acknowledgments

We extend our gratitude to the AquilaX engineering and security teams for their dedication and contributions to this project. Special thanks to the Hugging Face community for their invaluable support and resources, which played a crucial role in the development and training of the AI model.

Credit on Engineering team: Suriya & Pachaiappan

PreviousSecurity Assistant

Last updated 3 months ago

Was this helpful?