Review
Findings Review Model that significantly reduces noise and improves the accuracy of static analysis
Last updated
Was this helpful?
Findings Review Model that significantly reduces noise and improves the accuracy of static analysis
Last updated
Was this helpful?
Static Application Security Testing (SAST) tools are essential for identifying vulnerabilities in source code. However, their reliance on deterministic rule-based engines often results in excessive noise, including a high number of false positives (FPs
). These FPs
, which may represent incorrect or non-critical issues, create inefficiencies in vulnerability management. AquilaX addresses this challenge by introducing an AI-powered Findings Review Model that significantly reduces noise and improves the accuracy of static analysis. By intelligently classifying findings as true positives (TPs) or false positives (FPs), AquilaX enables organizations to streamline vulnerability triage and focus on critical security risks.
Traditional SAST tools rely on static rules to identify vulnerabilities, which often fail to capture the complexity and logical intricacies of modern codebases. As a result, security teams are burdened with manually reviewing findings to distinguish between TPs and FPs—a process that is time-consuming, subjective, and prone to human error.
tackles this inefficiency by introducing an AI-driven Findings Review Model that automates the classification of vulnerabilities. By leveraging anonymized data from real-world code reviews, the model significantly reduces noise, improves accuracy, and enables scalable vulnerability management. This white paper outlines the challenges of traditional SAST tools, AquilaX’s innovative approach, and the technical implementation and results of the Findings Review Model.
1. Noise and False Positives
Static rules often fail to accurately assess complex logic in modern codebases.
High volumes of findings, many of which are false positives, create inefficiencies in vulnerability triage.
2. Human Dependence
Manual review of findings is time-consuming and requires significant domain expertise.
Subjectivity in human judgment can lead to inconsistent classification of vulnerabilities.
AquilaX adopts a novel approach to address the limitations of traditional SAST tools. Instead of modifying the scanners themselves, AquilaX focuses on improving the consumption of findings by automating the classification process. The Findings Review Model is trained on anonymized data derived from real-world code reviews, enabling it to classify vulnerabilities from any static scanner with high accuracy.
A) Data Collection:
Source code is scanned using over 30 parallel scanners grouped into 8 categories (e.g., 1st-party code scanning, 3rd-party library scanning, infrastructure scanning, malware detection etc…).
All findings are initially stored as unverified.
B) Human Review:
Security engineers review findings to classify them as FP or TP.
Anonymized review data is stored in relational databases specific to project groups.
C) Model Training:
The anonymized dataset is used to train our AI model.
The model learns to automate the classification process, creating a feedback loop for continuous improvement.
D) Automation:
The trained model reviews new findings and automates the FP/TP classification process.
E) Daily Model Retraining:
Every day, the model is retrained with new data from PostgreSQL to ensure continuous improvement and adaptability.
The Findings Review Model is trained on data extracted from a PostgreSQL database, including the following fields:
CWE ID
: Common Weakness Enumeration identifier.
CWE Name
: Descriptive name of the identified vulnerability.
Affected Line
: Specific code line or segment affected.
Partial Code
: A snippet of the code where the issue was detected.
File Name
: File containing the affected code.
Org ID:
Unique identifier for the organization.
Status
: Human-reviewed classification (TP or FP).
Label
: Binary label assigned for training (TP: 1, FP: 0).
Sample data used for training is illustrated below:
cwe_id
cwe_name
affected_line
partial_code
file_name
org_id
status
CWE-693
Protection Mechanism Failure
Incompatible License Found
'@syncfusion/ej2-popups' is non-standard.
package-lock.json
6717514af741711f86f27de5
TP
CWE-798
Use of Hard-coded Credentials
AWS-ID: AAGAA**********AAAAA
//# sourceMappingURL=pdfmake.min.js.map\nthis...
datatables.bundle.js
6717514af731711f86d27de5
FP
CWE-353
Missing Support for Integrity Check
list.html
4717514af741711f86327de5
TP
CWE-178
Improper Handling of Case Sensitivity
Vite dev server option server.fs.deny can be...
vite 4.5.1
package-lock.json
6717514af741711f86d27de5
TP
CWE-353
Missing Support for Integrity Check
create-app.html
6717514af741711f86d27e575
TP
Base Model: microsoft/graphcodebert-base
Labels: TP (1) and FP (0)
Tokenization: Custom prompts constructed using fields from the dataset, formatted for optimal model understanding.
AquilaX selected its AI model based on performance, scalability, and adaptability. After evaluating traditional classifiers (like SVMs and random forests) and advanced architectures (like LSTMs), a transformer-based architecture was chosen for its superior handling of complex code structures. The graphcodebert-base
model was selected for its balance between accuracy and computational efficiency, enabling the processing of large findings volumes. This architecture supports diverse findings from multiple code scanning tools and categories, with continuous retraining ensuring scalability for evolving security needs. Hugging Face Transformers and PyTorch allowed custom tokenization strategies for tailored input prompts. This binary classification model streamlines vulnerability triage, prioritizes critical issues, and maintains data privacy compliance through anonymized training.
Frameworks and Tools: Hugging Face Transformers, PyTorch.
Training Configuration:
The model is trained and validated using an 80% to 20% train-test split.
Training Time: Less than 10 minutes on an NVIDIA RTX 4090 GPU.
The AquilaX Findings Review Model has been rigorously evaluated to assess its performance in classifying findings as true positives (TPs) or false positives (FPs). The results demonstrate the model's exceptional ability to reduce noise and improve the efficiency of static analysis.
Accuracy: The model achieved an accuracy of 93.54% on the test dataset, indicating its strong capability to correctly classify findings.
Precision and Recall:
Precision (TPs): 91.48%, reflecting the model's ability to minimize false positives.
Recall (TPs): 96.98%, indicating the model's effectiveness in identifying true vulnerabilities.
F1 Score: The F1 score of 94.15% balances precision and recall, showcasing the model's robustness in handling imbalanced datasets.
False Positive Classification:
Precision (FPs): 96%, demonstrating the model's ability to correctly identify false positives.
Recall (FPs): 90%, indicating the model's effectiveness in filtering out non-critical findings.
Reduction in Manual Effort: By automating the classification process, the model significantly reduces the manual review workload, enabling security teams to focus on critical vulnerabilities.
Classification Report
The detailed classification report highlights the model's performance across both classes (False Positives and True Positives):
Class
Precision
Recall
F1-Score
False Positive
0.96
0.90
0.93
True Positive
0.91
0.97
0.94
Accuracy
0.94
Macro Avg
0.94
0.93
0.93
Weighted Avg
0.94
0.94
0.94
Performance Comparison
Baseline (Rule-Based Systems): Traditional rule-based systems typically achieve an accuracy of 65-70%, with a high false positive rate of 40-50%.
AquilaX Model: The AI-powered model outperforms rule-based systems, achieving 93.54% accuracy and significantly reducing false positives.
Limitations
The model's performance is dependent on the quality and diversity of the training data. Findings from niche or highly specialized codebases may require additional fine-tuning.
The current model focuses on static analysis findings and does not incorporate dynamic or runtime data, which could further enhance its accuracy.
The AquilaX Findings Review Model represents a significant advancement in static application security testing. By leveraging AI to automate the classification of findings, the model addresses the critical challenges of noise and false positives inherent in traditional rule-based systems. With an accuracy of 93.54%
, precision of 91.48%
, and recall of 96.98%
, the model significantly outperforms traditional methods. By continuously retraining with new data and integrating feedback loops, AquilaX ensures that the system remains adaptive and effective in the face of evolving codebases and emerging vulnerabilities
To further enhance the capabilities of the AquilaX Findings Review Model, the following areas are proposed for future development:
Integration with DAST and RASP: Expanding the training data to include findings from dynamic and runtime analysis will provide a more comprehensive view of application security.
Real-Time Monitoring: Implementing real-time monitoring and active learning will enable the model to adapt to new vulnerabilities and code patterns dynamically.
Multi-Modal Approaches: Combining code analysis with behavioral data and runtime insights will further improve the model's accuracy and applicability.
Cross-Language Support: Extending the model's capabilities to support additional programming languages and frameworks will broaden its utility.
We extend our gratitude to the AquilaX engineering and security teams for their dedication and contributions to this project. Special thanks to the Hugging Face community for their invaluable support and resources, which played a crucial role in the development and training of the AI model.
<link rel="canonical" href="...
<meta property="og:url" content="...
<link rel="canonical" href="...
<meta property="og:url" content="...
Credit on Engineering team: &