AI Scanner

AI-Powered Vulnerability Detection Model

Introduction

The AI Scanner is an advanced AI model designed to detect vulnerabilities in code snippets or entire files. Leveraging state-of-the-art machine learning techniques, the model analyzes provided code and generates structured reports detailing any security issues it identifies. Trained on a custom dataset of 100,000 files, Ai Scanner is capable of handling a wide range of programming languages and vulnerability types, making it a valuable tool for developers and security professionals.

Model Architecture

Ai Scanner is built upon the unsloth/Qwen2.5-Coder-3B-Instruct model, a variant of the Qwen2.5 language model fine-tuned for coding tasks. The base model is loaded with 4-bit quantization to optimize memory usage and computational efficiency. To specialize the model for vulnerability detection, we employ Low-Rank Adaptation (LoRA) with a rank of 512 on key modules, including:

  • q_proj, k_proj, v_proj, o_proj (projection layers in the attention mechanism)

  • gate_proj, up_proj, down_proj (MLP layers)

This parameter-efficient fine-tuning (PEFT) approach allows us to adapt the model effectively while minimizing resource requirements.

Training Data

The model was trained on a custom dataset comprising 100,000 filesEach training example consists of:

  • Source Code: The code to be analyzed.

  • Vulnerability Information: A list of vulnerability dictionaries (or an empty list if no vulnerabilities are present), containing details such as the type of vulnerability, its location, and a description.

During training, the data is formatted into prompts that simulate a conversation between a user providing code and an assistant generating a vulnerability report. The system message defines the assistant’s role as Securitron, an AI specialized in vulnerability detection.

Training Process

The training process involves fine-tuning the pre-trained model using the SFTTrainer from the TRL library. Key training parameters include:

  • Batch Size: 2 (with gradient accumulation steps of 1)

  • Learning Rate: 2e-4

  • Optimizer: AdamW with 8-bit precision

  • Epochs: 5

  • Warmup Steps: 5

  • LR Scheduler: Cosine annealing

  • Max Sequence Length: 8192 tokens

To focus the training on generating accurate vulnerability reports, the train_on_responses_only function is used, ensuring the model learns to produce the assistant’s response given the system and user messages. Gradient checkpointing is enabled via unsloth to manage memory usage during training.

Upon completion, the model is merged and uploaded to the Hugging Face hub under the name AquilaX-AI/Ai Scanner.

Inference

Ai Scanner provides flexible inference capabilities designed to support a variety of use cases, offering both batch and real-time processing options for analyzing source code and generating structured vulnerability reports in JSON format. Below is a detailed overview of how these inference options work:

Batch Processing:

  • Description: his option allows users to submit source code and receive a comprehensive vulnerability report in a single response. It is particularly well-suited for efficiently analyzing entire files or codebases.

  • Input Format: {"source_code": "<code_here>"}

  • Output Format: A JSON object containing the vulnerability report.

Real-Time Streaming

  • Description: This option streams the vulnerability report incrementally, delivering it token by token. It is ideal for applications requiring immediate feedback or for handling large responses in real time.

  • Input Format: {"source_code": "<code_here>"}

  • Output: A stream of text tokens forming the JSON report.

Prompt Construction and Processing

  • A system message is included to define the assistant’s role and set the context for the analysis.

  • The user’s source code is appended to this system message to create the full prompt.

  • The model then generates the vulnerability report based on this prompt, formatted as JSON.

Device-Agnostic Deployment

The Ai Scanner model is deployed in a flexible, device-agnostic manner, automatically leveraging available hardware to optimize performance:

  • CUDA: Utilized when a compatible GPU is available for accelerated processing.

  • MPS: Employed on Apple devices with Metal Performance Shaders support.

  • CPU: Used as a fallback or in environments without specialized hardware.

This adaptability ensures that Ai Scanner can seamlessly integrate into diverse workflows, such as automated CI/CD pipelines, interactive development environments, or standalone tools, providing consistent and efficient vulnerability analysis across different platforms and use cases.

Examples

Below is an example of using the /scan endpoint:

Input

{
  "swagger": "2.0",
  "info": {
    "title": "example",
    "version": "1.0.0"
  },
  "paths": {
    "/": {
      "get": {
        "operationId": "example",
        "summary": "example",
        "responses": {
          "200": {
            "description": "200 response"
          }
        },
        "parameters": [
          {
            "name": "limit2",
            "in": "body",
            "required": true,
            "schema": {
              "type": "object"
            }
          }
        ],
        "security": [
          {
            "api_key": []
          }
        ]
      }
    }
  },
  "securityDefinitions": {
    "petstore_auth": {
      "type": "oauth2",
      "authorizationUrl": "http://swagger.io/api/oauth/dialog",
      "flow": "implicit",
      "scopes": {
        "write:pets": "write",
        "read:pets": "read"
      }
    }
  }
}

Output

[
  {
    "code_snipped": "\"get\": {\n  \"operationId\": \"example\",\n  \"summary\": \"example\",\n  \"responses\": {\n    \"200\": {\n      \"description\": \"200 response\"\n    }\n  },\n  \"parameters\": [\n    {\n      \"name\": \"limit2\",\n      \"in\": \"body\",\n      \"required\": true,\n      \"schema\": {\n        \"type\": \"object\"\n      }\n    }\n  ],\n  \"security\": [{\"api_key\": []}]\n}",
    "cwe_id": "CWE-306",
    "description": "This endpoint references a security scheme 'api_key' that isn't defined in the securityDefinitions, potentially leaving it unprotected. An attacker could exploit this by accessing the GET / endpoint without authentication, exposing any data or functionality it returns. The issue stems from the undefined scheme in this spec; ensure all referenced security definitions are properly configured, like adding an 'api_key' definition with type 'apiKey'. Note that this analysis is based on the spec alone, so verify the actual implementation for full context.",
    "vuln_code": "\"security\": [{\"api_key\": []}]",
    "vulnerability": "Missing Authentication"
  }
]

This example demonstrates the model’s ability to identify a missing authentication vulnerability in an OpenAPI specification.

Performance

The model's structured vulnerability reports are designed to offer clear and actionable insights, making it a powerful tool for developers and security professionals. Future updates to Ai Scanner may include benchmark results on standard vulnerability detection datasets, offering a more comprehensive understanding of its capabilities. In the meantime, users can leverage the model's detailed analyses to enhance their code security practices and mitigate potential risks.

Limitations

While Ai Scanner excels at detecting vulnerabilities, it’s not a standalone solution and benefits from integration into a broader security strategy:

  • Coverage: Trained on 100k diverse files, it captures many vulnerabilities but may miss rare ones. Pair with other tools or manual review for full coverage.

  • Complexity: Handles simple to complex code, though highly intricate or obfuscated code might need extra scrutiny.

  • Generalization: Performs well across varied patterns, but unique structures may pose challenges. Updates will enhance its scope.

  • False Positives/Negatives: May occasionally misflag or miss issues. Expert review refines accuracy.

Future Work

Potential enhancements for Ai Scanner include:

  • Dataset Expansion: Adding more programming languages and vulnerability types to the training data.

  • Advanced Techniques: Implementing reinforcement learning from human feedback to improve report quality.

  • Remediation Suggestions: Enabling the model to suggest fixes alongside vulnerability reports.

  • Large Codebase Support: Enhancing the ability to analyze entire projects efficiently.


Credit on Engineering team: Suriya & Pachaiappan

Last updated

Was this helpful?