TL;DR #
Writing Kusto Query Language (KQL) queries to analyze security logs can be challenging, especially for non-technical users or those unfamiliar with its syntax.
In this blog, we explore how Vega Security leverages AI—specifically large language models (LLMs), text embeddings, and vector similarity search—to enable seamless translation from natural language (NL) to KQL.
We’ll discuss our approach, the challenges we encountered, and the impact of this technology on cybersecurity operations.
If you’re a security analyst, threat hunter, or detection engineer looking to simplify data analysis in any KQL-based environment, this blog post is for you.
Introduction #
Breaking Down Barriers #
KQL is a powerful query language used in cybersecurity platforms for searching, analyzing, and visualizing log data. Vega Application relies on KQL as its primary querying language.
However, writing KQL queries requires technical expertise, and many security analysts struggle with its syntax, leading to inefficiencies in data exploration.
Let’s consider these examples:
-
“Show all failed login attempts from the last 24 hours.”
KQL Equivalent:Events | where EventID == 4625 | where TimeGenerated > ago(24h) | project Account, Computer, TimeGenerated
-
“Find all network connections to external IPs in the last 7 days.”
KQL Equivalent:NetworkConnections | where DestinationIPType == "Public" | where TimeGenerated > ago(7d)
For analysts without KQL expertise, writing such queries manually is time-consuming. Wouldn’t it be awesome if we could simply write in plain English to generate these queries?
That’s where Natural Language to KQL (NL2KQL) translation comes in—empowering users to query their data effortlessly.
Key Challenges of NL2KQL #
Ambiguity of Natural Language #
- Example: “Show login failures.” Should it return all failures or only those from a specific source?
- Solution: We use clarifying prompts and suggest multiple interpretations to ensure accuracy.
Ensuring Syntactic Accuracy #
- The generated KQL must be syntactically correct for the KQL-based search engine.
- The generation process must account for any syntax constraints specific to the search engine.
Ensuring Contextual Accuracy #
- The generated KQL must not only be valid but also contextually relevant to the logs, ensuring meaningful and useful query generation.
Our Approach #
At Vega, we aim to build a robust, real-time system that converts natural language queries into accurate KQL queries while ensuring high precision and correctness. Our system considers the specific security logs of a user and is designed to minimize latency.
To achieve these goals, we have adopted an architecture that follows this approach:
Our approach consists of two main phases:
-
Online Generation Workflow: When a user provides a natural language query, we use large language models (LLMs) with retrieval-augmented generation (RAG) to translate it into a KQL query.
-
Offline Data Preparation Workflow: For each tenant and data source, security logs are processed to create an optimal setup for high-precision, low-latency real-time NL2KQL translation.
The Online Generation Phase #
Creating the Ideal Prompt #
Translating a natural language request into a query for structured, relational databases (e.g., MySQL, Postgres) is a well-researched task, with numerous solutions proposed to address it.
Most of these solutions leverage large language models (LLMs) by providing relevant context within the prompt, including the database schema, field formats, and data types.
However, logs are unstructured and often lack a predefined schema, making this task significantly more challenging.
In many cases, fields are not pre-defined—nor are their types or potential values.
In real-world scenarios, some data sources can become extremely complex; for example, AWS CloudTrail contains more than 1,000 unique fields!
Not to mention that some fields may have an unbounded number of possible values.
To generate precise queries, we must gather the most relevant context about the queried data source. How can we achieve this?
Multiple RAG workflows to the rescue!
Retrieval-Augmented Generation (RAG) is a technique that retrieves relevant data and provides it to the LLM, enriching it with crucial context.
In our case, we aim to retrieve the most relevant fields within the data source and identify their most meaningful values.
This process is executed in real time for each query.
Here’s how it works:
Vector Similarity Search #
To retrieve the most similar items, we perform a similarity search using the cosine similarity metric over a vector database that stores text embeddings representing the items. This retrieval process is illustrated in the Online Generation Workflow chart—here’s a closer look:
-
Retrieving Relevant Fields
- We identify the most relevant fields in the logs by searching for the most similar generated field descriptions (generated? We’ll get to that later).
- To ensure optimal retrieval, we evaluated over our ground-truth NL2KQL pairs. We made sure that we retrieve enough fields to cover the ones needed to generate, but not too many to keep the context lean as possible (which affects latency).
-
Retrieving Relevant Observed Values
- Logs often contain dynamic values (e.g., specific usernames, IP addresses).
- We use similarity search to inject the most relevant actual values from the tenant’s logs into the query.
- Providing actual values makes it possible for the LLM to generate precise KQL queries with filters that actually going to work on the tenant’s data!
-
Finding Similar Queries for Reference
- Users often repeat similar queries with slight modifications.
- By referencing past ground-truth queries, we improve response accuracy.
- KQL queries that users were satisfied with are stored in our vector database for continuous learning and improvement.
We index and search over the data using HNSW as a graph-based approximate nearest neighbor algorithm to optimize real-time querying, while maintaining recall-oriented approach.
All the data retrieved, is used to refine the generated schema.
The Schema Refiner #
Although we generate a log schema for each data source, it is often highly complex, nested, and cumbersome.
Using the raw schema as context in the LLM prompt increases response time and can lead to hallucinations.
To address this, we developed a schema refining process, inspired by recent academic research from Microsoft USA (available here).
This process leverages the relevant fields and values retrieved during the query generation phase to construct a refined schema—containing only the most relevant fields and possible values.
By refining the schema, we achieved low-latency generation for real-time user interactions while significantly reducing the likelihood of hallucinations.
Finding the Right LLM for the Task #
Even with a highly optimized prompt that includes the most relevant context, selecting the right LLM is a crucial part of the process.
Through our research and testing, we found that Claude Sonnet 3.5 by Anthropic delivered the best results for our needs.
This model strikes a balance between high performance and reasonable latency, making it an ideal choice for real-time query generation.
That said, our approach is designed to be LLM-agnostic, meaning it can be easily adapted to a better model in the future with minimal effort.
Choosing the Right Text Embedding #
Our approach heavily relies on the quality of retrieval, making the choice of a text embedding model a critical factor in our success.
Several key aspects were considered:
- Quality: A high-performing model capable of effectively extracting information from cybersecurity-related text.
- Latency: A low-latency model that can generate text embeddings quickly enough for real-time processing.
- Lightweight: A model that is lightweight enough to be self-hosted, avoiding the need to send data to third-party APIs and reducing latency.
- Embedding Size: A model that produces low-dimensional embedding vectors to speed up cosine similarity calculations.
Based on these criteria, we selected GTE-Base from the General Text Embeddings model series, which outputs embeddings of size 768.
This model offers a balance of relatively small embedding size while still delivering high-quality embeddings—making it an ideal choice for our needs.
The Offline Data Preparation Phase #
To minimize processing time during query translation, we handle the heavy lifting offline, creating the ideal setup for real-time NL2KQL generation.
Here’s a closer look at the offline data preparation workflow:
For vector storage, we chose PGVector, as we already use a self-hosted Postgres database.
Extending our Postgres setup to support vector similarity search was a simple integration, making PGVector the optimal choice for our needs.
Handling Fields and Values #
-
Fields: For each field observed in the logs, we generate a description using the Tavily AI search engine.
- The engine gathers relevant information from official documentation (if available) and other sources that provide insights about the field.
- The extracted information is then converted into embeddings using our text embedding model and stored for the subsequent online NL2KQL generation.
-
Values: We collect unique observations from the logs’ fields and store their corresponding text embedding representations in the vector database.
Each tenant’s data is handled separately to ensure privacy. To achieve this, data and corresponding embeddings are strictly isolated in our vector databases.
Schema Building #
As we already know, the logs do not follow a pre-defined schema. To address this, we chose to generate schemas dynamically using a schema generator. For this task, we selected genson-rs, a high-performance library designed for generating schemas from JSON data, such as security logs.
"– 🔥 Generate JSON Schema from Gigabytes of JSON data in seconds
genson-rs is a Rust rewrite of the GenSON Python library, which can be used to generate JSON schema (Draft-04 and after) from one or multiple JSON objects."
— genson-rs documentation (crates.io)
The generated schemas are stored for later retrieval during the online NL2KQL process.
Ground-Truth NL2KQL Examples #
As described in the online generation process, ground-truth NL2KQL examples are highly valuable for improving model performance, as similar queries are often asked in different variations.
Our team of threat-hunting experts and detection engineers worked extensively to create and validate a large set of NL2KQL pairs across various domains, use cases, and data sources.
Making NL2KQL Intuitive #
Inline Comments for Better UX #
To improve the readability of generated queries, we enhance the output with auto-generated comments explaining key parts of the query.
Example: Show me all console logins without MFA authentication (over AWS CloudTrail):
Events
| where eventName == "ConsoleLogin" # Look for console logins
| where additionalEventData.MFAUsed == "No" # Filter for logins without MFA
Syntactically Correct Queries #
It’s not a great experience to receive a KQL query with syntax errors 😕.
Therefore, our engine strives to provide only syntactically correct queries, validating its results against the KQL-based search engine.
Privacy & Security #
At Vega Security, we take client privacy seriously — it’s our top priority. Our architecture ensures complete separation between tenants, keeping each client’s data isolated and protected. We work carefully to secure data throughout every phase of the pipeline. Additionally, private data is never used internally for training AI models at Vega.
Conclusion #
At Vega Security, we are transforming security operations by bridging the gap between human intent and structured query languages.
Our Natural Language to KQL (NL2KQL) solution enables faster, more intuitive data analysis, empowering security teams to detect threats efficiently—without requiring deep expertise in KQL.
With real-time query translation, low-latency responses, and intelligent prompt engineering, we are making cybersecurity analytics more accessible and effective.
Want to see it in action? Stay tuned for upcoming demos, or reach out to learn more.