Using LLMs for Log Anomaly Detection

By Paurav Hosur Param, 2025-10-17. Last updated 2025-11-05.

Our team recently investigated using AI for log anomaly detection. One of the several approaches was querying Large Language Models (LLMs) through the Ollama API. This article documents the journey from initial exploration to a production-ready API call, including the tools we discovered, the prompt engineering techniques that worked, and best practices for the team moving forward.

The Original Problem

Traditional log monitoring relies on predefined patterns, regex matching, and threshold-based alerts. While these work for known issues, they struggle with:

Novel anomalies that don’t match existing patterns
Complex, multi-line log sequences that indicate problems
Context-dependent issues (e.g., what’s normal at 3 AM vs 3 PM)
Subtle correlations between seemingly unrelated events

We needed a solution that could ideally:

Understand context and semantic meaning in logs
Identify unusual patterns without predefined rules
Explain findings in human-readable terms
Scale to analyze large log files efficiently

Why Ollama API?

Ollama provides a local API server for running LLMs, making it ideal for log analysis where data privacy and control are important. The API is OpenAI-compatible, making it familiar to developers who’ve worked with ChatGPT or similar services.

Key advantages we discovered:

Multiple model support: Easy to experiment with different models (we settled on qwen3:latest)
File upload capability: Direct attachment of log files rather than embedding them in prompts
Familiar REST API: Standard HTTP requests that integrate easily into existing tooling
Knowledge Collections: Ability to build persistent context about your systems

Our Journey: From Basic Queries to Effective Analysis

Initial Attempts

The first approach was very simple and generic:

curl -X POST <Ollama URL>/api/chat/completions \
-H "Authorization: Bearer <API-Key>" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3:latest",
  "messages": [
    {
      "role": "user",
      "content": "Analyze this log file for anomalies"
    }
  ],
  "files": [{"type": "file", "id": "93a77983-..."}]
}'

Don’t worry about the files field for now, these are log files and are explained in the “Key Components Explained” section.

What didn’t work:

Responses were too vague (“This log looks mostly normal”)
No structured output format
Missed context-specific anomalies
No severity classification

The model needed more guidance about what constitutes an anomaly and how to report findings.

Iteration 2: Adding System Context

We introduced a system role to establish expertise:

{
  "role": "system",
  "content": "You are a log analyst. Find anomalies in the provided logs."
}

Improvements:

Better context-aware output
Better terminology usage

Still missing:

Structured output
Specific anomaly categories
Severity levels
Actionable insights

Iteration 3: Complex User and System Prompt, with Additional Parameters

After multiple iterations, here’s the final API call (user prompt was created with the help of Claude):

curl -X POST <Ollama URL>/api/chat/completions \
-H "Authorization: Bearer <API-Key>" \
-H "Content-Type: application/json" \
-d '{
  "model": "qwen3:latest",
  "messages": [
    {
      "role": "system",
      "content": "You are an expert log analyst specializing in anomaly detection. Your task is to identify unusual patterns, errors, security threats, and behavioral anomalies in log files. You have deep knowledge of common log formats, typical system behaviors, and red flags that indicate problems."
    },
    {
      "role": "user", 
      "content": "Analyze the provided log files for anomalies. For each anomaly found, provide:\n\n## Analysis Format:\n**Line Number:** [exact line number]\n**Log Entry:** [full log line]\n**Anomaly Type:** [classification - e.g., Security, Performance, Error, Pattern Deviation]\n**Severity:** [Critical/High/Medium/Low]\n**Explanation:** [detailed reasoning]\n**Potential Impact:** [what this could indicate]\n\n## Look specifically for:\n- Authentication failures or suspicious login patterns\n- Unusual error rates or new error types\n- Performance degradation indicators (slow responses, timeouts)\n- Security-related events (failed access attempts, privilege escalations)\n- Resource exhaustion patterns\n- Timestamp anomalies or gaps\n- Unusual user agent strings or request patterns\n- Database connection issues\n- Memory leaks or resource spikes\n- Network connectivity problems\n- Configuration changes or system modifications\n- Repeated failed operations\n- Unusual traffic patterns or volumes\n\n## Context Analysis:\n- Compare patterns across different time periods\n- Identify deviations from baseline behavior\n- Note correlation between different types of events\n- Consider seasonal or time-based patterns\n\nIf no anomalies are found, state this clearly and provide a brief summary of the normal patterns observed. Focus on actionable insights rather than minor variations."
    }
  ],
  "files": [
    {"type": "file", "id": "93a77983-..."},
    {"type": "file", "id": "90eada25-..."}
  ],
  "options": {
    "temperature": 0.1,
    "top_p": 0.9,
    "max_tokens": 4000
  }
}'

Key Components Explained

1. System Role Design

The system message establishes the AI’s persona and expertise:

"You are an expert log analyst specializing in anomaly detection..."

Why this matters:

Sets expectations for the type of analysis needed
Primes the model with domain knowledge
Improves consistency across multiple queries

2. Structured Output Format

By explicitly defining the output structure, we get consistent, parseable results:

**Line Number:** [exact line number] (this was finicky sometimes, timestamp is another alternative)
**Log Entry:** [full log line]
**Anomaly Type:** [classification]
**Severity:** [Critical/High/Medium/Low]

This format makes it easy to:

Parse responses programmatically
Build dashboards or alerts
Track anomaly trends over time

3. Specific Detection Criteria

Rather than asking generically for “anomalies,” we enumerate specific categories:

Authentication failures
Performance degradation
Security events
Resource exhaustion
Network issues

Lesson learned: LLMs perform better with specific guidance rather than open-ended tasks.

4. Context Analysis Instructions

The prompt explicitly asks the model to:

Compare patterns across time periods
Identify baseline deviations
Find correlations between events

This encourages deeper analysis beyond simple pattern matching. However, admittedly so, none of the runs indicated any logs to have been flagged due to patterns across time periods or correlations between specific events.

5. File Upload Functionality

Ollama’s file upload feature is critical for log analysis:

"files": [
  {"type": "file", "id": "93a77983-a02f-4aec-89cb-6db09ed83d48"}
]

How it works:

Upload log files to Ollama separately (via upload endpoint)
Reference file IDs in your chat completion request
The model receives file contents as context, in the context window.

Advantages over embedding in prompts:

Cleaner API calls
Better handling of binary or formatted files
Can reference multiple files in a single analysis

Upload example:

curl -X POST <Ollama URL>/api/files \
-H "Authorization: Bearer <API-Key>" \
-F "file=@/path/to/logfile.log"

This returns a file ID you can use in subsequent requests.

Understanding API Options (LLM Parameters)

The options parameter fine-tunes model behavior:

"options": {
  "temperature": 0.1,
  "top_p": 0.9,
  "max_tokens": 4000
}

Temperature (0.0 - 2.0)

Temperature controls randomness in responses, 0.0 makes it so the output is completely deterministic and the model will always pick the most likely next token, whereas 2.0 is very random and creative. This parameter defaults at 0.8.

Our choice: 0.1

We want consistent, reproducible analysis
Log anomaly detection should be deterministic
Lower temperature reduces hallucinations

top_p (0.0 - 1.0)

top_p determines the “pool” of likely words the model can choose from - higher values give it more options, lower values make it more focused. Here is a quick example on top_p:

Imagine the model is predicting the next word after “The cat sat on this…”, the model assigns these probabilities to possible next words:

mat -> 40%
chair -> 30%
floor -> 15%
roof -> 10%
table -> 3%
fence -> 2%

With a top_p of 0.9, the model accumulates probabilities from most to least likely until it reaches 90%:

mat (+40%) -> final: 40%
chair (+30%) -> final: 70%
floor (+15%) -> final: 85%
roof (+10%) -> final: 95%, stops here!

Only the above 4 words are considered. The words “table” and “fence”, are excluded even though they are possible.

0.1: Very focused, only highly probable tokens
1.0: Considers all possibilities

Our choice: 0.9

Balances precision with the ability to catch unusual patterns
Prevents overly narrow thinking

Max Tokens

Maximum length of the response:

Too low: Analysis gets cut off mid-finding
Too high: Unnecessary cost and latency

Our choice: 4000

Sufficient for detailed analysis of multiple anomalies
Allows comprehensive explanations
Can cover several log files in one response

Monitoring note: Track actual token usage to optimize this value over time. As logs get longer, might need to increase the length to ensure sufficient explanations.

Here is an article that goes over the above in more detail along with examples.

Best Practices We Discovered

Be Explicit About Output Format: Don’t assume the model knows how you want results structured. Define it clearly.
Use Low Temperature for Analysis Tasks: Consistency matters more than creativity when detecting anomalies.
Provide Exhaustive Detection Criteria: List every type of anomaly you care about - the model won’t infer them.
Handle Multiple Files: When analyzing related logs (app logs + database logs), upload all files in one request for correlation analysis. Further work in this area will inform us on what group of files are best for correlation analysis.
Iterate on Prompts with Real Data: Test prompts against actual logs, not synthetic examples. Real logs have messiness that tests your prompt design.
Version Your Prompts: As you refine prompts, version them like code.

Future Work

Several directions to explore:

1. Utilizing Knowledge Collections

Ollama supports Knowledge Collections - a way to provide persistent background information to the model. Instead of individually referencing files in our Chat calls, we can instead continually add the files into the knowledge base, and just reference the knowledge base.

2. Real-Time Analysis Pipeline

Integrate Ollama API into a log ingestion pipeline:

Stream logs to Ollama as they’re generated
Cache model state for faster subsequent analysis
Trigger alerts for Critical/High severity findings

3. Model Comparison Framework

Test different models systematically:

for model in qwen3 llama3 mistral; do
  analyze_logs --model=$model --benchmark
done

4. Fine-Tuning on Our Logs

Explore fine-tuning models specifically on our infrastructure:

Better understanding of our specific log formats
Reduced false positives
Improved context awareness

5. Interactive Investigation

Build a chat interface for follow-up questions:

User: "Explain the authentication failures around 2:30 AM"
AI: [Detailed analysis of that specific time window]

Lessons Learned

LLMs aren’t magic: They need clear instructions and structure, just like any other tool.
Prompt engineering is iterative: Our final prompt took dozens of refinements. Don’t expect perfection on the first try.
Temperature matters more than you think: The difference between 0.1 and 0.7 was significant for our use case.
File uploads are underrated: Embedding large logs in prompts was the initial approach - file uploads are far superior.
Context is king: The more context you provide (via system prompts, knowledge collections, or explicit instructions), the better the analysis.