Using LLMs for Log Anomaly Detection
Our team recently investigated using AI for log anomaly detection. One of the several approaches was querying Large Language Models (LLMs) through the Ollama API. This article documents the journey from initial exploration to a production-ready API call, including the tools we discovered, the prompt engineering techniques that worked, and best practices for the team moving forward.
The Original Problem
Traditional log monitoring relies on predefined patterns, regex matching, and threshold-based alerts. While these work for known issues, they struggle with:
- Novel anomalies that don’t match existing patterns
- Complex, multi-line log sequences that indicate problems
- Context-dependent issues (e.g., what’s normal at 3 AM vs 3 PM)
- Subtle correlations between seemingly unrelated events
We needed a solution that could ideally:
- Understand context and semantic meaning in logs
- Identify unusual patterns without predefined rules
- Explain findings in human-readable terms
- Scale to analyze large log files efficiently
Why Ollama API?
Ollama provides a local API server for running LLMs, making it ideal for log analysis where data privacy and control are important. The API is OpenAI-compatible, making it familiar to developers who’ve worked with ChatGPT or similar services.
Key advantages we discovered:
- Multiple model support: Easy to experiment with different models (we settled on
qwen3:latest) - File upload capability: Direct attachment of log files rather than embedding them in prompts
- Familiar REST API: Standard HTTP requests that integrate easily into existing tooling
- Knowledge Collections: Ability to build persistent context about your systems
Our Journey: From Basic Queries to Effective Analysis
Initial Attempts
The first approach was very simple and generic:
curl -X POST <Ollama URL>/api/chat/completions \
-H "Authorization: Bearer <API-Key>" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:latest",
"messages": [
{
"role": "user",
"content": "Analyze this log file for anomalies"
}
],
"files": [{"type": "file", "id": "93a77983-..."}]
}'
Don’t worry about the
filesfield for now, these are log files and are explained in the “Key Components Explained” section.
What didn’t work:
- Responses were too vague (“This log looks mostly normal”)
- No structured output format
- Missed context-specific anomalies
- No severity classification
The model needed more guidance about what constitutes an anomaly and how to report findings.
Iteration 2: Adding System Context
We introduced a system role to establish expertise:
{
"role": "system",
"content": "You are a log analyst. Find anomalies in the provided logs."
}
Improvements:
- Better context-aware output
- Better terminology usage
Still missing:
- Structured output
- Specific anomaly categories
- Severity levels
- Actionable insights
Iteration 3: Complex User and System Prompt, with Additional Parameters
After multiple iterations, here’s the final API call (user prompt was created with the help of Claude):
curl -X POST <Ollama URL>/api/chat/completions \
-H "Authorization: Bearer <API-Key>" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:latest",
"messages": [
{
"role": "system",
"content": "You are an expert log analyst specializing in anomaly detection. Your task is to identify unusual patterns, errors, security threats, and behavioral anomalies in log files. You have deep knowledge of common log formats, typical system behaviors, and red flags that indicate problems."
},
{
"role": "user",
"content": "Analyze the provided log files for anomalies. For each anomaly found, provide:\n\n## Analysis Format:\n**Line Number:** [exact line number]\n**Log Entry:** [full log line]\n**Anomaly Type:** [classification - e.g., Security, Performance, Error, Pattern Deviation]\n**Severity:** [Critical/High/Medium/Low]\n**Explanation:** [detailed reasoning]\n**Potential Impact:** [what this could indicate]\n\n## Look specifically for:\n- Authentication failures or suspicious login patterns\n- Unusual error rates or new error types\n- Performance degradation indicators (slow responses, timeouts)\n- Security-related events (failed access attempts, privilege escalations)\n- Resource exhaustion patterns\n- Timestamp anomalies or gaps\n- Unusual user agent strings or request patterns\n- Database connection issues\n- Memory leaks or resource spikes\n- Network connectivity problems\n- Configuration changes or system modifications\n- Repeated failed operations\n- Unusual traffic patterns or volumes\n\n## Context Analysis:\n- Compare patterns across different time periods\n- Identify deviations from baseline behavior\n- Note correlation between different types of events\n- Consider seasonal or time-based patterns\n\nIf no anomalies are found, state this clearly and provide a brief summary of the normal patterns observed. Focus on actionable insights rather than minor variations."
}
],
"files": [
{"type": "file", "id": "93a77983-..."},
{"type": "file", "id": "90eada25-..."}
],
"options": {
"temperature": 0.1,
"top_p": 0.9,
"max_tokens": 4000
}
}'
Key Components Explained
1. System Role Design
The system message establishes the AI’s persona and expertise:
"You are an expert log analyst specializing in anomaly detection..."
Why this matters:
- Sets expectations for the type of analysis needed
- Primes the model with domain knowledge
- Improves consistency across multiple queries
2. Structured Output Format
By explicitly defining the output structure, we get consistent, parseable results:
**Line Number:** [exact line number] (this was finicky sometimes, timestamp is another alternative)
**Log Entry:** [full log line]
**Anomaly Type:** [classification]
**Severity:** [Critical/High/Medium/Low]
This format makes it easy to:
- Parse responses programmatically
- Build dashboards or alerts
- Track anomaly trends over time
3. Specific Detection Criteria
Rather than asking generically for “anomalies,” we enumerate specific categories:
- Authentication failures
- Performance degradation
- Security events
- Resource exhaustion
- Network issues
Lesson learned: LLMs perform better with specific guidance rather than open-ended tasks.
4. Context Analysis Instructions
The prompt explicitly asks the model to:
- Compare patterns across time periods
- Identify baseline deviations
- Find correlations between events
This encourages deeper analysis beyond simple pattern matching. However, admittedly so, none of the runs indicated any logs to have been flagged due to patterns across time periods or correlations between specific events.
5. File Upload Functionality
Ollama’s file upload feature is critical for log analysis:
"files": [
{"type": "file", "id": "93a77983-a02f-4aec-89cb-6db09ed83d48"}
]
How it works:
- Upload log files to Ollama separately (via upload endpoint)
- Reference file IDs in your chat completion request
- The model receives file contents as context, in the context window.
Advantages over embedding in prompts:
- Cleaner API calls
- Better handling of binary or formatted files
- Can reference multiple files in a single analysis
Upload example:
curl -X POST <Ollama URL>/api/files \
-H "Authorization: Bearer <API-Key>" \
-F "file=@/path/to/logfile.log"
This returns a file ID you can use in subsequent requests.
Understanding API Options (LLM Parameters)
The options parameter fine-tunes model behavior:
"options": {
"temperature": 0.1,
"top_p": 0.9,
"max_tokens": 4000
}
Temperature (0.0 - 2.0)
Temperature controls randomness in responses, 0.0 makes it so the output is completely deterministic and the model will always pick the most likely next token, whereas 2.0 is very random and creative. This parameter defaults at 0.8.
Our choice: 0.1
- We want consistent, reproducible analysis
- Log anomaly detection should be deterministic
- Lower temperature reduces hallucinations
top_p (0.0 - 1.0)
top_p determines the “pool” of likely words the model can choose from - higher values give it more options, lower values make it more focused. Here is a quick example on top_p:
Imagine the model is predicting the next word after “The cat sat on this…”, the model assigns these probabilities to possible next words:
- mat -> 40%
- chair -> 30%
- floor -> 15%
- roof -> 10%
- table -> 3%
- fence -> 2%
With a top_p of 0.9, the model accumulates probabilities from most to least likely until it reaches 90%:
- mat (+40%) -> final: 40%
- chair (+30%) -> final: 70%
- floor (+15%) -> final: 85%
- roof (+10%) -> final: 95%, stops here!
Only the above 4 words are considered. The words “table” and “fence”, are excluded even though they are possible.
- 0.1: Very focused, only highly probable tokens
- 1.0: Considers all possibilities
Our choice: 0.9
- Balances precision with the ability to catch unusual patterns
- Prevents overly narrow thinking
Max Tokens
Maximum length of the response:
- Too low: Analysis gets cut off mid-finding
- Too high: Unnecessary cost and latency
Our choice: 4000
- Sufficient for detailed analysis of multiple anomalies
- Allows comprehensive explanations
- Can cover several log files in one response
Monitoring note: Track actual token usage to optimize this value over time. As logs get longer, might need to increase the length to ensure sufficient explanations.
Here is an article that goes over the above in more detail along with examples.
Best Practices We Discovered
-
Be Explicit About Output Format: Don’t assume the model knows how you want results structured. Define it clearly.
-
Use Low Temperature for Analysis Tasks: Consistency matters more than creativity when detecting anomalies.
-
Provide Exhaustive Detection Criteria: List every type of anomaly you care about - the model won’t infer them.
-
Handle Multiple Files: When analyzing related logs (app logs + database logs), upload all files in one request for correlation analysis. Further work in this area will inform us on what group of files are best for correlation analysis.
-
Iterate on Prompts with Real Data: Test prompts against actual logs, not synthetic examples. Real logs have messiness that tests your prompt design.
-
Version Your Prompts: As you refine prompts, version them like code.
Future Work
Several directions to explore:
1. Utilizing Knowledge Collections
Ollama supports Knowledge Collections - a way to provide persistent background information to the model. Instead of individually referencing files in our Chat calls, we can instead continually add the files into the knowledge base, and just reference the knowledge base.
2. Real-Time Analysis Pipeline
Integrate Ollama API into a log ingestion pipeline:
- Stream logs to Ollama as they’re generated
- Cache model state for faster subsequent analysis
- Trigger alerts for Critical/High severity findings
3. Model Comparison Framework
Test different models systematically:
for model in qwen3 llama3 mistral; do
analyze_logs --model=$model --benchmark
done
4. Fine-Tuning on Our Logs
Explore fine-tuning models specifically on our infrastructure:
- Better understanding of our specific log formats
- Reduced false positives
- Improved context awareness
5. Interactive Investigation
Build a chat interface for follow-up questions:
User: "Explain the authentication failures around 2:30 AM"
AI: [Detailed analysis of that specific time window]
Lessons Learned
-
LLMs aren’t magic: They need clear instructions and structure, just like any other tool.
-
Prompt engineering is iterative: Our final prompt took dozens of refinements. Don’t expect perfection on the first try.
-
Temperature matters more than you think: The difference between 0.1 and 0.7 was significant for our use case.
-
File uploads are underrated: Embedding large logs in prompts was the initial approach - file uploads are far superior.
-
Context is king: The more context you provide (via system prompts, knowledge collections, or explicit instructions), the better the analysis.