CS 591 Stochastic Modeling
From TaylorGroves
Contents |
Project
The main goal of this project is to become more familiar with Bayes Nets. These seem like one of the more powerful tools we've discussed in this class. That being said the objectives of this project are:
- An overview of autonomous approaches to system maintenance
- Familiarity with Bayes Nets
- A review of systems/BBN related work, e.g. root cause failure analysis from logs, etc
Bayes Net Basics
- Can provide Inference, Data Collection, and Anomaly Detection.
- Are we examining Boolean Random Variables or Multi-variate?
Failure Prediction and Root Cause Analysis
- Categorize Failure Types (Hardware, CPU, Memory, Software, OS, Filesystem)
- Know what system information is correlated to failure
- Analyze failure distributions in time and space domain
Getting the Data
Sahoo categorizes system information into three groups
- event logs
- SAR (usage) data
- topology
After collecting the data locally it needs to be filtered/pre-processed and aligned.
Event Logs - Sahoo
- Node ID
- Event ID
- Timestamp
- Event Type
- PEND: loss of availability of device or component imminent.
- PERF: the performance of the device/component has degraded to below an acceptable level.
- PERM: permanent error (unrecoverable)
- TEMP: condition recovered after a number of unsuccessful steps
- UNKN: unknown
- INFO: entry is a information/warning.
- Event Class
- Hardware
- Software
- Information Only
- Undetermined
Performance (SAR) Logs - Sahoo
Unlike the event logs which were mostly generated from system calls and kernel interrupts, the performance logs are collected at regular intervals. The six fields are:
- Time
- Processor ID
- User Time
- Idle Time
- CPU Time
- I/O Time
Clustering algorithms and Distance Metrics
Causal
Temporal
This is generated by the SAR and event log data.
- Distance is time between failures
Why do we need it?
- Reduce the "human cost" in large scale systems
- Proactive system management which would benefit things like checkpointing.