CS 591 Stochastic Modeling

From TaylorGroves
Jump to: navigation, search

Contents

Project

The main goal of this project is to become more familiar with Bayes Nets. These seem like one of the more powerful tools we've discussed in this class. That being said the objectives of this project are:

  1. An overview of autonomous approaches to system maintenance
  2. Familiarity with Bayes Nets
  3. A review of systems/BBN related work, e.g. root cause failure analysis from logs, etc

Bayes Net Basics

  • Can provide Inference, Data Collection, and Anomaly Detection.
  • Are we examining Boolean Random Variables or Multi-variate?

Failure Prediction and Root Cause Analysis

  • Categorize Failure Types (Hardware, CPU, Memory, Software, OS, Filesystem)
  • Know what system information is correlated to failure
    • Analyze failure distributions in time and space domain


Getting the Data

Sahoo categorizes system information into three groups

  • event logs
  • SAR (usage) data
  • topology

After collecting the data locally it needs to be filtered/pre-processed and aligned.

Event Logs - Sahoo

  • Node ID
  • Event ID
  • Timestamp
  • Event Type
    • PEND: loss of availability of device or component imminent.
    • PERF: the performance of the device/component has degraded to below an acceptable level.
    • PERM: permanent error (unrecoverable)
    • TEMP: condition recovered after a number of unsuccessful steps
    • UNKN: unknown
    • INFO: entry is a information/warning.
  • Event Class
    • Hardware
    • Software
    • Information Only
    • Undetermined

Performance (SAR) Logs - Sahoo

Unlike the event logs which were mostly generated from system calls and kernel interrupts, the performance logs are collected at regular intervals. The six fields are:

  • Time
  • Processor ID
  • User Time
  • Idle Time
  • CPU Time
  • I/O Time

Clustering algorithms and Distance Metrics

Causal

Temporal

This is generated by the SAR and event log data.

  • Distance is time between failures

Why do we need it?

  • Reduce the "human cost" in large scale systems
  • Proactive system management which would benefit things like checkpointing.

Challenges

Redundant Data

Times Misaligned

References

Template:Refbegin

Template:Refend

Personal tools
Namespaces
Variants
Actions
Site Map
Toolbox