Stupid Routing is Smart

From TaylorGroves
Jump to: navigation, search

For a brief explanation of the project with results see the pdf-presentation.

Contents

Summary

The basic idea of my project is to explore methods of routing using minimal knowledge of the network infrastructure. The traditional network is built from addresses and tables or maps of how to reach these addresses. This experiment attempts to look at other alternative methods of communication that allow for disassembly and reassembly of the IXM network. The goal is that I can implement a reasonable means of communication with an improved level of robustness.

Applications

A problem with sub-problems, e.g. matrix multiplication, brute force key cracking, etc, is given to an initial node via USB. As this node receives requests for “resources” it will hand out sub-problems from within this original problem. This chain of command can be extended as needed. As this group of “producers” receives requests for resources, they will assign “base problems” to “worker” nodes. Upon completion of a problem set, the worker sends back its result and requests another problem set in a single packet.

Specifically, the problem used as an example for my experiments is distributed prime factorization in a deterministic, brute force approach.

Vision, Plans and Milestones

Stealing from Nature

  • Ideas taken from strategies used by ants to find near-shortest paths.
  • Worker simply makes a request to his most active neighbor. Then that neighbor forwards to where he thinks is best, disregarding the node he received from.
  • Good paths enforced by activity.
  • Bad paths decay.
  • The result is that the network optimizes itself, distributing congestion when possible.


The goal is to implement a algorithm that robustly locates/assigns problems, sub-problems, and solutions without addressing, regardless to changes in the network, in ways more efficient than simple flooding.

  • Flooding of the entire network is very costly and has a tendency to result in significant packet loss. There are some exceptions, which I am allowing for flooding. In order to retrieve statistics back from the network I use a limited form of flooding; Eric Schulte’s Collector code.
  • The system-at-large avoids micro-management at all costs and in general downloads code to cells and lets them implement it with their immediate peers.
  • There is a bottleneck in the system. The USB cable. This should not affect performance of the algorithms in use, but is only in regards to issuing new sketches and retrieving metric statistics.

Milestone 1

Presented botched premise for project, which needed grounding in reality. Too many arbitrary rules/decisions in the network. Lots of detailed charts and design gone to waste.

Milestone 2

Simplified much of Milestone 1, took out the “game-like” rules and just started writing the basic networking code. Worked on the worker class. Quite a bit of the code was fairly disgusting at this point, which is somewhat expected when learning something new. At the end the workers were capable of sending request which if it reached the TTL or a dead end timed out. Also did some light testing of the Collector code.

Milestone 3

The underlying network is near complete, and presented. A large amount of code needed to be “cleaned up” from Milestone 2. Both producers and workers completed. The worker uses a field “Resource Forwarding Frequency” (RFF) to determine which face it should request a resource from. Additionally there is an ‘x’ percent chance of the request going through a random face, where ‘x’ is set by the user. If multiple connected faces share a “highest” RFF then the face to forward a request on is randomly selected from them. Every node who forwards a request adds the face that the request came from to a queue (queue structure added this week). If the request for a resource makes it to a producer, then the producer passes a resource along which is passed out of whatever face is in the queue.

In addition to the above, I implemented several alarms that trigger 1) the chance to request a resource, 2) RFF decay. RFF decay is set to remove a user set percentage of the RFF every decay cycle. This is to allow changes in the network structure to be taken advantage of. Decay is mostly heuristic, and largely dependent on how often users are moving around IXM’s.

For all of this code lights are set on the IXM which clearly show the forwarding of requests, resources, timeouts, and class roles. This leads to my description of problems found this week.

  1. Most of the lights are set with a .5-1 second delay in order to give the user time to see the results of the experiment. These delays can cause significant blocking of packages. The solution being, in final benchmarks, to remove the lights and delay from the code.
  2. Occasionally when an IXM pings the faces to check for neighbours, no neighbours are found. In this case I have determined the “best” solution will be to allow for the packet to drop, rather than delaying (possibly blocking) and trying to find faces again.
  3. Lots of trouble with alarms stemmed from the face I assumed the alarms were set to individual counters. This is not the case as they are all going off of the same global clock.
  4. I needed to find a concrete example of what these IXM’s could be doing and how they would benefit from the network structure. Additionally, this should help finalize some of the arbitrary values given to some fields.

Milestone 4

Presentation from class detailing MS4 and preliminary results.

The goal for this milestone is to:

  • Clean up and remove as many calls to delay in the functions as possible.
    *
    Redo alarms from previous week to take advantage of cancel features.
  • Start to return metrics with the collector package.
  • Change the code to drop packets when no faces are found.
  • Shift the premise of the project to assigning and solving distributed problems like brute force key cracking.
MS4 Update Nov. 20
  • COMPLETED
  • Redo alarms from previous week to take advantage of cancel features.
  • Change the code to drop packets when no faces are found.
  • Shift the premise of the project to assigning and solving distributed problems like brute force key cracking. *Incorporate Schulte’s collector package.


The premise of the project has been set to revolve around distributed prime factorization.

MS4 Update Nov. 23

Rolled back to a previous working version of the code and started from there. The same goals were met from the Nov. 20th update, but I was able to remove the SFB error codes that were seen.

Had significant trouble in thinking my code was jumping around on the IXM’s debugged this for about a day. See [FAQ]: Why does my sketch seem to jump around during execution?

From the presentation in class, a bug was seen where IXM cells would freeze with the orange “response” light.

Milestone 5

The goal for this milestone is to:

  • Break-up the prime factorization so that it is
    • distributed,
    • distributed in small pieces.
  • Write the code for flooding.
  • Continue working on the prime factorization distribution code.
  • Create a py script to sort through collector’s output


MS5 Update Nov. 29

Went back through the core code

  • Updated naming conventions to reflect generality better.
  • Updated metric names.
  • Added a large amount of commenting.
  • Fixed 2 bugs: Bias in random face selection, The orange light freeze-up from MS4.

Final

Lots of things tweaked to get a better perspective in the benchmarking among other things.

  • Sending the primes found across the network now - give the people what they want.
  • Set a lights flag to be on or off.
  • Did some more scripting to generate data files.
  • Resetting all stats/metrics to zero each time a number is input to the system.

Component construction

Core Code

Here is the code for the generic network simulation. It has a good amount of commenting to explain itself. This should be of use to anyone in need of a simple routing codebase. Update: with the release of QLED in late Nov. 09, this package could optionally have all the code using LED’s updated.

  • Single Path → Completed.
  • Random Walk → Completed.

The basic mechanisms for routing are:

The Request

  1. To begin a worker sends requests out random faces.
  2. If at any point a face successfully returns a job, that face becomes favored.
  3. Future requests will go out of your most active face, i.e. active being the one that returns jobs the most.

The Response

  1. If a producer receives a requests, he forwards a job out of the corresponding face received from.
  2. If a node receives a job packet and doesn’t need to find work himself, he forwards the packet onward based on a limited queue of faces that requested a job most recently. Otherwise he nabs the job off the network... selfish.

Other Mechanisms

  • Random chance for exploration: This enables a node to possibly discover shorter paths.
  • Decay: we want to keep the idea of a good producer fresh in each nodes mind. So we decay the activity metric (Job Forwarding Frequency in the code) over time.

Distributed Prime Factorization

There are several aspects to this distribution that would be ideal, but probably won’t be implemented due to time constraints.

  • Enable a recursive/scaling feature: This would scale TTL up when solving larger numbers and assign co-producers to further distribute the work. The TTL scaling seems like it is a requirement specific to a hardware/IXM network since a node is connected directly to at most, four other nodes.

For this project I am limiting the scale in several ways.

  • A max limit on input → MAX_SIGNED_INT
  • The worker-producer relationship is only going 1 hierarchy deep.
  • Producers utilising a “map” to feed results back to the usb cable / laptop.

While these constraints take away from some of the applicability of this implementation, the lessons learned from it should transfer to larger networks scaling to a greater level.

Complications

  • Loss of a producer or worse, the starting producer.
  • Optimal size of sub-problems ← Using ten at the moment, ideally this would be fully recursive.
  • Resource contention ← Addressed
  • Location\retrieval ← Addressed
  • Shortest path\time ← Room for improvement
  • Dead ends ← Addressed
  • Node deletion\network restructuring ← Nodes can be added into the system and contribute, but removed nodes lose their work at this point.
  • Queue size ← Addressed
  • Removal-movement of initial producer

Data and Results

Preliminary Results can be found at the end of the Milestone4 presentation.

Metrics for Comparing Routing Strategies

int mMyNumOfJobsCompleted = 0; // total # of jobs that this node has completed.
int mMyNumOfRequestsUnasnwered = 0; // total # of times this node timed out waiting for a response.
int mNumOfRequestsReceived = 0; // total # of times requestReceived() was entered by this node.
int mNumOfJobsForwarded = 0; // The total # of jobs arriving then leaving through this node.
int mNumOfRequestsForwarded = 0; // The total # of requests arriving then leaving through this node.

Getting the Data Out of the System

Eric Schulte’s Collector Code ← Potential bottleneck here. Need to use sparingly.

Option 1) See the progress per producer as it occurs by plugging into it via usb.

Log: subBlockChecklist:[0][0][0][0][0][0][0][0][0][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][0][0][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][0][0][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][0][0][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][0][0][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][0][1][0]
Log: subBlockChecklist:[0][0][1][0][0][0][0][1][1][0]
Log: subBlockChecklist:[0][0][1][0][0][1][0][1][1][0]
Log: subBlockChecklist:[0][0][1][0][0][1][0][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][0][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][0]
Log: subBlockChecklist:[0][0][1][1][0][1][1][1][1][1]
Log: subBlockChecklist:[0][0][1][1][1][1][1][1][1][1]
Log: subBlockChecklist:[1][0][1][1][1][1][1][1][1][1]
Log: reporting completed block:8

Option 2) View the output from the scrutinizer Ruby script.

Input:1511
Primes Factors Found:0
Time to complete:30.0065
Number of Unique Completed Blocks:7
Number of Redundantly Solved Blocks:5
*=======*
Input:22222
Primes Factors Found:271,2,41
Time to complete:10.1067
Number of Unique Completed Blocks:1
Number of Redundantly Solved Blocks:0

Option 3) CSV file - A large amount of data is collected and put into csv files, what is seen in the results section is this data averaged together over worksets, producer counts, etc.

Tests Conducted

Input

Note that all numbers with the exception of 22222 are prime numbers themselves and should not return any prime factor. These numbers were chosen to provide a guaranteed workload for the network since a prime must be exhaustively tested. “22222” was added into the inputs to:

  • Show Ackley I am now outputting the primes.
  • To see how long the network takes to find p.f.’s for a number that only requires 1/10th of the block space completed.
[ "1511", "3001", "6007", "12037", "22222", "24001", "48017", "96001" ]

5X3

4 tests ran 3 times each for averaging.

  1. Single Producer
  2. Top/Bottom Middle
  3. Top/Bottom/Left/Right Middle
  4. Four corners + Top/Bottom Middle - Time Permitting

detail.php?id=stupid_routing_is_smart&cache=cache&media=5x13num.jpg

USB output always attached to bottom middle.

SRIS
Parameters
  • Routing display lights: OFF
  • 10% chance to choose random face.
  • TTL: 4
  • 60% decay of forwarding metrics per cycle
  • Delivery Queues of capacity: 2
  • Blocksize: 10
Results

The graph below shows the number of producers on the network increasing and a u-shaped affect on time to solve a set of jobs. Also, notice the number of duplicate blocks solved increases as the number of producers does. Efficiency vs. Robustness

fetch.php?w=&h=&cache=cache&media=sris_ugraph.png

The above graph can be better explained by the spread below which consists of averages metrics per node per input. Notice Average Jobs Forwarded is close to identical for both one and two producers. Yet jobs completed per node is significantly lower when only using one producer. This suggests that the one producer was generally oversubscribed. When we move up to four producers, the producers appear to be under-utilized.

fetch.php?w=&h=&cache=cache&media=srisspread.png

In this graph you can see the input value graphed on the bottom and the corresponding time to solve with a given number of producers. It is worth noting that 2 seems to be the optimal number of producers for the network tested. There is a visible decrease in time to complete a job and from the previous graph we can see the increased level of robustness.

Additionally, notice the shifts in performance amongst producer levels, which fit a niche of workload duration, i.e. size of the input.

fetch.php?w=&h=&cache=cache&media=srisisvts.png

Random Walk
Parameters
  • Routing display lights: OFF
  • 100% chance to choose random face.
  • TTL: 4
  • 60% decay of forwarding metrics per cycle
  • Delivery Queues of capacity: 2
  • Blocksize: 10
Results

All data for these experiments can be found at http://cs.unm.edu/~tgroves/data.tar.bz2

fetch.php?w=&h=&cache=cache&media=rwisvts.png

detail.php?id=stupid_routing_is_smart&cache=cache&media=rwugraph.png

In the table below, it is important to note the large number of requests that do not receive a response (the ignored column). This is much higher than the SRIS implementation, in many cases near twice the amount of messages that are having to be sent on this network.

Table showing RW metric summary:

fetch.php?w=&h=&cache=cache&media=rwspread.png

An obvious problem with this routing strategy is that the worker’s request is frequently not making it to any producer.

An even bigger problem with random walk is that, even in a network that has a large number of producers, e.g. one third of the network in our case, there is no mechanism for consistently getting a result back to the producer who assigned the job. Since the workers do not favor any particular path, all faces are used randomly. The implications of this is that completed blocks are often sent to the wrong producer who discards them. While mechanisms are in place to discourage a node from throwing away its work and picking up a job from a new producer; if the worker goes without a response for a long enough time, it may give up and look for another job from a “more reliable” producer.

Summary Comparison Graph

From this Graph, it’s apparent that Random Walk Algorithms are not going to be a solution for workloads where jobs are owned by individuals – where repeated communication between two distinct nodes are needed. SRIS completes the tasks at hand in close to half of the amount of time required by RW. Furthermore, SRIS maintains a similar (if not better) level of workload redundancy.

fetch.php?w=&h=&cache=cache&media=comparison.png

For more graphs and a presentation of the results see the linked pdf. :final.pdf
Future Work/Conclusions

First I’ll list some of the specifics that I would like to implement:

  • Full recursion for the worker producer model.
  • Saving state in eeprom, so if disconnected you won’t lose what you’ve done.
  • Compare SRIS against several other routing mechanisms, and add some test cases that incorporate nodes being physically taken off and put back on the network.
  • Implement some error checking/correction when duplicate blocks fail to match.

In a broader sense I envision that this networking algorithm incorporates better abstraction that allows it to adapt to varying problems, not just prime factorization. Additionally, I would really like to see this network adapt to its demands on the fly, e.g. nodes that are being starved for work by greedy neighbors → become a producer. I would like to see the network – not the user – determine how to set the parameters, such as block-size, TTL, decay, and chance for random walk.

While SRIS is a good start towards a abstract, dumb-down, and hands off routing approach, ant colonies and similar systems have had millions of years to evolve into a finely tuned system with all the right parameters. It remains to be seen, in a network like SRIS, how could these settings be determined on-the-fly with as little meddling by the user as possible. If trends continue towards smaller and “more of” – these techniques will be essential to harnessing the power of a system.

Files

Summary

  1. Unpack the scripts
  2. Make a directory “collector”
  3. Unpack the IXM code into the collector directory
  4. Change any parameters you want in sketch.pde, e.g. lights on/off
  5. Put together your Makefile and sketch-wrapper
  6. Compile the IXM code
  7. Edit your path into scrutinizer.rb
  8. Run scrutinizer.rb

Ruby Scripts

http://cs.unm.edu/~tgroves/ruby_scripts.tar.bz2

This is a bundling of several individuals work. Within this zip is the libixm package which is used for interfacing with the IXM’s. The board, group and scrutinizer .rb files were created by Eric Schulte. scrutinizer.rb has undergone significant additions to add all the functionality for gathering solved blocks in the prime factors simulation and sending out new numbers to solve for.

If you want to change what numbers are going into the simulation just make a list as seen at the top of scrutinizer.rb.

Prime Factorization Code

http://cs.unm.edu/~tgroves/prime_factorization_sketch.tar.bz2

This is a package of code that accomplishes a distributed prime factorization on the IXM’s. all parameters are at the top of sketch.pde. The collector code was written by Eric Schulte with some functionality added by myself.

Personal tools
Namespaces
Variants
Actions
Site Map
Toolbox