I'm looking for a way to sample lists of probabilities.

7    29 Jan 2015 21:41 by u/gunfourhigher

I am working on a text synthesis program that reads in text and makes markov chains, but I need to weight the chains. Instead of having nodes that just have a list of words ex:(cat, dog) after "a", I need to have something like (cat:30, dog:70) where if I sample the letter a a hundred times, I get ~30ish cats and ~70ish dogs.

I am sure that this problem has been solved efficiently, or at least more so than what I am about to write. I am thinking of having a lists of 100 elements, and having the ratios of dogs to cats match the elements, and sampling the entire list randomly. I could then increase the size of the list later to account for granularity, but since I am going to be dealing with thousands of words, this isnt the best option as far as memory goes.

I hope someone understands what I'm trying to say!

It's worth noting that I can write whatever backend to this that I need, I'm not worried about details. I'm not trying to get the code written for me, I just need to know what this is called, and where I can find out more about it. I don't want to appear like someone trying to bamboozle you into doing my homework!

6 comments

1

This reminds me of a school task I worked on with MapReduce. Perhaps MapReduce might be helpful?

1

No, not really. I'm using Ruby for this project, and the solution will be pure ruby. I've run across probability vectors since my initial post, and I haven't had much time to look, but I think that's what I need.

1

Problem description is not too clear

1

I did this in CoffeeScript. Let me explain the algorithm:

  • gather all of your probability objects
  • sort them in ascending order
  • sum your probabilities
  • create a sum sequence of your probabilities
  • choose an int in the range [smallest..sum]
  • choose the first probability larger than your int chosen above

note, this assumes positive probabilities (but really, who uses negative probabilities)

a sample run could be below:

  • [10, 100, 500, 800, 1000] (sorted)
  • sum: 2410
  • [10, 110, 610, 1410, 2410] (sequenced)
  • chosen: 705
  • [10, 110, 610, 1410, 2410] (ignore anything <705)
  • result: 1410 (choose first result)

Let me know how this works out, or if you have questions.

1

That's basically a probability vector, but instead of using 1.0, you're running your own limit and adjusting it on the fly. Thanks for sharing though, I hadn't even thought about doing it that way!

0

No problem. Sorry I couldn't be more helpful.