v/programming: I'm looking for a way to sample lists of probabilities.

I'm looking for a way to sample lists of probabilities.

7 29 Jan 2015 21:41 by u/gunfourhigher

I am working on a text synthesis program that reads in text and makes markov chains, but I need to weight the chains. Instead of having nodes that just have a list of words ex:(cat, dog) after "a", I need to have something like (cat:30, dog:70) where if I sample the letter a a hundred times, I get ~30ish cats and ~70ish dogs.

I am sure that this problem has been solved efficiently, or at least more so than what I am about to write. I am thinking of having a lists of 100 elements, and having the ratios of dogs to cats match the elements, and sampling the entire list randomly. I could then increase the size of the list later to account for granularity, but since I am going to be dealing with thousands of words, this isnt the best option as far as memory goes.

I hope someone understands what I'm trying to say!

It's worth noting that I can write whatever backend to this that I need, I'm not worried about details. I'm not trying to get the code written for me, I just need to know what this is called, and where I can find out more about it. I don't want to appear like someone trying to bamboozle you into doing my homework!

6 comments

1 u/Atko 29 Jan 2015 21:55

This reminds me of a school task I worked on with MapReduce. Perhaps MapReduce might be helpful?

1 u/gunfourhigher [OP] 29 Jan 2015 23:24

No, not really. I'm using Ruby for this project, and the solution will be pure ruby. I've run across probability vectors since my initial post, and I haven't had much time to look, but I think that's what I need.

1 u/tabularassa 30 Jan 2015 03:25

Problem description is not too clear

1 u/seiyria 30 Jan 2015 22:18

I did this in CoffeeScript. Let me explain the algorithm:

gather all of your probability objects
sort them in ascending order
sum your probabilities
create a sum sequence of your probabilities
choose an int in the range [smallest..sum]
choose the first probability larger than your int chosen above

note, this assumes positive probabilities (but really, who uses negative probabilities)

a sample run could be below:

[10, 100, 500, 800, 1000] (sorted)
sum: 2410
[10, 110, 610, 1410, 2410] (sequenced)
chosen: 705
[10, 110, 610, 1410, 2410] (ignore anything <705)
result: 1410 (choose first result)

Let me know how this works out, or if you have questions.

1 u/gunfourhigher [OP] 31 Jan 2015 06:53

That's basically a probability vector, but instead of using 1.0, you're running your own limit and adjusting it on the fly. Thanks for sharing though, I hadn't even thought about doing it that way!

0 u/seiyria 31 Jan 2015 17:14

No problem. Sorry I couldn't be more helpful.