Code Answer: I am looking for a content addressing data structure

I'm trying to design a data structure that allows efficient extraction of entries from a part of their content.

Let's say I am looking for an entry that matches this: [ x 2 3 x x ]

If [ 0 2 3 4 5 ] or [ 3 2 3 7 8 ] are in my data structure, they should be returned by my find function.

I wrote such a data structure where I compare the "pattern" against all entries of the data structure, but of course it takes way too much time. I have few ideas about how to do this a faster way but they are quite heavy to implement. Does something like this already exists? If not, how would you do this?

From stackoverflow

What you're trying to implement looks a lot like tuple space.

Ben : Yes it is actually what I want to do but I need I wanted a hint for implementing it. I guess I'll to look at the code! Thanks

vartec : Let me guess. Distributed programming class homework? ;-)

Ben : Nop ! that is real work :/

vartec : Oh, then why not use already available solutions?

Ben : Well I have specific needs but I simplified a bit the problem to ask my question. Available solutions are not exactly what I want, but they are a good start for implementation, I am on it.
Well, a Suffix tree could do it in O(1), but it would take a LOT of memory.

j_random_hacker : +1. Assuming you have multiple entries, what you really want is called a generalised suffix tree, which is formed by concatenating all the entries together, using distinct (unique) separator symbols between each. Suffix arrays are a (theoretically) slower but much more memory-efficient alternative.

j_random_hacker : P.S.: Suffix trees and arrays rock, but they are *hard* to implement. If you can't find an existing implementation, give yourself a solid 2 weeks to try to understand Ukkonnen's or McCreight's original papers on linear-time construction algorithms.

Darius Bacon : How specifically does it do it in O(1)? You can retrieve all the entries containing a particular substring in O(length(substring)), if that's what you mean. I think the given problem does not restrict the known bits to be a single contiguous substring, either.

Vilx- : I think it does restrict it to a single contiguous substring, but yes, it's not very clear on this. As for O(1) - well, it depends on how you look at it. The Big O in this case depends only on the length of the substring, not the amount of entries in the dictionary.

Vilx- : Clasically when one talks about searching/sorting algorithms he measures the Big O in respect to the number of entries, not the length of them. From that point of view it's O(1).
The first thing that comes to mind is to have a hash table for each position in the tuple. For searching you intersect the results for all the positions with a specified value.

dmeister : To hash each item is a starting point, but for large answer sets the intersection takes time and consumes memory. Consider also to hash each subset of a specific window size.
You might take a look at the RETE algorithm. This general problem came up in the AI systems of the 70s; Chapter 14 in Paradigms of AI Programming covers it. (IIRC there's mainly elaborations of starblue's answer and decision trees.)

Code Answer

Monday, April 25, 2011

I am looking for a content addressing data structure

0 comments:

Post a Comment

Blog Archive