Monday, April 25, 2011

I am looking for a content addressing data structure

I'm trying to design a data structure that allows efficient extraction of entries from a part of their content.

Let's say I am looking for an entry that matches this: [ x 2 3 x x ]

If [ 0 2 3 4 5 ] or [ 3 2 3 7 8 ] are in my data structure, they should be returned by my find function.

I wrote such a data structure where I compare the "pattern" against all entries of the data structure, but of course it takes way too much time. I have few ideas about how to do this a faster way but they are quite heavy to implement. Does something like this already exists? If not, how would you do this?

From stackoverflow
  • What you're trying to implement looks a lot like tuple space.

    Ben : Yes it is actually what I want to do but I need I wanted a hint for implementing it. I guess I'll to look at the code! Thanks
    vartec : Let me guess. Distributed programming class homework? ;-)
    Ben : Nop ! that is real work :/
    vartec : Oh, then why not use already available solutions?
    Ben : Well I have specific needs but I simplified a bit the problem to ask my question. Available solutions are not exactly what I want, but they are a good start for implementation, I am on it.
  • Well, a Suffix tree could do it in O(1), but it would take a LOT of memory.

    j_random_hacker : +1. Assuming you have multiple entries, what you really want is called a generalised suffix tree, which is formed by concatenating all the entries together, using distinct (unique) separator symbols between each. Suffix arrays are a (theoretically) slower but much more memory-efficient alternative.
    j_random_hacker : P.S.: Suffix trees and arrays rock, but they are *hard* to implement. If you can't find an existing implementation, give yourself a solid 2 weeks to try to understand Ukkonnen's or McCreight's original papers on linear-time construction algorithms.
    Darius Bacon : How specifically does it do it in O(1)? You can retrieve all the entries containing a particular substring in O(length(substring)), if that's what you mean. I think the given problem does not restrict the known bits to be a single contiguous substring, either.
    Vilx- : I think it does restrict it to a single contiguous substring, but yes, it's not very clear on this. As for O(1) - well, it depends on how you look at it. The Big O in this case depends only on the length of the substring, not the amount of entries in the dictionary.
    Vilx- : Clasically when one talks about searching/sorting algorithms he measures the Big O in respect to the number of entries, not the length of them. From that point of view it's O(1).
  • The first thing that comes to mind is to have a hash table for each position in the tuple. For searching you intersect the results for all the positions with a specified value.

    dmeister : To hash each item is a starting point, but for large answer sets the intersection takes time and consumes memory. Consider also to hash each subset of a specific window size.
  • You might take a look at the RETE algorithm. This general problem came up in the AI systems of the 70s; Chapter 14 in Paradigms of AI Programming covers it. (IIRC there's mainly elaborations of starblue's answer and decision trees.)

0 comments:

Post a Comment