CVDB 2005: Paper Session III: High-Dimensional Data Analysis and Indexing

Persistent Clustered Main Memory Index for Accelerating k-NN Queries on High Dimensional Datasets

Lijuan Zhang, New Jersey Institute of Technology
Alexander Thomasian, New Jersey Institute of Technology

Similarity search implemented via k-Nearest-Neighbor (k-NN) queries is an extremely useful paradigm in content based image retrieval (CBIR), which is costly on high-dimensional indices due to the curse of dimensionality. We improve k-NN query processing by utilizing the double filltering effect of clustering and indexing on a persistent version of the Ordered-Partition tree (OP-tree) index, which is highly effficient in processing k-NN queries. The OP-tree is made persistent by writing it onto disk after serialization, i.e. arranging its nodes into contiguous memory locations, so that the high transfer rate of modern disk drives is exploited. We first report experimental results to optimize OP-tree parameters. We then compare OP-trees and sequential scans with options for the Karhunen-Loľeve transform and Euclidean distance calculation. Comparisons against OMNI-based sequential scan are also reported. We finally compare a clustered and persistent version of the OP-tree against a clustered version of the SR-tree and the VA-File method. It is observed that the OP-tree index outperforms the other two methods and that the improvement increases with the number of dimensions.