Fast search and retrieval of high dimensional data

Fast Search and Retrieval of High-Dimensional Data

With developments of semiconductor technology and powerful signal processing tools, there has been a proliferation of personal digital media devices such as digital cameras, music and digital video players. In parallel, storage media (both magnetic and optical) have become cheaper to manufacture and correspondingly their capacities have increased. This has spawned new applications such as Multimedia Information Systems, CAD/CAM, Geographical Information systems (GIS), medical imaging, time-series analysis (in stock markets and sensors), that periodically store large amounts of high-dimensional data that are later retrieved and processed.

Thus, large repositories of high dimensional data are central to a vast array of disciplines and applications, and the degree to which they can be exploited depends critically on the availability of efficient and smart tools for search and retrieval, analysis, and mining. This ongoing research project is primarily concerned with the problem of efficient, interactive, approximate and exact similarity search in high-dimensional data sets. Our methodology involves approaches whose origins lie in several disciplines beside classical data management, including optimization, information theory, pattern recognition and signal compression.

Research team includes:

Faculty: Prof. Kenneth Rose, Prof. B. S. Manjunath
Graduate Student Researchers: Sharadh Ramaswamy, Ankur Saxena

Past Researchers: Dr. Ertem Tuncel, Dr. Jayanth Nayak, Dr. Jelena Tesic, Dr. Peng Wu, Dr. Sitaram Bhagavathy

To our Sponsors - Thank You!: This project is mainly supported by the National Science Foundation (NSF) under grant IIS-0329267; and in part by the University of California MICRO program, Applied Signal Technology, Inc., Dolby Laboratories Inc., and Qualcomm Inc.

Some of our high-dimensional datasets are available for download here....

A list of publications is available here....

Some presented talks are available:

Coming soon!! - simulation software packages for
(1) high-dimensional indexing (beta-version) and
(2) fusion coding (beta-version)

Exact search for nearest neighbors by adaptive cluster distance bounding

Researchers: Sharadh Ramaswamy, Prof. Kenneth Rose

Description: Substantial activity was devoted this year to the problem of exact nearest-neighbor search in high-dimensional spaces. The Vector Approximation File (VA-File) approach is in fact based on uniform scalar quantization of feature vectors, and is a powerful technique that scales well with size and dimensionality of the data-set. Since feature vectors often exhibit correlations and dependencies across dimensions, one should expect a search technique based on vector quantization to achieve performance gains. Building on our earlier findings that clustering-based search methods are inherently advantageous as the overall framework within which one incorporates compression-based search techniques, and noting that the K-means clustering algorithm is virtually identical to the Generalized Lloyd Algorithm from vector quantization, we designed an indexing technique that competes with VA-File by accurately identifying and retrieving only the necessary clusters.

Fig 1: Vector Approximation (VA) -File

The Hyperplane Bound: Central to this index is adaptive bounding of the query-cluster distances. We investigated the effectiveness of known distance bounding methods - namely those based on bounding hyper-rectangles, rotated bounding hyper-rectangles and bounding hyperspheres. This led to and motivated our development of a new and superior bounding technqiue that is based on distance projections onto separating hyperplanes between clusters. We evaluated the performance of our search index over two high-dimensional data-sets and compared it with the known techniques. The data-sets used in our investigation consisted of Gabor feature vectors extracted from aerial photographs (AERIAL - 275,000 elements) and bio-molecular images (BIO-RETINA - 208,000 elements) obtained from the ongoing research work at the Center for Bio-image Informatics, UCSB.

Fig 2: The Hyperplane Bound

Results: We observed that our technique for distance bounding based on separating hyper-planes more effective (than known competitors) in estimating the query cluster distances and in terms of storage overhead (see publication no. 16). Once incorporated in our search index, we observed that the new search index, compared with the VA-File, reduces the number of costly random IO operations by a factor of nearly 100X, for both data sets, given (roughly) the same number of sequential IOs (VA-File at 5 bits/dimension quantization). Our index also incurred nearly 10-100 times less storage overhead as compared with VA-File and was of nearly 10 times lower computational complexity. It provided speed-ups over VA-files (evaluated from 3 bits per dimension to 12 bits per dimension), iDistance and LDC (Local Digital Coding).

Fig 3: IO comparison Hyperplane Bound vs Sphere and MBR bounds

Fig 4: IO comparison VQ-Hyperplane Bounding vs VA-File, iDistance, LDC

Fast approximate similarity search of high-dimensional data

Introduction: The search for nearest neighbors in a high-dimensional database is complicated by the size of the database and the dimensionality of the space. While in principle the exact nearest neighbors (defined/decided through some distance metric) of a query point are desirable, one often questions the meaningfulness of the search results. This is because for one, the choice of feature vectors is in no way perfect. Different features are useful in different settings. For another, canonical distance measures are poor estimators of perceptual similarity. This has led to the concept of an approximate nearest neighbor or approximate similarity. Instead of incurring the high cost of searching for the exact nearest neighbors, we now have a faster search which returns database points that are close to the query point, but are not necessarily the closest possible. The loss in accuracy (w.r.t some distance metric) is traded against the reduction in the number of elements that need to be scanned or an increase in search speed.

Popular search methodologies either aim at reducing the number of retrieved elements while extracting full information about each of them i.e. retrieved set reduction e.g. clustering-based indexes or aim at reducing the amount of information retrieved about each element, while extracting some information from all database elements i.e. retrieved information reduction e.g. Vector Approximation (VA)-File. An important component of the research work was the study of the applicability of a variety of tools, mostly inspired by signal compression, to approximate similarity search in databases.

An Index for Fast Approximate Search - the VQ-Index:

The VA-File is a powerful technique to counter the so-called 'curse of dimensionality', and is based on scalar quantization. Since it is based on scalar quantization, it does not exploit correlations and dependencies across dimensions. A Vector Quantization (VQ) technique would be able to exploit such dependencies and hence result in a smaller and more compact approximation (representation) of the database. VQ-Index, an index based on VQ, was proposed for approximate similarity search, where accuracy, measured through a soft-variant of the popular recall metric, was traded-off against the number of page accesses (see publication no. 3). Preprocessing storage was controlled through structural constraints on the VQ, namely split and multi-stage vector quantization. This resulted in disk IOs reduced by factors ranging from 26X to 35X over the VA-File, when evaluated on real data-sets.

Fig. 1 Performance comparison of the VQ-Index with other popular indexing techniques (10 Nearest Neighbors mined)

Fig. 2 Performance comparison of the VQ-Index with other popular indexing techniques (50 Nearest Neighbors mined)

Trading Search Quality for Search Time - the Rate-distortion Viewpoint:

Additionally, the optimality of any search strategy would also depend on the usage i.e. query statistics. A characterization of the the trade-off between search quality and search time (see publications no. 1, 3 and 5 ), where the database designer has access to the query statistics, was studied. An alternate view of similarity search as a guessing game, linked similarity search to an important problem in compression theory, namely successive refinement and the method of types (see publications no. 6 and 9). As a result of this observation, the optimal search strategy was found to be to sort type classes P in increasing order of RP(D), and create the guessing list as the concatenation of codebooks (i.e., sets of codevectors) that cover type classes P1; P2;P3, etc.

Combining Retrieved Set Reduction and Retrieved Information Reduction:

We have investigated the possibility of effectively combining quantization methods with clustering methods to enjoy benefits from both approaches. Traditionally, vector quantization schemes and clustering schemes both involve a set of representatives or codevectors and the nearest neighbor partition induced by these representatives, and often optimize similar criteria. However, from the database perspective, there is an important difference in the operational framework: given a set of data points, the quantization system stores mappings from the points to their representatives while clustering schemes store the inverse mapping. Since in any reasonable quantization/clustering scenario it is a many-to-one mapping from points to representative, it is advantageous to organize the data using the clustering rather than the quantization paradigm. Hence, while methods for approximate nearest neighbor search will benefit from both compression and clustering techniques as has been demonstrated in our results, it is beneficial to embed them within the clustering framework.

Exploiting Query Statistics and Global Optimization:

In practice, each database is subject to a certain usage pattern or query distribution relevant to its application. When query statistics, abstracted by a query training set, were incorporated in the design of our clustering-based search strategy, it provided substantial improvement in performance to naive clustering techniques (see publication no. 2). The main features of such a framework are: i) The contribution of any query to the cost function involves an explicit tradeoff between the amount of data retrieved for answering the query and the accuracy with which the query in answered. ii) The total cost function is averaged over the query distribution. However, the cost surface is non-convex, leading to the design algorithm getting "trapped" in locally optimal solutions.

We have been able to formulate a novel design algorithm based on deterministic annealing, which applies controlled randomization to the cost function. The mappings are probabilistic and the cost is minimized subject to a constraint on the system randomness (measured by Shannon entropy). The level of randomness is gradually reduced in direct analogy to annealing in physics, thereby avoiding many suboptimal local minima that would trap greedy techniques. Unlike simulated annealing, the system operates deterministically by optimizing expectation functionals, rather than simulating a stochastic process. It is hence considerably faster. Preliminary results show substantial gains.

Relevance feedback and image feature extraction

Description: It is in general impractical to store large amounts of extracted feature vectors in a random access memory (RAM). Since input/output (I/O) operations for hard storage devices are slow, the time spent accessing the feature vectors overwhelmingly dominates the time complexity of the search. However, what is often ignored is the observation that pre-defined feature vectors and standard distance measures are poor approximators of human perception, which could widely vary from one user to another. The first thrust of this work is on the extraction of better features to model images.

In developing image/video search techniques, ``relevance feedback'' has been widely used in associating user's concept with the descriptors. Given a set of representative features, relevance feedback aims to learn the distance function with the user feedback. An initial set of nearest neighbors (based on some naive distance function) are retrieved, and these results are classified by the user as being relevant or not, and this is used to estimate the "correct" similarity metric, based upon which a new set of nearest neighbors are retrieved. During this learning process, the algorithm needs to access the descriptors associated with image/video objects. The time complexity problem is further exacerbated since the search is to be performed multiple times until user is satisfied with the retrieved results. So far, the impact of using learning on the indexing structure has generally been ignored. The feasibility of learning algorithms has not been evaluated for handling large amount of image/video data in relevance feedback scenarios. We believe that this is one of the first attempts to address the issue of efficient nearest neighbor search in relevance feedback in the literature.

Results: Our results can be classified into two categories. The first result deals with feature extraction and distance function formulation for improved perceptual quality in image retrieval. The other set of results focus on optimizing indexing for relevance feedback in image databases.

Distance Formulation and Feature Extraction

We have been able to successfully model a broad variety of compound geo-spatial objects such as harbors, golf courses, and airports, using Gabor-filter based texture features (see publication no. 12). On the other hand for image databases (see publication no. 8), we considered inter-image distances that involve geometric mappings subject to global coherence constraints (implemented by imposing hidden Markov models). Such distances enable robust clustering and indexing techniques, which were then tested on various face image databases and was seen to outperform the Bayesian intra/extrapersonal classifier.

Indexing for Relevance Feedback

Nearest neighbor computation in each iteration is quadratic with number of dimensions and linear with number of items. In one proposed scheme (see publication no. 4), the use of the user's feedback enbles not only to improve the effectiveness of the similarity retrieval, but also its efficiency in an interactive content based image retrieval system. It uses rectangular approximations for nearest neighbor search under quadratic distance metric and exploits correlations between two consecutive nearest neighbor sets. This adaptive approach significantly reduces the overall search complexity by reducing the number of assessed data on the hard storage devices up to 90 times.

As an alternative to the VA-File style indexing for relevance feedback, we consider VQ schemes. VQ is optimal in compression and hence can produce more compact representations of the data-set. We derive an invariance property of point-to-hyperplane distance for a general Mahalanobis distance. This property allows us to apply the hyperplane bound (see publication no. 16) for cluster distance bounding and selection, under a changing Mahalnobis distance matrix. Enormous gains in I/O, storage and computational complexity were noticed (see publication no. 22).

Fig. 1: IO Performance Comparison for MARS

We describe results of experiments on one such data-set. We select images from the CalTech 101 data-set, which have strong texture features. The 11 classes of images chosen were - ``accordion", ``barrel", ``bass", ``brain", ``beaver", ``cougar body", ``Leopards", ``wild cat", ``hedgehog", ``platypus", and ``soccer ball". These images classes are a-priori unknown to the system and the images are used as queries for performance evaluation. We embed these images within the CORTINA-48 (texture) data-set.

In one set of experiments, the weight matrix for image retrieval was evaluated according to relevance feedback principles established in MARS, here each class has a separate weight matrix. In another, a random weight matrix was used for retrieval. For each query, the 10 nearest neighbors (10NN) were mined. We also assumed a page size of 8kB. In both cases, the data-set was clustered with the squared Euclidean distance i.e. without knowledge of the Mahalanobis weight matrix for retrieval. The hyperplane based cluster retrieval has huge gains over the VA-File variant used for relevance feedback indexing (see Figures 1 and 2).

Fig. 2: IO Performance Comparison for Random Weight Matrix

Fast Selective Retrieval of Fusion Stored Time-Series/Sensor Data

Researchers: Sharadh Ramaswamy, Dr. Jayanth Nayak, Prof. Kenneth Rose

Introduction: We are concerned with optimized storage of correlated streams of data, such that the time required for retrieving subsets of the streams, as defined by user requests, is minimized. This problem is directly motivated by the emergence of sensor networks, and the need to store their output for future retrieval and analysis.

Fig 1: A 2D Sensor Field: Regions of Interest are Boxed and Shaded

See Figure 1 for a representative example. A new and challenging tradeoff between the amount of storage space and the average retrieval time in this case is due to the inter-source correlation. Specifically, if all streams are jointly compressed the storage cost will be minimized as the correlation between the streams can be fully exploited. However, this implies that all the stored data will need to be retrieved, even when only a small subset of the data streams is requested, which is extremely inefficient in terms of retrieval cost. On the other hand, if we store a separate description for every possible subset of streams (every possible query), the retrieval cost is certainly minimized, but the resulting storage cost will grow combinatorially.

Fig 2: Fusion Storage vs. Selective Retrieval

Lossless Fusion Coding and Characterization of Achievable Rates:

For memoryless sources, we investigated this tradeoff from an information theoretic perspective and obtained the corresponding performance bounds (see publication no. 7). We observed that it has direct connections with a known information-theoretic problem called multi-terminal source coding. By obtaining a characterization of the set of achievable rates for the source coding problem, we were able to completely characterize the storage versus retrieval tradeoff for the correlated streams storage problem. Since the rate region characterization that we obtained is asymptotic and not directly computable, we also developed a characterization of a (possibly strict) subset of the rate region that is computable at lower complexity.

Fig 3: Encoder Decomposition for the 2 Source Case

In order to do develop the achievable rate region, we break up the monolithic encoder output into bit-streams generated by component encoders. We maintain a separate decoder for each possible query and we partition the encoded bits into groups based on their access by the corresponding decoders. For 2 sources, we have 3 possible queries and hence 23-1=7 bit-streams. In general, for M sources, there are 2M-1 decoders 2^(2M-1)-1 bit-streams. The bit-streams are termed "shared descriptions". Next, since all encoders have access to all sources, the achievable rate region remains the same if the encoders are assumed to be independent. Now, we can apply the Han and Kobayashi theorem for multi-terminal source coding and characterize the rates of the individual bit-streams. Available storage imposes constraints on the rate sum and over this region, the retrieval rate is minimized.

Fusion Coder for Optimizing Storage Rate-Retrieval Rate-Distortion Tradeoff:

Description: Based on our earlier work on optimized storage and fast selective retrieval of correlated streams of data, we developed information theoretical models that estimate gains possible with selective information retrieval over naive storage techniques. We then formulated a Lagrangian cost function that is a weighted sum of the source and query averaged distortion and the query-averaged retrieval rate, at a given storage complexity. The optimization of this cost over a training set of sample inputs and sample queries reduces to the design of an encoding rule, a mapping from queries to subset of encoded bits and code-books for different subsets of bits. We derived the necessary conditions for optimality and minimized the cost function by iteratively enforcing the necessary conditions for optimality. Since each condition for optimality results in a reduced Lagrangian cost, the algorithm is guaranteed to converge in a few iterations. We investigated the performance of the algorithm over synthetic and real data sets for query distributions representing expected user behavior and compared the results with naive storage techniques - namely joint compression, and separate compression.

Fig 4: Proposed Fusion Coder Operation

Results: We evaluated the retrieval rate-distortion curves for our indexing for correlated data streams, and compared it with both 'naive' variants of joint and separate compression methods. Our method outperforms naive techniques at all retrieval speeds/distortions. Preliminary results on a set of 335 randomly generated queries over real and synthetic data-sets, indicate distortion gains of up to 3.5dB and retrieval speed-ups of up to 5X (see publication no. 13).

Shared Descriptions Fusion Coding:

Description: The design complexity of the proposed fusion coder grows exponentially with the storage rate Rs. Hence, the design of practical storage systems that handle hundreds of streams, where it would be necessary to encode streams to a large number of storage bits, would be unfeasible. Structural constraints on the fusion coder would reduce the complexity. One such constrained approach is the shared descriptions fusion coder. Here, queries are mapped to one of several shared descriptions, each generated by a low-complexity (low storage rate) encoder. By iteratively mapping queries to descriptions and designing descriptions that are optimal for the set of queries mapped to them, a constrained version of fusion coder is designed. Note that the constraint is in the fact that while encoders have access to all sources, the bit-selection for a query would be restricted to the bits of the corresponding description. With K shared descriptions, the complexity of design drops to O(K2Rs/K)

Shared Descriptions Fusion Coder

Fig 5: Proposed Shared Descriptions Fusion Coder

Results: Preliminary experiments with 100 sensors, Rs=80 bits and K=16 descriptions, already show retrieval rate reductions by 7X and distortion reduction by 3-4dB. Note that the above general fusion coder design would be beyond computational reach, O(280) (see publication no. 19).

Predictive Fusion Coding:

To compress sources with memory efficiently, we need to exploit temporal correlations, which are almost equally important to spatial correlations. Since fusion coding over long blocks is computationally intractable, we resort to a low complexity predictive fusion coder. However, design of optimal predictive fusion coders is complicated by the presence of a feedback loop and the need to handle an exponentially large query set at the encoder.

Fig. 6: Optimal Predictive Fusion Coding

In our approach, we impose constraints on the number of encoder prediction loops that are allowed, where queries and sources share prediction loops. We present an extreme case (see Figure 7) with only one prediction loop. This leads to the need for a predictor bit-selector SP, in addition to the query bit-selector S(q), and due to the presence of the predictor bit-selector in the prediction loop, open loop design is not possible. Stable closed loop design is difficult due to the low rates considered and because if the loop is fully closed, we loose a "handle" on the error residuals. Hence, we resort to design by application of the "asymptotic closed loop" principle, which guarantees stability of design and optimality in performance (see Figure 8).

Fig. 7: Constrained Predictive Fusion Coding: Encoder (1 Prediction Loop allowed)

Fig. 8: Constrained Predictive Fusion Coding Design By Asymptotic Closed Loop: Decoder (1 Prediction Loop allowed)