Query Aware Determinization of Uncertain Objects RESULT PAPER
Keywords:
Determinization, uncertain data, data quality, query workload, branch and bound algorithmAbstract
In this paper considers the problem of determinizing probabilistic data to enable such data to be stored in
legacy systems that accept only deterministic input. Probabilistic data may be generated by automated data
analysis/enrichment techniques such as entity resolution, information extraction, and speech processing. The legacy
system may correspond to pre-existing web applications such as Flickr, Picasa, etc. The goal is to generate a
deterministic representation of probabilistic data that optimizes the quality of the end-application built on deterministic
data. We explore such a Determinization problem in the context of two different data processing tasks triggers and
selection queries. We show that approaches such as thresholding or top-1 selection traditionally used for
Determinization lead to suboptimal performance for such applications. Instead, we develop a query-aware strategy and
show its advantages over existing solutions through a comprehensive empirical evaluation over real and synthetic
datasets.