Document assignment in multi-site search engines

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Authors

Assigning documents accurately to sites is critical for the performance of multi-site Web search engines. In such settings, sites crawl only documents they index and forward queries to obtain best-matching documents from other sites. Inaccurate assignments may lead to inefficiencies when crawling Web pages or processing user queries. In this work, we propose a machine-learned document assignment strategy that uses the locality of document views in search results to decide upon assignments. We evaluate the performance of our strategy using various document features extracted from a large Web collection. Our experimental setup uses query logs from a number of search front-ends spread across different geographic locations and uses these logs to learn the document access patterns. We compare our technique against baselines such as region- and language-based document assignment and observe that our technique achieves substantial performance improvements with respect to recall. With our technique, we are able to obtain a small query forwarding rate (0.04) requiring roughly 45% less replication of documents compared to replicating all documents across all sites.

Original languageEnglish
Title of host publicationProceedings of the fourth ACM international conference on Web search and data mining
Number of pages10
Place of PublicationNew York
PublisherAssociation for Computing Machinery, Inc
Publication date2011
Pages575-584
ISBN (print)978-1-4503-0493-1
DOIs
Publication statusPublished - 2011
Externally publishedYes
Event4th ACM International Conference on Web Search and Data Mining - WSDM '11 2011 - Hong Kong, China
Duration: 09.02.201112.02.2011
Conference number: 4
http://www.wsdm2011.org/wsdm2011/_media/wsdm2011-program-20110127.pdf

    Research areas

  • Informatics - Assignment strategies, Classification, Document access, Document replication, Experimental setup, Geographic location, Multi-site, Multi-site web search engines, Performance improvements, Query forwarding, Query logs, Search results, User query, Web collections, Web page, Web search engines
  • Business informatics

DOI

Recently viewed

Publications

  1. Detection time analysis of propulsion system fault effects in a hexacopter
  2. Can measurement errors explain variance in the relationship between muscle- and tendon stiffness and range of motion?—a blinded reliability and objectivity study
  3. Integrating the underlying structure of stochasticity into community ecology
  4. Unraveling Privacy Concerns in Complex Data Ecosystems with Architectural Thinking
  5. A new way of assessing the interaction of a metallic phase precursor with a modified oxide support substrate as a source of information for predicting metal dispersion
  6. Public Value: rethinking value creation
  7. Biodiversity-multifunctionality relationships depend on identity and number of measured functions
  8. Structure analysis in an octocopter using piezoelectric sensors and machine learning
  9. A Framework for Applying Natural Language Processing in Digital Health Interventions
  10. On the origin of passive rotation in rotational joints, and how to calculate it
  11. Towards productive functions?
  12. Use of Machine-Learning Algorithms Based on Text, Audio and Video Data in the Prediction of Anxiety and Post-Traumatic Stress in General and Clinical Populations
  13. Methodological support for the selection of simplified equations of state for modeling technical fluids
  14. Spectral Early-Warning Signals for Sudden Changes in Time-Dependent Flow Patterns
  15. Enhancing EFL classroom instruction via the FeedBook: effects on language development and communicative language use.
  16. Interplays between relational and instrumental values
  17. Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers
  18. How alloying and processing effects can influence the microstructure and mechanical properties of directly extruded thin zinc wires
  19. Value Structure and Dimensions
  20. Conceptual understanding of complex components and Nyquist-Shannon sampling theorem
  21. Nonlinear PD fault-tolerant control for dynamic positioning of ships with actuator constraints
  22. Predicate‐based model of problem‐solving for robotic actions planning
  23. Homogenization methods for multi-phase elastic composites with non-elliptical reinforcements
  24. The role of task complexity, modality and aptitude in narrative task performance
  25. Factored MDPs for detecting topics of user sessions
  26. Inside-sediment partitioning of PAH, PCB and organochlorine compounds and inferences on sampling and normalization methods
  27. Privatizing the commons
  28. A tutorial introduction to adaptive fractal analysis