A preliminary study on similarity-preserving digital book identifiers

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Authors

  • Klemo Vladimir
  • Marin Silic
  • Nenad Romic
  • Goran Delac
  • Sinisa Srbljic
Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing to
even smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.
Original languageEnglish
Title of host publicationProceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities : LaTeCH 2015
EditorsKalliopi A. Zervanou, Marieke van Erp, Beatrice Alex
Number of pages6
Place of PublicationBeijing
PublisherAssociation for Computational Linguistics (ACL)
Publication date2015
Pages78-83
ISBN (electronic)978-1-941643-63-1
Publication statusPublished - 2015
Event9th Socio-Economic Sciences and Humanities Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - SIGHUM 2015 - Peking, China
Duration: 26.07.201530.07.2015
Conference number: 9
https://aclanthology.info/volumes/proceedings-of-the-9th-sighum-workshop-on-language-technology-for-cultural-heritage-social-sciences-and-humanities-latech
https://sighum.wordpress.com/events/latech-2015/

Bibliographical note

Funding Information:
This work was supported in part by the Croatian science foundation through the Recommender System for Service-oriented Architecture research project and in part by Leuphana Universität Lüneburg, DCRL Digital Cultures Research Lab. The authors would like to thank Robert M. Ochshorn and Goran Glavasˇ for their invaluable comments and suggestions and Project Gutenberg for their book collection.

Publisher Copyright:
© 2015 Proceedings of the Annual Meeting of the Association for Computational Linguistics.

Recently viewed

Researchers

  1. Ingeborg Warnke

Publications

  1. Effectiveness of psychological interventions in preventing recurrence of depressive disorder
  2. Elektroaltgeräte
  3. Grazing response patterns indicate isolation of semi-natural European grasslands
  4. Online-counseling for teachers via internet forum - A comparative study between norwegian and german users
  5. Insensible and Inexplicable
  6. "Der siebente Brunnen". Fred Wanders Versuch einer anderen Darstellung der Shoah in der DDR-Literatur
  7. From lignin to nylon
  8. High-Load Squat Training Improves Sprinting Performance in Junior Elite-Level Soccer Players: A Critically Appraised Topic.
  9. Assessing quality in cross-country comparisons of health systems and policies
  10. Fabian Nitschkowski & Paul Geisler
  11. Creep and hot working behavior of a new magnesium alloy Mg-3Sn-2Ca
  12. Can personal initiative training improve small business success?
  13. Quantencomputer. Taktlos
  14. Network measures of mixing
  15. System Properties Determine Food Security and Biodiversity Outcomes at Landscape Scale
  16. Near Field Communication im Destinationsmanagement
  17. Productivity and the product scope of multi-product firms:
  18. Theoretische Fundierung der Internen Revision
  19. The Stakes of the Stage
  20. Buchbesprechung
  21. Pia und die Dinge
  22. Forest history from a single tree species perspective
  23. Richard K. Nelson’s The Island Within
  24. Das Datenhandeln
  25. Towards Fashion Renting: Identification of Influencing Factors for Consumer Behavior
  26. Im Netz der Dinge
  27. From the open road to the high seas?
  28. Knowledge production and distribution of higher education institutions in the sway of global development trends
  29. 2 Thessalonians as pseudepigraphic 'reading instruction' for 1 Thessalonians
  30. Jenny and Abigail on the rocks
  31. Residual stresses of the as-cast Mg-xCa alloys with hot sprues by neutron diffraction
  32. Einen gemeinsamen Code finden
  33. Disentangling Puzzles of Spatial Scales and Participation in Environmental Governance
  34. Effectiveness of an internet-based intervention to improve sleep difficulties in a culturally diverse sample of international students
  35. Corrosion behavior of Mg-Gd-Zn based alloys in aqueous NaCl solution
  36. Die Computerspielnutzung Heranwachsender aus Elternsicht unter dem Blickwinkel der Habitustheorie
  37. Remote sensing
  38. The Manager’s Job at BP
  39. A guide to training your own horses