A preliminary study on similarity-preserving digital book identifiers

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Standard

A preliminary study on similarity-preserving digital book identifiers. / Vladimir, Klemo; Silic, Marin; Romic, Nenad et al.
Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities: LaTeCH 2015. ed. / Kalliopi A. Zervanou; Marieke van Erp; Beatrice Alex. Beijing: Association for Computational Linguistics (ACL), 2015. p. 78-83.

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Harvard

Vladimir, K, Silic, M, Romic, N, Delac, G & Srbljic, S 2015, A preliminary study on similarity-preserving digital book identifiers. in KA Zervanou, M van Erp & B Alex (eds), Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities: LaTeCH 2015. Association for Computational Linguistics (ACL), Beijing, pp. 78-83, 9th Socio-Economic Sciences and Humanities Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - SIGHUM 2015, Peking, China, 26.07.15. <http://www.aclweb.org/anthology/W15-3700.pdf>

APA

Vladimir, K., Silic, M., Romic, N., Delac, G., & Srbljic, S. (2015). A preliminary study on similarity-preserving digital book identifiers. In K. A. Zervanou, M. van Erp, & B. Alex (Eds.), Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities: LaTeCH 2015 (pp. 78-83). Association for Computational Linguistics (ACL). http://www.aclweb.org/anthology/W15-3700.pdf

Vancouver

Vladimir K, Silic M, Romic N, Delac G, Srbljic S. A preliminary study on similarity-preserving digital book identifiers. In Zervanou KA, van Erp M, Alex B, editors, Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities: LaTeCH 2015. Beijing: Association for Computational Linguistics (ACL). 2015. p. 78-83

Bibtex

@inbook{75808cc943944961b64f45e3d92291c8,
title = "A preliminary study on similarity-preserving digital book identifiers",
abstract = "Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.",
keywords = "Digital media",
author = "Klemo Vladimir and Marin Silic and Nenad Romic and Goran Delac and Sinisa Srbljic",
note = "Funding Information: This work was supported in part by the Croatian science foundation through the Recommender System for Service-oriented Architecture research project and in part by Leuphana Universit{\"a}t L{\"u}neburg, DCRL Digital Cultures Research Lab. The authors would like to thank Robert M. Ochshorn and Goran Glavasˇ for their invaluable comments and suggestions and Project Gutenberg for their book collection. Publisher Copyright: {\textcopyright} 2015 Proceedings of the Annual Meeting of the Association for Computational Linguistics.; 9th Socio-Economic Sciences and Humanities Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - SIGHUM 2015, LaTeCH SIGHUM workshop 2015 ; Conference date: 26-07-2015 Through 30-07-2015",
year = "2015",
language = "English",
pages = "78--83",
editor = "Zervanou, {Kalliopi A.} and {van Erp}, Marieke and Beatrice Alex",
booktitle = "Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities",
publisher = "Association for Computational Linguistics (ACL)",
address = "United States",
url = "https://aclanthology.info/volumes/proceedings-of-the-9th-sighum-workshop-on-language-technology-for-cultural-heritage-social-sciences-and-humanities-latech, https://sighum.wordpress.com/events/latech-2015/",

}

RIS

TY - CHAP

T1 - A preliminary study on similarity-preserving digital book identifiers

AU - Vladimir, Klemo

AU - Silic, Marin

AU - Romic, Nenad

AU - Delac, Goran

AU - Srbljic, Sinisa

N1 - Conference code: 9

PY - 2015

Y1 - 2015

N2 - Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.

AB - Due to proliferation of digital publishing, e-book catalogs are abundant but noisy and unstructured. Tools for the digital librarian rely on ISBN, metadata embedded into digital files (without accepted standard) and cryptographic hash functions for the identification of coderivative or nearduplicate content. However, unreliability of metadata and sensitivity of hashing toeven smallest changes prevents efficient detection of coderivative or similar digital books. Focus of the study are books with many versions that differ in certain amount of OCR errors and have a number of sentence-length variations. Identification of similar books is performed using small-sized fingerprints that can be easily shared and compared. We created synthetic datasets to evaluate fingerprinting accuracy while providing standard precision and recall measurements.

KW - Digital media

UR - http://www.scopus.com/inward/record.url?scp=85122502213&partnerID=8YFLogxK

M3 - Article in conference proceedings

SP - 78

EP - 83

BT - Proceedings of the 9th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities

A2 - Zervanou, Kalliopi A.

A2 - van Erp, Marieke

A2 - Alex, Beatrice

PB - Association for Computational Linguistics (ACL)

CY - Beijing

T2 - 9th Socio-Economic Sciences and Humanities Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities - SIGHUM 2015

Y2 - 26 July 2015 through 30 July 2015

ER -