Supervised clustering of streaming data for email batch detection
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Standard
Proceedings of the 24th international conference on Machine learning. ed. / Zoubin Ghahramani. New York: Association for Computing Machinery, Inc, 2007. p. 345-352.
Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - CHAP
T1 - Supervised clustering of streaming data for email batch detection
AU - Haider, Peter
AU - Brefeld, Ulf
AU - Scheffer, Tobias
N1 - Conference code: 24
PY - 2007
Y1 - 2007
N2 - We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made - - owing to the streaming nature of the data - - then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails.
AB - We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made - - owing to the streaming nature of the data - - then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails.
KW - Informatics
KW - Business informatics
UR - http://www.scopus.com/inward/record.url?scp=34547983265&partnerID=8YFLogxK
U2 - 10.1145/1273496.1273540
DO - 10.1145/1273496.1273540
M3 - Article in conference proceedings
AN - SCOPUS:34547983265
SN - 978-1-59593-793-3
SP - 345
EP - 352
BT - Proceedings of the 24th international conference on Machine learning
A2 - Ghahramani, Zoubin
PB - Association for Computing Machinery, Inc
CY - New York
T2 - Proceedings of the 24th international conference on Machine learning - ICML 2007
Y2 - 20 June 2007 through 24 June 2007
ER -