Supervised clustering of streaming data for email batch detection

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Authors

We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised clustering, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made - - owing to the streaming nature of the data - - then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails.

Original languageEnglish
Title of host publicationProceedings of the 24th international conference on Machine learning
EditorsZoubin Ghahramani
Number of pages8
Place of PublicationNew York
PublisherAssociation for Computing Machinery, Inc
Publication date2007
Pages345-352
ISBN (print)978-1-59593-793-3
DOIs
Publication statusPublished - 2007
Externally publishedYes
EventProceedings of the 24th international conference on Machine learning - ICML 2007 - Corvalis, OR, United States
Duration: 20.06.200724.06.2007
Conference number: 24
https://dl.acm.org/doi/proceedings/10.1145/1273496

DOI

Recently viewed

Publications

  1. Comparing Two Voltage Observers in a Sensorsystem using Repetitive Control
  2. Noise level estimation using haar wavelet packet trees for sensor robust outlier detection
  3. Vision-Based Deep Learning Algorithm for Detecting Potholes
  4. Digital Control of a Camless Engine Using Lyapunov Approach with Backward Euler Approximation
  5. Different approaches to learning from errors: Comparing the effectiveness of high reliability and error management approaches
  6. A Python toolbox for the numerical solution of the Maxey-Riley equation
  7. Authenticity and authentication in language learning
  8. Towards a Dynamic Interpretation of Subjective and Objective Values
  9. Using haar wavelets for fault detection in technical processes
  10. Multidimensional recurrence quantification analysis (MdRQA) for the analysis of multidimensional time-series
  11. Optimization Analysis for an Uncovered Wagon Transportation with an Interactive Animated Simulation-Based Platform for Multidisciplinary Learning
  12. Analysis and Implementation of a Resistance Temperature Estimator Based on Bi-Polynomial Least Squares Method and Discrete Kalman Filter
  13. A simple fuzzy controller for robot manipulators with bounded inputs
  14. Inversion of fuzzy neural networks for the reduction of noise in the control loop
  15. Transformer with Tree-order Encoding for Neural Program Generation
  16. A denoising procedure using wavelet packets for instantaneous detection of pantograph oscillations
  17. Recurrence Quantification Analysis of Processes and Products of Discourse
  18. Integrating Mobile Devices into AAL-Environments using Knowledge based Assistance Systems
  19. Using cross-recurrence quantification analysis to compute similarity measures for time series of unequal length with applications to sleep stage analysis
  20. FaST: A linear time stack trace alignment heuristic for crash report deduplication
  21. Automatic enumeration of all connected subgraphs.
  22. Supporting discourse in a synchronous learning environment
  23. Modelling the Complexity of Measurement Estimation Situations - A Theoretical Framework for the Estimation of Lengths
  24. How Much Tracking Is Necessary? - The Learning Curve in Bayesian User Journey Analysis
  25. Towards improved dispatching rules for complex shop floor scenarios - A genetic programming approach
  26. Expertise in research integration and implementation for tackling complex problems
  27. Identification of structure-biodegradability relationships for ionic liquids - clustering of a dataset based on structural similarity
  28. Using augmented video to test in-car user experiences of context analog HUDs
  29. Linux-based Embedded System for Wavelet Denoising and Monitoring of sEMG Signals using an Axiomatic Seminorm
  30. Top-down contingent attentional capture during feed-forward visual processing
  31. Microstructural development of as-cast AM50 during Constrained Friction Processing: grain refinement and influence of process parameters
  32. Distributed robust Gaussian Process regression
  33. Continuous 3D scanning mode using servomotors instead of stepping motors in dynamic laser triangulation