TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Standard

TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. / Vilhagra, Lucas Akayama; Fernandes, Eraldo Rezende; Nogueira, Bruno Magalhães.
The 35th Annual ACM Symposium on Applied Computing: Brno, Czech Republic, March 30 - April 3, 2020. New York: Association for Computing Machinery, Inc, 2020. p. 1135-1142.

Research output: Contributions to collected editions/worksArticle in conference proceedingsResearchpeer-review

Harvard

Vilhagra, LA, Fernandes, ER & Nogueira, BM 2020, TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. in The 35th Annual ACM Symposium on Applied Computing: Brno, Czech Republic, March 30 - April 3, 2020. Association for Computing Machinery, Inc, New York, pp. 1135-1142, Annual ACM Symposium on Applied Computing - SAC 2020, Brno, Czech Republic, 30.03.20. https://doi.org/10.1145/3341105.3374018

APA

Vilhagra, L. A., Fernandes, E. R., & Nogueira, B. M. (2020). TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. In The 35th Annual ACM Symposium on Applied Computing: Brno, Czech Republic, March 30 - April 3, 2020 (pp. 1135-1142). Association for Computing Machinery, Inc. https://doi.org/10.1145/3341105.3374018

Vancouver

Vilhagra LA, Fernandes ER, Nogueira BM. TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network. In The 35th Annual ACM Symposium on Applied Computing: Brno, Czech Republic, March 30 - April 3, 2020. New York: Association for Computing Machinery, Inc. 2020. p. 1135-1142 doi: 10.1145/3341105.3374018

Bibtex

@inbook{23c6deb5967d46eca3e7d8edcecf05e4,
title = "TextCSN: A semi-supervised approach for text clustering using pairwise constraints and convolutional siamese network",
abstract = "Clustering is a key problem in several applications. Although this task is originally unsupervised, there are many proposals leveraging different supervision signals in order to improve clustering performance. Some semi-supervised clustering methods employ pairwise constraints to inform the learning algorithm about pairs of instances that should be in the same cluster (must-link constraints or similar instances) and pairs that should be in different clusters (cannot-link constraints or dissimilar instances). In many applications, to provide pairwise constraints is cheaper than asking users for explicit labels on the data. More recently, deep clustering methods have been proposed in the literature. Such methods consists in learning a deep neural representation of the input data in order to improve clustering. In this paper, we present TextCSN, a deep clustering approach that combines (i) a Convolutional Siamese Network (CSN) based on pairwise constraints to perform representation learning and (ii) the traditional K-Means algorithm for unsupervised clustering using the learned representation. As far as we know, this is the first semi-supervised deep learning method based on pairwise constraints applied on text clustering. By means of eight text clustering tasks, we assess our approach comparing it with two baselines: MPC-KMeans, a semi-supervised clustering algorithm; and ordinary K-Means algorithm. Results indicate that the proposed approach outperforms the baselines in six of these datasets, and its performance increases with the number of constraints provided.",
keywords = "Deep clustering, Neural networks, Representation learning, Semi-supervised clustering, Text clustering, Informatics, Business informatics",
author = "Vilhagra, {Lucas Akayama} and Fernandes, {Eraldo Rezende} and Nogueira, {Bruno Magalh{\~a}es}",
year = "2020",
month = mar,
day = "30",
doi = "10.1145/3341105.3374018",
language = "English",
pages = "1135--1142",
booktitle = "The 35th Annual ACM Symposium on Applied Computing",
publisher = "Association for Computing Machinery, Inc",
address = "United States",
note = "Annual ACM Symposium on Applied Computing - SAC 2020 ; Conference date: 30-03-2020 Through 03-04-2020",
url = "https://www.sigapp.org/sac/sac2020/",

}

RIS

TY - CHAP

T1 - TextCSN

T2 - Annual ACM Symposium on Applied Computing - SAC 2020

AU - Vilhagra, Lucas Akayama

AU - Fernandes, Eraldo Rezende

AU - Nogueira, Bruno Magalhães

N1 - Conference code: 35

PY - 2020/3/30

Y1 - 2020/3/30

N2 - Clustering is a key problem in several applications. Although this task is originally unsupervised, there are many proposals leveraging different supervision signals in order to improve clustering performance. Some semi-supervised clustering methods employ pairwise constraints to inform the learning algorithm about pairs of instances that should be in the same cluster (must-link constraints or similar instances) and pairs that should be in different clusters (cannot-link constraints or dissimilar instances). In many applications, to provide pairwise constraints is cheaper than asking users for explicit labels on the data. More recently, deep clustering methods have been proposed in the literature. Such methods consists in learning a deep neural representation of the input data in order to improve clustering. In this paper, we present TextCSN, a deep clustering approach that combines (i) a Convolutional Siamese Network (CSN) based on pairwise constraints to perform representation learning and (ii) the traditional K-Means algorithm for unsupervised clustering using the learned representation. As far as we know, this is the first semi-supervised deep learning method based on pairwise constraints applied on text clustering. By means of eight text clustering tasks, we assess our approach comparing it with two baselines: MPC-KMeans, a semi-supervised clustering algorithm; and ordinary K-Means algorithm. Results indicate that the proposed approach outperforms the baselines in six of these datasets, and its performance increases with the number of constraints provided.

AB - Clustering is a key problem in several applications. Although this task is originally unsupervised, there are many proposals leveraging different supervision signals in order to improve clustering performance. Some semi-supervised clustering methods employ pairwise constraints to inform the learning algorithm about pairs of instances that should be in the same cluster (must-link constraints or similar instances) and pairs that should be in different clusters (cannot-link constraints or dissimilar instances). In many applications, to provide pairwise constraints is cheaper than asking users for explicit labels on the data. More recently, deep clustering methods have been proposed in the literature. Such methods consists in learning a deep neural representation of the input data in order to improve clustering. In this paper, we present TextCSN, a deep clustering approach that combines (i) a Convolutional Siamese Network (CSN) based on pairwise constraints to perform representation learning and (ii) the traditional K-Means algorithm for unsupervised clustering using the learned representation. As far as we know, this is the first semi-supervised deep learning method based on pairwise constraints applied on text clustering. By means of eight text clustering tasks, we assess our approach comparing it with two baselines: MPC-KMeans, a semi-supervised clustering algorithm; and ordinary K-Means algorithm. Results indicate that the proposed approach outperforms the baselines in six of these datasets, and its performance increases with the number of constraints provided.

KW - Deep clustering

KW - Neural networks

KW - Representation learning

KW - Semi-supervised clustering

KW - Text clustering

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=85083023584&partnerID=8YFLogxK

U2 - 10.1145/3341105.3374018

DO - 10.1145/3341105.3374018

M3 - Article in conference proceedings

AN - SCOPUS:85083023584

SP - 1135

EP - 1142

BT - The 35th Annual ACM Symposium on Applied Computing

PB - Association for Computing Machinery, Inc

CY - New York

Y2 - 30 March 2020 through 3 April 2020

ER -

DOI