TraceSim: An Alignment Method for Computing Stack Trace Similarity

Irving Muller Rodrigues; Aleksandr Khvorov; Daniel Aloise; Roman Vasiliev; Dmitrij Koznov; Eraldo Rezende Fernandes; George Chernishev; Dmitry Luciv; Nikita Povarov

doi:10.1007/s10664-021-10070-w

TraceSim: An Alignment Method for Computing Stack Trace Similarity

Research output: Journal contributions › Journal articles › Research › peer-review

Standard

TraceSim: An Alignment Method for Computing Stack Trace Similarity. / Rodrigues, Irving Muller; Khvorov, Aleksandr; Aloise, Daniel et al.
In: Empirical Software Engineering, Vol. 27, No. 2, 53, 01.03.2022.

Research output: Journal contributions › Journal articles › Research › peer-review

Harvard

Rodrigues, IM, Khvorov, A, Aloise, D, Vasiliev, R, Koznov, D, Fernandes, ER, Chernishev, G, Luciv, D & Povarov, N 2022, 'TraceSim: An Alignment Method for Computing Stack Trace Similarity', Empirical Software Engineering, vol. 27, no. 2, 53. https://doi.org/10.1007/s10664-021-10070-w

APA

Rodrigues, I. M., Khvorov, A., Aloise, D., Vasiliev, R., Koznov, D., Fernandes, E. R., Chernishev, G., Luciv, D., & Povarov, N. (2022). TraceSim: An Alignment Method for Computing Stack Trace Similarity. Empirical Software Engineering, 27(2), Article 53. https://doi.org/10.1007/s10664-021-10070-w

Vancouver

Rodrigues IM, Khvorov A, Aloise D, Vasiliev R, Koznov D, Fernandes ER et al. TraceSim: An Alignment Method for Computing Stack Trace Similarity. Empirical Software Engineering. 2022 Mar 1;27(2):53. doi: 10.1007/s10664-021-10070-w

Bibtex

@article{706e6936dc8f4948b53e7e71e3ee6ca0,

title = "TraceSim: An Alignment Method for Computing Stack Trace Similarity",

abstract = "Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim{\textquoteright}s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.",

keywords = "Automatic crash reporting, Crash report deduplication, Duplicate crash report, Duplicate crash report detection, Stack trace, Business informatics",

author = "Rodrigues, {Irving Muller} and Aleksandr Khvorov and Daniel Aloise and Roman Vasiliev and Dmitrij Koznov and Fernandes, {Eraldo Rezende} and George Chernishev and Dmitry Luciv and Nikita Povarov",

note = "We would like to gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Ericsson, Ciena, and EffciOS for funding this project. Moreover, this research was enabled in part by the support provided by WestGrid (https://www.westgrid.ca/) and Compute Canada (www.computecanada.ca).",

year = "2022",

month = mar,

day = "1",

doi = "10.1007/s10664-021-10070-w",

language = "English",

volume = "27",

journal = "Empirical Software Engineering",

issn = "1382-3256",

publisher = "Springer Netherlands",

number = "2",

}

RIS

TY - JOUR

T1 - TraceSim

T2 - An Alignment Method for Computing Stack Trace Similarity

AU - Rodrigues, Irving Muller

AU - Khvorov, Aleksandr

AU - Aloise, Daniel

AU - Vasiliev, Roman

AU - Koznov, Dmitrij

AU - Fernandes, Eraldo Rezende

AU - Chernishev, George

AU - Luciv, Dmitry

AU - Povarov, Nikita

N1 - We would like to gratefully acknowledge the Natural Sciences and Engineering Research Council of Canada (NSERC), Ericsson, Ciena, and EffciOS for funding this project. Moreover, this research was enabled in part by the support provided by WestGrid (https://www.westgrid.ca/) and Compute Canada (www.computecanada.ca).

PY - 2022/3/1

Y1 - 2022/3/1

N2 - Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.

AB - Software systems can automatically submit crash reports to a repository for investigation when program failures occur. A significant portion of these crash reports are duplicate, i.e., they are caused by the same software issue. Therefore, if the volume of submitted reports is very large, automatic grouping of duplicate crash reports can significantly ease and speed up analysis of software failures. This task is known as crash report deduplication. Given a huge volume of incoming reports, increasing quality of deduplication is an important task. The majority of studies address it via information retrieval or sequence matching methods based on the similarity of stack traces from two crash reports. While information retrieval methods disregard the position of a frame in a stack trace, the existing works based on sequence matching algorithms do not fully consider subroutine global frequency and unmatched frames. Besides, due to data distribution differences among software projects, parameters that are learned using machine learning algorithms are necessary to provide more flexibility to the methods. In this paper, we propose TraceSim – an approach for crash report deduplication which combines TF-IDF, optimum global alignment, and machine learning (ML) in a novel way. Moreover, we propose a new evaluation methodology for this task that is more comprehensive and robust than previously used evaluation approaches. TraceSim significantly outperforms seven baselines and state-of-the-art methods in the majority of the scenarios. It is the only approach that achieves competitive results on all datasets regarding all considered metrics. Moreover, we conduct an extensive ablation study that demonstrates the importance of each TraceSim’s element to its final performance and robustness. Finally, we provide the source code for all considered methods and evaluation methodology as well as the created datasets.

KW - Automatic crash reporting

KW - Crash report deduplication

KW - Duplicate crash report

KW - Duplicate crash report detection

KW - Stack trace

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=85125623765&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/fec2bb19-97d3-3284-92d4-c1a34b8447fe/

U2 - 10.1007/s10664-021-10070-w

DO - 10.1007/s10664-021-10070-w

M3 - Journal articles

AN - SCOPUS:85125623765

VL - 27

JO - Empirical Software Engineering

JF - Empirical Software Engineering

SN - 1382-3256

IS - 2

M1 - 53

ER -

Other publications by the same author(s)

Data practices in apps from Brazil: What do privacy policies inform us about?

Quadros dos Reis, V., Rabello, M. E. R., Lima, A. C., Jardim, G. P. S., Fernandes, E. R. & Brefeld, U., 10.02.2023, In: Journal on Interactive Systems. 14, 1, p. 1-8 8 p.

Research output: Journal contributions › Journal articles › Research › peer-review

Entity Extraction from Portuguese Legal Documents Using Distant Supervision

Navarezi, L. M., Sakiyama, K., Rodrigues, L. S., Robaldo, C. M. O., Lobato, G. R., Vilela, P. A., Matsubara, E. T. & Fernandes, E. R., 2022, Computational Processing of the Portuguese Language : 15th International Conference, PROPOR 2022, Fortaleza, Brazil, March 21-23, 2022, Proceedings. Pinheiro, V., Gamallo, P., Amaro, R., Scarton, C., Batista, F., Silva, D., Magro, C. & Pinto, H. (eds.). Cham: Springer Nature Switzerland AG, p. 166-176 11 p. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); vol. 13208 LNAI).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

FaST: A linear time stack trace alignment heuristic for crash report deduplication

Rodrigues, I. M., Aloise, D. & Fernandes, E. R., 17.10.2022, The 2022 Mining Software Repositories Conference: MSR 2022, Proceedings; 18-20 May 2022, Virtual; 23-24 May 2022, Pittsburgh, Pennsylvania. New York: Institute of Electrical and Electronics Engineers Inc., p. 549-560 12 p. (Proceedings - IEEE/ACM International Conference on Mining Software Repositories ).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

Performance predictors for graphics processing units applied to dark-silicon-aware design space exploration

Sonohata, R., Arigoni, D. C. A., Fernandes, E. R., Ribeiro dos Santos, R. & Dessandre Duenha, L., 01.08.2023, In: Concurrency and Computation: Practice and Experience. 35, 17, 16 p., e6877.

Research output: Journal contributions › Journal articles › Research › peer-review

Rhetorical Role Identification for Portuguese Legal Documents

Aragy, R., Fernandes, E. R. & Caceres, E. N., 2021, Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29 – December 3, 2021, Proceedings, Part II. Britto, A. & Valdivia Delgado, K. (eds.). Cham: Springer Schweiz, p. 557-571 15 p. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); vol. 13074 LNAI).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

DOI

https://doi.org/10.1007/s10664-021-10070-w
Final published version