Systematic feature evaluation for gene name recognition

Jörg Hakenberg; Steffen Bickel; Conrad Plake; Ulf Brefeld; Hagen Zahn; Lukas Faulstich; Ulf Leser; Tobias Scheffer

doi:10.1186/1471-2105-6-S1-S9

Systematic feature evaluation for gene name recognition

Research output: Journal contributions › Journal articles › Research › peer-review

Standard

Systematic feature evaluation for gene name recognition. / Hakenberg, Jörg; Bickel, Steffen; Plake, Conrad et al.
In: BMC Bioinformatics, Vol. 6, No. SUPPL.1, S9, 24.05.2005.

Research output: Journal contributions › Journal articles › Research › peer-review

Harvard

Hakenberg, J, Bickel, S, Plake, C, Brefeld, U, Zahn, H, Faulstich, L, Leser, U & Scheffer, T 2005, 'Systematic feature evaluation for gene name recognition', BMC Bioinformatics, vol. 6, no. SUPPL.1, S9. https://doi.org/10.1186/1471-2105-6-S1-S9

APA

Hakenberg, J., Bickel, S., Plake, C., Brefeld, U., Zahn, H., Faulstich, L., Leser, U., & Scheffer, T. (2005). Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6(SUPPL.1), Article S9. https://doi.org/10.1186/1471-2105-6-S1-S9

Vancouver

Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L et al. Systematic feature evaluation for gene name recognition. BMC Bioinformatics. 2005 May 24;6(SUPPL.1):S9. doi: 10.1186/1471-2105-6-S1-S9

Bibtex

@article{145a0b45e5f644dca22d39068f96349a,

title = "Systematic feature evaluation for gene name recognition",

abstract = "In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.",

keywords = "Informatics, Business informatics",

author = "J{\"o}rg Hakenberg and Steffen Bickel and Conrad Plake and Ulf Brefeld and Hagen Zahn and Lukas Faulstich and Ulf Leser and Tobias Scheffer",

year = "2005",

month = may,

day = "24",

doi = "10.1186/1471-2105-6-S1-S9",

language = "English",

volume = "6",

journal = "BMC Bioinformatics",

issn = "1471-2105",

publisher = "BioMed Central Ltd.",

number = "SUPPL.1",

}

RIS

TY - JOUR

T1 - Systematic feature evaluation for gene name recognition

AU - Hakenberg, Jörg

AU - Bickel, Steffen

AU - Plake, Conrad

AU - Brefeld, Ulf

AU - Zahn, Hagen

AU - Faulstich, Lukas

AU - Leser, Ulf

AU - Scheffer, Tobias

PY - 2005/5/24

Y1 - 2005/5/24

N2 - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

AB - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=33947304479&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/483dea0a-b292-3915-b21a-ee2b226bc166/

U2 - 10.1186/1471-2105-6-S1-S9

DO - 10.1186/1471-2105-6-S1-S9

M3 - Journal articles

C2 - 15960843

AN - SCOPUS:33947304479

VL - 6

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL.1

M1 - S9

ER -

Related by journal

Interfacing medicinal chemistry with structural bioinformatics: implications for T box riboswitch RNA drug discovery

Jentzsch, F. & Hines, J. V., 13.03.2012, In: BMC Bioinformatics. 13 Suppl 2, Suppl 2, p. 1-5 5 p., S5.

Research output: Journal contributions › Journal articles › Research › peer-review

Other publications by the same author(s)

Interactive sequential generative models for team sports

Fassmeyer, D., Cordes, M. & Brefeld, U., 02.2025, In: Machine Learning. 114, 2, 15 p., 38.

Research output: Journal contributions › Journal articles › Research › peer-review

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Bengs, D., Brefeld, U., Kroehne, U. & Zehner, F., 2025, (Accepted/In press) In: Psychometrika.

Research output: Journal contributions › Journal articles › Research › peer-review

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Brefeld, U. (Editor), Davis, J. (Editor), Van Haaren, J. (Editor) & Zimmermann, A. (Editor), 2025, Cham: Springer Verlag. 119 p. (Communications in Computer and Information Science; vol. 2460)

Research output: Books and anthologies › Conference proceedings › Research

Masked autoencoder for multiagent trajectories

Rudolph, Y. & Brefeld, U., 02.2025, In: Machine Learning. 114, 2, 18 p., 44.

Research output: Journal contributions › Journal articles › Research › peer-review

Self-improvement for Computerized Adaptive Testing

Rudolph, Y., Neubauer, K. & Brefeld, U., 2026, Machine Learning and Knowledge Discovery in Databases - Research Track: European Conference, ECML PKDD 2025, Porto, Portugal, September 15–19, 2025, Proceedings. Ribeiro, R. P., Jorge, A. M., Soares, C., Gama, J., Pfahringer, B., Japkowicz, N., Larrañaga, P. & Abreu, P. H. (eds.). Cham: Springer International Publishing, Vol. 2. p. 70-86 17 p. (Lecture Notes in Computer Science; vol. 16014 LNCS).

Research output: Contributions to collected editions/works › Article in conference proceedings › Research › peer-review

DOI

https://doi.org/10.1186/1471-2105-6-S1-S9
Final published version

Systematic feature evaluation for gene name recognition

Standard

Harvard

APA

Vancouver

Bibtex

RIS

Related by journal

Interfacing medicinal chemistry with structural bioinformatics: implications for T box riboswitch RNA drug discovery

Other publications by the same author(s)

Interactive sequential generative models for team sports

Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items

Machine Learning and Data Mining for Sports Analytics: 11th International Workshop, MLSA 2024, Vilnius, Lithuania, September 9, 2024, Revised Selected Papers

Masked autoencoder for multiagent trajectories

Self-improvement for Computerized Adaptive Testing

DOI

Recently viewed

Activities

Prizes

Publications

Press / Media