Systematic feature evaluation for gene name recognition
Research output: Journal contributions › Journal articles › Research › peer-review
Standard
In: BMC Bioinformatics, Vol. 6, No. SUPPL.1, S9, 24.05.2005.
Research output: Journal contributions › Journal articles › Research › peer-review
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Systematic feature evaluation for gene name recognition
AU - Hakenberg, Jörg
AU - Bickel, Steffen
AU - Plake, Conrad
AU - Brefeld, Ulf
AU - Zahn, Hagen
AU - Faulstich, Lukas
AU - Leser, Ulf
AU - Scheffer, Tobias
PY - 2005/5/24
Y1 - 2005/5/24
N2 - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.
AB - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.
KW - Informatics
KW - Business informatics
UR - http://www.scopus.com/inward/record.url?scp=33947304479&partnerID=8YFLogxK
UR - https://www.mendeley.com/catalogue/483dea0a-b292-3915-b21a-ee2b226bc166/
U2 - 10.1186/1471-2105-6-S1-S9
DO - 10.1186/1471-2105-6-S1-S9
M3 - Journal articles
C2 - 15960843
AN - SCOPUS:33947304479
VL - 6
JO - BMC Bioinformatics
JF - BMC Bioinformatics
SN - 1471-2105
IS - SUPPL.1
M1 - S9
ER -