Systematic feature evaluation for gene name recognition

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Systematic feature evaluation for gene name recognition. / Hakenberg, Jörg; Bickel, Steffen; Plake, Conrad et al.
In: BMC Bioinformatics, Vol. 6, No. SUPPL.1, S9, 24.05.2005.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

Hakenberg, J, Bickel, S, Plake, C, Brefeld, U, Zahn, H, Faulstich, L, Leser, U & Scheffer, T 2005, 'Systematic feature evaluation for gene name recognition', BMC Bioinformatics, vol. 6, no. SUPPL.1, S9. https://doi.org/10.1186/1471-2105-6-S1-S9

APA

Hakenberg, J., Bickel, S., Plake, C., Brefeld, U., Zahn, H., Faulstich, L., Leser, U., & Scheffer, T. (2005). Systematic feature evaluation for gene name recognition. BMC Bioinformatics, 6(SUPPL.1), Article S9. https://doi.org/10.1186/1471-2105-6-S1-S9

Vancouver

Hakenberg J, Bickel S, Plake C, Brefeld U, Zahn H, Faulstich L et al. Systematic feature evaluation for gene name recognition. BMC Bioinformatics. 2005 May 24;6(SUPPL.1):S9. doi: 10.1186/1471-2105-6-S1-S9

Bibtex

@article{145a0b45e5f644dca22d39068f96349a,
title = "Systematic feature evaluation for gene name recognition",
abstract = "In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.",
keywords = "Informatics, Business informatics",
author = "J{\"o}rg Hakenberg and Steffen Bickel and Conrad Plake and Ulf Brefeld and Hagen Zahn and Lukas Faulstich and Ulf Leser and Tobias Scheffer",
year = "2005",
month = may,
day = "24",
doi = "10.1186/1471-2105-6-S1-S9",
language = "English",
volume = "6",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central Ltd.",
number = "SUPPL.1",

}

RIS

TY - JOUR

T1 - Systematic feature evaluation for gene name recognition

AU - Hakenberg, Jörg

AU - Bickel, Steffen

AU - Plake, Conrad

AU - Brefeld, Ulf

AU - Zahn, Hagen

AU - Faulstich, Lukas

AU - Leser, Ulf

AU - Scheffer, Tobias

PY - 2005/5/24

Y1 - 2005/5/24

N2 - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

AB - In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

KW - Informatics

KW - Business informatics

UR - http://www.scopus.com/inward/record.url?scp=33947304479&partnerID=8YFLogxK

UR - https://www.mendeley.com/catalogue/483dea0a-b292-3915-b21a-ee2b226bc166/

U2 - 10.1186/1471-2105-6-S1-S9

DO - 10.1186/1471-2105-6-S1-S9

M3 - Journal articles

C2 - 15960843

AN - SCOPUS:33947304479

VL - 6

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

IS - SUPPL.1

M1 - S9

ER -

Recently viewed

Publications

  1. Developing spatial biophysical accounting for multiple ecosystem services
  2. An Equation with many Variables
  3. Wavelet functions for rejecting spurious values
  4. Implicit and explicit horizons
  5. Combining flatness based feedforward action with a fractional PI regulator to control the intake valve engine
  6. Linked Accomplishment Of Order Management And Production Planning And Control. An Integrated Model-based Approach
  7. An EKF-based observer for sensorless valve control in camless internal combustion engine
  8. Developing a Process for the Analysis of User Journeys and the Prediction of Dropout in Digital Health Interventions:
  9. Spatially assessing unpleasant places with hard- and soft-GIS methods
  10. Accuracy Improvement of Vision System for Mobile Robot Navigation by Finding the Energetic Center of Laser Signal
  11. A simple control strategy for increasing the soft bending actuator performance by using a pressure boost
  12. Duration of Organizational Decision Processes in Organizations in View of Simulation Calculations
  13. Comparison of three methods of length compensation in a parallel kinematic and their equivalence conditions
  14. Explaining General and Specific Factors in Longitudinal, Multimethod, and Bifactor Models
  15. PD/PID-switching control as a human-machine interface for a semi-autonomous driver in automobiles
  16. Early Edema Detection Based on the Examination of Multidimensional Ultra-Wide band Data
  17. Modeling and simulation of the heterogenous material behavior in thermal-sprayed coatings
  18. Validation of Inspection Frameworks and Methods
  19. Creating spaces for cooperation
  20. Towards combined methods for recording ground beetles
  21. A New Approach for Optimal Solving Cyclic and Non-Cyclic Bus Drvier Rostering Problems
  22. How Differences in Ratings of Odors and Odor Labels Are Associated with Identification Mechanisms
  23. Nonlinear anisotropic boundary value problems – regularity results and multiscale discretizations
  24. The interplay between posture control and memory for spatial locations
  25. A toolkit for robust risk assessment using F-divergences
  26. Direct parameter specification of an attention shift: Evidence from perceptual latency priming
  27. Finding Datasets in Publications: The University of Paderborn Approach
  28. On the role of linguistic features for comprehension and learning from STEM texts. A meta-analysis
  29. Effects of accuracy feedback on fractal characteristics of time estimation
  30. Operational integration of EMIS and ERP systems
  31. The shooter bias: Replicating the classic effect and introducing a novel paradigm
  32. Graph-based Approaches for Analyzing Team Interaction on the Example of Soccer
  33. Learning from Erroneous Examples
  34. Performance of an IMU-Based Sensor Concept for Solving the Direct Kinematics Problem of the Stewart-Gough Platform
  35. Geometric series with randomly increasing exponents
  36. Incorporating ecosystem services into ecosystem-based management to deal with complexity
  37. A piezo servo hydraulic actuator for use in camless combustion engines and its control with MPC