Systematic feature evaluation for gene name recognition

Research output: Journal contributionsJournal articlesResearchpeer-review

Authors

  • Jörg Hakenberg
  • Steffen Bickel
  • Conrad Plake
  • Ulf Brefeld
  • Hagen Zahn
  • Lukas Faulstich
  • Ulf Leser
  • Tobias Scheffer

In task 1A of the BioCreAtIvE evaluation, systems had to be devised that recognize words and phrases forming gene or protein names in natural language sentences. We approach this problem by building a word classification system based on a sliding window approach with a Support Vector Machine, combined with a pattern-based post-processing for the recognition of phrases. The performance of such a system crucially depends on the type of features chosen for consideration by the classification method, such as pre- or postfixes, character n-grams, patterns of capitalization, or classification of preceding or following words. We present a systematic approach to evaluate the performance of different feature sets based on recursive feature elimination, RFE. Based on a systematic reduction of the number of features used by the system, we can quantify the impact of different feature sets on the results of the word classification problem. This helps us to identify descriptive features, to learn about the structure of the problem, and to design systems that are faster and easier to understand. We observe that the SVM is robust to redundant features. RFE improves the performance by 0.7%, compared to using the complete set of attributes. Moreover, a performance that is only 2.3% below this maximum can be obtained using fewer than 5% of the features.

Original languageEnglish
Article numberS9
JournalBMC Bioinformatics
Volume6
Issue numberSUPPL.1
Number of pages11
ISSN1471-2105
DOIs
Publication statusPublished - 24.05.2005
Externally publishedYes

Recently viewed

Publications

  1. A cognitive mapping approach to understanding public objection to energy infrastructure
  2. Public Value: rethinking value creation
  3. Predicate‐based model of problem‐solving for robotic actions planning
  4. Octanol-Water Partition Coefficient Measurement by a Simple 1H NMR Method
  5. Approximate tree kernels
  6. Mathematical Modeling for Robot 3D Laser Scanning in Complete Darkness Environments to Advance Pipeline Inspection
  7. Introducing split orders and optimizing operational policies in robotic mobile fulfillment systems
  8. Metrics for Experimentation Programs: Categories, Benefits and Challenges
  9. Scholarly Question Answering Using Large Language Models in the NFDI4DataScience Gateway
  10. Application of design of experiments for laser shock peening process optimization
  11. A survey of empirical studies using transaction level data on exports and imports
  12. A Wavelet Packet Algorithm for Online Detection of Pantograph Vibrations
  13. Processing of CSR communication: insights from the ELM
  14. Experimentally established correlation of friction surfacing process temperature and deposit geometry
  15. Guest Editorial - ''Econometrics of Anonymized Micro Data''
  16. Performance Saga: Interview 01
  17. Active learning for network intrusion detection
  18. A Lyapunov based PI controller with an anti-windup scheme for a purification process of potable water
  19. Embarrassment as a public vs. private emotion and symbolic coping behaviour
  20. Intraspecific trait variation increases species diversity in a trait-based grassland model
  21. »HOW TO MAKE YOUR OWN SAMPLES«
  22. Imaginary practices as the nexus between continuity and disruptive change
  23. Polar Coordinates and Interactive Learning
  24. Developing a sustainable platform for entity annotation benchmarks
  25. Meta-Image – a collaborative environment for the image discourse
  26. Gaining deep leverage? Reflecting and shaping real-world lab impacts through leverage points