Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

Research output: Journal contributionsJournal articlesResearchpeer-review

Authors

Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.
Original languageEnglish
Article number200285
JournalIntelligent Systems with Applications
Volume20
Number of pages14
ISSN2667-3053
DOIs
Publication statusPublished - 01.11.2023

Bibliographical note

Funding Information:
We acknowledge support by the German Research Foundation (DFG).

Publisher Copyright:
© 2023 The Authors

    Research areas

  • Business informatics - Layout-rich documents, Document analysis, Natural language processing

Recently viewed

Activities

  1. Identification in closed loop
  2. Structured Prediction in Social Contexts
  3. Alterations of a visual and how they work for and at the boundaries of an interorganizational team: A multimodal exploration
  4. Improving the quality of selecting applicants for university student programs
  5. Workshop on Stochastic Models, Statistics and Their Applications 2017
  6. Teaching the machine how to assess grammar skills. Modelling verb-tense exercise characteristics as a basis for an adaptive E-learning system
  7. Temporary Organizing and Organizing Trmporality: On the Multilayered Architecture of Accelerators
  8. Coding feedback in an online- and video-based learning environment during a field experience
  9. Performance resource depletion influence on performance: Advancing concepts and findings
  10. Do connectives improve the level of understandability in mathematical modeling tasks?
  11. Simulation and Evaluation of Control Mechanisms for Mobile Robot Fulfillment Systems
  12. Is there only one modelling competency? The question of situated cognition when solving real world problems
  13. Effects of using VR training for skill development and reflection in the context of parent-teacher conferences
  14. Conference on Participatory Approaches in Science & Technology - PATH 2006
  15. The semantics of transformation: conceptual work based on Freirean methodology.
  16. "Curious and Concerned" – A mixed-methods study of teacher educators’ AI literacy, usage experience, and perceptions
  17. On the relational structure of two tests measuring general pedagogical knowledge

Publications

  1. Mapping the intersection of planetary boundaries and environmentally extended input-output analysis: A systematic literature review
  2. Evaluating structural and compositional canopy characteristics to predict the light-demand signature of the forest understorey in mixed, semi-natural temperate forests
  3. lp-Norm Multiple Kernel Learning
  4. Changing Data Collection Methods Means Different Kind of Data
  5. A geometric approach for controlling an electromagnetic actuator with the help of a linear Model Predictive Control
  6. A Class of Simple Stochastic Online Bin Packing Algorithms
  7. Lagged Multidimensional Recurrence Quantification Analysis for Determining Leader–Follower Relationships Within Multidimensional Time Series
  8. Design optimization of spiral coils for textile applications by genetic algorithm
  9. Design of controllers applied to autonomous unmanned aerial vehicles using software in the loop
  10. Computational modeling of amorphous polymers
  11. Dynamically adjusting the k-values of the ATCS rule in a flexible flow shop scenario with reinforcement learning
  12. On the origin of passive rotation in rotational joints, and how to calculate it
  13. Early Detection of Faillure in Conveyor Chain Systems by Wireless Sensor Node
  14. There is no Software, there are just Services: Introduction
  15. Using corpus-linguistic methods to track longitudinal development
  16. E-stability and stability of adaptive learning in models with asymmetric information
  17. Need Satisfaction and Optimal Functioning at Leisure and Work: A Longitudinal Validation Study of the DRAMMA Model
  18. Selecting and Adapting Methods for Analysis and Design in Value-Sensitive Digital Social Innovation Projects: Toward Design Principles
  19. Simple saturated PID control for fast transient of motion systems
  20. The delay vector variance method and the recurrence quantification analysis of energy markets
  21. Joint Item Response Models for Manual and Automatic Scores on Open-Ended Test Items
  22. Switching Dispatching Rules with Gaussian Processes
  23. Refusal and the Computational City - From (De)Coding the Machine to (En)Coding Care
  24. A computational study of a model of single-crystal strain-gradient viscoplasticity with an interactive hardening relation
  25. A Wavelet Packet Algorithm for Online Detection of Pantograph Vibrations
  26. Accounting and Modeling as Design Metaphors for CEMIS
  27. Active and semi-supervised data domain description
  28. Formative Perspectives on the Relation Between CSR Communication and CSR Practices
  29. Sensitivity to complexity - an important prerequisite of problem solving mathematics teaching