Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers. / Krieger, Felix; Drews, Paul; Funk, Burkhardt.
In: Intelligent Systems with Applications, Vol. 20, 200285, 01.11.2023.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Bibtex

@article{75e4eebe29b84ec990facd0d3a84d8cd,
title = "Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers",
abstract = "Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.",
keywords = "Business informatics, Layout-rich documents, Document analysis, Natural language processing",
author = "Felix Krieger and Paul Drews and Burkhardt Funk",
note = "Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: {\textcopyright} 2023 The Authors",
year = "2023",
month = nov,
day = "1",
doi = "10.1016/j.iswa.2023.200285",
language = "English",
volume = "20",
journal = "Intelligent Systems with Applications",
issn = "2667-3053",
publisher = "Elsevier B.V.",

}

RIS

TY - JOUR

T1 - Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

AU - Krieger, Felix

AU - Drews, Paul

AU - Funk, Burkhardt

N1 - Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: © 2023 The Authors

PY - 2023/11/1

Y1 - 2023/11/1

N2 - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

AB - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

KW - Business informatics

KW - Layout-rich documents

KW - Document analysis

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85174540281&partnerID=8YFLogxK

U2 - 10.1016/j.iswa.2023.200285

DO - 10.1016/j.iswa.2023.200285

M3 - Journal articles

VL - 20

JO - Intelligent Systems with Applications

JF - Intelligent Systems with Applications

SN - 2667-3053

M1 - 200285

ER -

Recently viewed

Activities

  1. Explaining primary school teachers’ usage of digital learning data: A mixed method study
  2. The role of different forms of cohesion and readers' expectations towards different types of text
  3. Beyond Gamification: From Problem-solving to Problem-making
  4. Tri-trophic interaction networks along a tree diversity gradient in BEF-China: How tree diversity effects higher trophic levels
  5. Points of cooperation: Integrating Cooperative Learning into Web-Based Courses
  6. Is there a threshold effect of time headway on subjective variables for different velocities?
  7. Digitalization and Organizational Learning: Use the Double-Loop
  8. Robotics (Fachzeitschrift)
  9. It's Time to Talk About Time Shaping Competence: A Framework for Addressing “Time” in ESE
  10. Efficacy of an app-based gratitude intervention in reducing repetitive negative thinking and fostering resilience: results of a randomized controlled trial
  11. Implementing Sustainability Strategies Through Accounting Controls: An Exploration of Practices in Seven Multinational Corporations
  12. Understanding Corruption by Means of Experiments
  13. Open-Ended Issues - 2015
  14. In-Depth Interviews and Data Analysis
  15. A Simple Likelihood-based Panel Cointegration Test in the Presence of a Linear Time Trend and Cross-sectional Dependence
  16. LC-MS identification of the photo-transformation products of desipramine with studying the effect of different environmental variables on the kinetics of their formation
  17. An Axiomatic Approach to Decision under Knightian Uncertainty
  18. (Un)regulated affect: sensing moods and analyzing sentiments from pre-individual intensities as a new modulation of control

Publications

  1. How, when and why do negotiators use reference points?
  2. Watershed groundwater balance estimation using streamflow recession analysis and baseflow separation
  3. Learning and Re-learning from net- based cooperative learning discourses
  4. Developing a Complex Portrait of Content Teaching for Multilingual Learners via Nonlinear Theoretical Understandings
  5. A PD regulator to minimize noise effect using a minimal variance method for soft landing control of an electromagnetic valve actuator
  6. Soil conditions modify species diversity effects on tree functional trait expression
  7. Topic Embeddings – A New Approach to Classify Very Short Documents Based on Predefined Topics
  8. Solving mathematical problems with dynamical sketches
  9. Toward a methodical framework for comprehensively assessing forest multifunctionality
  10. Bayesian Parameter Estimation in Green Business Process Management
  11. Performance incentives in activity-based management
  12. Experiments on the Fehrer-Raab effect and the ‘Weather Station Model’ of visual backward masking
  13. Distributed robust Gaussian Process regression
  14. Understanding Partnering Strategies in the Low-Code Platform Ecosystem
  15. A MODEL FOR QUANTIFICATION OF SOFTWARE COMPLEXITY
  16. Influence of Process Parameters and Die Design on the Microstructure and Texture Development of Direct Extruded Magnesium Flat Products
  17. Introduction Mobile Digital Practices. Situating People, Things, and Data
  18. Dynamically adjusting the k-values of the ATCS rule in a flexible flow shop scenario with reinforcement learning
  19. Learning from Erroneous Examples
  20. On the origin of passive rotation in rotational joints, and how to calculate it
  21. Value Structure and Dimensions
  22. Sliding Mode Control Strategies for Maglev Systems Based on Kalman Filtering
  23. Privatizing the commons
  24. Differenz, Differenzierung
  25. A tutorial introduction to adaptive fractal analysis
  26. Octanol-Water Partition Coefficient Measurement by a Simple 1H NMR Method
  27. New method for assessing the repeatability of the measuring system for roughness measurements
  28. Changing Data Collection Methods Means Different Kind of Data
  29. Monitoring of microbially mediated corrosion and scaling processes using redox potential measurements