Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers. / Krieger, Felix; Drews, Paul; Funk, Burkhardt.
In: Intelligent Systems with Applications, Vol. 20, 200285, 01.11.2023.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Bibtex

@article{75e4eebe29b84ec990facd0d3a84d8cd,
title = "Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers",
abstract = "Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.",
keywords = "Business informatics, Layout-rich documents, Document analysis, Natural language processing",
author = "Felix Krieger and Paul Drews and Burkhardt Funk",
note = "Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: {\textcopyright} 2023 The Authors",
year = "2023",
month = nov,
day = "1",
doi = "10.1016/j.iswa.2023.200285",
language = "English",
volume = "20",
journal = "Intelligent Systems with Applications",
issn = "2667-3053",
publisher = "Elsevier B.V.",

}

RIS

TY - JOUR

T1 - Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

AU - Krieger, Felix

AU - Drews, Paul

AU - Funk, Burkhardt

N1 - Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: © 2023 The Authors

PY - 2023/11/1

Y1 - 2023/11/1

N2 - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

AB - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

KW - Business informatics

KW - Layout-rich documents

KW - Document analysis

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85174540281&partnerID=8YFLogxK

U2 - 10.1016/j.iswa.2023.200285

DO - 10.1016/j.iswa.2023.200285

M3 - Journal articles

VL - 20

JO - Intelligent Systems with Applications

JF - Intelligent Systems with Applications

SN - 2667-3053

M1 - 200285

ER -

Recently viewed

Publications

  1. A Hermeneutic Interpretation of Concepts in a Cooperative Multicultural Working Project
  2. Value Orientations in the World of Visual Art: An Exploration Based on Latent Class and Correspondence Analysis
  3. Constraints are the solution, not the problem
  4. Deciding between the Covariance Analytical Approach and the Change-Score Approach in Two Wave Panel Data
  5. AUC Maximizing Support Vector Learning
  6. Integrating inductive and deductive analysis to identify and characterize archetypical social-ecological systems and their changes
  7. Applying the Three Horizons approach in local and regional scenarios to support policy coherence in SDG implementation
  8. Simulation-based Investigation of Energy Flexibility in the Optimization of Hinterland Drainage
  9. How mobile app design impacts user responses to mixed self-tracking outcomes
  10. Introduction
  11. Comment on “Stretching intervention can prevent muscle injuries: a systematic review and meta-analysis”
  12. Navigating (In)Visibility
  13. A latent state-trait analysis of current achievement motivation across different tasks of cognitive ability
  14. An interdisciplinary perspective on scaling in transitions
  15. Steering of land use in the context of sustainable development
  16. A Conceptual Structure of Justice - Providing a Tool to Analyse Conceptions of Justice
  17. Audiosoftware im Unterricht
  18. Using work values to predict post-retirement work intentions
  19. Mathematical Model of Double Row Self-Aligning Ball Bearing
  20. Rotational complexity in mental rotation tests
  21. Development and characterisation of a new interface for coupling capillary LC with collision-cell ICPMS and its application for phosphorylation profiling of tryptic protein digests
  22. The global context and people at work: Special issue introduction
  23. Microstructure, mechanical properties and fracture behaviors of large-scale sand-cast Mg-3Y-2Gd-1Nd-0.4Zr alloy
  24. Predictive mapping of plant species and communities using GIS and Landsat data in a southern Mongolian mountain range
  25. Toxicity testing with luminescent bacteria - Characterization of an automated method for the combined assessment of acute and chronic effects
  26. Degrees of Integration
  27. Optimizing price levels in e-commerce applications
  28. Why Emergency? Reflections on the Practice and Rhetoric of Exceptionalism
  29. Experimental Verification of the Impact of Radial Internal Clearance on a Bearing's Dynamics
  30. Do You Like What You (Can't) See? The Differential Effects of Hardware and Software Upgrades on High-Tech Product Evaluations
  31. Direct measurement of cognitive load in multimedia learning
  32. Exploring Difficult History Lessons, Identity Construction, the Artistic Expansion of Sitcom Storytelling Tools in the Black-ish Episode, "Juneteenth"
  33. Integrating indigenous and local knowledge in management and research on coastal ecosystems in the Global South
  34. Wireless power transmission via a multi-coil inductive system
  35. Revisiting the tolerance limit of Fe impurity in biodegradable magnesium
  36. Thermal synthesis of a thermochemical heat storage with heat exchanger optimization
  37. Computational Study of Three-Dimensional Lagrangian Transport and Mixing in a Stirred Tank Reactor  
  38. Land-use legacy and tree age in continuous woodlands
  39. Cognitive verbs in discourse
  40. Integration in Controllingsystemen
  41. Technology-centred learning processes as digital artistic development