Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

Research output: Journal contributionsJournal articlesResearchpeer-review

Standard

Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers. / Krieger, Felix; Drews, Paul; Funk, Burkhardt.
In: Intelligent Systems with Applications, Vol. 20, 200285, 01.11.2023.

Research output: Journal contributionsJournal articlesResearchpeer-review

Harvard

APA

Vancouver

Bibtex

@article{75e4eebe29b84ec990facd0d3a84d8cd,
title = "Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers",
abstract = "Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.",
keywords = "Business informatics, Layout-rich documents, Document analysis, Natural language processing",
author = "Felix Krieger and Paul Drews and Burkhardt Funk",
note = "Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: {\textcopyright} 2023 The Authors",
year = "2023",
month = nov,
day = "1",
doi = "10.1016/j.iswa.2023.200285",
language = "English",
volume = "20",
journal = "Intelligent Systems with Applications",
issn = "2667-3053",
publisher = "Elsevier B.V.",

}

RIS

TY - JOUR

T1 - Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers

AU - Krieger, Felix

AU - Drews, Paul

AU - Funk, Burkhardt

N1 - Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: © 2023 The Authors

PY - 2023/11/1

Y1 - 2023/11/1

N2 - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

AB - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.

KW - Business informatics

KW - Layout-rich documents

KW - Document analysis

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85174540281&partnerID=8YFLogxK

U2 - 10.1016/j.iswa.2023.200285

DO - 10.1016/j.iswa.2023.200285

M3 - Journal articles

VL - 20

JO - Intelligent Systems with Applications

JF - Intelligent Systems with Applications

SN - 2667-3053

M1 - 200285

ER -

Recently viewed

Activities

  1. Quantum Mechanics and Reality, lecture by Antony Valentini
  2. Going Green: Digital project work as a transdisciplinary and transcultural task in the foreign language and STEM classrooms
  3. Spec­tral Ki­ne­tic Si­mu­la­ti­on of Ideal Mul­ti­po­le Re­so­nan­ce Probe
  4. Explaining primary school teachers’ usage of digital learning data: A mixed method study
  5. The global classroom. Introduction, presenation and workshop: Introduction, presenation and workshop
  6. Problem Framing Workshop with Local NGOs
  7. Collaborative modeling in climatic change adaptation and energy transformation.
  8. The Value Knowledge Grid - a new way of diagnosing the Culturally Non-Copyables: Building Blocks for Diagnostics
  9. Cognitive predictors of accurate syntax/semantics mapping in the early stages of adult L2 learning
  10. The role of different forms of cohesion and readers' expectations towards different types of text
  11. Beyond Gamification: From Problem-solving to Problem-making
  12. Towards a fully-automated adaptive e-learning environment: A predictive model for difficulty generating factors in gap-filling activities that target English tense-aspect-mood
  13. Trajectory-based Lagrangian approaches for the extraction and characterization of coherent structures in turbulent convection
  14. Mental Parsing as A Mixed Blessing for Integrative Agreements: When Parsing Multiple Issues into Separate Mental Accounts Helps Versus Hurts Negotiators.
  15. Tri-trophic interaction networks along a tree diversity gradient in BEF-China: How tree diversity effects higher trophic levels

Publications

  1. On the added value of considering effects of generic and subject-specific instructional quality on students’ achievements – an exploratory study on the example of implementing formative assessment in mathematics education
  2. Frame-based Data Factorizations
  3. Public perceptions of CCS in context
  4. Understanding Low-Code Evolution, Adoption and Ecosystem for Software Development
  5. Q-Adaptive Control of the nonlinear dynamics of the cantilever-sample system of an Atomic Force Microscope
  6. An application of multiple behavior SIA for analyzing data from student exams
  7. Towards Advanced Learning in Dispatching Rule-Based Scheuling
  8. Speed of processing and stimulus complexity in low-frequency and high-frequency channels
  9. THE PARALLAX OF INDIVIDUATION
  10. Memory Acts: Memory without Representation.
  11. How, when and why do negotiators use reference points?
  12. Using heuristic worked examples to promote solving of reality‑based tasks in mathematics in lower secondary school
  13. Don’t underestimate the problems of user centredness in software development projectsthere are many!?
  14. Input-Output Linearization of a Thermoelectric Cooler for an Ice Clamping System Using a Dual Extended Kalman Filter
  15. An observer for sensorless variable valve control in camless internal combustion engines
  16. Watershed groundwater balance estimation using streamflow recession analysis and baseflow separation
  17. Mathematics in Robot Control for Theoretical and Applied Problems
  18. New Labor, Old Questions: Practices of Collaboration with Robots
  19. Getting down to specifics on RCA [Resource Consumption Accounting]
  20. Restricted nonlinear approximation and singular solutions of boundary integral equations
  21. Learning and Re-learning from net- based cooperative learning discourses
  22. Extraction of finite-time coherent sets in 3D Rayleigh-Benard Convection using the dynamic Laplacian
  23. Model-based logistic controlling of converging material flows
  24. A Hermeneutic Interpretation of Concepts in a Cooperative Multicultural Working Project
  25. Developing a Complex Portrait of Content Teaching for Multilingual Learners via Nonlinear Theoretical Understandings
  26. Problem solving in mathematics education
  27. Primary Side Circuit Design of a Multi-coil Inductive System for Powering Wireless Sensors
  28. Grazing, exploring and networking for sustainability-oriented innovations in learning-action networks
  29. On New Forms of Science Communication and Communication in Science
  30. A PD regulator to minimize noise effect using a minimal variance method for soft landing control of an electromagnetic valve actuator
  31. Using data mining techniques to investigate the correlation between surface cracks and flange lengths in deep drawn sheet metals