Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Standard
in: Intelligent Systems with Applications, Jahrgang 20, 200285, 01.11.2023.
Publikation: Beiträge in Zeitschriften › Zeitschriftenaufsätze › Forschung › begutachtet
Harvard
APA
Vancouver
Bibtex
}
RIS
TY - JOUR
T1 - Automated Invoice Processing: Machine Learning-Based Information Extraction for Long Tail Suppliers
AU - Krieger, Felix
AU - Drews, Paul
AU - Funk, Burkhardt
N1 - Funding Information: We acknowledge support by the German Research Foundation (DFG). Publisher Copyright: © 2023 The Authors
PY - 2023/11/1
Y1 - 2023/11/1
N2 - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.
AB - Automation of incoming invoices processing promises to yield vast efficiency improvements in accounting. Until a universal adoption of fully electronic invoice exchange formats has been achieved, machine learning can help bridge the adoption gaps in electronic invoicing by extracting structured information from unstructured invoice formats. Machine learning especially helps the processing of invoices of suppliers who only send invoices infrequently, as the models are able to capture the semantic and visual cues of invoices and generalize them to previously unknown invoice layouts. Since the population of invoices in many companies is skewed toward a few frequent suppliers and their layouts, this research examines the effects of training data taken from such populations on the predictive quality of different machine-learning approaches for the extraction of information from invoices. Comparing the different approaches, we find that they are affected to varying degrees by skewed layout populations: The accuracy gap between in-sample and out-of-sample layouts is much higher in the Chargrid and random forest models than in the LayoutLM transformer model, which also exhibits the best overall predictive quality. To arrive at this finding, we designed and implemented a research pipeline that pays special attention to the distribution of layouts in the splitting of data and the evaluation of the models.
KW - Business informatics
KW - Layout-rich documents
KW - Document analysis
KW - Natural language processing
UR - http://www.scopus.com/inward/record.url?scp=85174540281&partnerID=8YFLogxK
U2 - 10.1016/j.iswa.2023.200285
DO - 10.1016/j.iswa.2023.200285
M3 - Journal articles
VL - 20
JO - Intelligent Systems with Applications
JF - Intelligent Systems with Applications
SN - 2667-3053
M1 - 200285
ER -