Estimation of minimal data sets sizes for machine learning predictions in digital mental health interventions

Research output: Journal contributionsJournal articlesResearchpeer-review

Authors

  • Kirsten Zantvoort
  • Barbara Nacke
  • Dennis Görlich
  • Silvan Hornstein
  • Corinna Jacobi
  • Burkhardt Funk

Artificial intelligence promises to revolutionize mental health care, but small dataset sizes and lack of robust methods raise concerns about result generalizability. To provide insights on minimal necessary data set sizes, we explore domain-specific learning curves for digital intervention dropout predictions based on 3654 users from a single study (ISRCTN13716228, 26/02/2016). Prediction performance is analyzed based on dataset size (N = 100–3654), feature groups (F = 2–129), and algorithm choice (from Naive Bayes to Neural Networks). The results substantiate the concern that small datasets (N ≤ 300) overestimate predictive power. For uninformative feature groups, in-sample prediction performance was negatively correlated with dataset size. Sophisticated models overfitted in small datasets but maximized holdout test results in larger datasets. While N = 500 mitigated overfitting, performance did not converge until N = 750–1500. Consequently, we propose minimum dataset sizes of N = 500–1000. As such, this study offers an empirical reference for researchers designing or interpreting AI studies on Digital Mental Health Intervention data.

Original languageEnglish
Article number361
Journalnpj Digital Medicine
Volume7
Issue number1
Number of pages10
DOIs
Publication statusPublished - 12.2024

Bibliographical note

Publisher Copyright:
© The Author(s) 2024.

Recently viewed

Publications

  1. CHANGING RECREATIONAL ACTIVITIES FOR REDUCING INSOMNIA SEVERITY? RESULTS FROM A SERIAL MEDIATION ANALYSIS ON THE IMPACT OF RECREATIONAL BEHAVIOR AS A MECHANISM OF CHANGE IN DIGITAL INTERVENTIONS FOR INSOMNIA
  2. Anisotropy and mechanical properties of dissimilar Al additive manufactured structures generated by multi-layer friction surfacing
  3. Developing a Process for the Analysis of User Journeys and the Prediction of Dropout in Digital Health Interventions:
  4. Article 11 Formal Validity
  5. The role of plant biodiversity in modifying the structure and functioning of higher tropic Levels in species-rich forests
  6. The role of task meaning on output in groups
  7. Analysis of the relevance of models, influencing factors and the point in time of the forecast on the prediction quality in order-related delivery time determination using machine learning
  8. Using rating scales for the assessment of physical self-concept
  9. Microstructural and mechanical aspects of reinforcement welds for lightweight components produced by friction hydro pillar processing
  10. "If you like something, you want it to develop."
  11. archiDART: an R package for the automated computation of plant root architectural traits
  12. Context-sensitive adjustment of pointing in great apes
  13. Obtaining Object Information from Stereo Vision System for Autonomous Vehicles
  14. Effectiveness of the world network of biosphere reserves in maintaining forest ecosystem functions
  15. Second-Order Sliding Mode Control with State and Disturbance Estimation for a Permanent Magnet Linear Motor
  16. What is normal?