Methodological Foundations for Merging Structured and Unstructured Sources in ML Pipelines

Authors

  • Olusesan Ogundulu Data Engineer at Alvarez & Marsal Holdings Tampa, FL, United States

Keywords:

data integration, multimodal pipelines, structured sources, unstructured data, machine learning, language models, agent systems, matching metrics, adaptive architectures, data lakes

Abstract

The article presents a theoretical and applied analysis of the methodological foundations for merging structured and unstructured data sources within machine learning systems. The study is based on an interdisciplinary approach that integrates architectural design of ML pipelines, data representation theory, and practices of heterogeneous format integration. Particular attention is paid to the analysis of recent scientific publications highlighting the application of Retrieve–Merge–Predict architectures, agent-based discovery systems, and multimodal frameworks involving large language models. Four stable strategies for data merging are identified, ranging from static unification to end-to-end processing within a unified training loop. The importance of selecting matching metrics and adaptation schemes when dealing with unstable data streams is demonstrated using experiments from Cappuzzo and Eltabakh. Special emphasis is placed on the methodological limitations of universal solutions, including the generalization paradox, sensitivity to structural evolution, and the lack of formalized testing scenarios in agent-oriented pipelines. It is shown that sustainable development of ML architectures requires a shift from linear ETL pipelines to coherent, iterative systems with internal adaptation and feedback from the model to the data. The article will be of interest to researchers in data preparation automation, developers of multimodal ML systems, data engineering specialists, and digital platform architects working with multi-format sources.

References

Cappuzzo, R., Coelho, A., Lefebvre, F., Papotti, P., & Varoquaux, G. (2025). Retrieve, merge, predict: Augmenting tables with data lakes (arXiv:2402.06282). arXiv. https://doi.org/10.48550/arXiv.2402.06282

Carlson, J., & Dell, M. (2025). A unifying framework for robust and efficient inference with unstructured data (arXiv:2505.00282). arXiv. https://doi.org/10.48550/arXiv.2505.00282

D'Alessandro, M., Calabrés, E., & Elkano, M. (2024). A modular end-to-end multimodal learning method for structured and unstructured data (arXiv:2403.04866). arXiv. https://doi.org/10.48550/arXiv.2403.04866

Dritsas, E., & Trigka, M. (2025). Exploring the intersection of machine learning and big data: A survey. Machine Learning and Knowledge Extraction, 7(1), 13. https://doi.org/10.3390/make7010013

Eltabakh, M. Y., Kunjir, M., Elmagarmid, A., & Ahmad, M. S. (2023). Cross modal data discovery over structured and unstructured data lakes (arXiv:2306.00932). arXiv. https://doi.org/10.48550/arXiv.2306.00932

Hilprecht, B., Hammacher, C., Reis, E., Abdelaal, M., & Binnig, C. (2022). DiffML: End-to-end differentiable ML pipelines (arXiv:2207.01269). arXiv. https://doi.org/10.48550/arXiv.2207.01269

Jandoubi, B., & Akhloufi, M. A. (2025). Multimodal artificial intelligence in medical diagnostics. Information, 16(7), 591. https://doi.org/10.3390/info16070591

Li, B., Jiang, G., Li, N., & Song, C. (2024). Research on large-scale structured and unstructured data processing based on a large language model. Preprints. https://doi.org/10.20944/preprints202407.1364.v1

Sedlakova, J., Daniore, P., Horn Wintsch, A., Wolf, M., Stanikic, M., Haag, C., Sieber, C., Schneider, G., Staub, K., Alois Ettlin, D., Grübner, O., Rinaldi, F., von Wyl, V., & University of Zurich Digital Society Initiative (UZH-DSI) Health Community. (2023). Challenges and best practices for digital unstructured data enrichment in health research: A systematic narrative review. PLOS Digital Health, 2(10), e0000347. https://doi.org/10.1371/journal.pdig.0000347

Wu, S.-W., Li, C.-C., Chien, T.-N., & Chu, C.-M. (2024). Integrating structured and unstructured data with BERTopic and machine learning: A comprehensive predictive model for mortality in ICU heart failure patients. Applied Sciences, 14(17), 7546. https://doi.org/10.3390/app14177546

Chandra, R., Lulla, K., & Sirigiri, K. (2025). Automation frameworks for end-to-end testing of large language models (LLMs). Journal of Information Systems Engineering and Management, 10(43s), e464–e472. https://doi.org/10.55278/jisem.2025.10.43s.8400

Chandra, R., Bansal, R., & Lulla, K. (2025). Benchmarking techniques for real-time evaluation of LLMs in production systems. International Journal of Engineering, Science and Information Technology, 5(3), 363–372. https://doi.org/10.52088/ijesty.v5i3.955

Downloads

Published

2025-09-13