Preface: From January 2016 to July 2016, I wrote my master’s thesis at fme in Brunswick, Germany. In the following blog post I am summarizing my thesis and thank all these people who supported me. Especially, my advisor and the IT department.
Tobias Stein, September 2016
Exponential data growth rates make the management of content an enormous challenge for organizations. Terms like »information overload« and »content chaos« express the inefficient situation regarding content management on an enterprise-wide scale. Employees search for documents in different versions, languages and formats across various repositories and systems through the entire company. However, these documents contain important and decision-relevant information that are becoming a key business resource. Moreover, compliance with statutory regulations and technical requirements with respect to storing huge amounts of data are complicated.
Many organizations have adopted Enterprise Content Management (ECM) solutions to overcome this information overload and data complexity on an enterprise-wide scale. Nonetheless, it still remains difficult to extract useful and relevant information from large data collections by using the given technologies. Accompanied by the tremendous data growth rates, new technologies like the Apache Hadoop framework for distributed storage and processing of very large data sets on computer clusters arose. Hadoop comes with a distributed filesystem and the MapReduce programming model, which has been enhanced by Apache Spark – another framework for fast, fault-tolerant and large scale data processing. These technologies, especially Spark’s machine learning library MLlib, gained a lot of attention. In my opinion, it is an appropriate technology for analyzing unstructured data residing in ECM systems and improving the traditional technologies. At the same time, we observe the Cloud Computing paradigm moving on-premise software into the Cloud, providing it as a service in a subscription model. Instantly, it becomes easy to get access to vast amounts of computing resources on demand. So far, academia has not considered the potential impact of Cloud Computing and new Big Data technologies on ECM, even though it has been identified as an emerging topic by Alalwan & Weistroffer in 2012.
Given this surrounding my thesis investigated whether the Apache Hadoop framework is capable of analyzing unstructured data from multiple ECM systems. In Apache Spark, a similarity search algorithm was implemented which finds similar documents given a reference document. As a result, it can be stated that the proof of concept implementation was successful.
The algorithm is based on Spark and its machine learning library MLlib. It utilizes the computing resources of a Hadoop cluster and Elasticsearch as NoSQL data storage. Technically, each document is transformed into a vector representation based on the term frequency. As distance measure, the Euclidean distance has been used for calculating the distances between the vectors and separating them into clusters.
My similarity search algorithm can be deployed to a Cloud environment like Amazon EC2. Due to the linear scaling capabilities, it is able to handle large amounts of data.
My research provided answers to the emerging topics Cloud Computing and Big Data Alalwan & Weistroffer (2012) proposed in their paper. Right now, this implementation is at an early development stage; academia has not yet combined machine learning with Enterprise Content Management. More research could be devoted to the evaluation of this application as well as detailed comparisons of machine learning algorithms for a similarity search or other related use cases. The variety of Apache Spark’s machine learning components and the easy deployment to Cloud Computing providers makes it an interesting topic for future research – extending the capabilities of ECM to a degree relational databases would not be able to.
- Alalwan, J. A., & Weistroffer, H. R. (2012). Enterprise content management research: A comprehensive review. Journal of Enterprise Information Management, 25(5), 441–461
- Apache Hadoop: Framework for distributed storage and distributed processing of very large data sets on computer clusters > http://hadoop.apache.org/
- Apache Spark: Spark – Lightning-fast cluster computing > http://spark.apache.org/