Basic Data Mining Algorithms and their Scalability for Big Data

Last date of Online Registration is extended upto August 10, 2016

August 16-21, 2016


Download Brochure

Overview: The main purpose of this course is to introduce the basic ideas of data mining algorithms and also of their scalability for Big Data situations. Relatively simple classifier and clustering algorithms are presented and analyzed in detail. Basic ideas of Map-Reduce algorithms for asynchronous computing in the cloud/Hadoop environments are introduced. How the simple data mining algorithms need to be redesigned for the Map-Reduce environment is presented and analyzed.

Objectives:  The objectives of this course are to impart knowledge and understanding of the following topics to the participants:

  1. Details of the decision tree induction algorithms and various issues related to their outcomes and performance.
  2. Details of association rule mining algorithms and various issues related to their outcomes and performance.
  3. Sequential and partitional clustering algorithms.
  4. Basics of the Map-Reduce paradigm for designing algorithms, and will apply this paradigm to redesign the decision tree and association rule induction algorithms.

Every session will be followed by lab and practice assignments. (Matlab and R programming)

Topic List:

  1. Introduction to Data Mining (Total 4 hours) 
    1. Lecture (2 hours)
      1. Applications and Need for data mining algorithms
      2. Scalability issues for data mining tasks
      3. Relationship to other fields such as statistics and machine learning
      4. Different types of data: Relations, Graphs, Sequences, and Text
      5. Types and nature of patterns and knowledge to be discovered in data
    2. Lab and Practice (2 hours)
      1. Practice with representation and processing of data in MATLAB
  2. Classification Algorithms: Decision Trees (6 hours)
    1. Lecture (3 hours)
      1. What is a Decision Tree: How does it work
      2. Algorithms for inducing decision trees from data
      3. Characteristics of decision tree induction algorithms
      4. Overfitting and underfitting
      5. Evaluating the performance of a decision tree
      6. Applications and Real life cases
      7. Learning of tree ensembles
      8. Induction of Decision Trees for Big Data: Issues of performance and algorithms
    2. Lab and Practice (3 hours)
      1. Build decision trees from test datasets using MATLAB functions
  3. Association Analysis (6 hours)
    1. Lecture (3 hours)
      1. What are association rules
      2. Apriori principle for frequent itemset generation
      3. Association Rule Generation
      4. Support, confidence, lift etc. metrics
    2. Lab and Practice (3 hours)
      1. MATLAB functions to generate association rules: test and practice
  4. Basic Clustering Algorithms (5 hours)
    1. Lecture (3 hours)
      1. Why clustering?
      2. Sequential Clustering Algorithms
      3. Partitional Clustering Algorithms: K-means, bisecting k-means
      4. Evaluating performance of clustering algorithms
    2. Exercise and Practices (3 hours)
      1. Exercises with clustering algorithms
  5. Scalability of Algorithms for Big Data (8 hours)
    1. Lecture (4 hours)
      1. Types of Hardware for scaling: Scaling Up vs. Scaling Out
      2. Hadoop Architecture and map-Reduce Algorithms
      3. Foundational ideas of Map-Reduce Algorithms
      4. Simple statistical Functions using MapReduce Formulations
    2. Lab and Practice (4 hours)
      1. Exercises in designing algorithms using MapReduce Paradigm
  6. Design of Clustering Algorithms for Hadoop (5 hours)
    1. Lecture (2.5 hours)
      1. K-means algorithm using Map-Reduce
      2. Other clustering algorithms using Map-Reduce
    2. Lab practice (2.5 hours)
      1. Exercises using MapReduce for clustering

6 hours for evaluation and presentation by participants.

Resource Person:
Prof. Raj Bhatnagar Detailed CV

Raj K Bhatnagar
Professor of Computer Science
Department of Electrical Engineering and Computing Systems
University of Cincinnati, Cincinnati, OH 45221, USA
+1 (513) 556-4932
Prof. Raj Bhatnagar is Professor of Computer Science at University of Cincinnati, Ohio, USA. His area of research is data mining and pattern recognition and he has worked on problems in this research area for more than twenty five years. His research projects have been funded by NSF, US Air Force, US DARPA, and a number of Industrial sponsors. He has supervised graduate students for eleven Ph.D. dissertations and seventy M.S. theses. His recent research projects include design of mining and analysis algorithms for Big Data situations in Biomedical, Manufacturing, GIS, and Security applications. These problems have involved various types of structured and unstructured data. He has published more than eighty peer-reviewed publications. He has designed and taught graduate level classes on the topics of Data Mining, Big Data Analysis, and Artificial Intelligence. He recently published three papers in the IEEE International conference on Big Data (Oct 2015) and delivered a 3.5 hours tutorial on Design of Analytics Algorithms for Big Data at the Big Data Analytics 2015 (BDA2015) conference held in Hyderabad in December 2015.

Course Coordinators:
Dr. Pritee Khanna
Associate Professor, Computer Science and Engineering
phone: +91761 2794222 (O), +919425324241 (M)

Dr. Sraban Kumar Mohanty
Assistant Professor, Computer Science and Engineering
phone: +91761 2794224 (O), +919425807609 (M)

Contact us:
For course related queries kindly write to:
The Course Coordinators
Data Mining Algorithms and their Scalability for Big Data