CSCI 4380/6380 Data Mining
CSCI 4380/6380 Data Mining
Fall 2017: Mondays 3:35pm - 4:25pm & Tuesdays
and Thursdays 3:30pm - 4:45pm, Boyd GSRC Room 208
Instructor: Prof. Khaled
Rasheed
Telephone: (706)542-0881
Office Hours: Monday 4:30-6:00pm and Thursday 2:00-3:00pm or by email
appointment
Office Location: Room 111, Boyd GSRC
Email: khaled@uga.edu
Teaching Assistant: Roxana Attar
Office Hours: Wednesday 1:30pm-4:00pm
Office Location: Room 537, Boyd GSRC
Email: roxana.attar@uga.edu
Objectives:
The course aims to provide students with a broad introduction to the
field of Data Mining and related areas and to teach students how to
apply these methods to solve problems in complex domains.
The course is appropriate both for students preparing for research in
Data Mining and Machine Learning, as well as Bioinformatics, Science
and Engineering students who want to apply Data Mining techniques to
solve problems in their fields of study.
Recommended Background:
CSCI 2720 Data Structures. Familiarity with basic computer algorithms
and data structures. Familiarity with the Java programming language is
recommended but not required.
Topics to be Covered:
Part I: Data Mining techniques: Selected from: Association and
Classification Rule Mining, Linear Models, Decision Trees and Random
Forests, Neural Network approaches, Support Vector Machines, Bayesian
Learning, Instance-based Learning, Pre-processing and Feature
Selection, Performance evaluation, Ensemble Learning and clustering.
Part II: Data Mining applications: Selected from: Bioinformatics,
Biomedical/Physical/Chemical modeling, medical diagnosis, text/web
mining, pattern recognition and/or other contemporary applications.
Expected Work:
Reading; assignments (include running experiments using the Weka
package); paper presentation, two midterms; and term project (may
require programming or running existing packages) and paper.
Unless otherwise announced by the instructor, all assignments and all
exams must be done entirely on your own.
Academic Honesty and Integrity:
All academic work must meet the standards contained in
"A Culture of Honesty." Students are responsible for informing
themselves about those standards before performing any academic
work. The penalties for academic dishonesty are severe and ignorance
is not an acceptable defense.
Grading Policy:
Assignments: 20% (Programs, homeworks, attendance, paper presentation)
Midterm Examinations: 40%
Term Project: 40% (includes term paper and presentation)
Students may work on their term projects in groups of up to
FOUR students each. The above distribution is only
tentative and may change later. The instructor will announce any
changes.
Assignment Submission Policy
Assignments must be turned in by the assigned deadline. Late
assignments will not be accepted. Rare exceptions may be made by the
instructor only under extenuating circumstances and in accordance with
the university policies.
Course Home-page
A variety of materials will be made available on the DM Class
Home-page at
http://cobweb.cs.uga.edu/~khaled/DMcourse/, including handouts,
lecture notes and assignments. Announcements may be posted between
class meetings. You are responsible for being aware of whatever
information is posted there.
Lecture Notes
Copies of some of Dr. Rasheed's lecture notes will be
available at the bottom of the class home page. Not all the lectures
will have electronic notes though and the students should be prepared
to take notes inside the lecture at any time.
Textbook in Bookstore
"Data Mining: Practical Machine Learning Tools and Techniques
(4th edition)", Ian Witten, Eibe Frank , Mark Hall and Christopher Pal. Morgan Kaufmann,
2016. (Required)
ISBN-10: 0128042915 & ISBN-13: 978-0128042915
Web Resources
The WEKA Machine Learning Project
University of California at Irvine ML Repository
David Aha's Machine Learning Resources
Announcements:
[12-4-2017] The course evaluations for CSCI courses are to be done on-line for Fall 2017. You can access the following URL which will allow you to login into the course evaluation system using your MyID and password:
eval.franklin.uga.edu
The course evaluation system will be on-line till December 7, 12:00AM (finals begin on December 7). Please remember that you will get 1\% extra credit for your total class grade just for doing the evaluations!
[12-4-2017] The course project presentations will be done on
12-12-2017 at the time of the final exam (because there is no final
exam). They will be done from 3:30 pm to 6:30 pm with about 15 minutes per
group in Boyd room 208. The course project reports are due on the same
day. For the project report format, please write it as a conference
paper of about 8 two-column pages or 12 single-column pages (there is
no restriction on size though). You should include a title, an
abstract, an introduction, a mention of related work if any, a
description of your experiments and results and a conclusion. In the
introduction or elsewhere in the paper you should describe the domain
that you applied your Data Mining technique(s) to, in enough detail
for the reader to appreciate the significance and difficulty of the
problem. Please bring a hard copy. Please also include your email
addresses as well as the URLs of any demo/supporting web pages. There
is a slight chance that I might contact you soon after the submission
deadline (within 48 hours) requesting codes, clarifications or more
data.
[11-28-2017] The second midterm will be this Thursday
11-30-2017. It will focus on Chapter 5 and the sections of Chapters 7
and 12 that are covered in the lecture notes but please bring all
lecture notes for all chapters with you. It will be open notes but the
use of laptops or phones will not be allowed. You should also bring
your lecture notes and all handouts and you may also bring any
additional notes, homeworks etc.
[11-22-2017] To improve the value of your course project there
are many things you can do. I list some of them in this
announcement. Feel free to contact me regarding your specific
project. A good project will typically exercise many of the ideas
discussed in the class. You should apply several data mining
methods. For classification problems, you can try tree based, SVM
based, Neural networks, Bayesian, Nearest neighbor, logistic
regression or other suitable methods. For clustering problems you can
try several clustering approaches including Hierarchical, Kmeans,
spectral, EM or other suitable methods. You should try to improve
performance using ideas such as hyper parameter tuning,
discretization, preprocessing or other suitable ideas. You should
analyze the performance of the different methods using appropriate
performance measures. You should also try to combine multiple
classifiers using bagging, boosting and/or stacking if applicable. In
summary, I recommend you start with one method and close the loop,
then add the other methods, do the tuning, analyse performance and
finish with ensemble learning.
[10-10-2017] The first midterm exam will be this Thursday
10-12-2017. It will cover all the topics discussed in the course till
last Thursday (i.e. up to Chapter 5 Page 166). It will be open notes
but the use of books, laptops or phones will not be allowed. You
should bring a calculator to the exam; If you do not have a calculator
you may use your phone as a calculator. You should also bring your
lecture notes and all handouts and you may also bring any additional
notes, homeworks etc.
Papers
"Unsupervised feature selection for multi-cluster data"
2010. [Qinglin Dong][11/06]
{download}
"Clustering by Passing Messages Between Data Points"
2007. [Joshwa Shannon][11/07]
{download}
"Extending market basket analysis with graph mining techniques: A real case" 2014. [Zach Baker][11/07]
{download}
"Text and Structural Data Mining of Influenza Mentions in Web and
Social Media" 2010. [Amy Giuntini][11/09]
{download}
"ImageNet classification with deep convolutional neural networks" 2017. [Zach Jones][11/09]
{download}
"Authorship Verification for Short Messages using Stylometry"
2013. [Isela Diaz Martinez][11/09]
{download}
"Feature Mining for Image Classification" 2014. [Hari Teja Tatavarti][11/13] {download}
"Emotional state classification from EEG data using machine
learning approach" 2014. [Shulin Zhang][11/14]
{download}
"Sparse Bayesian Learning for Identifying Imaging Biomarkers in
AD Prediction" 2011. [Christian McDaniel][11/16]
{download}
"SPEECH RECOGNITION WITH DEEP RECURRENT NEURAL NETWORKS" 2013. [Yuanming Shi][11/27] {download}
"Deep Bilateral Learning for Real-Time Image Enhancement"
2017. [Mahdi Kashanipour][11/27]
{download}
"Why didn’t my (great!) protocol get adopted?" 2015. [I-Huei
Ho][11/27]
{download}
"Least squares support vector machines ensemble models for credit
scoring" 2010. [Nicholas Klepp][11/28]
{download}
"An Efficient Approach for Image Recognition using Data Mining"
2011. [Kang Yuan][12/4]
{download}
"Neural Turing Machines" 2014. [Layton Hayes][12/4]
{download}
Assignments:
Homework 1: Exercise 17.1 on pages 559 - 565 of the Weka exercises
handout given in class today. You can also download all the exercises
from HERE. [Due 9-7-2017 in class]
Homework 2
Homework 3
Homework 4: Exercise 17.4 on page 574 of the Weka ecercises handout.[Due
10-31-2017 in class]
Homework 5
Homework 6
Lecture Notes:
Chapter 1
Chapter 2
Chapter 3
Weka Tutorial Slides by Roxana Attar
Chapter 4
Chapter 5
Chapter 7
Chapter 12
The course syllabus is a general plan for the course;
deviations announced to the class by the instructor may be
necessary.
Last modified: December 4, 2017.
Khaled Rasheed
(khaled[at]uga.edu)