Spring 2024
I 320D Topics in Human-Centered Data Science: Text Mining and NLP Essentials
DESCRIPTION
Leveraging Text Mining, Natural Language Processing, and Computational Linguistics to address real-world textual data challenges, including document processing, keyword extraction, question answering, translation, summarization, sentiment analysis, search, recommendation, and information extraction. Each week, classes include (a) Theory and Methods for NLP concepts and (b) Lab Tutorials for practical application with Python on multilingual text datasets.
COURSE NOTES
Natural language processing (NLP) is concerned with interactions between computers and human through the medium of human languages. It involves analyzing, understanding, and generating human language, making it possible for machines to interpret and respond to human speech and text. NLP is currently making significant contributions to modern modern technological advancements and serves as the backbone of crucial applications such as large scale document processing, keyword/topic extraction, question answering, human language translation summarization, sentiment and emotion analysis, search and recommendation and information extraction in healthcare. The proposed undergraduate course aims to cover fundamental concepts in Natural Language Processing / Computational Linguistics and how they are used to solve real-world problems. Classes in each week will be divided into two segments: (a) Theory and Methods, a concise description of an NLP concept, and (b) Lab Tutorial, a hands-on session on applying the theory to a real-world task on publicly available multilingual text datasets. This course aims to give students a broad overview of Natural Language Processing (NLP) and prepare them for various career paths, such as working on cutting-edge text analysis products, NLP-centric industry roles, or pursuing doctoral studies in NLP or computational linguistics. By the end of the course, students will: 1. Learn to collect and preprocess multilingual text data from diverse sources. 2. Handle multilingual data, apply language processing techniques (e.g., normalization, tokenization, lemmatization), and extract machine-readable representations. 3. Train machine learning models for natural language understanding and generation, and assess their performance. 4. Extract information from unstructured text and create knowledge graphs. 5. Utilize existing knowledge graphs, ontologies, and lexical networks for predictive text analysis. 6. Develop and present innovative product or research ideas through iterative experimentation to the class.
PREREQUISITES
Upper-division standing; Informatics 310D and Informatics 304 (or one of the following approved substitutions: C S 303E, C S 312, C S 312H, C S 313E).
RESTRICTIONS
Registration prioritized for undergraduate Informatics majors through registration period 1, with access being extended to Informatics minors beginning in period 2. Outside students will be permitted to join our waitlists beginning with period 3.