Schedule and Readings
The course schedule may be subject to change. The textbooks and materials for this course are:
Text data analysis 《文本数据挖掘(第2版) 》,宗成庆,夏睿,张家俊
Natural language processing 《统计自然语言处理(第2版) 》,宗成庆
Natural language processing and large language models《自然语言处理基础与大模型:案例与实践》,宗成庆、赵阳、飞桨教材编写组
Week 1, Sep. 7
Introduction to this course with the following outline:
- Why study this course
- How to study this course
- Case analysis
- Issues and challenges faced
Lecture
Slides with code: h7c4
Jupyter notebook tutorial with code: npa7
Jupyter notebook with code: ng6k
Reading
Private traits and attributes are predictable from digital records of human behavior
You are what you tweet: Personality expression and perception on Twitter
Unlocking Personalities: Predicting MBTI Types from Text Posts
Week 2, Sep. 21
Data collection and preprocessing
- How to crawl data from a website, utilize APIs, and leverage existing corpora
- How to preprocess the text data crawl from the internet
Lecture
Slides with code: zjwz
HTML examples with code: ww72
Jupyter notebook with code: vydm
Week 3, Sep. 28
Language model
- N-gram model
- Parameter estimation
- Data smoothing methods
- Feedforward neural network language model
- Recurrent neural network language model
- Self-attention language model
- Applications of traditional and neural language models
Lecture
Slides with code: vbdc
Reading
A Neural Probabilistic Language Model
Week 4, Oct. 19
Text representation method
Vector space model
Representation learning model
- word representations
- phrase representations
- sentence representations
- document representations
Lecture
Slides with code: 1t7h
Reading
Efficient Estimation of Word Representations in Vector Space
GloVe: Global Vectors for Word Representation
Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks
Week 5, Nov. 2
Text representation and application
- Playing around with word embeddings
- Bias in word embeddings
- Quantifying bias
- Hands on Mitigating Bias
Lecture
Slides with code: dk0r
Jupyter notebook with code: jp71
Data used in jupyter notebook with code: 5a1l
Reading
Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
Replicating ’Man is to Computer Programmer as Woman is to Homemakers? Debiasing Word Embeddings’
Week 6, Nov. 9
Pre-trained language model
- Why pre-trained language model
- Elmo
- GPT
- BERT
Lecture
Slides with code: roqu
Reading
Deep Contextualized Word Representations
Improving Language Understanding by Generative Pre-Training
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Week 7, Nov. 16
Text classification
- Applications of text classification method
- Machine learning method
- Deep learning method
- Evaluation
Lecture
Slides with code: 813f
Code&data with code: l69h
Week 8, Nov. 23
Text clustering
- Applications of text clustering method
- Text similarity measurement
- Text clustering algorithms
- Evaluation
Lecture
Slides with code: t6o4
Code with question&data with code: becw
Code with answer&data with code: