Schedule and Readings

The course schedule may be subject to change. The textbooks and materials for this course are:

Text data analysis 《文本数据挖掘(第2版) 》,宗成庆,夏睿,张家俊
Natural language processing 《统计自然语言处理(第2版) 》,宗成庆
Natural language processing and large language models《自然语言处理基础与大模型:案例与实践》,宗成庆、赵阳、飞桨教材编写组


Week 1, Sep. 7

Week 2, Sep. 21

Data collection and preprocessing

  • How to crawl data from a website, utilize APIs, and leverage existing corpora
  • How to preprocess the text data crawl from the internet

Lecture

Slides with code: zjwz

HTML examples with code: ww72

Jupyter notebook with code: vydm


Week 3, Sep. 28

Language model

  • N-gram model
  • Parameter estimation
  • Data smoothing methods
  • Feedforward neural network language model
  • Recurrent neural network language model
  • Self-attention language model
  • Applications of traditional and neural language models

Lecture

Slides with code: vbdc

Reading

A Neural Probabilistic Language Model


Week 4, Oct. 19

Text representation method

Vector space model

Representation learning model

  • word representations
  • phrase representations
  • sentence representations
  • document representations

Lecture

Slides with code: 1t7h

Reading

Efficient Estimation of Word Representations in Vector Space

GloVe: Global Vectors for Word Representation

Learning Continuous Phrase Representations and Syntactic Parsing with Recursive Neural Networks

Skip-Thought Vectors


Week 5, Nov. 2

Text representation and application

  • Playing around with word embeddings
  • Bias in word embeddings
  • Quantifying bias
  • Hands on Mitigating Bias

Lecture

Slides with code: dk0r

Jupyter notebook with code: jp71

Data used in jupyter notebook with code: 5a1l

Reading

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings

Replicating ’Man is to Computer Programmer as Woman is to Homemakers? Debiasing Word Embeddings’


Week 6, Nov. 9

Week 7, Nov. 16

Text classification

  • Applications of text classification method
  • Machine learning method
  • Deep learning method
  • Evaluation

Lecture

Slides with code: 813f

Code&data with code: l69h

Kaggle notebook


Week 8, Nov. 23

Text clustering

  • Applications of text clustering method
  • Text similarity measurement
  • Text clustering algorithms
  • Evaluation

Lecture

Slides with code: t6o4

Code with question&data with code: becw

Code with answer&data with code:


Week 9, Nov. 30

Information extraction

Week 10, Dec. 7

Text summarization