Title: Latent Dirichlet allocationLDA : three-level hierarchical bayesian model
Author: D. Blei, A. Ng, and M. Jordan
Year: Journal of Machine Learning Research, 3:993–1022, January 2003
which use to generate model for a corpus from training data.
we want to know 1) in each topic, the words' probability
2) each document, the topic probability
first, what can we generate a new document from this trained corpus( get α , β)
1. guess N : the word count of this document from some distribution (Poission)
2. guess θ : the topic distribution probability of this document ~ Dir(α) p.s α has known
3. guess word (total N word!)
1) choose topic Zn from Multinomial(θ)
2) from this topic, choose a word ~ p(w | Zn,β) - a multinomial probability conditioned on the topic Zn
multi-model: know each example and the probability of it
θ : (Pi~Pk)
the show-up probability of topic (each document has a distinctive theta, so this parameter can be used for distinguishing between documents)
my realization: θ describe the topic's probability of this document, high probability means this document belongs to it ---like, what is the probability that dice 1~6 show up
dirichlet parameter (alpha & beta):
- α :
each topic's sample frequency ( like how many times dice 1~6 show up)
- β :
a K x V matrix, beta(i,j): in topic i, word j 's probability
- generate a document:
when we want to generate a document for this corpus,
first need alpha&beta ( related for corpora)
then,
alpha generate theta(topic's distribution), theta generate topic(which topics are this document belongs), and topic generate word(for each possible topic, the related words it will generate)
word generate document( combine this N word, we get a document
exchangeability: assume each word in document appear independently
so,
[issue]
parameter estimation:
which alpha,beta has high probobility to generate this corpus
how to choose the topic number of this corpus?



0 Comment(s):
Post a Comment