[PaperReading] Latent Dirichlet allocation

2010-03-11 ·

Title: Latent Dirichlet allocation
Author: D. Blei, A. Ng, and M. Jordan
Year: Journal of Machine Learning Research, 3:993–1022, January 2003
LDA : three-level hierarchical bayesian model
which use to generate model for a corpus from training data.

we want to know 1)  in each topic, the words' probability
                              2) each document, the topic probability

first, what can we generate a new document from this trained corpus( get α , β)
1. guess N : the word count of this document  from some distribution (Poission)
2. guess  θ : the topic distribution probability of this document  ~ Dir(α)      p.s  α has known
3. guess word  (total N word!)
     1) choose topic Zn from Multinomial(θ)
     2) from this topic, choose a word ~ p(w | Zn,β)  - a multinomial probability conditioned on the topic Zn




multi-model: know each example and the probability of it

 θ : (Pi~Pk)
the show-up probability of topic (each document has a distinctive theta, so this parameter can be used for distinguishing between documents)
my realization:  θ describe the topic's probability of this document, high probability means this document belongs to it  ---like, what is the probability that dice 1~6 show up



dirichlet parameter (alpha & beta):

  • α :  

each topic's sample frequency ( like how many times dice 1~6 show up)

  • β :  

a K x V matrix, beta(i,j): in topic i, word j 's probability


  • generate a document:

when we want to generate a document for this corpus,
first need alpha&beta ( related for corpora)
then,
alpha generate theta(topic's distribution), theta generate topic(which topics are this document belongs), and topic generate word(for each possible topic, the related words it will generate)
word generate document( combine this N word, we get a document )



exchangeability: assume each word in document appear independently
so,

[issue]
parameter estimation:
which alpha,beta has high probobility to generate this corpus
how to choose the topic number of this corpus?

0 Comment(s):

[ About ]

Welcome :P
I am Saphina Cheng (anon),
a master student of MiRA (Multimedia indexing, Retrieval, and Analysis) group of the Communication & Multimedia Laboratory at National Taiwan University

This blog are about my reading papers.

Any opinion is appreciated.

Contact:

[ Calendar ]

<<             >>

[ Comments ]