Main Page | Namespace List | Class Hierarchy | Class List | File List | Namespace Members | Class Members | File Members | Related Pages

lemur::cluster::PLSA Class Reference

Probabilistic Latent Semantic Analysis Java Reference implementation from Andrew Schein and Alexandrin Popescul (Penn). PennAspect (GPL). More...

#include <PLSA.hpp>

List of all members.

Public Member Functions

 PLSA (const lemur::api::Index &dbIndex, int numCats, lemur::utility::HashFreqVector **train, lemur::utility::HashFreqVector **validate, int numIter, int numRestarts, double betastart, double betastop, double anneal, double betaMod)
 building with provided train/test partitions

 PLSA (const lemur::api::Index &dbIndex, int testPercentage, int numCats, int numIter, int numRestarts, double betastart, double betastop, double anneal, double betaMod)
 building without provided train/test partitions

 PLSA (const lemur::api::Index &dbIndex)
 for using prebuilt tables.

virtual ~PLSA ()
void iterateWithRestarts ()
 Start things going.

double * get_p_z () const
double ** get_p_w_z () const
 P(w|z) matrix.

double ** get_p_d_z () const
 P(d|z) matrix.

double getProb (int d, int w) const
 P(d,w).

int numWords () const
 number of terms

int numDocs () const
 number of docs

int numCats () const
 number of categories

bool readArrays ()

Private Types

enum  pType { P_Z = 0, P_W_Z = 1, P_D_Z = 2 }
 read/write array options. More...


Private Member Functions

void setPrevToCurrent ()
 Copy current iteration data to previous iteration data.

void setCurrentToBest ()
 Copy best iteration data to current iteration data.

void setBestToCurrent ()
 Copy current iteration data to best iteration data.

void setBestToPrev ()
 Copy previous iteration data to best iteration data.

void setPrevToBest ()
 Copy best iteration data to previous iteration data.

double getAverageLikelihood ()
double getAverageLikelihoodPrev ()
double jointEstimate (int indexD, int indexW)
 Estimates P(d,w) using previous parameter estimates.

double jointEstimateCurrent (int indexD, int indexW)
 Estimates P(d,w) using current parameter estimates.

double jointEstimateBest (int indexD, int indexW)
 Estimates P(d,w) using best parameter estimates.

double jointEstimateBeta (int indexD, int indexW)
void iterate ()
 main routine for model training.

void initializeParameters ()
 Initialize the prev probability arrays to random values.

double doLogLikelihood (jointfuncType, lemur::utility::HashFreqVector **&myData)
double logLikelihood ()
 Calculate the training data log-likelihood using prev parameters.

double validateDataLogLikelihood ()
 Calculate the hold out data log-likelihood using prev parameters.

double validateCurrentLogLikelihood ()
 Calculate the hold out data log-likelihood using current parameters.

double bestDataLogLikelihood ()
 Calculate the hold out data log-likelihood using the best parameters.

double interleavedIterationEM ()
 performs one EM iteration, returns log likelihood of training data

void selectTestTrain (int testPercent)
 Select training/test events.

void init ()
 Initialize attributes.

void initR ()
 Initialize R and w->d inverted list.

void writeArrays ()
 write out all the arrays to file.

bool readArray (ifstream &infile, enum pType which)
 Read a probability array (matrix) from a file.

void writeArray (ofstream &ofile, enum pType which)
 Write a probability array (matrix) to a file.


Private Attributes

const lemur::api::Indexind
 Index to use.

int sizeZ
 number of categories

int sizeD
 number of documents

int sizeW
 number of words

lemur::utility::HashFreqVector ** data
 train d->w freq list

lemur::utility::HashFreqVector ** testData
 test (validation) d->w freq list

set< int, less< int > > * invIndex
 w->d inverted index for M step of P(w | z)

double startBeta
 Beta for TEM.

double beta
 Beta for TEM.

double betaMin
 Beta for TEM.

double betaModifier
 eta for TEM (beta = eta * beta;)

double annealcue
 annealcue value (delta)

int R
 used in M step for p_z

int numberOfIterations
 How many iterations.

int numberOfRestarts
 How many restarts.

double bestTestLL
 Best log likelihood on the test data so far.

double bestA
 Best average log likelihood on the test data so far.

bool bestOnly
 have we only loaded existing tables from files

bool ownMem
 did we allocate the test/train vectors?

double * p_z_current
 P(z) vector current iteration.

double ** p_w_z_current
 P(w|z) matrix current iteration.

double ** p_d_z_current
 P(d|z) matrix current iteration.

double * p_z_prev
 P(z) vector previous iteration.

double ** p_w_z_prev
 P(w|z) matrix previous iteration.

double ** p_d_z_prev
 P(d|z) matrix previous iteration.

double * p_z_best
 P(z) vector best iteration.

double ** p_w_z_best
 P(w|z) matrix best iteration.

double ** p_d_z_best
 P(d|z) matrix best iteration.


Detailed Description

Probabilistic Latent Semantic Analysis Java Reference implementation from Andrew Schein and Alexandrin Popescul (Penn). PennAspect (GPL).


Member Enumeration Documentation

enum lemur::cluster::PLSA::pType [private]
 

read/write array options.

Enumeration values:
P_Z 
P_W_Z 
P_D_Z 


Constructor & Destructor Documentation

lemur::cluster::PLSA::PLSA const lemur::api::Index dbIndex,
int  numCats,
lemur::utility::HashFreqVector **  train,
lemur::utility::HashFreqVector **  validate,
int  numIter,
int  numRestarts,
double  betastart,
double  betastop,
double  anneal,
double  betaMod
 

building with provided train/test partitions

lemur::cluster::PLSA::PLSA const lemur::api::Index dbIndex,
int  testPercentage,
int  numCats,
int  numIter,
int  numRestarts,
double  betastart,
double  betastop,
double  anneal,
double  betaMod
 

building without provided train/test partitions

lemur::cluster::PLSA::PLSA const lemur::api::Index dbIndex  ) 
 

for using prebuilt tables.

pass in a filestem.

lemur::cluster::PLSA::~PLSA  )  [virtual]
 


Member Function Documentation

double lemur::cluster::PLSA::bestDataLogLikelihood  )  [private]
 

Calculate the hold out data log-likelihood using the best parameters.

double lemur::cluster::PLSA::doLogLikelihood jointfuncType  ,
lemur::utility::HashFreqVector **&  myData
[private]
 

Calculate the log likelihood of a given data set using the supplied joint estimate method.

double** lemur::cluster::PLSA::get_p_d_z  )  const [inline]
 

P(d|z) matrix.

double** lemur::cluster::PLSA::get_p_w_z  )  const [inline]
 

P(w|z) matrix.

double* lemur::cluster::PLSA::get_p_z  )  const [inline]
 

get the final values P(z) vector

double lemur::cluster::PLSA::getAverageLikelihood  )  [private]
 

Calculate the average likelihood of an event in the test data using the current iteration estimates.

double lemur::cluster::PLSA::getAverageLikelihoodPrev  )  [private]
 

Calculate the average likelihood of an event in the test data using the previous iteration estimates.

double lemur::cluster::PLSA::getProb int  d,
int  w
const
 

P(d,w).

void lemur::cluster::PLSA::init  )  [private]
 

Initialize attributes.

void lemur::cluster::PLSA::initializeParameters  )  [private]
 

Initialize the prev probability arrays to random values.

void lemur::cluster::PLSA::initR  )  [private]
 

Initialize R and w->d inverted list.

double lemur::cluster::PLSA::interleavedIterationEM  )  [private]
 

performs one EM iteration, returns log likelihood of training data

void lemur::cluster::PLSA::iterate  )  [private]
 

main routine for model training.

void lemur::cluster::PLSA::iterateWithRestarts  ) 
 

Start things going.

double lemur::cluster::PLSA::jointEstimate int  indexD,
int  indexW
[private]
 

Estimates P(d,w) using previous parameter estimates.

double lemur::cluster::PLSA::jointEstimateBest int  indexD,
int  indexW
[private]
 

Estimates P(d,w) using best parameter estimates.

double lemur::cluster::PLSA::jointEstimateBeta int  indexD,
int  indexW
[private]
 

Joint estimate using previous parameter estimates where the terms in summation are raised to power beta.

double lemur::cluster::PLSA::jointEstimateCurrent int  indexD,
int  indexW
[private]
 

Estimates P(d,w) using current parameter estimates.

double lemur::cluster::PLSA::logLikelihood  )  [private]
 

Calculate the training data log-likelihood using prev parameters.

int lemur::cluster::PLSA::numCats  )  const [inline]
 

number of categories

int lemur::cluster::PLSA::numDocs  )  const [inline]
 

number of docs

int lemur::cluster::PLSA::numWords  )  const [inline]
 

number of terms

bool lemur::cluster::PLSA::readArray ifstream &  infile,
enum pType  which
[private]
 

Read a probability array (matrix) from a file.

bool lemur::cluster::PLSA::readArrays  ) 
 

should these be public? On ctor if not constructing.

void lemur::cluster::PLSA::selectTestTrain int  testPercent  )  [private]
 

Select training/test events.

void lemur::cluster::PLSA::setBestToCurrent  )  [private]
 

Copy current iteration data to best iteration data.

void lemur::cluster::PLSA::setBestToPrev  )  [private]
 

Copy previous iteration data to best iteration data.

void lemur::cluster::PLSA::setCurrentToBest  )  [private]
 

Copy best iteration data to current iteration data.

void lemur::cluster::PLSA::setPrevToBest  )  [private]
 

Copy best iteration data to previous iteration data.

void lemur::cluster::PLSA::setPrevToCurrent  )  [private]
 

Copy current iteration data to previous iteration data.

double lemur::cluster::PLSA::validateCurrentLogLikelihood  )  [private]
 

Calculate the hold out data log-likelihood using current parameters.

double lemur::cluster::PLSA::validateDataLogLikelihood  )  [private]
 

Calculate the hold out data log-likelihood using prev parameters.

void lemur::cluster::PLSA::writeArray ofstream &  ofile,
enum pType  which
[private]
 

Write a probability array (matrix) to a file.

void lemur::cluster::PLSA::writeArrays  )  [private]
 

write out all the arrays to file.


Member Data Documentation

double lemur::cluster::PLSA::annealcue [private]
 

annealcue value (delta)

double lemur::cluster::PLSA::bestA [private]
 

Best average log likelihood on the test data so far.

bool lemur::cluster::PLSA::bestOnly [private]
 

have we only loaded existing tables from files

double lemur::cluster::PLSA::bestTestLL [private]
 

Best log likelihood on the test data so far.

double lemur::cluster::PLSA::beta [private]
 

Beta for TEM.

double lemur::cluster::PLSA::betaMin [private]
 

Beta for TEM.

double lemur::cluster::PLSA::betaModifier [private]
 

eta for TEM (beta = eta * beta;)

lemur::utility::HashFreqVector** lemur::cluster::PLSA::data [private]
 

train d->w freq list

const lemur::api::Index& lemur::cluster::PLSA::ind [private]
 

Index to use.

set<int, less<int> >* lemur::cluster::PLSA::invIndex [private]
 

w->d inverted index for M step of P(w | z)

int lemur::cluster::PLSA::numberOfIterations [private]
 

How many iterations.

int lemur::cluster::PLSA::numberOfRestarts [private]
 

How many restarts.

bool lemur::cluster::PLSA::ownMem [private]
 

did we allocate the test/train vectors?

double** lemur::cluster::PLSA::p_d_z_best [private]
 

P(d|z) matrix best iteration.

double** lemur::cluster::PLSA::p_d_z_current [private]
 

P(d|z) matrix current iteration.

double** lemur::cluster::PLSA::p_d_z_prev [private]
 

P(d|z) matrix previous iteration.

double** lemur::cluster::PLSA::p_w_z_best [private]
 

P(w|z) matrix best iteration.

double** lemur::cluster::PLSA::p_w_z_current [private]
 

P(w|z) matrix current iteration.

double** lemur::cluster::PLSA::p_w_z_prev [private]
 

P(w|z) matrix previous iteration.

double* lemur::cluster::PLSA::p_z_best [private]
 

P(z) vector best iteration.

double* lemur::cluster::PLSA::p_z_current [private]
 

P(z) vector current iteration.

double* lemur::cluster::PLSA::p_z_prev [private]
 

P(z) vector previous iteration.

int lemur::cluster::PLSA::R [private]
 

used in M step for p_z

int lemur::cluster::PLSA::sizeD [private]
 

number of documents

int lemur::cluster::PLSA::sizeW [private]
 

number of words

int lemur::cluster::PLSA::sizeZ [private]
 

number of categories

double lemur::cluster::PLSA::startBeta [private]
 

Beta for TEM.

lemur::utility::HashFreqVector** lemur::cluster::PLSA::testData [private]
 

test (validation) d->w freq list


The documentation for this class was generated from the following files:
Generated on Tue Jun 15 11:03:05 2010 for Lemur by doxygen 1.3.4