DD API Operations¶

models.domain_discovery_model.random() → x in the interval [0, 1).¶

Acquire Content¶

class models.domain_discovery_model.DomainModel[source]¶

queryWeb(terms, max_url_count=100, session=None)[source]¶

Issue query on the web: results are stored in elastic search, nothing returned here.

Parameters:: terms (string): Search query string max_url_count (int): Number of pages to query. Maximum allowed = 100 session (json): should have domainId
Returns:: None

uploadUrls(urls_str, session)[source]¶

Download pages corresponding to already known set of domain URLs

Parameters:: urls_str (string): Space separated list of URLs session (json): should have domainId
Returns:: number of pages downloaded (int)

getForwardLinks(urls, session)[source]¶

The content can be extended by crawling the given pages one level forward. The assumption here is that a relevant page will contain links to other relevant pages.

Parameters:: urls (list): list of urls to crawl forward session (json): should have domainId
Return:: None (Results are downloaded into elasticsearch)

getBackwardLinks(urls, session)[source]¶

The content can be extended by crawling the given pages one level back to the pages that link to them. The assumption here is that a page containing the link to the given relevant page will contain links to other relevant pages.

Parameters:: urls (list): list of urls to crawl backward session (json): should have domainId
Return:: None (Results are downloaded into elasticsearch)

Annotate Content¶

class models.domain_discovery_model.DomainModel[source]

setPagesTag(pages, tag, applyTagFlag, session)[source]¶

Tag the pages with the given tag which can be a custom tag or ‘Relevant’/’Irrelevant’ which indicate relevance or irrelevance to the domain of interest. Tags help in clustering and categorizing the pages. They also help build computational models of the domain.

Parameters:: pages (urls): list of urls to apply tag tag (string): custom tag, ‘Relevant’, ‘Irrelevant’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
Returns:: Returns string “Completed Process”

setTermsTag(terms, tag, applyTagFlag, session)[source]¶

Tag the terms as ‘Positive’/’Negative’ which indicate relevance or irrelevance to the domain of interest. Tags help in reranking terms to show the ones relevan to the domain.

Parameters:: terms (string): list of terms to apply tag tag (string): ‘Positive’ or ‘Negative’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
Returns:: None

Summarize Content¶

class models.domain_discovery_model.DomainModel[source]

extractTerms(opt_maxNumberOfTerms=40, session=None)[source]¶

Extract most relevant unigrams, bigrams and trigrams that summarize the pages. These could provide unknown information about the domain. This in turn could suggest further queries for searching content.

Parameters:

opt_maxNumberOfTerms (int): Number of terms to return

session (json): should have domainId

Returns:

array: [[term, frequencyInRelevantPages, frequencyInIrrelevantPages, tags], ...]

make_topic_model(session, tokenizer, vectorizer, model, ntopics)[source]¶

Build topic model from the corpus of the supplied DDT domain.

The topic model is represented as a topik.TopikProject object, and is persisted in disk, recording the model parameters and the location of the data. The output of the topic model itself is stored in Elasticsearch.

Parameters:

domain (str): DDT domain name as stored in Elasticsearch, so lowercase and with underscores in place of spaces.

tokenizer (str): A tokenizer from topik.tokenizer.registered_tokenizers

vectorizer (str): A vectorization method from topik.vectorizers.registered_vectorizers

model (str): A topic model from topik.vectorizers.registered_models

ntopics (int): The number of topics to be used when modeling the corpus.

Returns:

model: topik model, encoding things like term frequencies, etc.

Organize Content¶

class models.domain_discovery_model.DomainModel[source]

getPagesProjection(session)[source]¶

Organize content by some criteria such as relevance, similarity or category which allows to easily analyze groups of pages. The ‘x’,’y’ co-ordinates returned project the page in 2D maintaining clustering based on the projection chosen. The projection criteria is specified in the session object

Parameters:: session: Should Contain ‘domainId’ Should contain ‘activeProjectionAlg’ which takes values ‘tsne’, ‘pca’ or ‘kmeans’ currently
Returns dictionary in the format:{ ‘last_downloaded_url_epoch’: 1432310403 (in seconds) ‘pages’: [ [url1, x, y, tags, retrieved], (tags are a list, potentially empty) [url2, x, y, tags, retrieved], [url3, x, y, tags, retrieved],: ] }

Filter Content¶

class models.domain_discovery_model.DomainModel[source]

getPages(session)[source]¶

Find pages that satisfy the specified criteria. One or more of the following criteria are specified in the session object as ‘pageRetrievalCriteria’:

‘Most Recent’, ‘More like’, ‘Queries’, ‘Tags’, ‘Model Tags’, ‘Maybe relevant’, ‘Maybe irrelevant’, ‘Unsure’

and filter by keywords specified in the session object as ‘filter’

Parameters:: session (json): Should contain ‘domainId’,’pageRetrievalCriteria’ or ‘filter’
Returns:: json: {url1: {snippet, image_url, title, tags, retrieved}} (tags are a list, potentially empty)

Generate Model¶

class models.domain_discovery_model.DomainModel[source]

createModel(session, zip=True)[source]¶

Create an ACHE model to be applied to SeedFinder and focused crawler. It saves the classifiers, features, the training data in the <project>/data/<domain> directory. If zip=True all generated files and folders are zipped into a file.

Parameters:: session (json): should have domainId
Returns:: None