DD API Operations

models.domain_discovery_model.random() → x in the interval [0, 1).

Acquire Content

class models.domain_discovery_model.DomainModel[source]
queryWeb(terms, max_url_count=100, session=None)[source]

Issue query on the web: results are stored in elastic search, nothing returned here.

Parameters:
terms (string): Search query string max_url_count (int): Number of pages to query. Maximum allowed = 100 session (json): should have domainId
Returns:
None
uploadUrls(urls_str, session)[source]

Download pages corresponding to already known set of domain URLs

Parameters:
urls_str (string): Space separated list of URLs session (json): should have domainId
Returns:
number of pages downloaded (int)

The content can be extended by crawling the given pages one level forward. The assumption here is that a relevant page will contain links to other relevant pages.

Parameters:
urls (list): list of urls to crawl forward session (json): should have domainId
Return:
None (Results are downloaded into elasticsearch)

The content can be extended by crawling the given pages one level back to the pages that link to them. The assumption here is that a page containing the link to the given relevant page will contain links to other relevant pages.

Parameters:
urls (list): list of urls to crawl backward session (json): should have domainId
Return:
None (Results are downloaded into elasticsearch)

Annotate Content

class models.domain_discovery_model.DomainModel[source]
setPagesTag(pages, tag, applyTagFlag, session)[source]

Tag the pages with the given tag which can be a custom tag or ‘Relevant’/’Irrelevant’ which indicate relevance or irrelevance to the domain of interest. Tags help in clustering and categorizing the pages. They also help build computational models of the domain.

Parameters:
pages (urls): list of urls to apply tag tag (string): custom tag, ‘Relevant’, ‘Irrelevant’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
Returns:
Returns string “Completed Process”
setTermsTag(terms, tag, applyTagFlag, session)[source]

Tag the terms as ‘Positive’/’Negative’ which indicate relevance or irrelevance to the domain of interest. Tags help in reranking terms to show the ones relevan to the domain.

Parameters:
terms (string): list of terms to apply tag tag (string): ‘Positive’ or ‘Negative’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
Returns:
None

Summarize Content

class models.domain_discovery_model.DomainModel[source]
extractTerms(opt_maxNumberOfTerms=40, session=None)[source]

Extract most relevant unigrams, bigrams and trigrams that summarize the pages. These could provide unknown information about the domain. This in turn could suggest further queries for searching content.

Parameters:

opt_maxNumberOfTerms (int): Number of terms to return

session (json): should have domainId

Returns:
array: [[term, frequencyInRelevantPages, frequencyInIrrelevantPages, tags], ...]
make_topic_model(session, tokenizer, vectorizer, model, ntopics)[source]

Build topic model from the corpus of the supplied DDT domain.

The topic model is represented as a topik.TopikProject object, and is persisted in disk, recording the model parameters and the location of the data. The output of the topic model itself is stored in Elasticsearch.

Parameters:

domain (str): DDT domain name as stored in Elasticsearch, so lowercase and with underscores in place of spaces.

tokenizer (str): A tokenizer from topik.tokenizer.registered_tokenizers

vectorizer (str): A vectorization method from topik.vectorizers.registered_vectorizers

model (str): A topic model from topik.vectorizers.registered_models

ntopics (int): The number of topics to be used when modeling the corpus.

Returns:

model: topik model, encoding things like term frequencies, etc.

Organize Content

class models.domain_discovery_model.DomainModel[source]
getPagesProjection(session)[source]

Organize content by some criteria such as relevance, similarity or category which allows to easily analyze groups of pages. The ‘x’,’y’ co-ordinates returned project the page in 2D maintaining clustering based on the projection chosen. The projection criteria is specified in the session object

Parameters:
session: Should Contain ‘domainId’ Should contain ‘activeProjectionAlg’ which takes values ‘tsne’, ‘pca’ or ‘kmeans’ currently
Returns dictionary in the format:{ ‘last_downloaded_url_epoch’: 1432310403 (in seconds) ‘pages’: [ [url1, x, y, tags, retrieved], (tags are a list, potentially empty) [url2, x, y, tags, retrieved], [url3, x, y, tags, retrieved],
] }

Filter Content

class models.domain_discovery_model.DomainModel[source]
getPages(session)[source]

Find pages that satisfy the specified criteria. One or more of the following criteria are specified in the session object as ‘pageRetrievalCriteria’:

‘Most Recent’, ‘More like’, ‘Queries’, ‘Tags’, ‘Model Tags’, ‘Maybe relevant’, ‘Maybe irrelevant’, ‘Unsure’

and filter by keywords specified in the session object as ‘filter’

Parameters:
session (json): Should contain ‘domainId’,’pageRetrievalCriteria’ or ‘filter’
Returns:
json: {url1: {snippet, image_url, title, tags, retrieved}} (tags are a list, potentially empty)

Generate Model

class models.domain_discovery_model.DomainModel[source]
createModel(session, zip=True)[source]

Create an ACHE model to be applied to SeedFinder and focused crawler. It saves the classifiers, features, the training data in the <project>/data/<domain> directory. If zip=True all generated files and folders are zipped into a file.

Parameters:
session (json): should have domainId
Returns:
None