DD API Operations¶
-
models.domain_discovery_model.
random
() → x in the interval [0, 1).¶
Acquire Content¶
-
class
models.domain_discovery_model.
DomainModel
[source]¶ -
queryWeb
(terms, max_url_count=100, session=None)[source]¶ Issue query on the web: results are stored in elastic search, nothing returned here.
- Parameters:
- terms (string): Search query string max_url_count (int): Number of pages to query. Maximum allowed = 100 session (json): should have domainId
- Returns:
- None
-
uploadUrls
(urls_str, session)[source]¶ Download pages corresponding to already known set of domain URLs
- Parameters:
- urls_str (string): Space separated list of URLs session (json): should have domainId
- Returns:
- number of pages downloaded (int)
-
getForwardLinks
(urls, session)[source]¶ The content can be extended by crawling the given pages one level forward. The assumption here is that a relevant page will contain links to other relevant pages.
- Parameters:
- urls (list): list of urls to crawl forward session (json): should have domainId
- Return:
- None (Results are downloaded into elasticsearch)
-
getBackwardLinks
(urls, session)[source]¶ The content can be extended by crawling the given pages one level back to the pages that link to them. The assumption here is that a page containing the link to the given relevant page will contain links to other relevant pages.
- Parameters:
- urls (list): list of urls to crawl backward session (json): should have domainId
- Return:
- None (Results are downloaded into elasticsearch)
-
Annotate Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
setPagesTag
(pages, tag, applyTagFlag, session)[source]¶ Tag the pages with the given tag which can be a custom tag or ‘Relevant’/’Irrelevant’ which indicate relevance or irrelevance to the domain of interest. Tags help in clustering and categorizing the pages. They also help build computational models of the domain.
- Parameters:
- pages (urls): list of urls to apply tag tag (string): custom tag, ‘Relevant’, ‘Irrelevant’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
- Returns:
- Returns string “Completed Process”
-
setTermsTag
(terms, tag, applyTagFlag, session)[source]¶ Tag the terms as ‘Positive’/’Negative’ which indicate relevance or irrelevance to the domain of interest. Tags help in reranking terms to show the ones relevan to the domain.
- Parameters:
- terms (string): list of terms to apply tag tag (string): ‘Positive’ or ‘Negative’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
- Returns:
- None
-
Summarize Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
extractTerms
(opt_maxNumberOfTerms=40, session=None)[source]¶ Extract most relevant unigrams, bigrams and trigrams that summarize the pages. These could provide unknown information about the domain. This in turn could suggest further queries for searching content.
- Parameters:
opt_maxNumberOfTerms (int): Number of terms to return
session (json): should have domainId
- Returns:
- array: [[term, frequencyInRelevantPages, frequencyInIrrelevantPages, tags], ...]
-
make_topic_model
(session, tokenizer, vectorizer, model, ntopics)[source]¶ Build topic model from the corpus of the supplied DDT domain.
The topic model is represented as a topik.TopikProject object, and is persisted in disk, recording the model parameters and the location of the data. The output of the topic model itself is stored in Elasticsearch.
Parameters:
domain (str): DDT domain name as stored in Elasticsearch, so lowercase and with underscores in place of spaces.
tokenizer (str): A tokenizer from
topik.tokenizer.registered_tokenizers
vectorizer (str): A vectorization method from
topik.vectorizers.registered_vectorizers
model (str): A topic model from
topik.vectorizers.registered_models
ntopics (int): The number of topics to be used when modeling the corpus.
Returns:
model: topik model, encoding things like term frequencies, etc.
-
Organize Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
getPagesProjection
(session)[source]¶ Organize content by some criteria such as relevance, similarity or category which allows to easily analyze groups of pages. The ‘x’,’y’ co-ordinates returned project the page in 2D maintaining clustering based on the projection chosen. The projection criteria is specified in the session object
- Parameters:
- session: Should Contain ‘domainId’ Should contain ‘activeProjectionAlg’ which takes values ‘tsne’, ‘pca’ or ‘kmeans’ currently
- Returns dictionary in the format:{ ‘last_downloaded_url_epoch’: 1432310403 (in seconds) ‘pages’: [ [url1, x, y, tags, retrieved], (tags are a list, potentially empty) [url2, x, y, tags, retrieved], [url3, x, y, tags, retrieved],
- ] }
-
Filter Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
getPages
(session)[source]¶ Find pages that satisfy the specified criteria. One or more of the following criteria are specified in the session object as ‘pageRetrievalCriteria’:
‘Most Recent’, ‘More like’, ‘Queries’, ‘Tags’, ‘Model Tags’, ‘Maybe relevant’, ‘Maybe irrelevant’, ‘Unsure’
and filter by keywords specified in the session object as ‘filter’
- Parameters:
- session (json): Should contain ‘domainId’,’pageRetrievalCriteria’ or ‘filter’
- Returns:
- json: {url1: {snippet, image_url, title, tags, retrieved}} (tags are a list, potentially empty)
-
Generate Model¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
createModel
(session, zip=True)[source]¶ Create an ACHE model to be applied to SeedFinder and focused crawler. It saves the classifiers, features, the training data in the <project>/data/<domain> directory. If zip=True all generated files and folders are zipped into a file.
- Parameters:
- session (json): should have domainId
- Returns:
- None
-