Domain Discovery (DD) API’s Documentation¶
Domain Discovery is the process of acquiring, understanding and exploring data for a specific domain. Some example domains include human trafficking, illegal sale of weapons and micro-cap fraud. While acquiring knowledge about a domain humans usually start with a conception of that domain. This conception is based on prior knowledge of parts of the domain. The process of gaining a more complete knowledge of the domain involves using this prior knowledge to obtain content that provides additional information about that domain that was previously unknown. This new knowledge of the domain now becomes prior knowledge leading to an iterative process of domain discovery as illustrated in Figure 2. The goals of this iterative domain discovery process are:
- complete the human’s knowledge of the domain
- acquire sufficient content that captures the human coginition of the domain to translate into a computational model

The Domain Discovery API formalizes the human domain discovery process by defining a set of operations that capture the essential tasks that lead to domain discovery on the Web as we have discovered in interacting with the Subject Matter Experts (SME)s. The API facilitates:
- Creation of different user interfaces to satisfy different DD needs
- Configure and understand different DD workflows
- Scripting DD
Contents¶
Installation¶
Building and deploying the Domain Discovery can be done using its Makefile to create a local development environment. The conda build environment is currently only supported on 64-bit OS X and Linux.
First install conda, either through the Anaconda or miniconda installers provided by Continuum. You will also need Git, a Java Development Kit and Maven. These are system tools that are generally not provided by conda.
Clone the DD API repository and enter it:
>>> git clone https://github.com/ViDA-NYU/domain_discovery_API
>>> cd domain_discovery_API
Use the make command to build DD API and download/install its dependencies.
>>> make
Now you can use the API
DD API Operations¶
-
models.domain_discovery_model.
random
() → x in the interval [0, 1).¶
Acquire Content¶
-
class
models.domain_discovery_model.
DomainModel
[source]¶ -
queryWeb
(terms, max_url_count=100, session=None)[source]¶ Issue query on the web: results are stored in elastic search, nothing returned here.
- Parameters:
- terms (string): Search query string max_url_count (int): Number of pages to query. Maximum allowed = 100 session (json): should have domainId
- Returns:
- None
-
uploadUrls
(urls_str, session)[source]¶ Download pages corresponding to already known set of domain URLs
- Parameters:
- urls_str (string): Space separated list of URLs session (json): should have domainId
- Returns:
- number of pages downloaded (int)
-
getForwardLinks
(urls, session)[source]¶ The content can be extended by crawling the given pages one level forward. The assumption here is that a relevant page will contain links to other relevant pages.
- Parameters:
- urls (list): list of urls to crawl forward session (json): should have domainId
- Return:
- None (Results are downloaded into elasticsearch)
-
getBackwardLinks
(urls, session)[source]¶ The content can be extended by crawling the given pages one level back to the pages that link to them. The assumption here is that a page containing the link to the given relevant page will contain links to other relevant pages.
- Parameters:
- urls (list): list of urls to crawl backward session (json): should have domainId
- Return:
- None (Results are downloaded into elasticsearch)
-
Annotate Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
setPagesTag
(pages, tag, applyTagFlag, session)[source]¶ Tag the pages with the given tag which can be a custom tag or ‘Relevant’/’Irrelevant’ which indicate relevance or irrelevance to the domain of interest. Tags help in clustering and categorizing the pages. They also help build computational models of the domain.
- Parameters:
- pages (urls): list of urls to apply tag tag (string): custom tag, ‘Relevant’, ‘Irrelevant’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
- Returns:
- Returns string “Completed Process”
-
setTermsTag
(terms, tag, applyTagFlag, session)[source]¶ Tag the terms as ‘Positive’/’Negative’ which indicate relevance or irrelevance to the domain of interest. Tags help in reranking terms to show the ones relevan to the domain.
- Parameters:
- terms (string): list of terms to apply tag tag (string): ‘Positive’ or ‘Negative’ applyTagFlag (bool): True - Add tag, False - Remove tag session (json): Should contain domainId
- Returns:
- None
-
Summarize Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
extractTerms
(opt_maxNumberOfTerms=40, session=None)[source]¶ Extract most relevant unigrams, bigrams and trigrams that summarize the pages. These could provide unknown information about the domain. This in turn could suggest further queries for searching content.
- Parameters:
opt_maxNumberOfTerms (int): Number of terms to return
session (json): should have domainId
- Returns:
- array: [[term, frequencyInRelevantPages, frequencyInIrrelevantPages, tags], ...]
-
make_topic_model
(session, tokenizer, vectorizer, model, ntopics)[source]¶ Build topic model from the corpus of the supplied DDT domain.
The topic model is represented as a topik.TopikProject object, and is persisted in disk, recording the model parameters and the location of the data. The output of the topic model itself is stored in Elasticsearch.
Parameters:
domain (str): DDT domain name as stored in Elasticsearch, so lowercase and with underscores in place of spaces.
tokenizer (str): A tokenizer from
topik.tokenizer.registered_tokenizers
vectorizer (str): A vectorization method from
topik.vectorizers.registered_vectorizers
model (str): A topic model from
topik.vectorizers.registered_models
ntopics (int): The number of topics to be used when modeling the corpus.
Returns:
model: topik model, encoding things like term frequencies, etc.
-
Organize Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
getPagesProjection
(session)[source]¶ Organize content by some criteria such as relevance, similarity or category which allows to easily analyze groups of pages. The ‘x’,’y’ co-ordinates returned project the page in 2D maintaining clustering based on the projection chosen. The projection criteria is specified in the session object
- Parameters:
- session: Should Contain ‘domainId’ Should contain ‘activeProjectionAlg’ which takes values ‘tsne’, ‘pca’ or ‘kmeans’ currently
- Returns dictionary in the format:{ ‘last_downloaded_url_epoch’: 1432310403 (in seconds) ‘pages’: [ [url1, x, y, tags, retrieved], (tags are a list, potentially empty) [url2, x, y, tags, retrieved], [url3, x, y, tags, retrieved],
- ] }
-
Filter Content¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
getPages
(session)[source]¶ Find pages that satisfy the specified criteria. One or more of the following criteria are specified in the session object as ‘pageRetrievalCriteria’:
‘Most Recent’, ‘More like’, ‘Queries’, ‘Tags’, ‘Model Tags’, ‘Maybe relevant’, ‘Maybe irrelevant’, ‘Unsure’
and filter by keywords specified in the session object as ‘filter’
- Parameters:
- session (json): Should contain ‘domainId’,’pageRetrievalCriteria’ or ‘filter’
- Returns:
- json: {url1: {snippet, image_url, title, tags, retrieved}} (tags are a list, potentially empty)
-
Generate Model¶
-
class
models.domain_discovery_model.
DomainModel
[source] -
createModel
(session, zip=True)[source]¶ Create an ACHE model to be applied to SeedFinder and focused crawler. It saves the classifiers, features, the training data in the <project>/data/<domain> directory. If zip=True all generated files and folders are zipped into a file.
- Parameters:
- session (json): should have domainId
- Returns:
- None
-