Crawling FAQ

Q: How long will my crawl take?

The initial crawl for any datasource will always take a while; the total time of which is dependent on two key factors:

  1. The size of the datasource (e.g.: number of documents/messages, and the size of each).
  2. The rate limit of the datasource's API.

If an API has a low rate limit, this will affect how quickly Glean can crawl it for items. Likewise, datasources containing numerous documents, files, or messages, will also take longer to crawl.

For a typical enterprise datasource, expect the initial crawl to take anywhere from 3 days, up to 10 days for large datasources with low API rate limits.

Q: What if my crawl is taking a long time?

The duration of the crawl depends on the amount of data in your datasource. Larger datasets, or apps with low API rate limits, will take longer to crawl. If your crawl is taking longer than expected, please contact Glean support.

Q: I don't want Glean to crawl everything. How do I restrict what is crawled?

Some connectors support configuring restrictions at setup time from the UI (e.g. GitHub), but for most datasources, you will need to contact Glean support to have crawl restrictions applied. There are multiple ways to restrict the data crawled, including:

  1. Time-based restrictions (e.g. Only crawl created or accessed in the last 6 months)
  2. User-based restrictions (e.g.: Only crawl content from the specified users)
  3. Group-based restrictions (e.g.: Only crawl content from the specified AD group)
  4. Site/channel-based restrictions (eg: Only crawl content from the specified site or channel)
  5. Folder-based restrictions (e.g.: Only crawl content from within the specified folders)

The restrictions that can be applied are dependent on the datasource and what is available via the API. For most apps, greenlisting (explicit inclusion), and redlisting (explicit exclusion) are typically both supported.

Q: What if I see errors in my crawl status?

If you see errors in your crawl status, this could indicate a problem with the connection to your datasource or with the data itself. Please check your datasource configuration and contact Glean support if the issue persists.

Q: Can I crawl multiple datasources at the same time?

Yes, Glean supports crawling multiple datasources simultaneously.

Q: How can I see the progress of a crawl?

If a crawl is in progress, the status under the Content crawling heading in the table of apps with be Job in progress. When a crawl is complete, this field will show Synced. It is not currently possible to see more detailed crawl progress or an ETA.

Q: The crawl seems stuck at Job in progress?

Job in progress means that there is an active crawl of that datasource underway. Remember: a full crawl of the datasource will take several days to complete, so the status of the datasource will be Job in progress during this time.

Q: How do I delete a datasource?

Contact Glean support, who will remove it for you.

Q: How do I stop or restart a crawl?

Contact Glean support, who can do this for you.