Connect Datasources

In this section, you will learn how to connect the sources of data that Glean will crawl and index for search.

About Datasources and Connectors

Datasources

Datasources are the platforms, services, or cloud apps where your data resides. These could be:

Category	Example Apps
Cloud Storage	Box, OneDrive
Email	Outlook, Gmail
Communication	Slack, Teams
Documentation	Confluence, Docusign
Ticketing & Support	Jira, Zendesk
Code & Engineering	GitHub, BitBucket
HR	Workday, Lattice
Sales & Marketing	Salesforce, Marketo
Project Management	Asana, Monday
...and more!

Connectors

Connectors are the tools that Glean uses to connect to your datasources and crawl data from them. Today, Glean supports 80+ connectors to different data sources.

Connectors typically pull data from your datasources securely over API, but may also receive data from your datasources via a webhook.

Select a Datasource to Connect

From the Glean UI, navigate to Workspace Settings > Setup > Apps and click the Add app button at the top-right.

glean-1700789248269-2x

Select the datasource that you want to connect Glean to and follow the instructions that are presented on-screen.

glean-1700873802158-2x

Connector configuration is typically achieved via OAuth and/or via installing Glean via your cloud app's marketplace/store (e.g. Atlassian Marketplace).

As part of the setup flow for each connector, your API credentials and permissions will be validated.

Error prevention

You must apply any API access permissions in the setup documents exactly as referenced.

For each item within a datasource, Glean will crawl 3 things:

The item itself (ie: spreadsheet, document, message, email, event, etc)
Access permissions for the item (ie: which users have access to the item)
Activities performed on the item (ie: when was the item created/posted/modified/viewed/etc and by which users?)

Glean only asks for the most minimal permissions to perform the above, however, this varies between datasources based on the capabilities of the API provided by the cloud service. For example: Some cloud services only expose document permissions via ReadWrite or FullControl, instead of a ReadOnly API scope.

Failure to set the correct API access permissions will cause your Glean crawl to fail.

Start Crawling

Once you have connected your datasource, you can initiate the crawl for it. This is the process where Glean goes through the data in your connected datasources and indexes it for search.

Info

You will not be able to complete this step until your Glean tenant has been provisioned. If you were not able to switch from Magic Links to SSO in the last section, you will need to return to this step later.

Warning

If you would like to restrict the content that Glean crawls, DO NOT start crawling. Crawling restrictions can be applied from Workspace Settings > Setup > Apps once the initial configuration for the datasource has been saved.

The restrictions that are supported vary between apps, but most datasources support at least two of the following restrictions:

Time-based restrictions (eg: Only crawl created or accessed in the last 6 months)
User-based restrictions (eg: Only crawl content from the specified users)
Group-based restrictions (eg: Only crawl content from the specified AD group)
Site/channel-based restrictions (eg: Only crawl content from the specified site or channel)
Folder-based restrictions (eg: Only crawl content from within the specified folders)

For most apps, greenlisting (explicit inclusion), and redlisting (explicit exclusion) are typically both supported.

Not all crawling restrictions are available in the UI: some can only be applied by Glean. Contact your Glean account team or Glean support for additional information.

To start the crawl, click on the Start crawl button.

You can also start the crawl later by selecting the app under Workspace Settings > Setup > Apps, and selecting Start crawl.

glean-1700878392608-2x

How long does a crawl take?

The initial crawl for any datasource will always take a while; the total time of which is dependent on two key factors:

The size of the datasource (eg: number of documents/messages, and the size of each).
The rate limit of the datasource's API.

If an API has a low rate limit, this will affect how quickly Glean can crawl it for items. Likewise, datasources containing a large number of documents, files, or messages, will also take longer to crawl.

For a typical enterprise datasource, expect the initial crawl to take anywhere from 3 days, up to 10 days for large datasources with low API rate limits.

Checking the Crawl Status

You can check the status of your crawl at any time by going to Workspace Settings > Setup > Apps and reviewing the table of configured apps.

Here, you will see information about the progress of the crawl, including how many documents have been indexed and any errors that may have occurred.

glean-1700876245644-2x

Tip

For crawls of large datasources, or datasources with low rate limits, it is normal for the document count to be low initially and then exponentially increase over a few days.

If the document count remains low after a few days, please check the permissions granted to the Glean connector and contact Glean support.

FAQ

See Crawling FAQ for a list of common questions and answers regarding crawling.