How Discord Indexes Billions of Messages (2017)

technical requirements: fast, self-healing, linearly scalable, lazily indexed
- lazily indexed - don’t index until someone attempts to search messages at least once
decided to use elasticsearch (evaluated solr vs elasticsearch)
wanted to avoid large es clusters since they’re difficult to maintain and team didn’t have much experience
idea: delegate sharding to & routing to application layer which would allow them to index messages into a pool of smaller es clusters
- cluster outage would only affect limited messages
- can throw away unrecoverable clusters easily (and lazily re-index when needed)
es prefers bulk indexing
- queue messages and process in bulk
- delay is reasonable since users mostly search for historical messages and not recent ones
tooling
- message queue (used Celery)
- index workers (routing and bulk inserts from queue)
- historical index workers
- shard mapping (discord_guild -> es cluster + index)
  - persistent (on Cassandra)
  - cache (used mget in Redis)
- shard allocator (one-time assignment of “shard”)
  - store shards ranked by score that represents their load (used sorted sets in Redis)
  - score is incremented w/ each new allocation
  - each indexed message in es had a probability to increment the score of its shard
- discovery (used etcd)
  - of clusters and hosts within them
raw message data not stored in ES
- only fields actually stored (store: "yes" in es) are message id, server id & channel id
- message metadata is stored in inverted index
- timestamp isn’t included in index since ids contain a timestamp (snowflake ids)
for search results, message is fetched from Cassandra along w/ message context (2 messages before & after)
when testing
- observed unexpected disk & cpu usage
- es’s index refresh interval is 1s by default (to support near real-time search) so every second es was flushing in-memory buffer to a lucene segment & opening the segment to make it searchable across a thousand indexes
- just increasing the refresh interval to 1hr didn’t work since a discord server could go hours w/o needing to execute a single query
  - control refreshing from application layer
    - after bulk insert, marking guild ids & es shard containing them as dirty
      - expire after 1 hour (refresh would’ve happened by then)
    - on search, check if the es shard AND guild id are dirty & refresh the shard’s es index if so

Appendix

“guild” is just the internal name for a server
“shard” in es vs. “shard” in this post
- es index is split into shards for distributing data
- in this post, a shard just maps to a guild
- an es shard can contain data for multiple guild ids
Index Template

{
    'template': 'm-*',
    'settings': {
        'number_of_shards': 1,
        'number_of_replicas': 1,
        'index.refresh_interval': '3600s'
    },
    'mappings': {
        'message': {
            '_source': {
                'includes': [
                    'id',
                    'channel_id',
                    'guild_id'
                ]
            },
            'properties': {
                // This is the message_id, we index by this to allow
                // for greater than/less than queries, so we can search
                // before, on, and after.
                'id': {
                    'type': 'long'
                },
                // Lets us search with the "in:#channel-name" modifier.
                'channel_id': {
                    'type': 'long'
                },
                // Lets us scope a search to a given server.
                'guild_id': {
                    'type': 'long'
                },
                // Lets us search "from:Someone#0001"
                'author_id': {
                    'type': 'long'
                },
                // Is the author a user, bot or webhook? Not yet exposed in client.
                'author_type': {
                    'type': 'byte'
                },
                // Regular chat message, system message...
                'type': {
                    'type': 'short'
                },
                // Who was mentioned, "mentions:Person#1234"
                'mentions': {
                    'type': 'long'
                },
                // Was "@everyone" mentioned
                // (only true if the author had permission to @everyone at the time).
                // This accounts for the case where "@everyone" could be in a
                // message, but it had no effect, 
                // because the user doesn't have permissions to ping everyone. 
                'mention_everyone': {
                    'type': 'boolean'
                },
                // Array of [message content, embed title, embed author,
                // embed description, ...] for full-text search.
                'content': {
                    'type': 'text',
                    'fields': {
                        'lang_analyzed': {
                            'type': 'text',
                            'analyzer': 'english'
                        }
                    }
                },
                // An array of shorts, specifying what type of media the message has.
                // "has:link|image|video|embed|file".
                'has': {
                    'type': 'short'
                },
                // An array of normalized hostnames in the message,
                // traverse up to the domain. Not yet exposed in client.
                // "http://foo.bar.com" gets turned into ["foo.bar.com", "bar.com"]
                'link_hostnames': {
                    'type': 'keyword'
                },
                // Embed providers as returned by oembed, i.e. "Youtube".
                // Not yet exposed in client.
                'embed_providers': {
                    'type': 'keyword'
                },
                // Embed type as returned by oembed. Not yet exposed in client.
                'embed_types': {
                    'type': 'keyword'
                },
                // File extensions of attachments, i.e. "fileType:mp3"
                'attachment_extensions': {
                    'type': 'keyword'
                },
                // The filenames of the attachments. Not yet exposed in client.
                'attachment_filenames': {
                    'type': 'text',
                    'analyzer': 'simple'
                }
            }
        }
    }
}

Machines Fail

How Discord Indexes Billions of Messages (2017)