- technical requirements: fast, self-healing, linearly scalable, lazily indexed
- lazily indexed - don’t index until someone attempts to search messages at least once
- decided to use elasticsearch (evaluated solr vs elasticsearch)
- wanted to avoid large es clusters since they’re difficult to maintain and team didn’t have much experience
- idea: delegate sharding to & routing to application layer which would allow them to index messages into a pool of smaller es clusters
- cluster outage would only affect limited messages
- can throw away unrecoverable clusters easily (and lazily re-index when needed)
- es prefers bulk indexing
- queue messages and process in bulk
- delay is reasonable since users mostly search for historical messages and not recent ones
- tooling
- message queue (used Celery)
- index workers (routing and bulk inserts from queue)
- historical index workers
- shard mapping (discord_guild -> es cluster + index)
- persistent (on Cassandra)
- cache (used
mget
in Redis)
- shard allocator (one-time assignment of “shard”)
- store shards ranked by score that represents their load (used sorted sets in Redis)
- score is incremented w/ each new allocation
- each indexed message in es had a probability to increment the score of its shard
- discovery (used etcd)
- of clusters and hosts within them
- raw message data not stored in ES
- only fields actually stored (
store: "yes" in es
) are message id, server id & channel id - message metadata is stored in inverted index
- timestamp isn’t included in index since ids contain a timestamp (snowflake ids)
- only fields actually stored (
- for search results, message is fetched from Cassandra along w/ message context (2 messages before & after)
- when testing
- observed unexpected disk & cpu usage
- es’s index refresh interval is 1s by default (to support near real-time search) so every second es was flushing in-memory buffer to a lucene segment & opening the segment to make it searchable across a thousand indexes
- just increasing the refresh interval to 1hr didn’t work since a discord server could go hours w/o needing to execute a single query
- control refreshing from application layer
- after bulk insert, marking guild ids & es shard containing them as dirty
- expire after 1 hour (refresh would’ve happened by then)
- on search, check if the es shard AND guild id are dirty & refresh the shard’s es index if so
- after bulk insert, marking guild ids & es shard containing them as dirty
- control refreshing from application layer
Appendix
- “guild” is just the internal name for a server
- “shard” in es vs. “shard” in this post
- es index is split into shards for distributing data
- in this post, a shard just maps to a guild
- an es shard can contain data for multiple guild ids
- Index Template
{
'template': 'm-*',
'settings': {
'number_of_shards': 1,
'number_of_replicas': 1,
'index.refresh_interval': '3600s'
},
'mappings': {
'message': {
'_source': {
'includes': [
'id',
'channel_id',
'guild_id'
]
},
'properties': {
// This is the message_id, we index by this to allow
// for greater than/less than queries, so we can search
// before, on, and after.
'id': {
'type': 'long'
},
// Lets us search with the "in:#channel-name" modifier.
'channel_id': {
'type': 'long'
},
// Lets us scope a search to a given server.
'guild_id': {
'type': 'long'
},
// Lets us search "from:Someone#0001"
'author_id': {
'type': 'long'
},
// Is the author a user, bot or webhook? Not yet exposed in client.
'author_type': {
'type': 'byte'
},
// Regular chat message, system message...
'type': {
'type': 'short'
},
// Who was mentioned, "mentions:Person#1234"
'mentions': {
'type': 'long'
},
// Was "@everyone" mentioned
// (only true if the author had permission to @everyone at the time).
// This accounts for the case where "@everyone" could be in a
// message, but it had no effect,
// because the user doesn't have permissions to ping everyone.
'mention_everyone': {
'type': 'boolean'
},
// Array of [message content, embed title, embed author,
// embed description, ...] for full-text search.
'content': {
'type': 'text',
'fields': {
'lang_analyzed': {
'type': 'text',
'analyzer': 'english'
}
}
},
// An array of shorts, specifying what type of media the message has.
// "has:link|image|video|embed|file".
'has': {
'type': 'short'
},
// An array of normalized hostnames in the message,
// traverse up to the domain. Not yet exposed in client.
// "http://foo.bar.com" gets turned into ["foo.bar.com", "bar.com"]
'link_hostnames': {
'type': 'keyword'
},
// Embed providers as returned by oembed, i.e. "Youtube".
// Not yet exposed in client.
'embed_providers': {
'type': 'keyword'
},
// Embed type as returned by oembed. Not yet exposed in client.
'embed_types': {
'type': 'keyword'
},
// File extensions of attachments, i.e. "fileType:mp3"
'attachment_extensions': {
'type': 'keyword'
},
// The filenames of the attachments. Not yet exposed in client.
'attachment_filenames': {
'type': 'text',
'analyzer': 'simple'
}
}
}
}
}