Refine start field boosting

Decay scales that are too small can cause scores to quickly drop to 0, leaving you with no start field boosting at all. This change also switches to Gaussian start field decay, which better suits our boosting needs. The Gaussian function decays slowly, then rapidly, then slowly again. This creates a cluster of high-scoring results whose start field is near the present, and a spread of results whose starts are further from the present, either in the past or in future. LEARNER-993

Refine start field boosting
Decay scales that are too small can cause scores to quickly drop to 0, leaving you with no start field boosting at all. This change also switches to Gaussian start field decay, which better suits our boosting needs. The Gaussian function decays slowly, then rapidly, then slowly again. This creates a cluster of high-scoring results whose start field is near the present, and a spread of results whose starts are further from the present, either in the past or in future. LEARNER-993
1fece291 · Renzo Lucioni · a4185ffc · 1fece291 · 1fece291
Commit 1fece291 authored May 31, 2017 by Renzo Lucioni
Hide whitespace changes
Inline Side-by-side

Showing with 50 additions and 9 deletions

course_discovery/apps/edx_haystack_extensions/elasticsearch_boost_config.py
+49 -1

course_discovery/settings/devstack.py
+1 -8

No files found.
--- a/course_discovery/apps/edx_haystack_extensions/elasticsearch_boost_config.py
+++ b/course_discovery/apps/edx_haystack_extensions/elasticsearch_boost_config.py
+# pylint: disable=line-too-long
 def get_elasticsearch_boost_config():
+    """
+    Custom boosting config used to control relevance scores.
+
+    For a good primer on the theory behind relevance scoring, read
+    https://www.elastic.co/guide/en/elasticsearch/guide/1.x/scoring-theory.html.
+
+    If you're trying to tweak this config locally with a small dataset and are
+    seeing strange relevance scores, keep in mind that Elasticsearch computes
+    shard-local relevance scores. This causes small discrepancies between relevance
+    scores across shards that become more pronounced with small data sets. For more
+    on this, see https://www.elastic.co/blog/understanding-query-then-fetch-vs-dfs-query-then-fetch.
+
+    Use search_type=dfs_query_then_fetch to counteract this effect when querying
+    Elasticsearch while debugging. The search_type parameter must be passed in the
+    querystring. For more, see https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-body.html
+    and https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-search-type.html.
+
+    To see how a given hit's score was computed, use the explain parameter:
+    https://www.elastic.co/guide/en/elasticsearch/reference/1.5/search-request-explain.html
+    """
    elasticsearch_boost_config = {
        'function_score': {
            'boost_mode': 'sum',
@@ -8,7 +29,34 @@ def get_elasticsearch_boost_config():
                {'filter': {'term': {'pacing_type_exact': 'self_paced'}}, 'weight': 1.0},
                {'filter': {'term': {'type_exact': 'Professional Certificate'}}, 'weight': 1.0},
                {'filter': {'term': {'type_exact': 'MicroMasters'}}, 'weight': 1.0},
-                {'linear': {'start': {'origin': 'now', 'decay': 0.95, 'scale': '1d'}}, 'weight': 5.0},
+
+                # Decay function for modifying scores based on the value of the
+                # start field. The Gaussian function decays slowly, then rapidly,
+                # then slowly again. This creates a cluster of high-scoring results
+                # whose start field is near the present, and a spread of results
+                # whose starts are further from the present, either in the past
+                # or in future.
+                #
+                # Be careful with scales less than 30 days! Scales that are too
+                # small can cause scores to quickly drop to 0, leaving you with
+                # no start field boosting at all.
+                #
+                # For more on how decay functions work, especially if you're thinking
+                # about changing this, read https://www.elastic.co/guide/en/elasticsearch/guide/1.x/decay-functions.html
+                # and https://www.elastic.co/guide/en/elasticsearch/reference/1.5/query-dsl-function-score-query.html#_decay_functions.
+                #
+                # For help visualizing the effect different decay functions can
+                # have on relevance scores, try https://codepen.io/xyu/full/MyQYjN.
+                {
+                    'gauss': {
+                        'start': {
+                            'origin': 'now',
+                            'decay': 0.95,
+                            'scale': '30d'
+                        }
+                    },
+                    'weight': 5.0
+                },

                # Boost function for CourseRuns with enrollable paid Seats.
                # We want to boost if:

--- a/course_discovery/settings/devstack.py
+++ b/course_discovery/settings/devstack.py
@@ -11,14 +11,7 @@ LOGGING['handlers']['local'] = {
 # Determine which requests should render Django Debug Toolbar
 INTERNAL_IPS = ('127.0.0.1',)

-
-HAYSTACK_CONNECTIONS = {
-    'default': {
-        'ENGINE': 'haystack.backends.elasticsearch_backend.ElasticsearchSearchEngine',
-        'URL': 'http://edx.devstack.elasticsearch:9200/',
-        'INDEX_NAME': 'catalog',
-    },
-}
+HAYSTACK_CONNECTIONS['default']['URL'] = 'http://edx.devstack.elasticsearch:9200/'

 SOCIAL_AUTH_REDIRECT_IS_HTTPS = False