Merge branch 'master' into stkomarov-fixes

20e1fcd9 · Steve Komarov · 3eaa178f · 76ab893e · 20e1fcd9 · 20e1fcd9
Commit 20e1fcd9 authored Jul 22, 2013 by Steve Komarov
8 changed files
--- a/README.md
+++ b/README.md
@@ -24,18 +24,99 @@ to on-line batched analytics (e.g. for an instructor dashboard), to
 on-line realtime analytics (e.g. for the system to react to an event
 the analytics detects).

+The model behind Insights is the app store model: 
+![App store](docs/app_store.png)
+As with an app store (Android shown above), we provide a runtime. This
+runtime provides a fixed set of technologies (Python, numpy, scipy,
+pylab, pandas, mongo, a cache, etc.). If you restrict yourself to this
+runtime, anyone running Insights can host your analytic. If you'd like
+to move outside this set of tools, you can do that too, but then you
+may have to host your own analytics server.
+
+Comparison to other systems: 
+* Tincan is an SOA and a format for streaming analytics. Insights is
+  an API and runtime for handling those events. The two are very
+  complementary.
+* Twitter Storm is a framework for sending events around. Insights is
+  an API and runtime which would benefit from moving to something like
+  storm.
+* Hadoop is a distributed computation engine. For most learning
+  analytics, hadoop is overkill, but it could be embedded in an
+  analytics module if desired.
+
+Examples
+--------
+
+Views show up in the dashboards. To define an analytic which just
+shows "Hello World" in the analytics dashboard:
+
+    @view()
+    def hello_world():
+       return "<html>Hello world!</html>"
+
+Queries return data for use in other parts of the system. If you would
+like to define a new analytic which shows a histogram of grades, the
+first step would be to define a query while will return grades. How
+this is done depends on your LMS, but it is often convenient to define
+a dummy one which does not rely on having a functioning LMS
+present. This is convenient for off-line development without live
+student data:
+
+    @query()
+    def get_grades(course):
+        ''' Dummy data module. Returns grades  
+        '''
+        grades = 3*numpy.random.randn(1000,4)+ \
+            12*numpy.random.binomial(1,0.3,(1000,4))+40
+        return grades
+
+Once this is in place, you can define a view which will call this query: 
+
+    @view()
+    def plot_grades(fs, query, course):
+        grades = query.get_grades(course)
+        filename = course+"_"+str(time.time())+".png"
+        title("Histogram of course grades")
+        hist(grades)
+        f = fs.open(filename, "w")
+        savefig(f)
+        f.close()
+        fs.expire(filename, 5*60)
+        return "<img src="+fs.get_url(filename)+">"
+
+At this point, the following will show up in the instructor dashboard: 
+
+![Grade histogram](docs/grade_histogram.png)
+
+Note that the query and the view don't have to live on the same
+machine. If someone wants to reuse your grade histogram in a different
+LMS, all they need to do is define a new get_grades query.
+
+To build a module which takes all in coming events and dumps them into
+a database:
+
+    @event_handler()
+    def dump_to_db(mongodb, events):
+        collection = mongodb['event_log']
+        collection.insert([e.event for e in events])
+
+Except for imports, that's all that's required. 
+
 Architecture
 ------------

-A block diagram of the overall system is: 
+A block diagram of where the analytics might fit into an overall
+learning system is:

 ![System structure](docs/system_structure.png)

-The learning management streams events to the analytics framework. In
-addition, the modules in the framework will typically have access to
-read replicas of production databases. In practice, a lot of analytics
-can be performed directly from the LMS databases with a lot less
-effort than processing events.
+The learning management system (and potentially other sources) stream
+events to the analytics framework. In addition, the modules in the
+framework will typically have access to read replicas of production
+databases. In practice, a lot of analytics can be performed directly
+from the LMS databases with a lot less effort than processing events.
+
+A single module

 A rough diagram of a single analytics module is: 

@@ -65,11 +146,48 @@ The views and queries are automatically inspect for parameters, and
 the system will do the right thing. If you would like to have a
 per-module database, simply take a db parameter. Etc.

-To understand the system, the best place to start is by reading the
-module which defines testcases -- the file
+To understand how to build modules in more detail, the best place to
+start is by reading the module which defines testcases -- the file
 modules/testmodule/__init__.py. Next place is to look at the code for
 the decorators. Final place is for the main views and dashboard.

+Using with other LMSes
+
+The architecture is designed to be usable with common analytics shared
+between multiple LMSes. The structure for this is:
+
+![Multipule LMSes](docs/multilms.png)
+
+Here, each instance has a data layer module. This module translates
+the data generate by the particular LMS into a common
+representation. Higher-level analytics are built on top of that common
+representation. We're trying to come up with process for creating this
+data layer, but it's not essential we get it 100% right. In most
+cases, it is relatively easy to include backwards-compatibility
+queries.
+
+Structuring servers
+
+The system is transparent to how analytics are split across
+servers. There are several models for how this might be used. 
+
+First, we might have a production-grade code on e.g. a critical server
+which keeps student profile/grading/etc. information, while still
+maintaining prototype analytics servers, which may be on-line more
+intermittently: 
+
+![Multiple servers](docs/multiserver.png)
+
+A second way to use this might be by function. For example, we might
+embed analytics in the LMS, in the forums, in the wiki, in the student
+registration system, and in other parts of the system. Those would
+provide access to data from those subsystems. We may also wish to have
+specialized runtimes providing access to additional tools like Hadoop
+or R. A single computer can query across all of these servers from the
+Insights API:
+
+![Per-system analytic](docs/heterogenous.png)
+
 Installing
 ----------

@@ -145,11 +263,14 @@ per-course/per-student. An instructor of that course might want to
 have that fixed to the course (so it transforms into a per-student
 analytic). djobject's transform_embed defines a DSL for restricting
 permissions to analytics, as well as for fixing specific commandline
-parameters. 
+parameters. This DSL should be cleaned up, but it's good enough for
+now. 
+
+Multiple analytics servers can be merged into one djobject. 

 There is an issue of network reliability and timeouts when access
-remotely. This is planned to be handled by being able to set timeouts
-on djembed objects.
+remotely. You can set timeouts on djembed objects to manage those
+issues.

 Shortcuts/invariants
 --------------------
@@ -218,8 +339,10 @@ Gotchas

 * For events to flow in, a decorator in core.views must be
  called. This must be iported from the main appliction. 
-* Number 1 bug: Python path issues if you have this installed and are
-  developing from source.
+* Sometimes, the network transparency isn't quite right. This is a 
+  bug. 
+* Are there still any Python path issues if you have this installed
+  and are developing from source?

 Product Backlog
 ---------------
@@ -289,44 +412,16 @@ students, instructors, researchers, marketers, etc.
 Architecture Expansions
 =======================

-This section lists some long-term architectural design goals of the
-system. 
-
-The architecture is explicitly designed to eventually scale to running
-different analytics on different servers. edinsights.core.djobject
-(TODO: change to insights.core) provides a query object and a view
-object, which can be used to access queries and views in an identical
-way, regardless of whether or not there is a network in between. In
-the future, we would like to support an architecture where we have
-multiple analytics servers:
-
-![Multipule servers](docs/multiserver.png)

-This way, we can have production-grade code on e.g. a critical server
-which keeps student profile/grading/etc. information, while still
-maintaining prototype analytics servers, which may be on-line more
-intermittently. In order to support this, the djobject abstraction
-would have to be extended to support multiple servers. In addition,
-the current way the analytics embed in the courseware would have to
-change substantially.
-
-In addition, the architecture is designed to scale to sharing
-analytics between LMSes. A potential structure for this is: 
-
-![Multipule LMSes](docs/multilms.png)

-Here, each instance would have a data layer module. This module would
-translate the data generate by the particular LMS into a common
-representation. Analytics would be built on top of that common
-representation. 

-We would like to also support FERPA-compliance. This could be built in
+We would like to also support FERPA-compliance. This could be built in 
 one of two ways. Per-school stacks, including analytics: 

 Split analytics: 

 The API supports either. Building out back-end support for either
-would be substantial.
+would be substantial work.

 Other edX Code
 ==============

--- a/docs/app_store.png
+++ b/docs/app_store.png
--- a/docs/grade_histogram.png
+++ b/docs/grade_histogram.png
--- a/src/edinsights/core/decorators.py
+++ b/src/edinsights/core/decorators.py
@@ -67,6 +67,10 @@ def view(category = None, name = None, description = None, args = None):

    args: Optional argspec for the function. This is generally better
      omitted.
+
+    TODO: human_name: Name without Python name restrictions -- e.g.
+    "Daily uploads" instead of "daily_uploads" -- for use in
+    human-usable dashboards.
    '''
    def view_factory(f):
        registry.register_handler('view',category, name, description, f, args)

--- a/src/edinsights/core/registry.py
+++ b/src/edinsights/core/registry.py
@@ -33,11 +33,21 @@ def register_handler(cls, category, name, description, f, args):
                category+="+"
    if cls not in request_handlers:
        request_handlers[cls] = {}
-    if name in request_handlers[cls]:
-        # We used to have this be an error.
-        # We changed to a warning for the way we handle dummy values.
-        log.warn("{0} already in {1}".format(name, category))  # raise KeyError(name+" already in "+category)
-    request_handlers[cls][name] = {'function': f, 'name': name, 'doc': description, 'category' : category}
+
+    # We may want to register under multiple names. E.g. 
+    # edx.get_grades and (once adopted globally) generic 
+    # get_grades
+    if isinstance(name, list):
+        names = name
+    else:
+        names = [name]
+    for n in names: 
+        if n in request_handlers[cls]:
+            # We used to have this be an error.
+            # We changed to a warning for the way we handle dummy values.
+            log.warn("{0} already in {1}".format(n, category))  # raise KeyError(name+" already in "+category)
+
+        request_handlers[cls][n] = {'function': f, 'name': n, 'doc': description, 'category' : category}

 class StreamingEvent:
    ''' Event object. Behaves like the normal JSON event dictionary,

--- a/src/edinsights/modules/testmodule/__init__.py
+++ b/src/edinsights/modules/testmodule/__init__.py
@@ -164,3 +164,15 @@ def djt_fake_user_count(query):
    the network, as well as optional parameters like fs, db, etc. 
    '''
    return "<html>Users: {uc}</html>".format(uc = query.djt_fake_user_count())
+
+@query(name=['djt_three_name', 'edx_djt_three_name', 'edx.djt_three_name'])
+def djt_three_name():
+    return "I have three names"
+
+@query(name = 'djt_check_three_name')
+def check_three_name(query):
+    if query.djt_three_name() != "I have three names":
+        raise Exception("oops")
+    if query.edx_djt_three_name() != "I have three names":
+        raise Exception("oops")
+    return "Works"
--- a/src/edinsights/modules/tests.py
+++ b/src/edinsights/modules/tests.py
@@ -151,3 +151,8 @@ class SimpleTest(TestCase):
        c = Client()
        response = c.get('/view/djt_fake_user_count').content
        self.assertEqual(response, "<html>Users: 2</html>")
+
+    def test_multiname(self):
+        c = Client()
+        response = c.get('/query/djt_check_three_name').content
+        self.assertEqual(response, "Works")
--- a/src/edinsights/settings.py
+++ b/src/edinsights/settings.py
@@ -196,9 +196,23 @@ LOGGING = {
 #initialize celery
 import djcelery
 djcelery.setup_loader()
-#import the settings for celery from the edinsights module
-from edinsights.celerysettings_dev import *
-
-
-# import django cache settings
-from edinsights.djangocachesettings_dev import *
\ No newline at end of file
+#import the settings for celery from the edinsights module and for cache
+
+try: 
+    from celerysettings_dev import *
+    from djangocachesettings_dev import *
+except: 
+    # The code had the imports below. These fail when running test 
+    # cases stand-alone. I think the above fixes this, but I'm 
+    # leaving this in for now in case there are configurations I 
+    # haven't thought of. If the exception is raised, remove this
+    # comment, remove the exception, and add a comment explaining
+    # when the second set of imports is necessary. 
+    #
+    # If it's, say, October, and no one has run into the exception, 
+    # we should kill the extra code. 
+    # 
+    # pmitros -- 21/July/2013. 
+    raise Exception("Import failed. See instructions in settings.py")
+    from edinsights.djangocachesettings_dev import *
+    from edinsights.celerysettings_dev import *