Commit 20e1fcd9 by Steve Komarov

Merge branch 'master' into stkomarov-fixes

parents 3eaa178f 76ab893e
......@@ -24,18 +24,99 @@ to on-line batched analytics (e.g. for an instructor dashboard), to
on-line realtime analytics (e.g. for the system to react to an event
the analytics detects).
The model behind Insights is the app store model:
![App store](docs/app_store.png)
As with an app store (Android shown above), we provide a runtime. This
runtime provides a fixed set of technologies (Python, numpy, scipy,
pylab, pandas, mongo, a cache, etc.). If you restrict yourself to this
runtime, anyone running Insights can host your analytic. If you'd like
to move outside this set of tools, you can do that too, but then you
may have to host your own analytics server.
Comparison to other systems:
* Tincan is an SOA and a format for streaming analytics. Insights is
an API and runtime for handling those events. The two are very
complementary.
* Twitter Storm is a framework for sending events around. Insights is
an API and runtime which would benefit from moving to something like
storm.
* Hadoop is a distributed computation engine. For most learning
analytics, hadoop is overkill, but it could be embedded in an
analytics module if desired.
Examples
--------
Views show up in the dashboards. To define an analytic which just
shows "Hello World" in the analytics dashboard:
@view()
def hello_world():
return "<html>Hello world!</html>"
Queries return data for use in other parts of the system. If you would
like to define a new analytic which shows a histogram of grades, the
first step would be to define a query while will return grades. How
this is done depends on your LMS, but it is often convenient to define
a dummy one which does not rely on having a functioning LMS
present. This is convenient for off-line development without live
student data:
@query()
def get_grades(course):
''' Dummy data module. Returns grades
'''
grades = 3*numpy.random.randn(1000,4)+ \
12*numpy.random.binomial(1,0.3,(1000,4))+40
return grades
Once this is in place, you can define a view which will call this query:
@view()
def plot_grades(fs, query, course):
grades = query.get_grades(course)
filename = course+"_"+str(time.time())+".png"
title("Histogram of course grades")
hist(grades)
f = fs.open(filename, "w")
savefig(f)
f.close()
fs.expire(filename, 5*60)
return "<img src="+fs.get_url(filename)+">"
At this point, the following will show up in the instructor dashboard:
![Grade histogram](docs/grade_histogram.png)
Note that the query and the view don't have to live on the same
machine. If someone wants to reuse your grade histogram in a different
LMS, all they need to do is define a new get_grades query.
To build a module which takes all in coming events and dumps them into
a database:
@event_handler()
def dump_to_db(mongodb, events):
collection = mongodb['event_log']
collection.insert([e.event for e in events])
Except for imports, that's all that's required.
Architecture
------------
A block diagram of the overall system is:
A block diagram of where the analytics might fit into an overall
learning system is:
![System structure](docs/system_structure.png)
The learning management streams events to the analytics framework. In
addition, the modules in the framework will typically have access to
read replicas of production databases. In practice, a lot of analytics
can be performed directly from the LMS databases with a lot less
effort than processing events.
The learning management system (and potentially other sources) stream
events to the analytics framework. In addition, the modules in the
framework will typically have access to read replicas of production
databases. In practice, a lot of analytics can be performed directly
from the LMS databases with a lot less effort than processing events.
A single module
A rough diagram of a single analytics module is:
......@@ -65,11 +146,48 @@ The views and queries are automatically inspect for parameters, and
the system will do the right thing. If you would like to have a
per-module database, simply take a db parameter. Etc.
To understand the system, the best place to start is by reading the
module which defines testcases -- the file
To understand how to build modules in more detail, the best place to
start is by reading the module which defines testcases -- the file
modules/testmodule/__init__.py. Next place is to look at the code for
the decorators. Final place is for the main views and dashboard.
Using with other LMSes
The architecture is designed to be usable with common analytics shared
between multiple LMSes. The structure for this is:
![Multipule LMSes](docs/multilms.png)
Here, each instance has a data layer module. This module translates
the data generate by the particular LMS into a common
representation. Higher-level analytics are built on top of that common
representation. We're trying to come up with process for creating this
data layer, but it's not essential we get it 100% right. In most
cases, it is relatively easy to include backwards-compatibility
queries.
Structuring servers
The system is transparent to how analytics are split across
servers. There are several models for how this might be used.
First, we might have a production-grade code on e.g. a critical server
which keeps student profile/grading/etc. information, while still
maintaining prototype analytics servers, which may be on-line more
intermittently:
![Multiple servers](docs/multiserver.png)
A second way to use this might be by function. For example, we might
embed analytics in the LMS, in the forums, in the wiki, in the student
registration system, and in other parts of the system. Those would
provide access to data from those subsystems. We may also wish to have
specialized runtimes providing access to additional tools like Hadoop
or R. A single computer can query across all of these servers from the
Insights API:
![Per-system analytic](docs/heterogenous.png)
Installing
----------
......@@ -145,11 +263,14 @@ per-course/per-student. An instructor of that course might want to
have that fixed to the course (so it transforms into a per-student
analytic). djobject's transform_embed defines a DSL for restricting
permissions to analytics, as well as for fixing specific commandline
parameters.
parameters. This DSL should be cleaned up, but it's good enough for
now.
Multiple analytics servers can be merged into one djobject.
There is an issue of network reliability and timeouts when access
remotely. This is planned to be handled by being able to set timeouts
on djembed objects.
remotely. You can set timeouts on djembed objects to manage those
issues.
Shortcuts/invariants
--------------------
......@@ -218,8 +339,10 @@ Gotchas
* For events to flow in, a decorator in core.views must be
called. This must be iported from the main appliction.
* Number 1 bug: Python path issues if you have this installed and are
developing from source.
* Sometimes, the network transparency isn't quite right. This is a
bug.
* Are there still any Python path issues if you have this installed
and are developing from source?
Product Backlog
---------------
......@@ -289,44 +412,16 @@ students, instructors, researchers, marketers, etc.
Architecture Expansions
=======================
This section lists some long-term architectural design goals of the
system.
The architecture is explicitly designed to eventually scale to running
different analytics on different servers. edinsights.core.djobject
(TODO: change to insights.core) provides a query object and a view
object, which can be used to access queries and views in an identical
way, regardless of whether or not there is a network in between. In
the future, we would like to support an architecture where we have
multiple analytics servers:
![Multipule servers](docs/multiserver.png)
This way, we can have production-grade code on e.g. a critical server
which keeps student profile/grading/etc. information, while still
maintaining prototype analytics servers, which may be on-line more
intermittently. In order to support this, the djobject abstraction
would have to be extended to support multiple servers. In addition,
the current way the analytics embed in the courseware would have to
change substantially.
In addition, the architecture is designed to scale to sharing
analytics between LMSes. A potential structure for this is:
![Multipule LMSes](docs/multilms.png)
Here, each instance would have a data layer module. This module would
translate the data generate by the particular LMS into a common
representation. Analytics would be built on top of that common
representation.
We would like to also support FERPA-compliance. This could be built in
We would like to also support FERPA-compliance. This could be built in
one of two ways. Per-school stacks, including analytics:
Split analytics:
The API supports either. Building out back-end support for either
would be substantial.
would be substantial work.
Other edX Code
==============
......
......@@ -67,6 +67,10 @@ def view(category = None, name = None, description = None, args = None):
args: Optional argspec for the function. This is generally better
omitted.
TODO: human_name: Name without Python name restrictions -- e.g.
"Daily uploads" instead of "daily_uploads" -- for use in
human-usable dashboards.
'''
def view_factory(f):
registry.register_handler('view',category, name, description, f, args)
......
......@@ -33,11 +33,21 @@ def register_handler(cls, category, name, description, f, args):
category+="+"
if cls not in request_handlers:
request_handlers[cls] = {}
if name in request_handlers[cls]:
# We used to have this be an error.
# We changed to a warning for the way we handle dummy values.
log.warn("{0} already in {1}".format(name, category)) # raise KeyError(name+" already in "+category)
request_handlers[cls][name] = {'function': f, 'name': name, 'doc': description, 'category' : category}
# We may want to register under multiple names. E.g.
# edx.get_grades and (once adopted globally) generic
# get_grades
if isinstance(name, list):
names = name
else:
names = [name]
for n in names:
if n in request_handlers[cls]:
# We used to have this be an error.
# We changed to a warning for the way we handle dummy values.
log.warn("{0} already in {1}".format(n, category)) # raise KeyError(name+" already in "+category)
request_handlers[cls][n] = {'function': f, 'name': n, 'doc': description, 'category' : category}
class StreamingEvent:
''' Event object. Behaves like the normal JSON event dictionary,
......
......@@ -164,3 +164,15 @@ def djt_fake_user_count(query):
the network, as well as optional parameters like fs, db, etc.
'''
return "<html>Users: {uc}</html>".format(uc = query.djt_fake_user_count())
@query(name=['djt_three_name', 'edx_djt_three_name', 'edx.djt_three_name'])
def djt_three_name():
return "I have three names"
@query(name = 'djt_check_three_name')
def check_three_name(query):
if query.djt_three_name() != "I have three names":
raise Exception("oops")
if query.edx_djt_three_name() != "I have three names":
raise Exception("oops")
return "Works"
......@@ -151,3 +151,8 @@ class SimpleTest(TestCase):
c = Client()
response = c.get('/view/djt_fake_user_count').content
self.assertEqual(response, "<html>Users: 2</html>")
def test_multiname(self):
c = Client()
response = c.get('/query/djt_check_three_name').content
self.assertEqual(response, "Works")
......@@ -196,9 +196,23 @@ LOGGING = {
#initialize celery
import djcelery
djcelery.setup_loader()
#import the settings for celery from the edinsights module
from edinsights.celerysettings_dev import *
# import django cache settings
from edinsights.djangocachesettings_dev import *
\ No newline at end of file
#import the settings for celery from the edinsights module and for cache
try:
from celerysettings_dev import *
from djangocachesettings_dev import *
except:
# The code had the imports below. These fail when running test
# cases stand-alone. I think the above fixes this, but I'm
# leaving this in for now in case there are configurations I
# haven't thought of. If the exception is raised, remove this
# comment, remove the exception, and add a comment explaining
# when the second set of imports is necessary.
#
# If it's, say, October, and no one has run into the exception,
# we should kill the extra code.
#
# pmitros -- 21/July/2013.
raise Exception("Import failed. See instructions in settings.py")
from edinsights.djangocachesettings_dev import *
from edinsights.celerysettings_dev import *
Markdown is supported
0% or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment