This document describes the split mongostore representation which separates course structure from content where each course run can have its own structure. It does not describe the original mongostore representation which combined structure and content and used the key to distinguish draft from published elements. This document does not describe mongo nor its operations. See `http://www.mongodb.org/`_ for information on Mongo. Product Goals and Discussion ---------------------------- (Mark Chang) This work was instigated by the studio team's need to correctly do metadata inheritance. As we moved from an on-startup load of the courseware, the system was able to inflate and perform an inheritance calculation step such that the intended properties of children could be set through inheritance. While not strictly a requirement from the studio authoring approach, where inheritance really rears its head is on import of existing courseware that was designed assuming inheritance. A short term patch was applied that allowed inheritance to act correctly, but it was felt that it was insufficient and this would be an opportunity to make a more clean datastore representation. After much difficulty with how draft objects would work, Calen Pennington worked through a split data store model ala FAT filesystem (Mark's metaphor, not Cale's) to split the structure from the content. The goal would be a sea of content documents that would not know about the structure they were utilized within. Cale began the work and handed it off to Don Mitchell. In the interim, great discussion was had at the Architect's Council that firmed up the design and strategy for implementation, adding great richness and completeness to the new data structure. The immediate needs are two, and only two. #. functioning metadata inheritance #. good groundwork for versioning While the discussions of the atomic unit of courseware available for sharing, how these are shared, and how they refer back to the parent definition are all valuable, they will not be built in the near term. I understand and expect there to be many refactorings, improvements, and migrations in the future. I fully anticipate much more detail to be uncovered even in this first thin implementation. When that happens, we will need as much advice from those watching this page to make sure we move in the right direction. We also must have the right design artifacts to document where we stand relative to the overall design that has loftier goals. Representation -------------- The xmodule collections: + `modulestore.active_versions`: this collection maps the org, course, and run to the current draft and published versions of the course. + `modulestore.structures`: this collection has one entry per course run and one for the template. + `modulestore.definitions`: this collection has one entry per "module" or "block" version. modulestore.active_versions: 2 simple maps for dereferencing the correct course from the structures collection. Every course run will have a draft version. Not every course run will have a published version. No course run will have more than one of each of these. :: { '_id' : uniqueid, 'versions' : { <versionName> : versionGuid, ..} 'creator' : user_id, 'created' : date (native mongo rep) } :: + `id` is a unique id for finding this course run. It's a location-reference string, like 'edu.mit.eng.eecs.6002x.industry.spring2013'. + `versions`: These are references to `modulestore.structures`. A location-reference like `edu.mit.eng.eecs.6002x.industry.spring2013;draft` refers to the value associated with `draft` for this document. + `versionName` is `draft`, `published`, or another user-defined string. + `versionGuid` is a system generated globally unique id (hash). It points to the entry in `modulestore.structures` ` ` `draftVersion`: the design will try to generate a new draft version for each change to the course object: that is, for each move, deletion, node creation, or metadata change. Cloning a course (creating a new run of a course or such) will create a new entry in this table with just a `draftVersion` and will cause a copy of the corresponding entry in `modulestore.structures`. The entry in `structures` will point to its version parent in the source course. modulestore.structures : the entries in this collection follow this definition: :: { '_id' : course_guid, 'blocks' : { block_guid : // the guid is an arbitrary id to represent this node in the course tree { 'children' : [ block_guid* ], 'metadata' : { property map }, 'definition' : definition_guid, 'category' : 'section' | 'sequence' | ... } :: ...// more guids :: }, 'root' : block_guid, 'original' : course_guid, // the first version of this course from which all others were derived 'previous' : course_guid | null, // the previous revision of this course (null if this is the original) 'version_entry' : uniqueid, // from the active_versions collection 'creator' : user_id } + `blocks`: each block is a node in the course such as the course, a section, a subsection, a unit, or a component. The block ids remain the same over edits (they're not versioned). + `root`: the true top of the course. Not all nodes without parents are truly roots. Some are orphans. + `course_guid, block_guid, definition_guid` are not those specific strings but instead some system generated globally unique id. + The one which gets passed around and pointed to by urls is the `block_guid`; so, it will be the one the system ensures is readable. Unlike the other guids, this one stays the same over revisions and can even be the same between course runs (although the course run contextualizes it to distinguish its instantiated version). + `definition` points to the specific revision of the given element in `modulestore.definitions` which this version of the course includes. + `children` lists the block_guids which are the children of this node in the course tree. It's an error if the guid in the `children` list does not occur in the `blocks` dictionary. + `metadata` is the node's explicitly defined metadata some of which may be inherited by its children For debugging purposes, there may be value in adding a courseId field (org, course, run) for use via db browsers. modulestore.definitions : the data associated with each version of each node in the structures. Many courses may point to the same definition or may point to different versions derived from the same original definition. :: { '_id' : guid, 'data' : .., 'default_settings' : {'display_name':..,..}, // a starting point for new uses of this definition 'category' : xblocktype, // the xmodule/xblock type such as course, problem, html, video, about 'original' : guid, // the first kept version of this definition from which all others were derived 'previous' : guid | null, // the previous revision of this definition (null if this is the original) 'creator' : user_id // the id of whomever pressed the draft or publish button } + `_id`: a guid to uniquely identify the definition. + `data` is the payload used by the xmodule and following the xmodule's data representation. + `category` is the xmodule type and used to figure out which xmodule to instantiate. There may be some debugging value to adding a courseId field, but it may also be misleading if the element is used in more than one course. Templates ~~~~~~~~~ (I'm refactoring templates quite a bit from their representation prior to this design) All field defaults will be defined through the xblock field.default mechanism. Templates, otoh, are for representing optional boilerplate usually for examples such as a multiple-choice problem or a video component with the fields all filled in. Templates are stored in yaml files which provide a template name, sorting and filtering information (e.g., requires advanced editor v allows simple editor), and then field: value pairs for setting xblocks' fields upon template selection. Most of the pre-existing templates including all of the 'empty' ones will go away. The ones which will stay are the ones truly just giving examples or starting points for variants. This change will require that the template choice code provide a default 'blank' choice to the user which just instantiates the model w/ its defaults versus a choice of the boilerplates. The client can therefore populate its own model of the xblock and then send a create-item request to the server when the user says he/she's ready to save it. Import/export ~~~~~~~~~~~~~ Export should allow the user to select the version of the course to export which can be any of the draft or published versions. At a minimum, the user should choose between draft or published. Import should import the course as a draft course regardless of whether it was exported as a published or draft one, I believe. If there's already a draft for the same course, in the best of all worlds, it would have the guid to see if the guid exists in the structures collection, and, if so, just make that the current draftVersion (don't do any actual data changes). If there's no guid or the guid doesn't exist in the structures collection, then we'll need to work out the logic for how to decide what definitions to create v update v point to. Course ID ~~~~~~~~~ Currently, we use a triple to identify a run of a course. The triple is organization, course name, and run identity (e.g., 2013Q1). The system does not care what the id consists of only that it uniquely identify an edition of the course. The system uses this id to organize the course composition and find the course elements. It distinguishes between a current being-edited version (aka, draft) and publicly viewable version (published). Not every course has a published version, but every course will have a draft version. The application specifies whether it wants the draft or published version. This system allows the application to easily switch between the 2; however, it will have a configuration in which it's impossible to access the draft so that we can add access optimizations and extraction filtering later if needed. Location ~~~~~~~~ The purpose of `Location` is to identify content. That is, to be able to locate content by providing sufficient addressing. The `Location` object is ubiquitous throughout the current code and thus will be difficult to adapt and make more flexible. Right now, it's a very simple `namedtuple` and a lot of code presumes this. This refactoring generalizes and subclasses it to handle various addressing schemes and remove direct manipulations. Our code needs to locate several types of things and should probably use several different types of locators for these. These are the types of things we need to address. Some of these can be the same as others, but I wanted to lay them out fairly fine grained here before proposing my distinctions: #. Courses: an object representing a course as an offering but not any of its content. Used for dashboards and other such navigators. These may specify a version or merely reference the idea of the course's existence. #. Course structures: the names (and other metadata), `Locations`, and children pointers but not definitions for all the blocks in a course or a subtree of a course. Our applications often display contextual, outline, or other such structural information which do not need to include definitions but need to show display names, graded as, and other status info. This document's design makes fetching these a single document fetch; however, if it has to fetch the full course, it will require far more work (getting all definitions too) than the apps need. #. Blocks (uses of definitions within a version of a course including metadata, pointers to children, and type specific content) #. Definitions: use independent definitions of content without metadata (and currently w/o pointers to children). #. Version trees Fetching the time history portrayal of a definition, course, or block including branching. #. Collections of courses, definitions, or blocks matching some partial descriptors (e.g., all courses for org x, all definitions of type foo, all blocks in course y of type x, all currently accessible courses (published with startdate < today and enddate > today)). #. Fetching of courses, blocks, or definitions via "human readable" urls. #. (partial descriptors) may suffice for this as human readable does not guarantee uniqueness. Some of these differ not so much in how to address them but in what should be returned. The content should be up to the functions not the addressing scheme. So, I think the addressable things are: #. Course as in #1 above: usually a specific offering of a course. Often used as a context for the other queries. #. Blocks (aka usages) as in #3 above: a specific block contextualized in a course #. Definitions (#4): a specific definition #. Collections of courses, blocks within a specific course, or definitions matching a partial descriptor Course locator (course_loc) ``````````````````````````` There are 3 ways to locate a course: #. By its unique id in the `active_versions` collection with an implied or specified selection of draft or published version. #. By its unique id in the `structures` collection. Block locator (block_loc) ````````````````````````` A block locator finds a specific node in a specific version of a course. Thus, it needs a course locator plus a `usage_id`. Definition locator (definition_loc) ``````````````````````````````````` Just a `guid`. Partial descriptor collections locators (partial) ````````````````````````````````````````````````` In the most general case, and to simplify implementation, these can be any payload passable to mongo for doing the lookup. The specification of which collection to look into can be implied by which lookup function your code calls (get_courses, get_blocks, get_definitions) or we could add it as another property. For now, I will leave this as merely a search string. Thus, to find all courses for org = mitx, `{"org": "mitx"}`. To find all blocks in a course whose display name contains "circuit example", call `get_blocks` with the course locator plus `{"metadata.display_name" : /circuit example/i}` (the i makes it case insensitive and is just an example). To find if a definition is used in a course, call get_blocks with the course locator plus `{definition : definition_guid}`. Note, this looks for a specific version of the definition. If you wanted to see if it used any of a set of versions, use `{definition : {"$in" : [definition_guid*]}}` i4x locator ``````````` To support existing xml based courses and any urls, we need to support i4x locators. These are tuples of `(org course category id ['draft'])`. The trouble with these is that they don't uniquely identify a course run from which to dereference the element. There's also no requirement that `id` have any uniqueness outside the scope of the other elements. There's some debate as to whether these address blocks or definitions. To mean, they seem to address blocks; however, in the current system there is no distinction between blocks and definitions; so, either could be argued. This version will define an `i4x_location` class for representing these and using them for xml based courses if necessary. Current code munges strings to make them 'acceptable' by replacing 'illegal' chars with underscores. I'd like to suggest leaving strings as is and using url escaping to make acceptable urls. As to making human readable names from display strings, that should be the responsibility of the naming module not the Location representation, imo. Use cases (expository) ~~~~~~~~~~~~~~~~~~~~~~ There's a section below walking through a specific use case. This one just tries to review potential functionality. Inheritance ``````````` Our system has the notion of policies which should control the behavior of whole courses or subtrees within courses. Such policies include graceperiods, discussion forum controls, dates, whether to show answers, how to randomize, etc. It's important that the course authors' intent propagates to all relevant course sections. The desired behavior is that (some? all?) metadata attributes on modules flow down to all children unless overridden. This design addresses inheritance by making course structure and metadata separate from content thus enabling a single or small number of db queries to get these and then compute the inheritance. Separating editing from live production ``````````````````````````````````````` Course authors should be able to make changes in isolation from production and then push out consistent chunks of changes for all students to see as atomic and consistent. The current system allows authors to change text and content without affecting production but not metadata nor course structure. This design separates all changes from production until pushed. Sharing of content, part 1 `````````````````````````` Authors want to share content between course runs and even between different courses. The current system requires copying all such content and losing the providence information which could be used to take advantage of other peoples' changes. This design allows multiple courses and multiple places within a course to point to the same definitions and thus potentially, at some day, see other changes to the content. Sharing of content, part 2: course structure ```````````````````````````````````````````` Because courses structures are separate from their identities, courses can share structure and track changes in the same way as definitions. That is, a new course run can point to an existing course instance with its version history and then branch it from there. Sharing of content, part 3: modules ``````````````````````````````````` Suppose a course includes a soldering tutorial (or a required lab safety lesson). Other courses want to use the same tutorial and possibly allow the student to skip it if the student succeeded at it in another course. As the tutorial updates, other courses may want to track the updates or choose to move to the updates without having to copy the modules from the module's authoritative parent course. This design enables sharing of composed modules but it does not track the revisions of those modules separately from their courses. It does not adequately address this but may be extendible enough to do so. That is, we could represent these shared units as separate "courses" and allow ids in block.children[] to point to courses as well as other blocks in the same course. We should decide on the behaviors we want. Such as, some times the student has to repeat the content or the student never has to repeat it or? progress should be tracked by the owning course or as a stand alone minicourse type element? Because it's a safety lesson, all courses should track the current published head and not have their own heads or they should choose when to promote the head? Are these shared elements rare and large grained enough to make the indirection not expensive or will it result in devolving to the current one entry per module design for deducing course structure? Functional differences from existing modulestore: ------------------------------------------------- + Courses and definitions support trees of versions knowing from where they were derived. For now, I will not implement the server functions for retrieving and manipulating these version trees and will leave those for a future effort. I will only implement functions which extend the trees. + Changes to course structure don't immediately affect production: note, we need to figure out the granularity of the user's publish behavior for pushing out these actions. That is, do they publish a whole subtree which may include new children in order to make these effective, do they publish all structural (deletion, move) changes under a subtree but not insertions as an action, do they publish each action individually, or what? How do they know that any of these are not yet published? Do we have phantom placeholders for deleted nodes w/ "publish deletion" buttons? + Element deletion + Element move + metadata changes + No location objects used as ids! This implementation will use guids instead. There's a reasonable objection to guids as being too ugly, long, and indecipherable. I will check mongy, pymongo, and python guid generation mechanisms to find out if there's a way to make ones which include a prepended string (such as course and run or an explicitly stated prepend string) and minimize guid length (e.g., by using sequential serial # from a global or local pool). Use case walkthrough: --------------------- Simple course creation with no precursor course: Note, this shows that publishing creates subsets and side copies not in line versions of nodes. user db create course for org, course id, run id active_versions.draftVersion: add entry definitions: add entry C w/ category = 'course', no data structures: add entry w/ 1 child C, original = self, no previous, author = user add section S copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new definitions: add entry S w/ category = 'section' structures entry: + add S to children of the course block, + add S to blocks w/ no children add subsection T copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new definitions: add entry T w/ category = 'sequential' structures entry: + add T to children of the S block entry, + add T to blocks w/ no children add unit U copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new definitions: add entry U w/ category = 'vertical' structures entry: + add U to children of the T block entry, + add U to blocks w/ no children publish U create structures entry, new one points to self as original (no pointer to draft course b/c it's not really a clone) active_versions.publishedVersion points to new block: add U, T, S, C pointers with each as respective child (regardless of other children they may have in draft), and their metadata add units V, W, X under T copy structures entry of the draftVersion, new one points to old as original and previous active_versions.draftVersion points to new definitions: add entries V, W, X w/ category = 'vertical' structures entry: + add V, W, X to children of the T block entry, + add V, W, X to blocks w/ no children edit U copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new definitions: copy entry U to U_2 w/ updates, U_2 points to U as original and previous structures entry: + replace U w/ U_2 in children of the T block entry, + copy entry U in blocks to entry U_2 and remove U add subsection Z under S copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new definitions: add entry Z w/ category = 'sequential' structures entry: + add Z to children of the S block entry, + add Z to blocks w/ no children edit S's name (metadata) copy structures entry, new one points to old as original and previous active_versions.draftVersion points to new structures entry: update S's metadata w/ new name publish U, V copy publishedCourse structures entry, new one points to old published as original and previous active_versions.publishedVersion points to new block: update T to point to new U & V and not old U Note: does not update S's name publish C copy publishedCourse structures entry, new one points to old published as original and previous active_versions.publishedVersion points to new blocks: note that C child S == published(S) but metadata !=, update metadata note that S has unpublished children: publish them (recurse on this) note that Z is unpublished: add pointer to blocks and children of S note that W, X unpublished: add to blocks, add to children of T edit C metadata (e.g., graceperiod) copy draft structures entry, new one points to old as original and previous active_versions.draftVersion points to new structures entry: update C's metadata add Y under Z ... publish C's metadata change copy publishedCourse structures entry, new one points to old published as original and previous active_versions.publishedVersion points to new blocks: update C's metadata Note: no copying of Y or any other changes to published move X under Z copy draft structures entry, new one points to old as original and previous active_versions.draftVersion points to new structures entry: remove X from T's children and add to Z's Note: making it persistently clear to the user that X still exists under T in the published version will be crucial delete W copy draft structures entry, new one points to old as original and previous active_versions.draftVersion points to new structures entry: remove W from T's children and remove W from blocks Note: no actual deletion of W, just no longer reachable w/in the draft course, but still in published; so, need to keep user aware of that. publish Z Note: the interesting thing here is that X cannot occur under both Z and T, but the user's not publishing T, here's where having a consistent definition of original may help. If the original of a new element == original of an existing, then it's an update? copy publishedCourse entry... definitions: add Y, copy/update Z, X if either have any data changes (they don't) blocks: remove X from T's children and add to Z's, add Y to Z, add Y publish deletion of W copy publishedCourse entry... structures entry: remove W from T's children and remove W from blocks Conflict detection: Need a scenario where 2 authors make edits to different parts of course, to parts while parents being moved, while parents being deleted, to same parts, ... .. _http://www.mongodb.org/: http://www.mongodb.org/