The Problem With Document References and How Knowledge Management Fails Us
The other day a colleague called and asked about the best way to find a concept within an enterprise document repository.
After discussion, an enterprise repository was a bit of a stretch. What she meant was find the concept across the entire enterprise, regardless of where it was stored.
The target concept consisted of a set of management principles. In the current incarnation, the enterprise touted seven principles. After working with a number of consultants, it had honed its principles to five. The goal was to find all references to the seven principles and replace those references with the five.
I informed my colleague that there was no way to accomplish that task completely, even with the most sophisticated knowledge management system. I shared the following reasoning:
- Structured documents do not structure around concepts; they structure around document organization, such as headings, tables, and other elements. This type of structure adds no value in a discovery task like this.
- Search would likely find many, but not all documents, for many reasons, including inaccurate references (such as a reference to leadership principles rather than management principles, colloquialization, partial references (such as “in management principle number 7,” or “our principles”), and various technical issues like the precision and recall associated with the search algorithm. Also, because modern search engines typically seek relevance first, Intranet references that demonstrate use and inbound links may well appear, but those documents that live on their own don’t benefit from any social structure that the search engine would account for.
- While most documents in formal repositories use tags, it is likely that the number of documents that reference the concept will be relatively small compared to the number of times those formal documents are referenced. Further, tags require their own management. The idea of “management principles” may not be an assigned tag, and if not, the tags will not contribute to the discoverability of the target documents.
- Not all documents that reference the concept will be discoverable, such as downloaded copies, copies stored on removable media, copies e-mailed outside of the organization, or copies stored on servers not part of the enterprise index.
Document references: The reference problem
A reference problem quickly follows the discovery problem. Just finding the concept is not enough. The documents that reference the concept may reference it in several places and in different ways. A concept may be referenced in the text, and it may be referenced in a link (as a URL embedded in text that may not directly reference the content of the link). The concept may also be referenced in a footnote or endnote or perhaps in an illustration or table. While the “full official name” of it may be referenced for highly curated documents, the concept is just as likely to be referenced in some other way in less formal content.
The ambiguity of references makes it impossible to discover all the instances through search. The search set will likely prove incomplete in that all of the ways the concept was referenced were not maintained, and therefore, some documents will escape discovery because they reference the concept in a unique way.
Practically, tools, such as internal web metrics that count document hits based on keyword searches, will discover the majority of the documents that people actually look for and reference; it will not surface documents that exist but are seldom clicked on that are perhaps not referenced elsewhere. Again, practically, those documents may not matter, but they will still exist and could be discovered in the future, offering a less-than-accurate view to readers.
A secondary issue comes in references to the references, such as documents that talk to particular principles, like three through five. Four and five in that document, however, no longer apply, making the document mostly irrelevant—even if the discussion about item three proves brilliant. The document would need to be rewritten to reflect the new principles if it was to retain any value.
Further, the secondary issue also includes which principles are assigned to which numbers. Are the new five the first five, or are they a different five? In that case, any document that goes beyond referencing the principles in total as a distinctive object will need to be rewritten. All references become invalid as the object, despite the same name, as it represents a different object.
Revisiting Hypertext
Although the problem of managing internal references has been known for decades, no deployed enterprise system adequately addresses it. Most knowledge management and document management systems deal with documents as objects, sometimes as a collection of markup structures—but rarely deconstructs them at the most primitive level of sentences, paragraphs, and concepts. Documents built using Hypertext presume well-formed content that leverages object reference across documents.
Ted Nelson, the inventor of Hypertext, designed Hypertext systems to implement bi-directional links, not the unidirectional links that we know from the World Wide Web. That means in a well-implemented system, a reference object, like the management principles, would be able to be queried for all the documents in which it is referenced. Hypertext adherents refer to the content most of us work with today as Lump Files, meaning that they are lumps of content together that can’t be parsed by systems to expose their links, and other documents can only link to them in total, not as components.
The use of Hypertext, in its most complete form, however, never took off. Some initial Hypertext ideas persist, like those on the Project Xanadu site (see the Wikipedia entry here). After a number of moves and acquisitions and continued research, the idea now languishes on edge websites that keep the concept alive without moving it forward.
TRANSCOPYRIGHT Hypertext was devised as a universal publishing system. Because of that, Hypertext in Nelson’s implementations forces the need for copyright that recognizes references to even small portions of content that need to retain their copyright and perhaps even pay the copyright owner when they are read. Nelson and his colleagues refer to this as transcopyright. The system is designed to manage attribution and links between references and their original source.
There are systems, such as TheBrain, that offer an object-oriented construct that does include bi-directional links, but it requires constructing the document entirely inside TheBrain. TheBrain does accommodate documents but placing them inside TheBrain turns the container into metadata. Because of the navigation constructs in TheBrain, storing documents within a topic container will likely make it more discoverable because that container exists only once within TheBrain, but it does not solve the problem of text, spreadsheets, presentations, and other documents with embedded references that cannot be easily teased from their structures.
Some purpose-built systems do manage content at the component level, such as proposal management systems that compile content into a proposal based on the latest version of various products, services, schedules, and other elements of a proposal for a company with a standard set of offers that needs to configure them uniquely for each proposal, but which, for the most part, constitute reusable components.
While it would be easy to manage issues like modifying management principles in one place, the user would be limited to the creation of future documents and could not retroactively be applied to existing proposals.
Knowledge and content at this level should be managed much like a bill of materials. A capacitor has a part number. It can be used in a number of assemblies. A manufacturing system can easily be queried to show all of the assemblies that require that part number. The part number, the abstract representation of the part, includes lifecycle management, meaning that if it is replaced by a different component with the same functional profile, a new part number can be referenced, and all bills of material will now point to the new part, not the old part.
Unlike content, however, manufacturing systems don’t include commentary or other references that belie precision. Larger engineering systems that capture auxiliary content about parts, their performance in assemblies, quality, and other factors fall into the same traps as content management systems. Further, most of those systems, along with other structured systems like Customer Relationship Management (CRM) systems, associate content back to the original record, be it a part number or a customer number. While the underlying content may be difficult to repurpose or revise if a change is made, at least its context is clear.
How knowledge management fails us
Knowledge management purports to make knowledge more discoverable—but the levels of abstraction often sit too high above the knowledge. In the management principles example, a shift in codified knowledge—which statements constitute the principles—results in most content related to the principles requiring rewriting once discovered, and for all the reasons stated above; there is no systematic guarantee that any non-exhaustive examination of available content will retrieve all references to the principles.
Despite decades of work on tags, taxonomies, ontologies, document structures, indexing, pattern recognition and other techniques and technologies, the complexity of content—and the simple fact that most content isn’t written to be managed—become a considerable burden to those charged with the curation and refitting of ideas deeply embedded in that content. Unfortunately, none of knowledge management’s promises can be fulfilled when working at the detail level of most enterprise content repositories.
This example of content change management also demonstrates the ongoing issues related to content that moves from a managed space into a non-managed one. In today’s IT environment, it is very easy to share a source document and to, at minimum, reference the document via a URL (even if some reading the reference don’t have access because they lack access authorization) than to e-mail a copy and exacerbate the proliferation of unmanaged content.
So, what’s the answer?
As much as organizations want to rely on automation to slog through the chaos that is document creation, storage, and management, the best answer, for critical documents, is curation. The problems listed above will persist, but at least the foundational documents, those referenced by the organization’s policies and practices, onboarding material, and marketing content, will be discoverable. That will prove a heavy lift for most, however, without the discipline and systems associated with programs like spacecraft manufacturing; our more mundane, even important documents, never arrive in systems that store their components as objects and keep track of where those components get referenced.
If this becomes a serious issue for organizations going forward, they will need to once again invoke Nelson’s vision of Xanadu…a reference to a Coleridge poem that perhaps fittingly, was never completed.
For more serious insights on knowledge management, click here.
Leave a Reply