.. post:: 2010-11-16 21:01:00 Using Haystack to index non-database content ============================================ Over on ReadTheDocs, I wanted to build `search `_ around the documentation that we're hosting. I chose `Haystack `_ and `Solr `_ for this, because it's the best way to do search in Django these days. However, I've only ever used Haystack to index content that is in the database. I thought about trying to add all the rendered HTML from the documentation into the database, but that was a non-starter. I ended up adding a ImportedFile model to the database, which would contain the metadata for the HTML file: :: #!python class ImportedFile(models.Model): project = models.ForeignKey(Project, related_name='imported_files') name = models.CharField(max_length=255) slug = models.SlugField() path = models.CharField(max_length=255) md5 = models.CharField(max_length=255) This allows me to link the SearchIndex in haystack to a model. Then the interesting part is in the Haystack SearchIndex, where I override the prepare\_text method, allowing me to read the data in from the filesystem instead of from the database. :: #!python class ImportedFileIndex(SearchIndex): text = CharField(document=True) author = CharField(model_attr='project__user') project = CharField(model_attr='project__name') title = CharField(model_attr='name') def prepare_text(self, obj): full_path = obj.project.full_html_path to_read = os.path.join(full_path, obj.path.lstrip('/')) try: content = codecs.open(to_read, encoding="utf-8", mode='r').read() return content except IOError: print "%s not found" % full_path site.register(ImportedFile, ImportedFileIndex) This means that I don't have to bloat my database with all my rendered HTML, but have the full HTML stored in Solr which works for querying.