Updating Your Old Log Links

June 16th, 2020 by Diana Coman

As previously announced, I'll soon decommission the old logger at logs.ossasepia.com and redirect that url to the saner, persistent logs. Since I don't want to break anyone's links though, I took the time to extract the full mapping of urls from old to new so that anyone interested can simply run a replace on any text containing the old links. If you get stuck with this and ask in the comments below, I can try to help if it's clear enough what the problem is but do note that *not everything can be fixed* nor do I promise to be able to help everyone.

For completeness and future reference, here are the steps I followed to extract this mapping and then to update the links on my own blog for #ossasepia and #eulora chans:

Dumped numbers and dates for all loglines from old bot's database into file1
Dumped publication date, log date and database title for all log articles from the blog's database into file2
Processed file2 above to obtain the full url matched with corresponding log date for all log articles
Merged file1 and file2 on log date¹ so that each logline is matched with the full url of the article that contains it
Using the output of the above, I ran directly a script that generates the needed sql query² to replace the old link with the new but for your convenience I also dumped to a separate file the exact mapping old link next to new link for all lines, so that all you have to do is find the old link in there and take from the same line its corresponding new link: oldnewloglinks.tar.gz.

The above archive covers log lines up to the end of April 2020 (as the new logs were already up by then) and includes both #ossasepia and #eulora logs as those are published as articles on my blog. For logs of other chans (e.g. #trilema), the mapping rules might be different. For loglines from May 2020, there's only one article for each month (see the logs category), so you can easily get the appropriate new link for those too, if you need them (the line number is used as anchor on purpose, so that there is direct correspondence there, to make this sort of update easier). In principle, to the extent that other loggers are/were in sync with mine - and this is something you can spend way too much time on, to check! - you can use the above to update those links too - simply replace the logs.ossasepia.com with the other logger's url and find the correct mapping anyway, though you'll have a hell of a time to find and then correct errors caused by the smallest out of sync.

Just in case you haven't yet messed up anything important enough to have gotten already the habit of backups - do remember to do a backup of whatever it is you plan to change *before* rushing to change it!

join -1 1 -2 2 <(sort -k1 file1.txt) <(sort -k2 file2.txt) ↩
E.g. update yourblog_posts set post_content=replace(post_content, 'ossasepia.com/2020/04/30/ossasepia-logs-for-30-Apr-2020#1025541', 'ossasepia.com/2020/04/30/ossasepia-logs-for-30-Apr-2020#1025541'); ↩

Category: Logs

Comments feed: RSS 2.0

7 Responses to “Updating Your Old Log Links”

Jacob Welsh says:

November 29, 2021 at 7:01 p.m.

I'm a bit embarrassed to still not have even tried the recipe nearly 18 months later, especially seeing I was one asking for it, and managed the *more* involved part of adapting and deploying my own blog-logger.

2. Dumped publication date, log date and database title for all log articles from the blog's database into file2

How did you extract the log dates from the articles? Regex processing on the title or something more clever? I realize I don't need this step to apply the results, just curious.

In footnote 2, I assume the "from" string (the second argument to "replace()") is meant to be 'logs.ossasepia.com/log/ossasepia/2020-04-30#1025541' ie the old rather than new URL.

Though I don't have any #eulora links on Fixpoint, in checking the provided eulorapub.txt (from oldnewloglinks.tar.gz) I noticed the "from" (first column) is using a /ossasepia prefix rather than /eulora. Assuming that's the only problem with it, as appears to be the case, I might publish a corrected version because I've also simplified the mappings based on the observation that the ID numbers are identical and thus not even really relevant to the substitution.
Diana Coman says:

November 29, 2021 at 8:38 p.m.

How did you extract the log dates from the articles? Regex processing on the title or something more clever? I realize I don't need this step to apply the results, just curious.

Tbh I don't even recall exactly and the scripts are archived already seeing how it's more than a year ago and there weren't otherwise any questions on it until now. Most likely using mysql's string functions directly in the select though so I guess it counts as regex processing on the title, nothing more clever than that. (I tend to avoid clever if not really needed - possibly due at least in part to too much experience with other people's "cleverness" blowing up in my face.)

Though I don't have any #eulora links on Fixpoint, in checking the provided eulorapub.txt (from oldnewloglinks.tar.gz) I noticed the "from" (first column) is using a /ossasepia prefix rather than /eulora. I might publish a corrected version because I've also simplified the mappings based on the observation that the ID numbers are identical and thus not even really relevant to the substitution.

Right you are, thanks for pointing that out, it should indeed have the /eulora prefix but otherwise the links are correct as I just checked again and everything matches otherwise. I must have mangled somehow the prefixes when I made the archive itself as they were clearly correct when running the update for my blog.

The line numbers are indeed identical for #eulora but I think at the time I preferred to keep the format for the 2 files as similar as possible and without removing any potential information even if it is redundant in there, since any user can easily take out anything they want but would not really be able to add something back in with much confidence.

In principle if one relies on the line numbers truly being the same across all the loggers that were used, one would change then the prefix to update similarly the links to #o from other loggers (or at least from the rest that were based on the same flask-python code) but I can't say that I really looked into that.

The slightly more involved problem remaining would be at any rate updating also the #trilema links to the canonical pages on trilema.com because there the correspondence of links is not straightforward and there was no similar archive to this one published at any time. It can be obtained with some scripts though, I don't see why it couldn't be done.
Diana Coman says:

November 29, 2021 at 8:48 p.m.

In footnote 2, I assume the "from" string (the second argument to "replace()") is meant to be 'logs.ossasepia.com/log/ossasepia/2020-04-30#1025541' ie the old rather than new URL.

Yes, of course, the footnote is meant simply as a pointer to update + replace, not as a working copy/paste example of replacing anything.
Jacob Welsh says:

November 29, 2021 at 11:07 p.m.

Perhaps "more structured" is closer to what I meant than "more clever", in the sense of more sql-like than perl-like. Basically wondering whether there was something handy there that I didn't know about, as indeed that "join" command was.

For other loggers with out-of-sync IDs I guess you'd have to get their raw logs and match them up by message content and context, producing the fully expanded mapping; possibly easier to just search the db and fix them manually if they're not too numerous.

The slightly more involved problem remaining would be at any rate updating also the #trilema links to the canonical pages on trilema.com

More involved but perhaps still interesting, at minimum because there's more of them, at least in the Fixpoint sample, not to mention in the old logs themselves if there's anything to be done about that.

because there the correspondence of links is not straightforward and there was no similar archive to this one published at any time. It can be obtained with some scripts though, I don't see why it couldn't be done.

The barbaric approach coming to mind is 1) collect all the html-rendered log pages; 2) unpack them at least into ID and formatted content; 3) assume a relatively small number of jumps in the ID offset between those and the historical loggers; 4) find lines that match exactly and uniquely within a day-or-two range to seed some known offsets; 5) expand outward from these according to some heuristic, perhaps "fill holes smaller than X lines where the offset is the same on both ends"; 6) see what's left and perhaps fill in manually. The fuzziness comes about because I don't think the html transformation is unambiguously reversible.

It's sounding a lot like a "diff" implementation, actually.

Or do you see a simpler way? Sure would be easier with a raw archive matching the IDs now on trilema.
#ossasepia Logs for Jun 2020 « Ossa Sepia says:

June 26, 2020 at 1:38 p.m.

[...] http://ossasepia.com/2020/06/16/updating-your-old-log-links/ << Ossa Sepia -- Updating Your Old Log Links [...]
DG Agenda 2020.09.15 « Young Hands Club says:

September 15, 2020 at 12:32 a.m.

[...] Read. Wrap my head around the logger-link problem; publish my thoughts on what the issue is, and [...]
Healing my old #ossasepia log links « Fixpoint says:

November 30, 2021 at 5:19 a.m.

[...] to the #ossasepia channel logs formerly served by ossabot and its flask of pythons, based on the pointers provided by Diana. Since I took the time to work it out, I'll expand on those pointers to show my full process in [...]

Updating Your Old Log Links

7 Responses to “Updating Your Old Log Links”

Leave a Reply

Ossa Sepia

Recent Comments

Recent Articles

Pages

Categories

Archives

Owner of Younghands.club