diff --git a/README.md b/README.md index 77ad2d2..5ccc3f9 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,43 @@ # CryoLens -CryoLens Script Suite - -Measures how “chaotic” each dataset is based on how many different extensions it contains. - -Scripts detect datasets with too many file types, weak extension dominance, or mixed‑purpose content. - -Generated summaries give an LLM a compact understanding of each dataset’s identity so it can propose reorganizations. \ No newline at end of file +## GlacierLens Script Suite — Overview + +The GlacierLens suite is a collection of tools designed to help an LLM understand the structure, purpose, and internal logic of your GlacierEdge archive. Each script contributes a different layer of insight, allowing the system to reason about how your datasets are organized today and how they should be organized in the future. + +## Dataset Semantic Summarizer + +The Dataset Semantic Summarizer creates a short narrative profile for each dataset. +It describes how many files the dataset contains, how large it is, what kinds of categories dominate it, and how those categories shape the dataset’s identity. + +By examining timestamps, hash coverage, and representative file paths, the summarizer helps the LLM infer whether a dataset is primarily a photo collection, a document archive, a code repository, or something else entirely. +This script gives the LLM a clear sense of what each dataset is fundamentally about. + +## Extension Entropy Analyzer + +The Extension Entropy Analyzer evaluates how coherent or chaotic each dataset is by measuring how many different file extensions it contains. +A dataset with only a handful of extensions tends to have a clear purpose, while one with dozens or even hundreds of extensions is usually a mixed‑purpose dump. +The analyzer assigns an entropy score and explains what that score means, helping the LLM identify datasets that need cleanup or splitting. + +## Top‑50 Extension Profiler + +The Top‑50 Extension Profiler examines the most common extensions in each dataset and describes how they are used. +For each extension, it reports how many files use it, how large those files are, and which categories dominate that extension. +This allows the LLM to see that a .jpg in one dataset may represent faces, while a .jpg in another dataset may represent scanned documents or sensitive content. +The profiler gives a detailed, sentence‑level explanation of how extensions behave differently across datasets. + +## Category–Extension Cross‑Mapper + +The Category–Extension Cross‑Mapper reveals how categories and extensions interact within each dataset. +It shows which extensions are associated with photos, documents, archives, source code, or sensitive material, and it highlights mismatches such as source code appearing in a media dataset or tax documents appearing in a photo collection. +This mapping helps the LLM understand the semantic fingerprint of each extension and detect files that are out of place. + +## Anomaly Detector + +The Anomaly Detector scans for files that do not belong in their dataset based on category, extension, or path patterns. +It identifies misplaced tax files, source code in media folders, adult content outside secure datasets, and oversized archives in working directories. +These anomalies become strong signals for reorganization and help the LLM focus on structural problems that need attention. + +Purpose Inference Engine +The Purpose Inference Engine synthesizes all available metadata to infer the intended role of each dataset. +It determines whether a dataset is meant for photos, documents, backups, sensitive material, source code, or long‑term archives. +By expressing these conclusions in natural language, the engine gives the LLM a clear understanding of what each dataset is supposed to be, which is essential for proposing a cleaner, more logical folder structure. \ No newline at end of file