This function anonymizes or redacts sensitive information from various R objects including character vectors, factors, data frames, and lists. It uses pattern matching to find and replace sensitive content, with options for targeted anonymization based on variable names or classes and warnings about approximate matches.
Usage
anon(
x,
pattern_list = list(),
default_replacement = getOption("anon.default_replacement", default = "[REDACTED]"),
check_approximate = getOption("anon.check_approximate", default = FALSE),
max_distance = 2,
df_variable_names = NULL,
df_classes = NULL,
check_names = TRUE,
check_labels = TRUE,
nlp_auto = getOption("anon.nlp_auto"),
.self = FALSE,
.pattern_replacements = NULL,
.compiled = NULL
)Arguments
- x
The object to anonymize. Can be a character vector, factor, data frame, or list.
- pattern_list
A list of patterns to search for and replace. Can include:
Named elements where names are replacement values and values are one or more patterns to match
Unnamed elements where one or more patterns are replaced with
default_replacementThis parameter is combined with the global optiongetOption("anon.pattern_list").
- default_replacement
Value to use as the default replacement value when no specific replacement is provided. Default is
getOption("anon.default_replacement", default = "\[REDACTED\]").- check_approximate
Logical indicating whether to check for approximate matches using string distance. Default is
getOption("anon.check_approximate", default = FALSE).- max_distance
Maximum string distance for approximate matching when
check_approximateisTRUE. Default is2.- df_variable_names
For data frames, a character vector or named list specifying which variable names should be anonymized:
Unnamed elements: variables are replaced with
default_replacementNamed elements: variable names are keys, value can be either a replacement value or a function This parameter is combined with the global option
getOption("anon.df_variable_names").
- df_classes
For data frames, a character vector or named list specifying which variable classes should be anonymized:
Unnamed elements: variables with matching classes are replaced with
default_replacementNamed elements: class names are keys, value can be either a replacement value or a function This parameter is combined with the global option
getOption("anon.df_classes").
- check_names
Logical indicating whether to anonymize object names (column names, row names, list names). Default is
TRUE.- check_labels
Logical indicating whether to anonymize labels (attributes). Default is
TRUE.- nlp_auto
List of logical values with names corresponding to entity names. Can be generated with
nlp_auto()and can be set as theanon.nlp_autoglobal option. This argument overrides the global option.- .self
Logical for internal use only. Used in recursive calls. Default is
FALSE. WhenTRUE, warnings are collected as attributes instead of being issued immediately and global options are ignored and only explicitly provided parameters are used.- .pattern_replacements
List for internal use only. Pre-computed pattern replacement pairs passed down during recursive calls to avoid recomputing them for each column or list element. Default is
NULL, which triggers normal computation.- .compiled
List for internal use only. Pre-compiled pattern groups containing grouped regex objects, digest tokens, and fixed token matchers. Passed down during recursive calls to avoid recompiling patterns. Default is
NULL.
Value
An object of class anon_context with the same structure as x but with sensitive
information replaced. If approximate matches are found and .self is FALSE, warnings are issued.
If .self is TRUE, warnings are attached as an attribute.
Details
anon() operates recursively on nested structures. For data frames:
Individual columns are processed based on their content
Entire columns can be replaced if specified in
df_variable_namesordf_classesColumn names, row names, and labels are anonymized when enabled
Pattern matching is case-insensitive. When check_approximate is enabled,
anon() will warn about remaining potential matches that are similar but not exact.
Replacement functions can be provided in df_variable_names and df_classes as:
R functions that take the column/variable as input
Formula notation (e.g.,
~ .x + rnorm(length(.x), mean = 1))
The returned object has class anon_context which allows it to be combined with other
anonymized objects using c().
Global Options
The following global options affect function behavior:
anon.default_replacementDefault replacement text (default: "[REDACTED]").
anon.pattern_listGlobal patterns to combine with (after)
pattern_listparameter.anon.df_variable_namesGlobal variable name specifications to combine with (after)
df_variable_namesparameter.anon.df_classesGlobal class specifications to combine with (after)
df_classesparameter.anon.nlp_autoList of logical values indicating which NLP entity types should be automatically anonymized. Use
nlp_auto()to generate this list. Override the option by setting thenlp_autoargument.anon.nlp_default_replacementsDefault NLP replacement labels. Use
nlp_default_replacements()to generate this list.anon.example_values_nDefault
example_values_nused byanon_data_summary()andanon_report().anon.example_rowsDefault
example_rowsspecification used byanon_data_summary()andanon_report(). Useanon_example_rows()to generate this value.
See anon_options() for a central helper that lists and sets all supported
anon.* options.
To set global options:
Examples
# Basic string anonymization
text <- c("John Smith", "jane.doe@email.com", "Call 555-1234")
anon(text, pattern_list = c("John Smith", "@\\S+", "\\d{3}-\\d{4}"))
#> [1] "[REDACTED]" "jane.doe[REDACTED]" "Call [REDACTED]"
# Using named patterns for specific replacements
anon(text, pattern_list = list("PERSON" = "John Smith",
"EMAIL" = "@\\S+",
"PHONE" = "\\d{3}-\\d{4}"))
#> [1] "PERSON" "jane.doeEMAIL" "Call PHONE"
# Data frame anonymization
df <- data.frame(
name = c("Alice", "Bob"),
email = c("alice@test.com", "bob@test.com"),
score = c(95, 87)
)
# Anonymize specific columns by name
anon(df, df_variable_names = c("name", "email"))
#> name email score
#> 1 [REDACTED] [REDACTED] 95
#> 2 [REDACTED] [REDACTED] 87
# Anonymize columns by class with custom replacements
anon(df, df_classes = list("character" = "HIDDEN"))
#> name email score
#> 1 HIDDEN HIDDEN 95
#> 2 HIDDEN HIDDEN 87
# Using functions for dynamic replacement
anon(df, df_variable_names = list("name" = ~ paste("Person", seq_along(.x))))
#> name email score
#> 1 Person 1 alice@test.com 95
#> 2 Person 2 bob@test.com 87
anon_df <- df |>
anon(
df_variable_names = list(
"name" = ~ paste("Person", seq_along(.x)),
"email"
)
)
# Using global options
options(anon.pattern_list = list("EMAIL" = "@\\S+"))
options(anon.df_variable_names = "name")
anon(df) # Will anonymize emails and names using global settings
#> name email score
#> 1 [REDACTED] aliceEMAIL 95
#> 2 [REDACTED] bobEMAIL 87
# Combine anonymized objects
anon_summary <- anon_data_summary(list(df = df))
combined <- c(anon_df, anon_summary)
combined
#> === ANONYMIZED DATA CONTEXT ===
#>
#> name email score
#> 1 Person 1 [REDACTED] 95
#> 2 Person 2 [REDACTED] 87
#>
#> Environment Data Summary
#> ========================
#>
#> total_objects data_frames other_objects total_memory
#> 1 1 1 0 1272
#>
#> Data Frames:
#> ------------
#> name type n_rows n_cols memory_size
#> 1 df data.frame 2 3 1.2 Kb
#>
#>
#> Variable Details (df):
#>
#> --------------------
#> variable data_type n_distinct n_missing n_total pct_missing label
#> 1 name character 2 0 2 0 <NA>
#> 2 email character 2 0 2 0 <NA>
#> 3 score numeric 2 0 2 0 <NA>
#>
#>
#>
