How I generated Python test data with LLMs
View the series
- See how I used the OpenAI API to generate audio and images
- See why structured outputs also need hard guardrails
- Grab ready-to-use pytest snippets mocking the OpenAI API
- Add context to OpenAI API error messages for easier debugging
- Learn how to log OpenAI API calls in JSON format
- Learn how I parametrized tests and generated test data with GPT-5.2
- Cut down your Python import time with this 6-step method
While I was writing my phrasebook CLI ↗, (which generates translations, audio, and images with the OpenAI API) I needed a test that ensures the generated translations are saved correctly on disk.
There are several cases to test. For instance:
-
One record in the original file, no generated file.
-
One record in the original file, generated already exists, with a translation generated in a previous run.
Etc.
After writing a test for the first case, I thought. Hey, if I parametrize it, so input and output data are passed as parameters to a generic test, I could ask an LLM to generate various sets of parameters that cover all the cases.
This is what I did, and it worked well with almost no guidance to the LLM (GPT-5.2), except for the first prompt.
Here's the prompt I used:
I'll provide you below with a Python pytest test already parametrized that test a Typer app. For now I only have one set of parameter. I want you to provide appropriate parameters for the following case (Don't touch the logic of the test neither the code tested, I just want the data). If some case is not clear do not assume, just report to me what's not clear:
phrasebook_content1 record / enriched file 3 records (corresponding to 1 record inphrasebook_contentfrom a previous run) (englishnot inphrasebook_content)
should create 3 new enriched records and keep the original 3
phrasebook_content1 record / enriched file 3 records (corresponding to 1 record inphrasebook_contentfrom a previous run) (englishsame as inphrasebook_content)
should not create new enriched records
"Skip..." in the logs
phrasebook_content3 records / no enriched file
should create 9 enriched records
phrasebook_content3 records / enriched file 6 records (corresponding to 2 previously records inphrasebook_contentfrom some previous run) (englishnot inphrasebook_content)
should create 9 enriched records and keep the original 6
phrasebook_content3 records / enriched file 6 record (corresponding to 2 previously records inphrasebook_contentfrom some previous run) (first recordenglishsame as inphrasebook_content)
should create only 6 new enriched records corresponding to 2 records in
phrasebook_content"Skip..." in the logs
Note that if
enriched_contentis notNone, it should be a string
Along with this prompt, I provided GPT-5.2 with the parametrized test and the CLI entry point.
@pytest.mark.parametrize(
"phrasebook_content,translations,enriched_content,enriched_expected,logs",
[
# 1 record to be enriched + enriched_phrasebook.tsv doesn't exist
(
"date\tfrench\tenglish\n2025-12-15\tfr1\ten1",
[[("fr2", "en2"), ("fr3", "en3")]],
None,
[
(1, "fr1", "en1", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-15"),
(2, "fr2", "en2", 1, "audio/2.mp3", "img/2.png", "2025-12-15"),
(3, "fr3", "en3", 1, "audio/3.mp3", "img/3.png", "2025-12-15"),
],
["Record has been enriched: ('2025-12-15', 'fr1', 'en1')"],
)
],
ids=["1_record_no_enriched_file"],
)
def test_app_records_saved(
tmp_path_factory: pytest.TempPathFactory,
monkeypatch: pytest.MonkeyPatch,
caplog: pytest.LogCaptureFixture,
phrasebook_content,
translations,
enriched_content,
enriched_expected,
logs,
):
pass
@app.command()
def run(file: Path) -> None:
"""Enrich French to English phrasebooks with OpenAI API."""
pass
It generated some promising data. It also told me what was missing to complete the task correctly:
Paste the exact contents (or at least the header + one example row) of an
enriched_phrasebook.tsvfile produced by your app, or pastecli.ENRICHED_COLUMNSand confirm whetherread_enriched()reads withheader=0and expects those exact column names.
I provided GPT-5.2 with
ENRICHED_COLUMNS. It gave me
a patch for the generated data. Then I asked for the whole test
data, not just a patch. And we were done after I asked for the
ids to name each test case.
The data it generated was good. There was just one small error. It was an easy fix. And the names were maybe not what I'd have picked.
Overall, asking only for data, and not for logic, made working with GPT-5.2 smooth and pleasant.
Are you also using LLMs this way?
That's all I have for today! Talk soon 👋
Parametrizing pytest tests
This strategy works only if you can parametrize your tests so the data is separated from the test logic.
Here's an example of parametrization with pytest ↗ tests in Python.
In Python, we can parametrize
pytest tests using
@pytest.mark.parametrize(...).
For instance, to test the function
def is_greater(foo, bar):
return foo < bar
we can use the following test
def test_is_greater():
foo = 1
bar = 2
assert is_greater(foo, bar)
and we can parametrize it like this:
import pytest
@pytest.mark.parametrize("foo,bar", [(1, 2)], ids=["one_two"])
def test_is_greater(foo, bar):
assert is_greater(foo, bar)
This way,
adding another test case only requires changing the
data
we pass to
@pytest.mark.parametrize,
not the code itself.
For instance, to test
is_greater(3, 4), we just
have to add the tuple
(3, 4) like this:
import pytest
@pytest.mark.parametrize("foo,bar", [(1, 2), (3, 4)], ids=["one_two", "three_four"])
def test_is_greater(foo, bar):
assert is_greater(foo, bar)
And now, instead of filling in this test data by hand, you can ask an LLM to do it for you.
Simplified version of phrasebook-fr-to-en cli
Here's a skeleton of phrasebook CLI ↗ simplified with no calls to OpenAI API that focus on the part I asked GPT-5.2 to generate data for the test.
I'm sharing it here in case you want to play with smaller code, while still getting the benefit of seeing what I was trying to test. And maybe you can apply it in your own context.
It's a
typer ↗
cli app. It uses
pandas ↗
to read and write csv data. It takes one csv file with 2
columns, foo and
bar, as input. And it
generates another csv file named
generated.csv. That's
it.
The interesting part is in the tests. There you can see a "real" test parametrization example.
After installing the dependencies:
$ uv init
$ uv add typer pandas pytest
You can run it like this
$ uv run cli.py mydata.csv
assuming the
mydata.csv file is shaped
like this:
foo,bar
foo_1,bar_1
foo_2,bar_2
You can run the tests like this:
$ uv run test_cli.py
# cli.py
import typer
import pandas as pd
from pathlib import Path
from typing import Any
import random
app = typer.Typer(pretty_exceptions_enable=False)
COLUMNS = ["foo", "bar"]
def read_input_file(input_file: Path) -> pd.DataFrame:
return pd.read_csv(input_file, dtype="string")
def read_generated_file(generated_file: Path) -> pd.DataFrame:
if not generated_file.exists():
return pd.DataFrame(columns=pd.Index(COLUMNS), dtype="string")
return pd.read_csv(generated_file, dtype="string")
def generate_data(original_record: tuple([str, str])) -> dict[str, str]:
# Simulating randomness of LLM.
# In `phrasebook-fr-to-en` cli, we would call OpenAI API with an
# input formated with `original_record` variable
random_records = [
{"foo": "foo_a", "bar": "bar_a"},
{"foo": "foo_b", "bar": "bar_b"},
{"foo": "foo_c", "bar": "bar_c"},
]
return [random.choice(random_records) for x in [1, 2, 3]]
def save_data(
new_records: list[dict[str, Any]], generated_df: pd.DataFrame, generated_file: Path
):
new_df = pd.DataFrame(new_records, columns=pd.Index(COLUMNS), dtype="string")
updated_df = (
pd.concat([generated_df, new_df], ignore_index=True)
if not generated_df.empty
else new_df
)
updated_df.to_csv(generated_file, index=False)
return updated_df
@app.command()
def run(input_file: Path):
generated_file = input_file.parent / "generated.csv"
original_df = read_input_file(input_file)
generated_df = read_generated_file(generated_file)
for original_record in original_df.itertuples(index=False, name=None):
new_records = generate_data(original_record)
generated_df = save_data(new_records, generated_df, generated_file)
if __name__ == "__main__":
app()
# test_cli.py
import cli
from pathlib import Path
import pandas as pd
from unittest.mock import Mock
import pytest
from typer.testing import CliRunner
runner = CliRunner()
def test_cli_1(tmp_path: Path, monkeypatch: pytest.MonkeyPatch):
input_file = tmp_path / "mydata.csv"
generated_file = tmp_path / "generated.csv"
input_file_data = "foo,bar\nfoo_1,bar_1"
generated_file_data = None
generated_data = [
{"foo": "foo_a", "bar": "bar_a"},
{"foo": "foo_b", "bar": "bar_b"},
{"foo": "foo_c", "bar": "bar_c"},
]
generated_data_expected = [
("foo_a", "bar_a"),
("foo_b", "bar_b"),
("foo_c", "bar_c"),
]
input_file.write_text(input_file_data)
if generated_file_data:
generated_file.write_text(generated_file_data)
monkeypatch.setattr(cli, "generate_data", Mock(return_value=generated_data))
result = runner.invoke(cli.app, [str(input_file)], catch_exceptions=False)
assert result.exit_code == 0, result.output
generated_df_expected = pd.DataFrame(
generated_data_expected,
columns=pd.Index(cli.COLUMNS),
dtype="string",
)
generated_df = pd.read_csv(generated_file, dtype="string")
pd.testing.assert_frame_equal(generated_df, generated_df_expected, check_dtype=True)
@pytest.mark.parametrize(
"input_file_data,generated_file_data,generated_data,generated_data_expected",
[
# 1 record in `input_file` and `generated_file` doesn't exist
(
"foo,bar\nfoo_1,bar_1",
None,
[
{"foo": "foo_a", "bar": "bar_a"},
{"foo": "foo_b", "bar": "bar_b"},
{"foo": "foo_c", "bar": "bar_c"},
],
[
("foo_a", "bar_a"),
("foo_b", "bar_b"),
("foo_c", "bar_c"),
],
),
],
)
def test_cli_2(
tmp_path: Path,
monkeypatch: pytest.MonkeyPatch,
input_file_data,
generated_file_data,
generated_data,
generated_data_expected,
):
input_file = tmp_path / "mydata.csv"
generated_file = tmp_path / "generated.csv"
input_file.write_text(input_file_data)
if generated_file_data:
generated_file.write_text(generated_file_data)
monkeypatch.setattr(cli, "generate_data", Mock(return_value=generated_data))
result = runner.invoke(cli.app, [str(input_file)], catch_exceptions=False)
assert result.exit_code == 0, result.output
generated_df_expected = pd.DataFrame(
generated_data_expected,
columns=pd.Index(cli.COLUMNS),
dtype="string",
)
generated_df = pd.read_csv(generated_file, dtype="string")
pd.testing.assert_frame_equal(generated_df, generated_df_expected, check_dtype=True)
@pytest.mark.parametrize(
"input_file_data,generated_file_data,generated_data,generated_data_expected",
[
# 1 record in `input_file` and `generated_file` doesn't exist
(
"foo,bar\nfoo_1,bar_1",
None,
[
{"foo": "foo_a", "bar": "bar_a"},
{"foo": "foo_b", "bar": "bar_b"},
{"foo": "foo_c", "bar": "bar_c"},
],
[
("foo_a", "bar_a"),
("foo_b", "bar_b"),
("foo_c", "bar_c"),
],
),
# 1 record in `input_file` and 3 records in `generated_file`
(
"foo,bar\nfoo_1,bar_1",
"foo,bar\nfoo_a,bar_a\nfoo_b,bar_b\nfoo_c,bar_c",
[
{"foo": "foo_d", "bar": "bar_d"},
{"foo": "foo_e", "bar": "bar_e"},
{"foo": "foo_f", "bar": "bar_f"},
],
[
("foo_a", "bar_a"),
("foo_b", "bar_b"),
("foo_c", "bar_c"),
("foo_d", "bar_d"),
("foo_e", "bar_e"),
("foo_f", "bar_f"),
],
),
],
ids=[
"1_record_in_input_no_generated_file",
"1_record_in_input_3_records_in_generated_file",
],
)
def test_cli_3(
tmp_path: Path,
monkeypatch: pytest.MonkeyPatch,
input_file_data,
generated_file_data,
generated_data,
generated_data_expected,
):
input_file = tmp_path / "mydata.csv"
generated_file = tmp_path / "generated.csv"
input_file.write_text(input_file_data)
if generated_file_data:
generated_file.write_text(generated_file_data)
monkeypatch.setattr(cli, "generate_data", Mock(return_value=generated_data))
result = runner.invoke(cli.app, [str(input_file)], catch_exceptions=False)
assert result.exit_code == 0, result.output
generated_df_expected = pd.DataFrame(
generated_data_expected,
columns=pd.Index(cli.COLUMNS),
dtype="string",
)
generated_df = pd.read_csv(generated_file, dtype="string")
pd.testing.assert_frame_equal(generated_df, generated_df_expected, check_dtype=True)
Parametrized test and CLI entry point provided to the GPT-5.2
@pytest.mark.parametrize(
"phrasebook_content,translations,enriched_content,enriched_expected,logs",
[
# 1 record to be enriched + enriched_phrasebook.tsv doesn't exist
(
"date\tfrench\tenglish\n2025-12-15\tfr1\ten1",
[[("fr2", "en2"), ("fr3", "en3")]],
None,
[
(1, "fr1", "en1", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-15"),
(2, "fr2", "en2", 1, "audio/2.mp3", "img/2.png", "2025-12-15"),
(3, "fr3", "en3", 1, "audio/3.mp3", "img/3.png", "2025-12-15"),
],
["Record has been enriched: ('2025-12-15', 'fr1', 'en1')"],
)
],
ids=["1_record_no_enriched_file"],
)
def test_app_records_saved(
tmp_path_factory: pytest.TempPathFactory,
monkeypatch: pytest.MonkeyPatch,
caplog: pytest.LogCaptureFixture,
phrasebook_content,
translations,
enriched_content,
enriched_expected,
logs,
):
caplog.set_level(logging.INFO, logger="phrasebook_fr_to_en.cli")
mock_generate_translations = Mock(side_effect=translations)
mock_generate_audio = Mock(return_value=None)
mock_generate_img = Mock(return_value=None)
monkeypatch.setattr(cli, "generate_translations", mock_generate_translations)
monkeypatch.setattr(cli, "generate_audio", mock_generate_audio)
monkeypatch.setattr(cli, "generate_img", mock_generate_img)
tmp_path = tmp_path_factory.mktemp("phrasebook")
phrasebook_path = tmp_path / "phrasebook.tsv"
phrasebook_path.write_text(phrasebook_content)
enriched_path = cli.enrich_path(phrasebook_path)
if enriched_content:
enriched_path.write_text(enriched_content)
result = runner.invoke(cli.app, [str(phrasebook_path)])
enriched_df = pd.read_csv(enriched_path, sep="\t", dtype="string")
# Match the dtypes produced by save_new_records
enriched_df["id"] = enriched_df["id"].astype("Int64")
enriched_df["generated_from"] = enriched_df["generated_from"].astype("Int64")
enriched_df_expected = pd.DataFrame(
enriched_expected,
columns=cli.ENRICHED_COLUMNS,
dtype="string",
)
enriched_df_expected["id"] = enriched_df_expected["id"].astype("Int64")
enriched_df_expected["generated_from"] = enriched_df_expected[
"generated_from"
].astype("Int64")
pd.testing.assert_frame_equal(enriched_df, enriched_df_expected, check_dtype=True)
assert result.exit_code == 0, result.output
for log in logs:
assert log in caplog.text
@app.command()
def run(
file: Annotated[
Path,
typer.Argument(
help=(
"Filename of the phrasebook to be enriched. "
"It must be a TSV format file (TAB separation) with the "
"header fields: date, french, english. For intance:\n\n\n\n"
"date french english\n\n"
"2025-12-15 J'aime l'eau. I like water.\n\n"
"2025-12-16 Il fait froid. It is cold.\n\n"
)
),
],
) -> None:
"""Enrich French to English phrasebooks with OpenAI API."""
setup_logging()
phrasebook_path = file.absolute()
enriched_path = enrich_path(phrasebook_path)
try:
phrasebook_df = read_phrasebook(phrasebook_path)
enriched_df = read_enriched(enriched_path)
except Exception as err:
logger.error(err)
raise typer.Exit(code=1) from err
existing_english: set[str] = (
set(enriched_df["english"].dropna().to_list())
if not enriched_df.empty
else set()
)
for record_original in phrasebook_df.itertuples(index=False, name=None):
_, _, english = record_original
if english in existing_english:
logger.info(f"Skip existing record: {record_original}")
continue
new_records = enrich_record(record_original, next_id(enriched_df))
if not new_records:
raise typer.Exit(code=1)
try:
enriched_df = save_new_records(enriched_df, new_records, enriched_path)
except Exception:
logger.exception(
f"Failed to save enriched records from record {record_original} in file {enriched_path}"
)
raise typer.Exit(code=1)
existing_english.add(english)
logger.info(f"Record has been enriched: {record_original} -> {enriched_path}")
Test data generated by GPT-5.2
[
# 1 record / enriched file 3 records (corresponding to 1 record from a previous run)
# english NOT in phrasebook_content
# should create 3 new enriched records and keep the original 3
(
"date\tfrench\tenglish\n2025-12-15\tfr_new\ten_new",
[[("fr_new2", "en_new2"), ("fr_new3", "en_new3")]],
(
"id\tfrench\tenglish\tgenerated_from\taudio_path\timg_path\tdate\n"
"1\tfr_old\ten_old\t\taudio/1.mp3\timg/1.png\t2025-12-01\n"
"2\tfr_old2\ten_old2\t1\taudio/2.mp3\timg/2.png\t2025-12-01\n"
"3\tfr_old3\ten_old3\t1\taudio/3.mp3\timg/3.png\t2025-12-01"
),
[
(1, "fr_old", "en_old", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-01"),
(2, "fr_old2", "en_old2", 1, "audio/2.mp3", "img/2.png", "2025-12-01"),
(3, "fr_old3", "en_old3", 1, "audio/3.mp3", "img/3.png", "2025-12-01"),
(4, "fr_new", "en_new", pd.NA, "audio/4.mp3", "img/4.png", "2025-12-15"),
(5, "fr_new2", "en_new2", 4, "audio/5.mp3", "img/5.png", "2025-12-15"),
(6, "fr_new3", "en_new3", 4, "audio/6.mp3", "img/6.png", "2025-12-15"),
],
["Record has been enriched: ('2025-12-15', 'fr_new', 'en_new')"],
),
# 1 record / enriched file 3 records (corresponding to 1 record from a previous run)
# english SAME as in phrasebook_content
# should not create new enriched records
# "Skip..." in the logs
(
"date\tfrench\tenglish\n2025-12-15\tfr_whatever\ten_old",
[
# not used (skipped), but must exist as side_effect list item
[("unused_fr2", "unused_en2"), ("unused_fr3", "unused_en3")]
],
(
"id\tfrench\tenglish\tgenerated_from\taudio_path\timg_path\tdate\n"
"1\tfr_old\ten_old\t\taudio/1.mp3\timg/1.png\t2025-12-01\n"
"2\tfr_old2\ten_old2\t1\taudio/2.mp3\timg/2.png\t2025-12-01\n"
"3\tfr_old3\ten_old3\t1\taudio/3.mp3\timg/3.png\t2025-12-01"
),
[
(1, "fr_old", "en_old", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-01"),
(2, "fr_old2", "en_old2", 1, "audio/2.mp3", "img/2.png", "2025-12-01"),
(3, "fr_old3", "en_old3", 1, "audio/3.mp3", "img/3.png", "2025-12-01"),
],
["Skip existing record: ('2025-12-15', 'fr_whatever', 'en_old')"],
),
# phrasebook_content 3 records / no enriched file
# should create 9 enriched records
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr1\ten1\n"
"2025-12-16\tfr4\ten4\n"
"2025-12-17\tfr7\ten7",
[
[("fr2", "en2"), ("fr3", "en3")],
[("fr5", "en5"), ("fr6", "en6")],
[("fr8", "en8"), ("fr9", "en9")],
],
None,
[
(1, "fr1", "en1", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-15"),
(2, "fr2", "en2", 1, "audio/2.mp3", "img/2.png", "2025-12-15"),
(3, "fr3", "en3", 1, "audio/3.mp3", "img/3.png", "2025-12-15"),
(4, "fr4", "en4", pd.NA, "audio/4.mp3", "img/4.png", "2025-12-16"),
(5, "fr5", "en5", 4, "audio/5.mp3", "img/5.png", "2025-12-16"),
(6, "fr6", "en6", 4, "audio/6.mp3", "img/6.png", "2025-12-16"),
(7, "fr7", "en7", pd.NA, "audio/7.mp3", "img/7.png", "2025-12-17"),
(8, "fr8", "en8", 7, "audio/8.mp3", "img/8.png", "2025-12-17"),
(9, "fr9", "en9", 7, "audio/9.mp3", "img/9.png", "2025-12-17"),
],
[
"Record has been enriched: ('2025-12-15', 'fr1', 'en1')",
"Record has been enriched: ('2025-12-16', 'fr4', 'en4')",
"Record has been enriched: ('2025-12-17', 'fr7', 'en7')",
],
),
# phrasebook_content 3 records / enriched file 6 records (corresponding to 2 records from a previous run)
# english NOT in phrasebook_content
# should create 9 enriched records and keep the original 6
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr1\ten1\n"
"2025-12-16\tfr4\ten4\n"
"2025-12-17\tfr7\ten7",
[
[("fr2", "en2"), ("fr3", "en3")],
[("fr5", "en5"), ("fr6", "en6")],
[("fr8", "en8"), ("fr9", "en9")],
],
(
"id\tfrench\tenglish\tgenerated_from\taudio_path\timg_path\tdate\n"
"1\tfr_oldA\ten_oldA\t\taudio/1.mp3\timg/1.png\t2025-12-01\n"
"2\tfr_oldA2\ten_oldA2\t1\taudio/2.mp3\timg/2.png\t2025-12-01\n"
"3\tfr_oldA3\ten_oldA3\t1\taudio/3.mp3\timg/3.png\t2025-12-01\n"
"4\tfr_oldB\ten_oldB\t\taudio/4.mp3\timg/4.png\t2025-12-02\n"
"5\tfr_oldB2\ten_oldB2\t4\taudio/5.mp3\timg/5.png\t2025-12-02\n"
"6\tfr_oldB3\ten_oldB3\t4\taudio/6.mp3\timg/6.png\t2025-12-02"
),
[
(1, "fr_oldA", "en_oldA", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-01"),
(2, "fr_oldA2", "en_oldA2", 1, "audio/2.mp3", "img/2.png", "2025-12-01"),
(3, "fr_oldA3", "en_oldA3", 1, "audio/3.mp3", "img/3.png", "2025-12-01"),
(4, "fr_oldB", "en_oldB", pd.NA, "audio/4.mp3", "img/4.png", "2025-12-02"),
(5, "fr_oldB2", "en_oldB2", 4, "audio/5.mp3", "img/5.png", "2025-12-02"),
(6, "fr_oldB3", "en_oldB3", 4, "audio/6.mp3", "img/6.png", "2025-12-02"),
(7, "fr1", "en1", pd.NA, "audio/7.mp3", "img/7.png", "2025-12-15"),
(8, "fr2", "en2", 7, "audio/8.mp3", "img/8.png", "2025-12-15"),
(9, "fr3", "en3", 7, "audio/9.mp3", "img/9.png", "2025-12-15"),
(10, "fr4", "en4", pd.NA, "audio/10.mp3", "img/10.png", "2025-12-16"),
(11, "fr5", "en5", 10, "audio/11.mp3", "img/11.png", "2025-12-16"),
(12, "fr6", "en6", 10, "audio/12.mp3", "img/12.png", "2025-12-16"),
(13, "fr7", "en7", pd.NA, "audio/13.mp3", "img/13.png", "2025-12-17"),
(14, "fr8", "en8", 13, "audio/14.mp3", "img/14.png", "2025-12-17"),
(15, "fr9", "en9", 13, "audio/15.mp3", "img/15.png", "2025-12-17"),
],
[
"Record has been enriched: ('2025-12-15', 'fr1', 'en1')",
"Record has been enriched: ('2025-12-16', 'fr4', 'en4')",
"Record has been enriched: ('2025-12-17', 'fr7', 'en7')",
],
),
# phrasebook_content 3 records / enriched file 6 records (corresponding to 2 records from a previous run)
# first record english SAME as in phrasebook_content
# should create only 6 new enriched records for the other 2 phrasebook records
# "Skip..." in the logs
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr1\ten_oldA\n"
"2025-12-16\tfr4\ten4\n"
"2025-12-17\tfr7\ten7",
[
# not used (skipped), but must exist as side_effect list item
[("unused_fr2", "unused_en2"), ("unused_fr3", "unused_en3")],
[("fr5", "en5"), ("fr6", "en6")],
[("fr8", "en8"), ("fr9", "en9")],
],
(
"id\tfrench\tenglish\tgenerated_from\taudio_path\timg_path\tdate\n"
"1\tfr_oldA\ten_oldA\t\taudio/1.mp3\timg/1.png\t2025-12-01\n"
"2\tfr_oldA2\ten_oldA2\t1\taudio/2.mp3\timg/2.png\t2025-12-01\n"
"3\tfr_oldA3\ten_oldA3\t1\taudio/3.mp3\timg/3.png\t2025-12-01\n"
"4\tfr_oldB\ten_oldB\t\taudio/4.mp3\timg/4.png\t2025-12-02\n"
"5\tfr_oldB2\ten_oldB2\t4\taudio/5.mp3\timg/5.png\t2025-12-02\n"
"6\tfr_oldB3\ten_oldB3\t4\taudio/6.mp3\timg/6.png\t2025-12-02"
),
[
(1, "fr_oldA", "en_oldA", pd.NA, "audio/1.mp3", "img/1.png", "2025-12-01"),
(2, "fr_oldA2", "en_oldA2", 1, "audio/2.mp3", "img/2.png", "2025-12-01"),
(3, "fr_oldA3", "en_oldA3", 1, "audio/3.mp3", "img/3.png", "2025-12-01"),
(4, "fr_oldB", "en_oldB", pd.NA, "audio/4.mp3", "img/4.png", "2025-12-02"),
(5, "fr_oldB2", "en_oldB2", 4, "audio/5.mp3", "img/5.png", "2025-12-02"),
(6, "fr_oldB3", "en_oldB3", 4, "audio/6.mp3", "img/6.png", "2025-12-02"),
(7, "fr4", "en4", pd.NA, "audio/7.mp3", "img/7.png", "2025-12-16"),
(8, "fr5", "en5", 7, "audio/8.mp3", "img/8.png", "2025-12-16"),
(9, "fr6", "en6", 7, "audio/9.mp3", "img/9.png", "2025-12-16"),
(10, "fr7", "en7", pd.NA, "audio/10.mp3", "img/10.png", "2025-12-17"),
(11, "fr8", "en8", 10, "audio/11.mp3", "img/11.png", "2025-12-17"),
(12, "fr9", "en9", 10, "audio/12.mp3", "img/12.png", "2025-12-17"),
],
[
"Skip existing record: ('2025-12-15', 'fr1', 'en_oldA')",
"Record has been enriched: ('2025-12-16', 'fr4', 'en4')",
"Record has been enriched: ('2025-12-17', 'fr7', 'en7')",
],
),
]
Final test I kept
As the requirements changed after producing the test data above, the final test you can find here ↗, along with the other tests, ended up being this:
@pytest.mark.parametrize(
"phrasebook_content,translations,enriched_content,enriched_expected,logs",
[
# 1 record to be enriched + enriched_phrasebook.tsv doesn't exist
(
"date\tfrench\tenglish\n2025-12-15\tfr1\ten1",
[[("fr2", "en2"), ("fr3", "en3")]],
None,
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-15"),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-15"),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-15"),
],
["Record has been enriched: ('2025-12-15', 'fr1', 'en1')"],
),
# 1 record / enriched file 3 records (corresponding to 1 record from a previous run)
# english NOT in phrasebook_content
# should create 3 new enriched records and keep the original 3
(
"date\tfrench\tenglish\n2025-12-15\tfr4\ten4",
[[("fr5", "en5"), ("fr6", "en6")]],
(
"french\tenglish\tanki_audio\tanki_img\tgenerated_from\tid\taudio_filename\timg_filename\tdate\n"
'fr1\ten1\t[sound:phrasebook-fr-to-en-1.mp3]\t"<img src=""phrasebook-fr-to-en-1.png"">"\t\t1\tphrasebook-fr-to-en-1.mp3\tphrasebook-fr-to-en-1.png\t2025-12-01\n'
'fr2\ten2\t[sound:phrasebook-fr-to-en-2.mp3]\t"<img src=""phrasebook-fr-to-en-2.png"">"\t1\t2\tphrasebook-fr-to-en-2.mp3\tphrasebook-fr-to-en-2.png\t2025-12-01\n'
'fr3\ten3\t[sound:phrasebook-fr-to-en-3.mp3]\t"<img src=""phrasebook-fr-to-en-3.png"">"\t1\t3\tphrasebook-fr-to-en-3.mp3\tphrasebook-fr-to-en-3.png\t2025-12-01'
),
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-01",),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-01"),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-01"),
("fr4", "en4", "[sound:phrasebook-fr-to-en-4.mp3]", "<img src=\"phrasebook-fr-to-en-4.png\">", pd.NA, 4, "phrasebook-fr-to-en-4.mp3", "phrasebook-fr-to-en-4.png", "2025-12-15",),
("fr5", "en5", "[sound:phrasebook-fr-to-en-5.mp3]", "<img src=\"phrasebook-fr-to-en-5.png\">", 4, 5, "phrasebook-fr-to-en-5.mp3", "phrasebook-fr-to-en-5.png", "2025-12-15"),
("fr6", "en6", "[sound:phrasebook-fr-to-en-6.mp3]", "<img src=\"phrasebook-fr-to-en-6.png\">", 4, 6, "phrasebook-fr-to-en-6.mp3", "phrasebook-fr-to-en-6.png", "2025-12-15"),
],
["Record has been enriched: ('2025-12-15', 'fr4', 'en4')"],
),
# 1 record / enriched file 3 records (corresponding to 1 record from a previous run)
# 'en1' english field is present in both phrasebook_content and enriched_content
# should not create new enriched records
# "Skip..." in the logs
(
"date\tfrench\tenglish\n2025-12-15\tfr_whatever\ten1",
None, # No translations generated
(
"french\tenglish\tanki_audio\tanki_img\tgenerated_from\tid\taudio_filename\timg_filename\tdate\n"
'fr1\ten1\t[sound:phrasebook-fr-to-en-1.mp3]\t"<img src=""phrasebook-fr-to-en-1.png"">"\t\t1\tphrasebook-fr-to-en-1.mp3\tphrasebook-fr-to-en-1.png\t2025-12-01\n'
'fr2\ten2\t[sound:phrasebook-fr-to-en-2.mp3]\t"<img src=""phrasebook-fr-to-en-2.png"">"\t1\t2\tphrasebook-fr-to-en-2.mp3\tphrasebook-fr-to-en-2.png\t2025-12-01\n'
'fr3\ten3\t[sound:phrasebook-fr-to-en-3.mp3]\t"<img src=""phrasebook-fr-to-en-3.png"">"\t1\t3\tphrasebook-fr-to-en-3.mp3\tphrasebook-fr-to-en-3.png\t2025-12-01'
),
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-01"),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-01"),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-01"),
],
["Skip existing record: ('2025-12-15', 'fr_whatever', 'en1')"],
),
# phrasebook_content 3 records / no enriched file
# should create 9 enriched records
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr1\ten1\n"
"2025-12-16\tfr4\ten4\n"
"2025-12-17\tfr7\ten7",
[
[("fr2", "en2"), ("fr3", "en3")],
[("fr5", "en5"), ("fr6", "en6")],
[("fr8", "en8"), ("fr9", "en9")],
],
None,
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-15"),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-15"),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-15"),
("fr4", "en4", "[sound:phrasebook-fr-to-en-4.mp3]", "<img src=\"phrasebook-fr-to-en-4.png\">", pd.NA, 4, "phrasebook-fr-to-en-4.mp3", "phrasebook-fr-to-en-4.png", "2025-12-16"),
("fr5", "en5", "[sound:phrasebook-fr-to-en-5.mp3]", "<img src=\"phrasebook-fr-to-en-5.png\">", 4, 5, "phrasebook-fr-to-en-5.mp3", "phrasebook-fr-to-en-5.png", "2025-12-16"),
("fr6", "en6", "[sound:phrasebook-fr-to-en-6.mp3]", "<img src=\"phrasebook-fr-to-en-6.png\">", 4, 6, "phrasebook-fr-to-en-6.mp3", "phrasebook-fr-to-en-6.png", "2025-12-16"),
("fr7", "en7", "[sound:phrasebook-fr-to-en-7.mp3]", "<img src=\"phrasebook-fr-to-en-7.png\">", pd.NA, 7, "phrasebook-fr-to-en-7.mp3", "phrasebook-fr-to-en-7.png", "2025-12-17"),
("fr8", "en8", "[sound:phrasebook-fr-to-en-8.mp3]", "<img src=\"phrasebook-fr-to-en-8.png\">", 7, 8, "phrasebook-fr-to-en-8.mp3", "phrasebook-fr-to-en-8.png", "2025-12-17"),
("fr9", "en9", "[sound:phrasebook-fr-to-en-9.mp3]", "<img src=\"phrasebook-fr-to-en-9.png\">", 7, 9, "phrasebook-fr-to-en-9.mp3", "phrasebook-fr-to-en-9.png", "2025-12-17"),
],
[
"Record has been enriched: ('2025-12-15', 'fr1', 'en1')",
"Record has been enriched: ('2025-12-16', 'fr4', 'en4')",
"Record has been enriched: ('2025-12-17', 'fr7', 'en7')",
],
),
# phrasebook_content 3 records / enriched file 6 records (corresponding to 2 records from a previous run)
# english NOT in phrasebook_content
# should create 9 enriched records and keep the original 6
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr7\ten7\n"
"2025-12-16\tfr10\ten10\n"
"2025-12-17\tfr13\ten13",
[
[("fr8", "en8"), ("fr9", "en9")],
[("fr11", "en11"), ("fr12", "en12")],
[("fr14", "en14"), ("fr15", "en15")],
],
(
"french\tenglish\tanki_audio\tanki_img\tgenerated_from\tid\taudio_filename\timg_filename\tdate\n"
'fr1\ten1\t[sound:phrasebook-fr-to-en-1.mp3]\t"<img src=""phrasebook-fr-to-en-1.png"">"\t\t1\tphrasebook-fr-to-en-1.mp3\tphrasebook-fr-to-en-1.png\t2025-12-01\n'
'fr2\ten2\t[sound:phrasebook-fr-to-en-2.mp3]\t"<img src=""phrasebook-fr-to-en-2.png"">"\t1\t2\tphrasebook-fr-to-en-2.mp3\tphrasebook-fr-to-en-2.png\t2025-12-01\n'
'fr3\ten3\t[sound:phrasebook-fr-to-en-3.mp3]\t"<img src=""phrasebook-fr-to-en-3.png"">"\t1\t3\tphrasebook-fr-to-en-3.mp3\tphrasebook-fr-to-en-3.png\t2025-12-01\n'
'fr4\ten4\t[sound:phrasebook-fr-to-en-4.mp3]\t"<img src=""phrasebook-fr-to-en-4.png"">"\t\t4\tphrasebook-fr-to-en-4.mp3\tphrasebook-fr-to-en-4.png\t2025-12-02\n'
'fr5\ten5\t[sound:phrasebook-fr-to-en-5.mp3]\t"<img src=""phrasebook-fr-to-en-5.png"">"\t4\t5\tphrasebook-fr-to-en-5.mp3\tphrasebook-fr-to-en-5.png\t2025-12-02\n'
'fr6\ten6\t[sound:phrasebook-fr-to-en-6.mp3]\t"<img src=""phrasebook-fr-to-en-6.png"">"\t4\t6\tphrasebook-fr-to-en-6.mp3\tphrasebook-fr-to-en-6.png\t2025-12-02'
),
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-01",),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-01",),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-01",),
("fr4", "en4", "[sound:phrasebook-fr-to-en-4.mp3]", "<img src=\"phrasebook-fr-to-en-4.png\">", pd.NA, 4, "phrasebook-fr-to-en-4.mp3", "phrasebook-fr-to-en-4.png", "2025-12-02",),
("fr5", "en5", "[sound:phrasebook-fr-to-en-5.mp3]", "<img src=\"phrasebook-fr-to-en-5.png\">", 4, 5, "phrasebook-fr-to-en-5.mp3", "phrasebook-fr-to-en-5.png", "2025-12-02",),
("fr6", "en6", "[sound:phrasebook-fr-to-en-6.mp3]", "<img src=\"phrasebook-fr-to-en-6.png\">", 4, 6, "phrasebook-fr-to-en-6.mp3", "phrasebook-fr-to-en-6.png", "2025-12-02",),
("fr7", "en7", "[sound:phrasebook-fr-to-en-7.mp3]", "<img src=\"phrasebook-fr-to-en-7.png\">", pd.NA, 7, "phrasebook-fr-to-en-7.mp3", "phrasebook-fr-to-en-7.png", "2025-12-15"),
("fr8", "en8", "[sound:phrasebook-fr-to-en-8.mp3]", "<img src=\"phrasebook-fr-to-en-8.png\">", 7, 8, "phrasebook-fr-to-en-8.mp3", "phrasebook-fr-to-en-8.png", "2025-12-15"),
("fr9", "en9", "[sound:phrasebook-fr-to-en-9.mp3]", "<img src=\"phrasebook-fr-to-en-9.png\">", 7, 9, "phrasebook-fr-to-en-9.mp3", "phrasebook-fr-to-en-9.png", "2025-12-15"),
("fr10", "en10", "[sound:phrasebook-fr-to-en-10.mp3]", "<img src=\"phrasebook-fr-to-en-10.png\">", pd.NA, 10, "phrasebook-fr-to-en-10.mp3", "phrasebook-fr-to-en-10.png", "2025-12-16"),
("fr11", "en11", "[sound:phrasebook-fr-to-en-11.mp3]", "<img src=\"phrasebook-fr-to-en-11.png\">", 10, 11, "phrasebook-fr-to-en-11.mp3", "phrasebook-fr-to-en-11.png", "2025-12-16"),
("fr12", "en12", "[sound:phrasebook-fr-to-en-12.mp3]", "<img src=\"phrasebook-fr-to-en-12.png\">", 10, 12, "phrasebook-fr-to-en-12.mp3", "phrasebook-fr-to-en-12.png", "2025-12-16"),
("fr13", "en13", "[sound:phrasebook-fr-to-en-13.mp3]", "<img src=\"phrasebook-fr-to-en-13.png\">", pd.NA, 13, "phrasebook-fr-to-en-13.mp3", "phrasebook-fr-to-en-13.png", "2025-12-17"),
("fr14", "en14", "[sound:phrasebook-fr-to-en-14.mp3]", "<img src=\"phrasebook-fr-to-en-14.png\">", 13, 14, "phrasebook-fr-to-en-14.mp3", "phrasebook-fr-to-en-14.png", "2025-12-17"),
("fr15", "en15", "[sound:phrasebook-fr-to-en-15.mp3]", "<img src=\"phrasebook-fr-to-en-15.png\">", 13, 15, "phrasebook-fr-to-en-15.mp3", "phrasebook-fr-to-en-15.png", "2025-12-17"),
],
[
"Record has been enriched: ('2025-12-15', 'fr7', 'en7')",
"Record has been enriched: ('2025-12-16', 'fr10', 'en10')",
"Record has been enriched: ('2025-12-17', 'fr13', 'en13')",
],
),
# phrasebook_content 3 records / enriched file 6 records (corresponding to 2 records from a previous run)
# 'en1' english field is present in both phrasebook_content and enriched_content
# should create only 6 new enriched records for the other 2 phrasebook records
# "Skip..." in the logs
(
"date\tfrench\tenglish\n"
"2025-12-15\tfr_whatever\ten1\n"
"2025-12-16\tfr7\ten7\n"
"2025-12-17\tfr10\ten10",
[
[("fr8", "en8"), ("fr9", "en9")],
[("fr11", "en11"), ("fr12", "en12")],
],
(
"french\tenglish\tanki_audio\tanki_img\tgenerated_from\tid\taudio_filename\timg_filename\tdate\n"
'fr1\ten1\t[sound:phrasebook-fr-to-en-1.mp3]\t"<img src=""phrasebook-fr-to-en-1.png"">"\t\t1\tphrasebook-fr-to-en-1.mp3\tphrasebook-fr-to-en-1.png\t2025-12-01\n'
'fr2\ten2\t[sound:phrasebook-fr-to-en-2.mp3]\t"<img src=""phrasebook-fr-to-en-2.png"">"\t1\t2\tphrasebook-fr-to-en-2.mp3\tphrasebook-fr-to-en-2.png\t2025-12-01\n'
'fr3\ten3\t[sound:phrasebook-fr-to-en-3.mp3]\t"<img src=""phrasebook-fr-to-en-3.png"">"\t1\t3\tphrasebook-fr-to-en-3.mp3\tphrasebook-fr-to-en-3.png\t2025-12-01\n'
'fr4\ten4\t[sound:phrasebook-fr-to-en-4.mp3]\t"<img src=""phrasebook-fr-to-en-4.png"">"\t\t4\tphrasebook-fr-to-en-4.mp3\tphrasebook-fr-to-en-4.png\t2025-12-02\n'
'fr5\ten5\t[sound:phrasebook-fr-to-en-5.mp3]\t"<img src=""phrasebook-fr-to-en-5.png"">"\t4\t5\tphrasebook-fr-to-en-5.mp3\tphrasebook-fr-to-en-5.png\t2025-12-02\n'
'fr6\ten6\t[sound:phrasebook-fr-to-en-6.mp3]\t"<img src=""phrasebook-fr-to-en-6.png"">"\t4\t6\tphrasebook-fr-to-en-6.mp3\tphrasebook-fr-to-en-6.png\t2025-12-02'
),
[
("fr1", "en1", "[sound:phrasebook-fr-to-en-1.mp3]", "<img src=\"phrasebook-fr-to-en-1.png\">", pd.NA, 1, "phrasebook-fr-to-en-1.mp3", "phrasebook-fr-to-en-1.png", "2025-12-01",),
("fr2", "en2", "[sound:phrasebook-fr-to-en-2.mp3]", "<img src=\"phrasebook-fr-to-en-2.png\">", 1, 2, "phrasebook-fr-to-en-2.mp3", "phrasebook-fr-to-en-2.png", "2025-12-01",),
("fr3", "en3", "[sound:phrasebook-fr-to-en-3.mp3]", "<img src=\"phrasebook-fr-to-en-3.png\">", 1, 3, "phrasebook-fr-to-en-3.mp3", "phrasebook-fr-to-en-3.png", "2025-12-01",),
("fr4", "en4", "[sound:phrasebook-fr-to-en-4.mp3]", "<img src=\"phrasebook-fr-to-en-4.png\">", pd.NA, 4, "phrasebook-fr-to-en-4.mp3", "phrasebook-fr-to-en-4.png", "2025-12-02",),
("fr5", "en5", "[sound:phrasebook-fr-to-en-5.mp3]", "<img src=\"phrasebook-fr-to-en-5.png\">", 4, 5, "phrasebook-fr-to-en-5.mp3", "phrasebook-fr-to-en-5.png", "2025-12-02",),
("fr6", "en6", "[sound:phrasebook-fr-to-en-6.mp3]", "<img src=\"phrasebook-fr-to-en-6.png\">", 4, 6, "phrasebook-fr-to-en-6.mp3", "phrasebook-fr-to-en-6.png", "2025-12-02",),
("fr7", "en7", "[sound:phrasebook-fr-to-en-7.mp3]", "<img src=\"phrasebook-fr-to-en-7.png\">", pd.NA, 7, "phrasebook-fr-to-en-7.mp3", "phrasebook-fr-to-en-7.png", "2025-12-16"),
("fr8", "en8", "[sound:phrasebook-fr-to-en-8.mp3]", "<img src=\"phrasebook-fr-to-en-8.png\">", 7, 8, "phrasebook-fr-to-en-8.mp3", "phrasebook-fr-to-en-8.png", "2025-12-16"),
("fr9", "en9", "[sound:phrasebook-fr-to-en-9.mp3]", "<img src=\"phrasebook-fr-to-en-9.png\">", 7, 9, "phrasebook-fr-to-en-9.mp3", "phrasebook-fr-to-en-9.png", "2025-12-16"),
("fr10", "en10", "[sound:phrasebook-fr-to-en-10.mp3]", "<img src=\"phrasebook-fr-to-en-10.png\">", pd.NA, 10, "phrasebook-fr-to-en-10.mp3", "phrasebook-fr-to-en-10.png", "2025-12-17"),
("fr11", "en11", "[sound:phrasebook-fr-to-en-11.mp3]", "<img src=\"phrasebook-fr-to-en-11.png\">", 10, 11, "phrasebook-fr-to-en-11.mp3", "phrasebook-fr-to-en-11.png", "2025-12-17"),
("fr12", "en12", "[sound:phrasebook-fr-to-en-12.mp3]", "<img src=\"phrasebook-fr-to-en-12.png\">", 10, 12, "phrasebook-fr-to-en-12.mp3", "phrasebook-fr-to-en-12.png", "2025-12-17"),
],
[
"Skip existing record: ('2025-12-15', 'fr_whatever', 'en1')",
"Record has been enriched: ('2025-12-16', 'fr7', 'en7')",
"Record has been enriched: ('2025-12-17', 'fr10', 'en10')",
],
),
],
ids=[
"1_record_no_enriched_file",
"1_record_enriched_file_3_records_english_not_in_phrasebook_creates_3_keeps_3",
"1_record_enriched_file_3_records_english_same_skips_creates_0",
"3_records_no_enriched_file_creates_9",
"3_records_enriched_file_6_records_english_not_in_phrasebook_creates_9_keeps_6",
"3_records_enriched_file_6_records_first_english_same_skips_1_creates_6",
],
) # fmt: skip
def test_app_records_saved(
tmp_path_factory: pytest.TempPathFactory,
monkeypatch: pytest.MonkeyPatch,
caplog: pytest.LogCaptureFixture,
phrasebook_content,
translations,
enriched_content,
enriched_expected,
logs,
):
caplog.set_level(logging.INFO, logger="phrasebook_fr_to_en.cli")
# OPENAI_API_KEY must be set to run the app
# As we mock `generate_...` functions, we don't hit OpenAI API,
# so we don't have to use a real API key
monkeypatch.setenv("OPENAI_API_KEY", "foo-api-key")
mock_generate_translations = Mock(side_effect=translations)
mock_generate_audio = Mock(return_value=None)
mock_generate_img = Mock(return_value=None)
monkeypatch.setattr(cli, "generate_translations", mock_generate_translations)
monkeypatch.setattr(cli, "generate_audio", mock_generate_audio)
monkeypatch.setattr(cli, "generate_img", mock_generate_img)
tmp_path = tmp_path_factory.mktemp("phrasebook")
phrasebook_path = tmp_path / "phrasebook.tsv"
phrasebook_path.write_text(phrasebook_content)
enriched_path = cli.enriched_path_func(phrasebook_path)
if enriched_content:
enriched_path.write_text(enriched_content)
result = runner.invoke(cli.app, [str(phrasebook_path)], catch_exceptions=False)
assert result.exit_code == 0, result.output
enriched_df = pd.read_csv(enriched_path, sep="\t", dtype="string")
# Match the dtypes produced by save_new_records
enriched_df["id"] = enriched_df["id"].astype("Int64")
enriched_df["generated_from"] = enriched_df["generated_from"].astype("Int64")
enriched_df_expected = pd.DataFrame(
enriched_expected,
columns=pd.Index(cli.ENRICHED_COLUMNS),
dtype="string",
)
enriched_df_expected["id"] = enriched_df_expected["id"].astype("Int64")
enriched_df_expected["generated_from"] = enriched_df_expected[
"generated_from"
].astype("Int64")
pd.testing.assert_frame_equal(enriched_df, enriched_df_expected, check_dtype=True)
for log in logs:
assert log in caplog.text